CSI5180: Topics in Artificial Intelligence: Natural Language Processing, A Statistical Approach
Due: Feb 10, 2012, 21::00, extended till Feb 17, 21:00.
Corpus processing and comparison of a specialized corpus and a corpus of generic English [25 points]
We will use a corpus of Computer Science articles as a domain-specific corpus, and a part of the British National Corpus (BNC) as a corpus of generic English http://www.natcorp.ox.ac.uk/. You can download the zipped files from here: cs.zip (approximately 1.5 mil words), bnc.zip (approximately 6 million words). The Computer Science corpus is a part of the NUS Corpus (plain text was extracted from pdf files, abstract and domain categories were kept). Test your programs on a small sample first.
You will need to do preprocessing and to do word tokenization. You can ignore the upper-lower case differences when counting words. Please describe in your report all the decisions you made relative to pre-processing and tokenization. Include in your report the results of the tokenizer on this small file.
Provide in your report the following information for each of the two corpora:
a) For the top 100 most frequent words, print the word and its frequency, including stopwords (exclude punctuation tokens).
b) Print the above information for the next ranks (>100), but do not print every line, print only one line for every 1000 words (otherwise there will be too many lines to print) in decreasing order of frequency and decreasing alphabetical order among words with the same frequency.
c) Print the sequence c1, c2, c3, ...., c99 ,c100, c>100 where ci is the number of words in the corpus that have frequency count i (the frequency of frequencies). c>100 is the number of words that occur more than 100 times. Discuss your findings.
d) What is the type/token ratio for the corpus? The type/token ratio is defined as the number of unique words (types) divided by the number of words (tokens), punctuation excluded. How many types you had? How many tokens?
e) For the top 100 most frequent words, print the word and its frequency, excluding stopwords and punctuation. You can use this list of stopwords (or any other).
f) For the top 100 most frequent pairs of two consecutive words, print the pair and its frequency, excluding stopwords and punctuation.
g) Compute strong collocations of two words, for example using mutual information, chi-square, and other measures (include in your report the first 20 collocations for two measures, optionally for more measures). You can implement your own collocation extractor, or use tools, for example the NSP statistical package http://www.d.umn.edu/~tpederse/nsp.html.
Include in the report a discussion of the results, with focus on comparison between the two corpora.
What other statistics could be useful when comparing two corpora?
Can you identify words and collocations (specialized terms) that appear only in the Computer Science corpus?
Prepare written answers (typed) in a file named report (.txt, .doc, or .pdf). In the report, present your results and discuss them. Write your name and student number in the report. Submit your report file by email to firstname.lastname@example.org