A1

CSI5386: Natural Language Processing

Assignment 1

Social media corpus analysis

Due: Fri, Sept 29, 2017, 10pm

Note: This assignment should be done in groups of two students. Only one student from each group needs to submit via the Virtual campus (or by email), but to specify the names of the partners.

Part 1: Corpus processing: tokenization and word counting [60 points]

Implement a word tokenizer for Twitter messages that splits the text of the messages into tokens and separates punctuation marks and other symbols from the words. Please describe in your report all the decisions you made relative to pre-processing and tokenization.

Implement a program that counts the number of occurrences of each token in the corpus.

You can use any tools for tokenization, and write programs only if you need to put the data in the right format or to compute additional information. There are many NLP tools that include tokenizers. Some of them are adapted to social media texts, for example to tools for POS tagging mentioned in Part 2.

Use this corpus of 48401 Twitter messages as input to your tokenizer. The format is one Twitter message per line. Provide in your report the following information about the corpus:

a) Submit a file microblog2011_tokenized.txt with the tokenizer’s output for the whole corpus. Include in your report the output for the first 20 sentences in the corpus.

b) How many tokens did you find in the corpus? How many types (unique tokens) did you have? What is the type/token ratio for the corpus? The type/token ratio is defined as the number of types divided by the number of tokens.

c) For each token, print the token and its frequency in a file called Tokens.txt (from the most frequent to the least frequent) and include the first 100 lines in your report.

d) How many tokens appeared only once in the corpus?

e) From the list of tokens, extract only words, by excluding punctuation and other symbols. How many words did you find? List the top 100 most frequent words in your report, with their frequencies. What is the type/token ratio when you use only word tokens (called lexical diversity)?

f) From the list of words, exclude stopwords. List the top 100 most frequent words and their frequencies. You can use this list of stopwords (or any other that you consider adequate).

g) Compute all the pairs of two consecutive words (excluding stopwords and punctuation). List the most frequent 100 pairs and their frequencies in your report. Also compute the type/token ratio when you use only word tokens without stopwords (called lexical density)?

h) Extract multi-word expressions (composed of two or more words, so that the meaning of the expression is more than the composition of the meanings of its words). You can use an existing tool or your own method (explain what tool or method you used). List the most frequent 100 expressions extracted.

Part 2: Part-of-Speech Tagging [40 points]

Use this corpus of 10,000 Twitter messages, already tokenized. It is in the format of one sentence per line. Run one or more part-of-speech (POS) taggers on the corpus, and compute the tagging accuracy. Use the PennTreebank tags plus 4 extra tags for Twitter messages: USR, HT, URL, RT for @usernames, #hastags, urls, and re-tweet symbols.

You can use any part-of-speech tagger. Here is a list of tools adapted for social media text: GATE Twitter POS tagger, CMU Twitter NLP and POS tagger, Allan Ritter’s Twitter POS tagger, or any other tool.

For computing the tagging accuracy you can implement your own script or use one included in the tools that you chose. Here is the expected solution, to compare against.

a) Submit a file POS_results.txt with the tagger’s output for the whole corpus.

Include in your report the POS tagger’s output for the first 20 sentences in the corpus.

b) What is the POS tagging accuracy for the whole corpus?

c) Include in your report the frequency of each POS tag in the corpus.

We will have a mini-competition, with chocolate prizes, for the best tagging accuracy on the whole corpus.

Submission instructions:

1. Prepare a report with written answers for the two parts. Summarize the methods that you implemented, any additional resources that you used, present the results that you obtained, and discuss them. Write the names and student numbers of the team members at the beginning of the report and explain how the tasks were divided.

2. Submit your report, results files (microblog2011_tokenized.txt, tokens.txt and POS_results.txt), and code electronically through the Virtual Campus or by email. Archive everything in a .zip file. Do not include the initial data files or any external tools. Include a readme file explaining how to run each program that you implemented and how to use any external resources.