CSI5180: Topics in Artificial Intelligence: Natural Language Processing, A Statistical Approach



Assignment 2

Due: Fri, Mar 25, 21:00



Keyphrase extraction [50 points]


Keyphrases are expressions (one or more words, most often nouns phrases) that describe the main topic of a document. As keyphrases represent the key ideas of documents, extracting good keyphrases benefits various natural language processing (NLP) applications, such as summarization, information retrieval, and question-answering. In summarization, the keyphrases can be used as a semantic metadata. In search engines, keyphrases can supplement full-text indexing and assist users in creating good queries. Therefore, the quality of keyphrases has a direct impact on the quality of downstream NLP applications.


Your task is to implement a method to extract keyphrases form scientific articles. You will use the data from the SemEval 2010 task 5 on automatic keyphrase extraction. It is split into three archives trial, train and test.


You need to tokenize each file in order to be able to extract words from it. Feel free to clean the files of unwanted symbols if needed (or to fix errors in them, since some text might have been obtained from pdf files, with conversion errors). You can use your tokenizer from Assignment 1.


You can choose any method for keyphrase extraction, from a simple tf-idf method to a sophisticated machine learning approach. One possibility is to detect important words, and then to choose the noun phrase that contains those words as the final keyphrases. Alternatively, you can use any tools or keyword extraction system available on the Internet.


Please write a report, describing your method. Also submit a file with results for the trial data and a file with results on the test data (the train data, you can use for yourself if you need).  


Please mention in your report your best results in terms of Precision, Recall, and F-measure in the top 15 extracted keyphrases on the trial data and on the test data. The expected solutions are provided by the SemEval task, as assigned by the authors of the articles, as assigned by human judges that read the articles, and combined.


The result files should contain one line for each document, in the following format:


where the  KEYPHRASE_LIST contains the keyphrases separated by commas

For example, if the FILENAME is C_1, the corresponding line could be:

C-1 : keyphrase extraction,competition,test,performance evaluation


Please list 15 candidate keyphrases per document.

You can use the script performance.pl that comes with the trail / test data, in order to compute the scores (usage: perl performance.pl <your_file_name>).  



Your goal is to achieve the best F-score on the trial data (for the combined author plus reader-assigned expected solution). Please do not optimize your algorithm on the test data; run on the test data only after you finished developing your algorithm. We will have a mini-competition (chocolate prizes) for the best F-score on the test data.


Please put the extracted keyphrases in separate result files in original version (as extracted from text) and in stemmed version (using the Porter stemmer). For the trial data, you have the expected solution with lemmas (base forms for nouns) and for stemmed forms. For the test data, the solution is available only in stemmed form. We will focus on the stemmed version for computing P,R,F, but it is nice to have the original keyprhases too, to be able to look at them.


The four results files that you will submit should be named TRIAL.lem, TRIAL.stm, TEST.lem and TEST.stm


For more information, read the task webpage and the readme files from the data sets.



Submission instructions:


1. Prepare a report (.pdf, .doc, or .txt). Describe your methods, any tools that you used, present the results, and analyze / discuss them.


2. Submit your report by email to diana@site.uottawa.ca. No need to submit your code. Please submit the 4 result files separately (please put all the files in a zip archive).