CSI4107, Winter 2013

Assignment 1

Due Monday Feb 11, extended till Fri Feb 15, 22:00

Geographic Information Retrieval System [100 points]

Note: This assignment should be done in groups of two students.

You will implement a geographic Information Retrieval (IR) system which will find Wikipedia entries / documents that answer a particular information need. Feel free to add geographical reasoning of some sort in your system.

We will use the data from GikiCLEF 2009, but only the English queries and the English Wikipedia documents.

You will submit the results of your system on a set of 50 test queries (available here). Put the results of your IR system in a file name Results.txt (whose format is explained below in point 4.). You also have their ideal answers, the relevance judgments, available here (also in trec_eval here with two additional columns) so that you can evaluate the performance of your system. Please mention the evaluation score in your report.

The collection of documents is a sub-part of the English Wikipedia, and it is available here (the older incomplete version is here, but it is no longer needed). The documents are in xml format. You need to extract the text from them or any other information that you want to use.

Here are two examples of test queries:

GC-2009-01: List the Italian places where Ernest Hemingway visited during his life.

GC-2009-02: Which countries have the white, green and red colors in their national flag?

Here is an exemple of expected solution, from the relevance judgements file.

GC-2009-01 en/a/c/c/Acciaroli

GC-2009-01 en/a/v/e/Aveto_Valley_caf9

GC-2009-01 en/f/o/s/Fossalta_di_Piave_e6b6

GC-2009-01 en/h/a/r/Harry_'s_Bar__(Venice_)_5e7a

GC-2009-01 en/v/e/n/Venice

GC-2009-02 en/d/o/m/Dominican_Republic_1b44

GC-2009-02 ...

...

The relevance judgements file contains the correct answers for one query, then for the next query, and so on. The answers are not necessarily complete. They were produced by human judges, using the pooling method: from all the answers retrieved by all the systems that participated in gikiCLEF 2009, they selected the answers that they considered correct. If your system produces additional answers, we will consider them incorrect, with the risk of excluding a small number of possibly correct answers.

You can use an existing IR system from the Internet, and adapt it to work on this collection and queries (this system can use the vector space model or a more advanced model, and might need some adaptation in regard to the format of the documents, queries and output). Alternatively, you can implement you own indexing scheme based on the vector space model, as discussed in class (see the implementation slides). Most tools that you can use include the following steps, or you can implement some of the steps and use a tool for other steps.

Feel free to include special treatment for geographical entities if you want.

1. Preprocessing the texts [10 points]

Implement preprocessing functions for tokenization and stopword removal. The index terms will be all the words left after filtering out markup that is not part of the text, punctuation tokens, numbers, stopwords, etc. Optionally, you can use the Porter stemmer to stem the index words.

- Input: Documents that are read one by one from the collection (process the fields to extract only what you need).

- Output: Tokens to be added to the index (the vocabulary)

The same preprocessing should be applied on the text of the queries.

2. Indexing the text [20 points]

Build an inverted index, with an entry for each word in the vocabulary. You can use any appropriate data structure (hash table, linked lists, Access database, etc.). Note: if you use an existing IR system, use its indexing mechanism.

- Input: Tokens obtained from the preprocessing module (the vocabulary)

- Output: An inverted index for fast access

3. Document retrieval and ranking [20 points]

Use the inverted index to find the limited set of documents that contain at least one of the query words. Compute the similarity scores between a query and each document (cosine similarity or any other).

- Input: One query and the inverted index from Step 2

- Output: Similarity values between the query and each of the documents. Rank the documents in decreasing order of similarity scores.

4. Results [20 points]

Run your system on the set of 50 test queries. Save the output in a file named Results.txt and submit this file with your assignment.

For each query, include maximum 20 results for each query. The file should have the following format:

query_id 1 docno rank score tag
where: query_id is the query name in order (all the answers for query 1, than for query 2, ..., till the 50th query); the second position is an unused field (set to 1); docno is the document id (without the extension .xml but including the full path); rank is the rank assigned by your system to the document (1 is the highest rank); score is the computed degree of match between the text and the query; and tag is a unique identifier you chose for this run (same for every query and document). Example (fictional):

GC-2009-01 1 en/f/o/s/Fossalta_di_Piave_e6b6 1 0.8032 run_name

GC-2009-01 1 en/h/a/r/Harry_'s_Bar__(Venice_)_5e7a 2 0.7586 run_name

GC-2009-01 1 en/h/o/l/Spain 3 0.6517 run_name

...

Evaluation: The Results.txt file should be compared automatically with the expected solution file, so that you can compute how correct the automatic answers are. As evaluation measure we will use Precision: from how many answers your system retrieves how many are relevant, summed over all the queries and Recall: from how many answers there are in the expected solution file, how many did your system retrieve, as well as the F-measure (the harmonic mean of P and R, F =2 P R / (P+R). Implement your own evaluation script and include it in your submission (another possibility is to use the trec_eval script). Please mention the evaluation scores in your report.

5. Report [30 points]

- write a README file (plain text, Word format, or pdf) including:

* the name and student number for the two students in the group, and specify how the tasks were divided.

* a detailed note about the functionality of your programs, and auxiliary tools.

* complete instructions on how to run them

* Explain the algorithms, data structures, and optimizations that you used. How big was the vocabulary? Include a sample of 100 tokens from your vocabulary. Include the answers to queries 1 and 25. Discuss your results.

* Include the score computed for the results on the 50 test queries.

Tools and resources:

You can use any other resources from the Internet, as long as you explain in your report how you used them (compilation, installation, adaptation, etc.).

Here is a list of stopwords to use in Step 1 (or use another list). In Step1 you can use the Java regular expressions library to help with the filtering. If you want to use stemming, you can use the Porter stemmer.

You can use any text IR system available on the Internet (Terrier, Lucene, Lemur, etc.).

Another resource you can explore for text IR using cosine vector space model is the package ir.vsr. Some documentation on ir.vsr is available here.

Submission instructions:

- include the REAMDE file (the report).

- include the Results.txt file with the results for all the test queries, in the required format.

- include all your programs, but not the extra tools (for those you only have to explain how to install and use them).

- submit your assignment, including programs, README file, and Results.txt file, as a zip file through Virtual Campus (Blackboard Learn version).

- do not include the initial collection.

Note: Only one student form each group needs to submit the assignment.

Have fun!!!