CSI4107, Winter 2012

Assignment 1

Due Friday Feb 10, 14:00, extended to Feb 20, 22:00.

Wikipedia Image Information Retrieval System [100 points]

Note: This assignment should be done in groups of two students.

You will implement an Information Retrieval (IR) system for a collection of Wikipedia images annotated with text description. You will submit the results of your system on a set of 75 test queries (available here; the images from the queries that are not in the collection are available here). You also have their ideal answers, the relevance judgments (available here) so that you can evaluate the performance of your system before submission. For computing evaluation measures you can use the trec_eval script. Here is a trec_eval script that works in Windows.

The data:

The Wikipedia image collection was used in imageCLEF 2008. It was created and employed by the INEX Multimedia (MM) Track (2006-2007). This WikipediaMM image collection contains of 151,519 images that cover diverse topics of interest. Each image is associated with user-generated alphanumeric, unstructured metadata in English. These metadata usually contain a brief caption or description of the image, the Wikipedia user who uploaded the image, and the copyright information. These descriptions are highly heterogeneous and of varying length. The figure below provides an example image and its associated metadata.

The wikipediaMM image collection is available here:

The thumbnails of the wikipediaMM image collection can be downloaded: here (2 GB)
The metadata of the images in wikipediaMM image collection can be downloaded: here (20 MB)
Additional information on the wikipediaMM collection can be downloaded here (1.7 MB) (optional). This contains:

A README-wikipediaMM file describing the provided data.
A imagesIDs.txt file listing all image identifiers.
A imagefile2metadatafile.txt file listing the correspondence between image and metadata files.

I recommend using the thumbnails of the images (2GB), to save disk space and processing time. If you want the full images (14 GB), email me and I will give you a password to download it from the imageCLEF 2008 website.

The topics are multimedia queries that can consist of a textual, visual and a conceptual part, with the latter two parts being optional. An example topic in the appropriate format is the following:

<topic>
  <number> 1 </number>
  <title> cities by night <title>
  <concept> building </concept>
  <image> http://www.bushland.de/hksky2.jpg </image>
  <narrative> I am decorating my flat and as I like photos of cities at night, I would like to find some that I could possibly print into posters. I would like to find photos of skylines or photos that contain parts of a city at night (including streets and buildings).Photos of cities (or the earth) from space are not relevant. </narrative>
</topic>

Therefore, the topics include the following fields:

title: query by keywords
concept: query by one or more concepts (optional)
image query by one or more images (optional)
narrative description of the information need where the definitive definition of relevance and irrelevance are given

Relevance judgements:

The file contains binary relevance scores in the format used by trec_eval:

TopicNumber DummyColumn DOCNO BinaryRelevance:

where DummyColumn is always zero, DOCNO is the document name (without the extension .jpg), and BinaryRelevance is either 0 or 1.  The relevance scores are sorted by topic (query) number, in ascending order. Example:

1 0 1311 1

1 0 1228 1

1 0 12757 0

1 0 627788 1

…

Read more about the format of the collection (documents/images, queries, and relevance judgements) at http://www.imageclef.org/2008/wikipedia.

For the text indexing and ranking you can write your own code or you can use an existing IR system from the Internet, and adapt it to work on this collection and queries (this system can use the vector space model or a more advanced model). Most tools that you can use include the following steps, or you can implement some of the steps and use a tool for other steps.

1. Preprocessing the text descriptions [10 points]

Implement preprocessing functions for tokenization and stopword removal. The index terms will be all the words left after filtering out markup that is not part of the text, punctuation tokens, numbers, stopwords, etc. Optionally, you can use the Porter stemmer to stem the index words.

• Input: Documents that are read one by one from the collection (process the fields: title, description, notes, location)

• Output: Tokens to be added to the index (the vocabulary)

The same preprocessing should be applied on the text of the queries.

2. Indexing the text descriptions [10 points]

Build an inverted index, with an entry for each word in the vocabulary. You can use any appropriate data structure (hash table, linked lists, Access database, etc.). An example of possible index is presented below. Note: if you use an existing IR system, use its indexing mechanism.

• Input: Tokens obtained from the preprocessing module (the vocabulary)

• Output: An inverted index for fast access

3. Text retrieval and ranking [10 points]

Use the inverted index to find the limited set of documents that contain at least one of the query words. Compute the cosine similarity scores between a query and each document.

• Input: One query (the text) and the inverted index from Step 2

• Output: Similarity values between the query and each of the documents. Rank the documents in decreasing order of similarity scores.

4. Indexing the images [10 points]

Use a tool to extract features from images (color, shape, texture, or other features). Index the images using these features, in order to allow Content-based Image Retrieval.

• Input: The set of all the features extracted from images

• Output: An inverted index for fast access

5. Image retrieval and ranking [10 points]

• Input: One query (the image) and the inverted index from Step 4

• Output: Similarity values between the query and each of the images. Rank the documents/images in decreasing order of similarity scores.

6. Text and image retrieval and ranking [10 points]

Combine the similarity scores for text matching and image matching to provide a final similarity score for matching a query with a document/image.

7. Results [15 points]

Run your system on the set of 75 test queries. Include the output in your submission as a file named Results (for your best run).

The file should have the following format, for the top-100 results for each query (the queries should be ordered in ascending order):

topic_id 1 docno rank score tag
where: topic_id is the topic/query number, the second position is an unused field (set to 1), docno is the document id taken from the DOCNO field of the image/document (without the extension .jpg), rank is the rank assigned by your system to the image/document (1 is the highest rank), score is the computed degree of match between the image an /or text and the topic, and tag is a unique identifier you chose for this run (same for every topic and document). Example:

1 1 236 1 0.8032 run_name

1 1 555 2 0.7586 run_name

1 1 444 3 0.6517 run_name

….

8. Report [25 points]

- write a Report file (plain text, Word, or pdf) including:

* names and student numbers (for the group of two, specify how the tasks were divided).

* a detailed note about the functionality of your programs, and auxiliary tools.

* complete instructions on how to run them.

* explain the algorithms, data structures, and optimizations that you used. Discuss your results.

* include the Mean Average Precision (MAP) score computed with trec_eval for the results on the 75 test queries for your best results (for the matching of both text and images). Include the MAP score for other runs (such as only text matching, only image matching).

Tools and resources:

You can use any other resources from the Internet, as long as you explain in your report how you used them (compilation, installation, adaptation, etc.). As mentioned, you can use any text IR system available on the Internet (Terrier, Lucene, Lemur, etc.).

Content-based image retrieval tools: (some of them deal only with images, some might also deal with text annotations; you need to see which tool can actually be used)

Several systems: http://en.wikipedia.org/wiki/CBIR#Free_.26_Open_Source

The FIRE image retrieval system

Some baseline systems on the ImageCLEF webpage

Submission instructions:

- include the Report file.

- include a file named Results with the results for all the test queries for your best run, in the required format.

- include all your programs, but not the extra tools (for those you only have to explain how to install and use them).

- submit your assignment, including programs, Report file, and Results file, as a zip file through Virtual Campus

- don’t include the initial collection.

Note: If you worked in a group only one student needs to submit the assignment.

Have fun!!!