CSI4107, Winter 2012

Assignment 2

Due Monday April 2, 22:00

Web Spider and Domain-Specific Corpus Collection [50 points]

Note: You will work in groups of two students.

In this assignment, you will create and unleash a web spider that retrieves web pages about a specific topic.

You can choose the topic. Select 10 webpages, manually, that are about the topic that you chose.

Examples of topics: computer science, software engineering, geography, math, history, etc.

You will store the texts extracted from the collected webpages, in order to create a domain-specific corpus.

The collected texts could be used with IR system that you implemented in A1 (this is optional).

As part of your web spider (also know as crawler, robot, agent, etc.) implement the following:

1. [10 marks] Implement an URL extractor from an HTML webpage. Make sure you transform the links to canonical form.

Compute the number of link going out of a webpage. Include in your README file all the links extracted from the following two webpages: http://www.uottawa.ca/welcome.html, http://www.site.uottawa.ca/eng/index.html. How many links were there for each of the two webpages?

2. [5 marks] Implement an extractor of text, that extracts the text out of an html webpage.

Include in your README file the plain text extracted from the same two webpages.

3. [10 marks] Implement a text similarity module for detecting domain-specific texts.

Implement a measure of similarity between two texts. It can be a simple measure, such as the number of terms in common, normalized by the lengths of the two texts. You will use it to compute the similarity between the text of a new webpage found by your web crawler and the text of the 10 webpages about your topic (concatenated). Using the similarity measure, the crawler will be able to decide if the new webpage is about your chosen topic.

4. [15 marks] The actual crawler. Your agent will have to perform the following tasks:

1) Start with the URLs for the 10 web pages that you collected (extract all the links from them, so that you can follow these links).

2) Perform a Web traversal using a breadth-first strategy.

3) Keep track of the traversed URLs, making sure

a. they were not already traversed (i.e., avoid duplicates, avoid cycles)

b. you respect the robot exclusion protocol (robots.txt files) and robots metatags.

c. put a time limit in case some pages are not available.

4) For each new URL, extract the text, and store it only if it is about the chosen topic. Use the similarity with the 10 texts concatenated in order to decide if the new text should be kept (for example, if the similarity is over a threshold).

5) Make sure you collect at least 1000 texts (that is, it is ok to stop your agent after it collected 1000 texts, but feel free to collect more, if you want).

Include the following information in the README file: the number of pages (URLs) traversed by your agent, and the number of pages that you collected about the chosen topic.

Manually look at 30 of the collected texts. Report the precision (how many of the 30 texts were about the chosen topic). Include 2 of these texts in your README file.

Resources: Any resources you want to use. Include in the README file explanations on how you used them. One resource you can explore is the package ir.vsr. Some documentation on ir.vsr is available here. Slides that might help with implementation details are here.

Submission instructions:

- write a README file (plain text, pdf, or Word format) [10 points for this report] including:

* the names and student numbers of the students in the group, and specify how the tasks were divided,

* explain what is the topic you chose and the URLs of the 10 pages you collected as a staring point,

* a detailed note about the functionality of your programs,

* complete instructions on how to run them,

* explain the algorithms and data structures that you used, etc.

- Produce a file named Result containing the list of all URLs that you extracted. How many are they?

- Produce a zip file (names Corpus.zip) containing the texts collected by your web crawler. How many texts did you collect?

- submit your assignment, including programs, README file, the Result file, and the Corpus.zip file as a zip file through the Virtual Campus.

Have fun!!!