CSI4107, Winter 2012
Assignment
2
Due
Monday April 2, 22:00
Web Spider and Domain-Specific
Corpus Collection [50 points]
Note: You
will work in groups of two students.
In this assignment, you will create
and unleash a web spider that retrieves web pages about a specific topic.
You can choose the topic. Select 10 webpages, manually, that are about the topic that you
chose.
Examples of topics: computer
science, software engineering, geography, math, history, etc.
You will store the texts extracted
from the collected webpages, in order to create a
domain-specific corpus.
The collected texts could be used
with IR system that you implemented in A1 (this is optional).
As part of your web spider (also
know as crawler, robot, agent, etc.) implement the following:
1. [10 marks]
Implement an URL
extractor from an HTML webpage. Make sure you transform the links to
canonical form.
Compute the number of link going out
of a webpage. Include in your README file all the links extracted from the
following two webpages: http://www.uottawa.ca/welcome.html,
http://www.site.uottawa.ca/eng/index.html.
How many links were there for each of the two webpages?
2. [5 marks]
Implement an extractor
of text, that extracts the text out of an html webpage.
Include in your README file the
plain text extracted from the same two webpages.
3. [10 marks] Implement a text similarity module for detecting
domain-specific texts.
Implement a measure of similarity
between two texts. It can be a simple measure, such as the number of terms in
common, normalized by the lengths of the two texts. You will use it to compute
the similarity between the text of a new webpage found by your web crawler and
the text of the 10 webpages about your topic
(concatenated). Using the similarity measure, the crawler will be able to
decide if the new webpage is about your chosen topic.
4. [15 marks] The actual crawler. Your agent will have to perform
the following tasks:
1) Start with the URLs for the 10 web
pages that you collected (extract all the links from them, so that you can
follow these links).
2) Perform a Web traversal using a
breadth-first strategy.
3) Keep track of the traversed URLs,
making sure
a. they were not already traversed
(i.e., avoid duplicates, avoid cycles)
b. you respect the robot exclusion
protocol (robots.txt files) and robots metatags.
c. put a time limit in case some pages are
not available.
4) For each new URL, extract the text,
and store it only if it is about the chosen topic. Use the similarity with the
10 texts concatenated in order to decide if the new text should be kept (for
example, if the similarity is over a threshold).
5) Make sure you collect at least 1000
texts (that is, it is ok to stop your agent after it collected 1000 texts, but
feel free to collect more, if you want).
Include the following information in
the README file: the number of pages (URLs) traversed by your agent, and the
number of pages that you collected about the chosen topic.
Manually look at 30 of the collected
texts. Report the precision (how many of the 30 texts were about the chosen
topic). Include 2 of these texts in your README file.
Resources: Any resources you want to use. Include in the
README file explanations on how you used them. One resource you can explore is
the package ir.vsr.
Some documentation on ir.vsr is available here.
Slides that might help with implementation details are here.
Submission instructions:
- write a README file (plain text, pdf, or Word format) [10 points for this report] including:
* the names
and student numbers of the students in the group, and specify how the tasks were divided,
* explain what is
the topic you chose and the URLs of the 10 pages you collected as a
staring point,
* a detailed
note about the functionality of your programs,
* complete instructions on how to run
them,
* explain the algorithms and data
structures that you used, etc.
- Produce a file named Result containing the
list of all URLs that you extracted. How many are they?
- Produce a zip file (names Corpus.zip)
containing the texts collected by your web crawler. How many texts did you
collect?
- submit your assignment, including programs, README file, the Result file, and the Corpus.zip
file as a zip file through the Virtual Campus.
Have fun!!!