Winter 2012
CSI4107: Information Retrieval and the Internet
Instructor: Diana Inkpen
Office: SITE 5015
E-mail: diana@site.uottawa.ca Telephone:
562-5800 ext. 6711
Meeting Times and Locations
- Wed 4-5:30pm in LPR 154 and Fri 2:30-4pm in LMX 106
Office Hours: Fri 12:30-1:30pm or by email appointment, in
SITE 5015.
Overview
Basic principles of Information Retrieval. Indexing methods. Query
processing. Linguistic aspects of Information Retrieval. Agents and
artificial intelligence approaches to Information Retrieval. Relation
of
Information Retrieval to the World Wide Web. Search engines. Servers
and clients. Browser and server side programming for Information
Retrieval.
Pre-Requisites (CSI3103 or ELG3300), (CSI3125 or
CSI2115 or SEG2101) or permission from the instructor.
Announcements:
- The final marks are posted in the virtual campus. Here is the
solution to the final
exam.
- There is an
extra lecture (exam revision) on Tue, April 14, 2:30 pm in LMX 106
(because the lectures from Fri April 6 - Easter break, moved to the
last day of classes, Tue, April
10)
- Exam preparation page.
- The midterm marks are available
in the Virtual Campus. Here is the
solution to the midterm.
-
Midterm preparation page.
There
will be extra office hours Mon, Feb 27, 2-4pm (and no office hours
Fri, Feb 24).
- Please note the change of classroom locations.
- Assignment 1 is posted.
The due date is extended to Feb 20,
22:00.
Evaluation Students will
be evaluated on:
- Two written and programming assignments / group project (30%). The
programming language will be Java.
Note:
The assignments will be submitted electronically
through Virtual Campus. No late assignments are
accepted.
- Midterm exam (15%)
- One in-class Presentation(15%)
(See Presentations
schedule)
- Final exam (40%)
- Bonus points for class participation
Timetable (no
late assignments are considered)
- Assignment 1, due Fri Feb 10, 14:00,
extended to Feb 20, 22:00.
- In-class presentation.
(See Presentations
schedule)
- Midterm (Fri, March 2, 14:30, in class)
- Assignment 2, due Mon, April 2, 22:00.
- Final exam (during exam period)
Recommended Textbook
Information Retrieval, by D. Grossman and O. Frieder, Springer,
2004 (second edition).
Other books:
An online book
Introduction to Information Retrieval, by Christopher D. Manning,
Prabhakar Raghavan and Hinrich Schütze, Cambridge University
Press, 2008
Another online book
Information Retrieval, by C. J. van Rijsbergen (1979)
Modern Information Retrieval, by Ricardo Baeza-Yates and
Berthier Ribeiro-Neto, 1999.
Companion website to this book.
Course notes (additional reading, pdf
file)
Syllabus (subject
to minor modifications) (The lecture
slides will be in pdf
format, you can read them with Acrobat Reader)
Credit: some of the
lecture notes are initially
designed by prof. Ray Mooney, University of Texas
Austin
Week 1: Jan 5, 7
Preliminaries.
Introduction:
Goals and history of IR. The impact of the web on IR. The
role of artificial intelligence (AI) in IR.
The
Internet and the WWW: History of Internet. TCP/IP. IP
addresses. WWW. HTTP. HTML. Web servers and clients.
Links: Top search
engines in US in 2010
Search engine watch
TREC
CLEF
Week 2: Jan 11,
13
Basic IR
Models:
Boolean and vector-space retrieval models; ranked retrieval;
text-similarity metrics; TF-IDF (term frequency/inverse document
frequency) weighting; cosine similarity.
Slides
on Implementation of Vector Space Model
Extra
slides on cosine measure
Example
discussed in class Solution
to the example.
Week 3: Jan 18, 20
Experimental
Evaluation of IR: Performance metrics:
recall, precision, and F-measure; Evaluations on benchmark text
collections.
Interpolated Precision.
Example discussed in class Solution to example.
Week 4:
Jan 25,
27
Query Operations and Languages:
Relevance
feedback; Query expansion; Query languages.
Example discussed in class
Solution (do it by yourself first)
Links:
WordNet
Corpus-based Similarity
Demo
Dekang Lin's Demos
WordNet::Similarity
Week 5: Feb 1, 3
Image Information Retrieval
Links: Content-based
image retrieval ESP Game
for labeling images
Week 6: Feb 8, 10
Web Search:
Search engines;
spidering; metacrawlers; directed spidering; link analysis (e.g. hubs
and authorities, Google PageRank); shopping agents.
Extra slides on Link Analysis: the
hubs and authorities algorithm, and the PageRank
algorithm.
PageRank
Hubs and authorities example discussed
in class Solution (do it by
yourself
first) PageRank examples
Links: Google Tech
Overview
Google
- Parallel architecture
Slides about the Google 1998 paper
Week 7: Feb 15, 17
Text Representation:
Word statistics;
Zipf's law; Porter stemmer; morphology; index term selection; using
thesauri. Metadata and markup languages (SGML, HTML, XML).
More slides on Web markup languages: HTML, XML,
XHTML, RDF, OWL Links:
Semantic Web
Example:
term frequencies in Tom Sawyer
Week 8: Feb 22, 24
Study Break
(Reading Week, no classes)
Week 9: Feb 29, Mar 2
Wed, Feb 29, Midterm revision, Fri, Mar 2, in class:
Midterm
Week 10: Mar 7, 9
Text Categorization
:
Categorization algorithms: decision trees; Rocchio; k-nearest neighbor,
Naive Bayes. Links:
Weka data mining tool
Extra slides on Naive Bayes
Week 11: Mar
14, 16
Text Clustering
:
Clustering algorithms: agglomerative clustering; k-means.
Applications to information filtering and organization.
Examples of text
classification and clustering discussed in class
Solution (do it by
yourself
first)
Week 12: Mar 21, 23
Advanced IR Models:
Probabilistic models;
Latent Semantic Indexing (LSI); Language Models.
Extra slides on LSI.
Language Models for Information
Retrieval.
Week 13:
Mar 28, 30
Question Answering :
Retrieving precise short answers to
natural language queries.
QA
System Demos. Slides about IBM's Watson.
Links to IBM's Watson
Deep QA
Answers
Ottawa
Citizen article
Week 14:
Apr 4, 10
Cross-Language
IR
Exam revision