Winter 2013
CSI4107: Information Retrieval and the Internet
Instructor: Diana Inkpen
Office: SITE 5015
E-mail: diana@site.uottawa.ca Telephone:
562-5800 ext. 6711
Meeting Times and Locations
Office Hours: TBA or by email appointment, in
SITE 5015.
Overview
Basic principles of Information Retrieval. Indexing methods. Query
processing. Linguistic aspects of Information Retrieval. Agents and
artificial intelligence approaches to Information Retrieval. Relation
of
Information Retrieval to the World Wide Web. Search engines. Servers
and clients. Browser and server side programming for Information
Retrieval.
Pre-Requisites (CSI3103 or ELG3300), (CSI3125 or
CSI2115 or SEG2101) or permission from the instructor.
Announcements:
- The final marks are posted in the Balckboard Learn system. Here is the
solution to exam. If you need to see
your exam, there will be office hours Monday April 29, 2-4pm.
- Exam preparation page
- Assignment 2 is posted.
- Midterm preparation page
- The deadline for A1 was extended till Feb 15. A new version of the
corpus for Assignment 1 was made available here. Please use this version instead of
en.zip because it is more complete.
- Assignment 1 is posted.
Evaluation Students will
be evaluated on:
- Two written and programming assignments / group project (30%). The
programming language will be Java.
Note:
The assignments will be submitted electronically
through Virtual Campus (the new version called Blackboard Learning). No
late assignments are
accepted.
- Midterm exam (15%)
- One in-class Presentation(15%)
(See Presentations
schedule)
- Final exam (40%)
- Bonus points for class participation
Timetable (no
late assignments are considered)
- Assignment 1, due Mon Feb 11,
extended till Fri Feb 15, 22:00.
- In-class presentation.
(See Presentations
schedule)
- Midterm (Fri, March 1, 14:30, in class) Solutions
- Assignment 2, due Mon, April 1,
extended till April 8, 22:00.
- Final exam (during exam period)
Recommended Textbook
Introduction to Information Retrieval, by Christopher D. Manning,
Prabhakar Raghavan and Hinrich Schutze, Cambridge University
Press, 2008 (online version available)
Other books:
Information Retrieval, by D. Grossman and O. Frieder, Springer,
2004 (second edition).
Another online book
Information Retrieval, by C. J. van Rijsbergen (1979)
Modern Information Retrieval, by Ricardo Baeza-Yates and
Berthier Ribeiro-Neto, 1999.
Companion website to this book.
Course notes (additional reading, pdf
file)
Syllabus (subject
to minor modifications) (The lecture
slides will be in pdf
format, you can read them with Acrobat Reader)
Credit: some of the
lecture notes are initially
designed by prof. Ray Mooney, University of Texas
Austin
Week 1: Jan 9, 11
Preliminaries.
Introduction:
Goals and history of IR. The impact of the web on IR. The
role of artificial intelligence (AI) in IR.
The
Internet and the WWW: History of Internet. TCP/IP. IP
addresses. WWW. HTTP. HTML. Web servers and clients.
Links: Top search
engines in US in 2010
Search engine watch
TREC
CLEF
Week 2: Jan 16, 18
Basic IR
Models:
Boolean and vector-space retrieval models; ranked retrieval;
text-similarity metrics; TF-IDF (term frequency/inverse document
frequency) weighting; cosine similarity.
Slides
on Implementation of Vector Space Model
Extra
slides on cosine measure
Example
discussed in class Solution
to the example.
Week 3: Jan 23, 25
Experimental
Evaluation of IR: Performance metrics:
recall, precision, and F-measure; Evaluations on benchmark text
collections.
Interpolated Precision.
Example discussed in class Solution to example.
Week 4:
Jan 30, Feb 1
Query Operations and Languages:
Relevance
feedback; Query expansion; Query languages.
Example discussed in class
Solution (do it by yourself first)
Links:
WordNet
Corpus-based Similarity
Demo
Dekang Lin's Demos
WordNet::Similarity
Week 5: Feb 6, 8
Image Information Retrieval
Links: Content-based
image retrieval ESP Game
for labeling images
Text Representation:
Word statistics;
Zipf's law; Porter stemmer; morphology; index term selection; using
thesauri. Metadata and markup languages (SGML, HTML, XML).
More slides on Web markup languages: HTML, XML,
XHTML, RDF, OWL Links:
Semantic Web
Example:
term frequencies in Tom Sawyer
Week 6: Feb 13, 15
Web Search:
Search engines;
spidering; metacrawlers; directed spidering; link analysis (e.g. hubs
and authorities, Google PageRank); shopping agents.
Extra slides on Link Analysis: the
hubs and authorities algorithm, and the PageRank
algorithm.
PageRank
Hubs and authorities example discussed
in class Solution (do it by
yourself
first) PageRank examples
Links: Google Tech
Overview
Google
- Parallel architecture
Slides about the Google 1998 paper
Week 7: Feb 20, 22
Study Break
(Reading Week, no classes)
Week 8: Feb 27, Mar 1
Wed, Feb 27, Midterm revision, Fri, Mar 1, in class:
Midterm
Week 9: Mar 6, 8
Text Categorization
:
Categorization algorithms: decision trees; Rocchio; k-nearest neighbor,
Naive Bayes. Links:
Weka data mining tool
Extra slides on Naive Bayes
Week 11: Mar
20, 22
Text Clustering
Clustering algorithms: agglomerative clustering; k-means.
Applications to information filtering and organization.
Examples of text
classification and clustering discussed in class
Solution (do it by
yourself
first)
Week 10: Mar 13, 15
Advanced IR Models:
Probabilistic models;
Latent Semantic Indexing (LSI); Language Models.
Extra slides on LSI.
Language Models for Information
Retrieval.
Week 12: Mar 27, 29
Question Answering :
Retrieving precise short answers to
natural language queries.
QA
System Demos. Slides about IBM's Watson.
Links to IBM's Watson
Deep QA
Answers
Ottawa
Citizen article
Week 13:
Apr 3, 5
Cross-Language
IR
Exam revision