Winter 2013

CSI4107: Information Retrieval and the Internet


Instructor:  Diana Inkpen

Office: SITE 5015
E-mail: diana@site.uottawa.ca   Telephone: 562-5800 ext. 6711

Meeting Times and Locations

Office Hours: TBA or by email appointment, in SITE 5015.

Overview

Basic principles of Information Retrieval. Indexing methods. Query processing. Linguistic aspects of Information Retrieval. Agents and artificial intelligence approaches to Information Retrieval. Relation of Information Retrieval to the World Wide Web. Search engines. Servers and clients. Browser and server side programming for Information Retrieval.
Pre-Requisites (CSI3103 or ELG3300), (CSI3125 or CSI2115 or SEG2101) or permission from the instructor.

Announcements:

Evaluation  Students will be evaluated on:

Timetable  (no late assignments are considered)

Recommended Textbook

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Cambridge University Press, 2008 (online version available)

Other books:
Information Retrieval, by D. Grossman and O. Frieder, Springer, 2004 (second edition).
Another online book Information Retrieval, by C. J. van Rijsbergen (1979)
Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999. Companion website to this book.

Course notes (additional reading, pdf file)

Syllabus (subject to minor modifications)  (The lecture slides will be in pdf format, you can read them with Acrobat Reader)
Credit: some of the lecture notes are initially designed by prof. Ray Mooney, University of Texas Austin


Week 1:  Jan 9, 11
Preliminaries. Introduction
: Goals and history of IR. The impact of the web on IR. The role of artificial intelligence (AI) in IR.
The Internet and the WWW: History of Internet. TCP/IP. IP addresses. WWW. HTTP. HTML. Web servers and clients.
Links: Top search engines in US in 2010 Search engine watch TREC CLEF

Week 2: Jan 16, 18
Basic IR Models: Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document  frequency) weighting; cosine similarity.
Slides on Implementation of Vector Space Model   Extra slides on cosine measure     Example discussed in class Solution to the example.

Week 3:  Jan 23, 25
Experimental Evaluation of IR: Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.
Interpolated Precision.       Example discussed in class Solution to example.  
Week 4:  Jan 30, Feb 1
Query Operations and Languages: Relevance feedback; Query expansion; Query languages.    
Example discussed in class   Solution (do it by yourself first)
Links:  WordNet Corpus-based Similarity Demo Dekang Lin's Demos WordNet::Similarity

Week 5:  Feb 6, 8
Image Information Retrieval
Links: Content-based image retrieval ESP Game for labeling images
Text Representation: Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages (SGML, HTML, XML).
More slides on Web markup languages: HTML, XML, XHTML, RDF, OWL   Links:   Semantic Web Example: term frequencies in Tom Sawyer   

Week 6:  Feb  13, 15
Web Search: Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank); shopping agents.
Extra slides on Link Analysis: the hubs and authorities algorithm, and the PageRank algorithm.    PageRank   
Hubs and authorities example discussed in class Solution (do it by yourself first) PageRank examples
Links: Google Tech Overview    Google - Parallel architecture    Slides about the Google 1998 paper

Week 7:  Feb 20, 22    Study Break (Reading Week, no classes)
Week 8:  Feb 27, Mar 1    Wed, Feb 27, Midterm revision, Fri, Mar 1, in class: Midterm

Week 9:  Mar 6, 8
Text Categorization : Categorization algorithms: decision trees; Rocchio; k-nearest neighbor, Naive Bayes.   Links:   Weka data mining tool
Extra slides on Naive Bayes

Week 11: Mar 20, 22
Text Clustering Clustering algorithms: agglomerative clustering; k-means. Applications to information filtering and organization.
Examples of text classification and clustering discussed in class   Solution (do it by yourself first)
Week 10: Mar 13, 15
Advanced IR Models: Probabilistic models; Latent Semantic Indexing (LSI); Language Models.
Extra slides on LSI. Language Models for Information Retrieval.

Week 12:  Mar 27, 29
Question Answering : Retrieving precise short answers to natural language queries.
QA System Demos. Slides about IBM's Watson. Links to IBM's Watson Deep QA Answers Ottawa Citizen article

Week 13:  Apr 3, 5
Cross-Language IR

Exam revision