Winter 2016

CSI4107: Information Retrieval and the Internet

Instructor:  Diana Inkpen

Office: SITE 5015
E-mail:   Telephone: 562-5800 ext. 6711

Meeting Times and Locations

Office Hours: TBA or by email appointment, in SITE 5015.


Basic principles of Information Retrieval. Indexing methods. Query processing. Linguistic aspects of Information Retrieval. Agents and artificial intelligence approaches to Information Retrieval. Relation of Information Retrieval to the World Wide Web. Search engines. Servers and clients. Browser and server side programming for Information Retrieval.
Pre-Requisites (CSI3103 or ELG3300), (CSI3125 or CSI2115 or SEG2101) or permission from the instructor.


Evaluation  Students will be evaluated on:

Timetable  (no late assignments are considered)

Recommended Textbook

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Cambridge University Press, 2008 (online version available)

Other books:
Information Retrieval, by D. Grossman and O. Frieder, Springer, 2004 (second edition).
Another online book Information Retrieval, by C. J. van Rijsbergen (1979)
Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999. Companion website to this book.

Course notes (additional reading, pdf file)

Syllabus (subject to minor modifications)  (The lecture slides will be in pdf format, you can read them with Acrobat Reader)
Credit: some of the lecture notes are initially designed by prof. Ray Mooney, University of Texas Austin

Week 1: 
Preliminaries. Introduction
: Goals and history of IR. The impact of the web on IR. The role of artificial intelligence (AI) in IR.
The Internet and the WWW: History of Internet. TCP/IP. IP addresses. WWW. HTTP. HTML. Web servers and clients.
Links: Top search engines in US in 2010 Search engine watch TREC CLEF

Week 2:
Basic IR Models: Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document  frequency) weighting; cosine similarity.
Slides on Implementation of Vector Space Model   Extra slides on cosine measure     Example discussed in class Solution to the example.

Week 3: 
Experimental Evaluation of IR: Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.
Interpolated Precision.       Example discussed in class Solution to example.  
Week 4: 
Query Operations and Languages: Relevance feedback; Query expansion; Query languages.    
Example discussed in class   Solution (do it by yourself first)
Links:  WordNet Corpus-based Similarity Demo Dekang Lin's Demos WordNet::Similarity
Image Information Retrieval
Links: Content-based image retrieval ESP Game for labeling images
Text Representation: Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages (SGML, HTML, XML).
More slides on Web markup languages: HTML, XML, XHTML, RDF, OWL Semantic Web and Linked Data   Links:   Semantic Web Linked Data video Example: term frequencies in Tom Sawyer   
Week 5: 
Web Search: Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank); shopping agents.
Extra slides on Link Analysis: the hubs and authorities algorithm, and the PageRank algorithm.    PageRank   
Hubs and authorities example discussed in class Solution (do it by yourself first) PageRank examples
Links: Google Tech Overview    Google - Parallel architecture    Slides about the Google 1998 paper

Week 6:  Feb 15-19 Study Break (Reading Week, no classes)

Week 7: 
Feb 24, Midterm revision; Feb 26, in class: Midterm

Week 8:  Text Categorization : Categorization algorithms: decision trees; Rocchio; k-nearest neighbor, Naive Bayes.   Sentiment Analysis
Links:   Weka data mining tool Extra slides on Naive Bayes

Week 9: 
Text Clustering Clustering algorithms: agglomerative clustering; k-means. Applications to information filtering and organization.
Examples of text classification and clustering discussed in class   Solution (do it by yourself first)
Week 10:
Advanced IR Models: Probabilistic models; Latent Semantic Indexing (LSI); Language Models.
Extra slides on LSI. Language Models for Information Retrieval.

Week 11:
Question Answering : Retrieving precise short answers to natural language queries.
QA System Demos. Slides about IBM's Watson. Links to IBM's Watson Deep QA Answers

Week 12: 
Cross-Language IR Learning to Rank

Week 13: 
Deep Learning for Natural Language Processing
Week 14: 
Exam revision