CS584: Information Retrieval

http://www.mathcs.emory.edu/~eugene/teaching/CS584_IR/

Course Description

Basic and advanced techniques for text-based information systems: text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation, feedback, and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; information extraction and text mining.

Course Mechanics

Instructor

Eugene Agichtein: eugene@mathcs.emory.edu, Office: N420; webpage: http://www.mathcs.emory.edu/~eugene/

Time/Place

Mondays and Wednesdays 4:00 – 5:30 pm  Room: E 408

Office Hours: N420

Mondays 12-1pm Thursdays 2-3pm

Text

·         Main text ( MRS ): Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2007.
Book draft available online at http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html

·         Supplemental text (CJR): C. J. van RIJSBERGEN, INFORMATION RETRIEVAL, http://www.dcs.gla.ac.uk/Keith/Preface.html, London: Butterworths, 1979

·         Additional readings will be posted on class web page

Grading:

·         2 projects: 60% total (20% first project, 40% second project)

·         Exams: 40% (20% midterm, 20% final)

 

Schedule

Date

Topic

Notes & Comments

  9/1

Intro & course overview

Topics, mechanics, software, expectations, tentative project descriptions

9/6

IR architecture, Index construction

The indexing pipeline: tokenization, inverted list construction

9/8

Project 1 description posted

Project 1 description here

9/11

Index construction: continued

Index support for phrases, proximity, intro to analyzing search complexity and index size; term distributions

9/13

Document ranking

Finish up zones/fields; Intro to document ranking; vector-space retrieval model; TF*IDF

9/18

Ranking (continued)

Vector space model, dimensionality reduction (intro), evaluation

9/20

Ranking (continued)

Relevance Feedback, intro do probabilistic retrieval/term weighting: BIR model, probabilistic ranking principle
Readings:

9/25 Mon

Ranking (Continued)

Probabilistic term weighting: review of BIR model, extensions: BM25, BM25F; intro to LM (maybe)

Readings:

9/27 Wed

No class

 

9/28 Thu

Special: Grad seminar@4pm

Implicit structures in unstructured data, query processing/cost estimation for text information extraction

9/29 Fri

Language modeling

Language modeling for IR.
Readings:

 

10/2 Mon

No class

Note: no class, no office hours (make-up office hours: Tuesday, 10:30am-11:30am)

10/3 Tue

Project 1 due on Tuesday, 10/3

 Office hours: 10:30am-11:30am

10/4 Wed

Semantics, dimensionality reduction

Query/document mismatch, synonymy, polysemy, LSA, PLSA

Readings:

10/6 Fri

Catch-up, Intro to Web Search

History, some (known) architectures, crawling, index construction, query processing

10/9 Mon

No class (fall break)

 

10/11 Wed

Link analysis

Citation graphs, high-level intro to PageRank and HITS algorithms

10/16 Mon

Link analysis catch-up, midterm review

Finish link analysis (as needed), review material for midterm. NOTE: try to come prepared with specific questions.

10/18 Wed

Midterm

 

10/23

Postmortem; project 2 topics/ideas

 

10/25

Web crawling

10/30

Crawling (cont), Index freshness

  • Effective change dectection using sampling, J. Cho & A. Ntoulas, VLDB 2002

  • What’s new on the web: Ntoulas & Cho, WWW 2004

  • User-Centric Web Crawling,  CS. Pandey, C. Olston, WWW 2005

11/1

XML Retrieval 

 Chapter 10 in MRS: http://nlp.stanford.edu/IR-book/pdf/10-xml.pdf

11/6 XML Retrieval II XIRQL, indexing, also:
11/8 Project 2 proposal due
(Web) Information extraction

Extracting Entities, Relations, Facts from the web to improve web search
11/13 Structured Web Search Vertical search: Froogle, object search
11/15 Recommendation Systems & User Interaction Recommender Systems, collaborative filtering, user interactions for web search
11/20 Classification I Naive Bayes classification, Feature Selection
Readings: MSR Ch. 13: Text classification & Naive Bayes
11/22 Classification II Vector-space classification, SVMs, practical issues
Readings:
MRS Ch. 14: Vector space classification
MRS Ch. 15: Support vector machines & kernel functions
 
11/27 Clustering MRS Chs 16, 17: Partitional clustering and Hierarchical Clustering
11/29 Spam Web spam, click spam, ad spam, everywhere search spam
12/4 Cross-lingual retrieval  
12/6 Question answering and wrap-up  
12/10 Sun Project 2 due Readme, data, code
12/11 Mon No class  

12/13 W

Project 2 presentations

 Last class. Everyone presents their project

12/20, W

Project 2 final report due

Final report replaces final examination (note: NO LATE SUBMISSIONS - grades are due soon after)

 

Expectations

Prerequisites

Some familiarity with Java and object-oriented programming is required (i.e., you should be comfortable reading Java documentation and implementing functions and classes), and have basic understanding of common data structures. Should also be comfortable with basic linear algebra, and probability & statistics, some graph theory exposure would help.

Software

Projects will be implemented on top of the Lucene search engine (http://lucene.apache.org/java/docs/index.html) -- a high-performance, full-featured text search engine library written entirely in Java..

Datasets

The datasets will depend on student interests. Default text collections that will be provided:

Wikipedia (1.3 million English articles): http://en.wikipedia.org/wiki/Main_Page

Blogs: (1 million blogs, 10 million blogs entries, courtesy of BuzzMetrics)

PubMed: (National Library of Medicine, about 14 Million medical journal citations): http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.html#Medline

OpenDirectory: (DMOZ/Open Directory project: over 4 million sites manually organized in over 500,000 categories): http://dmoz.org/