Project 1: Implementing Ranking Functions

Goal

Understand and implement the classic vector-space similarity/ranking functions and incorporate them into a working IR system implementation (Lucene). You will implement the TF*IDF and BM25 functions (details on weighting in class). The project is out of 100 points. There are also 2 extra credit options, each worth 10pts extra (but only applicable if the baseline system works). The extra points from extra credit can be used (weighted form) towards the calculation of the final grade. The project is of 4 parts. I recommend doing them sequentially and testing that you have a working system after every step.

 

Due date

Tuesday, 10/3/2006. Delivery mechanism and exact deliverables: UPDATED. see below.

Lateness policy: 10% off for every day late; Late submissions accepted until Friday, 10/6.

 

A: Implement the baseline Lucene system for the News data: (40%)

  1. Download the Lucene distribution from http://www.wmwweb.com/apache/lucene/java/lucene-2.0.0.tar.gz.
  2. Make a Lucene subdirectory in your class project directory:  /aut/proj/cs584/your_login and extract the distribution there
  3. Test your distribution by following instructions in “Indexing Files” at http://lucene.apache.org/java/docs/demo.html

·        Make an index: java –classpath lucene-cor-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.IndexFiles src/

·        Test your index: java –classpath lucene-cor-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.SearchFiles

 

  1. Read and understand (enough to modify) the demo source code: src/demo/org/apache/lucene/demo/
  2. Now you are ready to write your own indexer for the news data: /aut/proj/cs584/cs584000/project1/text/
    You may start with the demo IndexFiles.java code: src/demo/org/apache/lucene/demo/IndexFiles.java
    However: You will need to write your own document parser to process the files (to detect individual document boundaries within each file, marked by <DOC> and </DOC>.
    Warning: the dataset is large (1.9Gb). Do not copy the input documents; Rather give the original location of the text directory as input to indexer and store the index in your own project directory
  3. Make sure you index both the filename and the document id (look at one of the document in the collection to see the difference).
  4. Pay attention to the tokenizer (analyzer in Lucene lingo) used. You will have to use the same one at search time.
  5. Test your index with the modified SearchFiles.java program (hint: just make sure you can tell it where your index is, the rest can stay the same).

B: Implement the TF*IDF ranking: (40%)

  1. You will need to collect IDF (inverse document frequency) statistics for the collection. You will also need to index all term frequencies when building the index. Hint: refer to the FAQ http://wiki.apache.org/jakarta-lucene/LuceneFAQ and look at different ways a field can be indexed
  2. Re-order up to 1,000 hits according to your TF*IDF values and return results (see FAQ on some hints on speed-ups)
  3. Keep the code modular so that you could plug-in different ranking functions

C: Implement the BM25 ranking: (10%)

  1. Same as for TF*IDF. Hopefully your code is modular enough that you only have to add a little extra code specific to BM25.

D: Implement the evaluation harness: (10%)

  1. Input: parse and convert “topics” into queries
    1. Each topic is marked by <top> and </top>
    2. Be able to read the queries in the test file: /aut/proj/cs584/cs584000/eval/topics.351-400
    3. Use the union of the Title and Description fields as your query.
  2. Output: for each topic, return up to 1,000 documents, one result per line, tab-delimited, in following format
           topic_id  \t Q0 \t document_id \t rank \t score \t your_login
    e.g.,
           351  \t Q0 \t FBIS-41571 \t 143 \t 101.24 \t eugene      

Formal Evaluation

  1. An automatic evaluation program will be provided soon, which will give you lots of performance/accuracy numbers. Don’t worry about it yet (note: relative relevance scores are not directly part of the grade; However, a randomly ordered list of documents will not be accepted ).

 

Extra Credit 1: Improve on D by taking advantage of the fields of the query, and the document (10%)

  1. “Improve” means have higher relevance score than the default “union” approach. This implies, that you can correctly parse the input and return legal output to enable empirical evaluation.
  2. Fields in the query refer to the “narrative” component. Fields in the document – be creative. For example, titles, dates, etc. can all be indexed as separate fields (and are clearly marked in the XML for each document).

Extra Credit 2: Implement BM25F (the “fielded” version of BM25) and show improvement over baseline BM25 (10%)

  1. Instead of a single score for a document, index (and compute) fields such as title of the document separately from the body. Details in clas.

 

 

 

Deliverables for Project1

For the basic project, you will submit the locations of 3 files, containing the outputs of your system in the format described on project description page; the input is the query file topics.351-400. Each output will be generated using a different scoring function: Lucene original similarity; TF*IDF; and BM25 (TBD in class on Monday). Name each file as follows:  your_login.Lucene for Lucene similarity output; your_login.TFIDF; and your_login.BM25 respectively.

For the extra credit submission files, the format is the same, but name the files your_login.BM25F and your_login.

 

Evaluation Code

To get quantitative scores for your programs (before they are submitted), I am providing a set of queries and relevance judgments, and a program to compute Recall-Precision values as well as Average Precision (R-Precision) and Precision at K that we covered in class.

The evaluation data and code is available in /aut/proj/cs584/cs584000/project1/eval/:

test queries: topics.351-400

sample system output: input.ATT

relevance judgements: qrels.trec7

code: trec_eval

Usage: trec_eval [-h] [-q] [-a] [-o] [-v] relevance_file system_output

   -h: Give full help information, including other options
   -q: In addition to summary evaluation, give evaluation for each query
   -a: Print all evaluation measures, instead of just official measures
   -o: Print requested measures in old non-relational format
e.g.,: trec_eval -o qrels.trec7 input.ATT

This information will also be posted on the project1 web page.

http://www.mathcs.emory.edu/~eugene/teaching/CS584_IR/project1/