Project 1: Implementing Ranking
Functions
Goal
Understand and implement the classic vector-space
similarity/ranking functions and incorporate them into a working IR system
implementation (Lucene). You will implement the TF*IDF and BM25 functions (details
on weighting in class). The project is out of 100 points. There are also 2
extra credit options, each worth 10pts extra (but only applicable if the
baseline system works). The extra points from extra credit can be used
(weighted form) towards the calculation of the final grade. The project is of 4
parts. I recommend doing them sequentially and testing that you have a working
system after every step.
Due date
Tuesday, 10/3/2006.
Delivery mechanism
and exact deliverables: UPDATED. see below.
Lateness policy: 10% off for every day late; Late
submissions accepted until Friday, 10/6.
A: Implement the baseline Lucene system for the News data: (40%)
- Download
the Lucene distribution from http://www.wmwweb.com/apache/lucene/java/lucene-2.0.0.tar.gz.
- Make a
Lucene subdirectory in your class project directory: /aut/proj/cs584/your_login
and extract the distribution there
- Test
your distribution by following instructions in “Indexing Files” at http://lucene.apache.org/java/docs/demo.html
·
Make an index: java –classpath
lucene-cor-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.IndexFiles src/
·
Test your index: java –classpath
lucene-cor-2.0.0.jar:lucene-demos-2.0.0.jar org.apache.lucene.demo.SearchFiles
- Read
and understand (enough to modify) the demo source code: src/demo/org/apache/lucene/demo/
- Now
you are ready to write your own indexer for the news data: /aut/proj/cs584/cs584000/project1/text/
You may start with the demo IndexFiles.java
code: src/demo/org/apache/lucene/demo/IndexFiles.java
However: You will need to write your own document parser to
process the files (to detect individual document boundaries within each
file, marked by <DOC> and </DOC>.
Warning: the dataset is large (1.9Gb).
Do not copy the input documents; Rather give the original location
of the text directory as input to indexer and store the index in your own
project directory
- Make
sure you index both the filename and the document id (look at one of the
document in the collection to see the difference).
- Pay
attention to the tokenizer (analyzer in Lucene
lingo) used. You will have to use the same one at search time.
- Test your
index with the modified SearchFiles.java program
(hint: just make sure you can tell it where your index is, the rest can
stay the same).
B: Implement the TF*IDF ranking: (40%)
- You
will need to collect IDF (inverse document frequency) statistics for the
collection. You will also need to index all term frequencies when building
the index. Hint: refer to the FAQ http://wiki.apache.org/jakarta-lucene/LuceneFAQ
and look at different ways a field can be indexed
- Re-order
up to 1,000 hits according to your TF*IDF values and return results (see
FAQ on some hints on speed-ups)
- Keep
the code modular so that you could plug-in different ranking functions
C: Implement the BM25 ranking: (10%)
- Same
as for TF*IDF. Hopefully your code is modular enough that you only have to
add a little extra code specific to BM25.
D: Implement the evaluation harness: (10%)
- Input:
parse and convert “topics” into queries
- Each
topic is marked by <top> and </top>
- Be able
to read the queries in the test file: /aut/proj/cs584/cs584000/eval/topics.351-400
- Use
the union of the Title and Description fields as your
query.
- Output:
for each topic, return up to 1,000 documents, one result per line,
tab-delimited, in following format
topic_id
\t Q0 \t document_id \t rank
\t score \t your_login
e.g.,
351 \t Q0 \t FBIS-41571
\t 143 \t 101.24 \t eugene
Formal Evaluation
- An
automatic evaluation program will be provided soon, which will give you
lots of performance/accuracy numbers. Don’t worry about it yet (note:
relative relevance scores are not directly part of the grade; However, a
randomly ordered list of documents will not be accepted
).
Extra Credit 1: Improve on D by taking advantage of the fields of the
query, and the document (10%)
- “Improve”
means have higher relevance score than the default “union” approach. This implies, that you can correctly parse the input and
return legal output to enable empirical evaluation.
- Fields
in the query refer to the “narrative” component. Fields in the document –
be creative. For example, titles, dates, etc. can all be indexed as
separate fields (and are clearly marked in the XML for each document).
Extra Credit 2: Implement BM25F (the “fielded” version of BM25) and show
improvement over baseline BM25 (10%)
- Instead
of a single score for a document, index (and compute) fields such as title
of the document separately from the body. Details in clas.
Deliverables
for Project1
For the basic project, you will
submit the locations of 3 files, containing the outputs of your system in
the format described on project description page; the input is the query file
topics.351-400. Each output will be generated using a different scoring
function: Lucene original similarity; TF*IDF; and BM25 (TBD in class on
Monday). Name each file as follows: your_login.Lucene
for Lucene similarity output; your_login.TFIDF; and your_login.BM25
respectively.
For the extra credit submission
files, the format is the same, but name the files your_login.BM25F and
your_login.
Evaluation
Code
To get quantitative scores for
your programs (before they are submitted), I am providing a set of queries and
relevance judgments, and a program to compute Recall-Precision values as well
as Average Precision (R-Precision) and Precision at K that we covered in class.
The evaluation data and code is
available in /aut/proj/cs584/cs584000/project1/eval/:
test queries: topics.351-400
sample system output: input.ATT
relevance judgements: qrels.trec7
code: trec_eval
Usage: trec_eval
[-h] [-q] [-a] [-o] [-v] relevance_file system_output
-h: Give full help
information, including other options
-q: In addition to summary evaluation, give evaluation for each
query
-a: Print all evaluation measures, instead of just official
measures
-o: Print requested measures in old non-relational format
e.g.,: trec_eval -o qrels.trec7
input.ATT
This information will also be
posted on the project1 web page.
http://www.mathcs.emory.edu/~eugene/teaching/CS584_IR/project1/