Natural Language Processing (CS571) Project 2
Named Entity Tagging
Due: Thursday, October 25, 2007

n

 

Goals

  • Use Hidden Markov Models to identify named entities in text, and classify them into appropriate class

  • Gain experience with some off-the-shelf libraries for HMM-based tagging, and text classification

Entity types

  • qPerson (PER): Person name

  • Location (LOC): Location name such a city (LocCity), a state (LocState), or a country (Loc Country)

  • Company (ORG): Organization such as a political party (OrgPol), company (OrgCom)

Note that more precise entity types are provided, so ignore the sub-types for the main assignment.

q

Method

  • Baseline:
    Train a classifier (such as one used in Project 1) to classify all NPs into Per, Loc, Org, or Other (if none). You may use heuristics as needed.
  • HMM (or CMM):
    Train an HMM (or CMM or MEMM or CRF) to tag the named entities.

Libraries

You may use the HMM (and/or CMM or CRF, or machine learning/classification components) of the following libraries, but not the full-fledged NER taggers. If you prefer to use other libraries, make sure to check with the instructor.

Evaluation

  • Split data (labeled files) into disjoint training, development, and test sets. Suggested split:
    60% for training, 20% for dev-test (validation/parameter tuning), and 20% for test (reporting results)
  • Metrics: Per-Class Precision, Recall. Overall: Accuracy.
  • Note: I will provide an evaluation script that takes your predictions and outputs variety of scores.
  • Format for your output: same format as the input, except the Named Entity label should be generated by your system.

Deliverables

  • Project code:  Create Project2 directory in  /aut/proj/cs571/your_id/  and place all code/scripts there. I must be able to run your code
  • Teams: Collaboration -- i.e., working in pairs and submitting joint project, is strongly encouraged. Discussion of project-related issues, questions, difficulties is encouraged, please use the Discussion board at http://classes.emory.edu/
  • Short (< 3 page) writeup:
    • Describe implementation details (e.g., approach taken and main algorithms used, programming language, external libraries used, code implemented, etc),
    • Sufficient documentation on how to run your code
    • Summary of accuracy (plot or tabular format) for varying size of amount of training data (e.g., 20%, 40%, 60%, 80%, 100% of available training data)

Data

  •  /home/cs571000/Project2/data/

Extra Credit

To receive credit, describe exactly what was done, provide appropriate experimental results. You may use up to additional 2 pages to describe the extra credit submission.

  • Entity Subtypes (20pts): Instead of recognising only the main classes (PER, ORG, LOC), predict precise entity subtype (as specified in the training data). Take advantage of the type hierarchy/taxonomy (i.e., combine training evidence for general ORG to predict OrgPol).
  • Language-independent NER (20pts): Show that your code generalizes to other languages (Spanish and Dutch) by training and testing on data (provided in similar format to main task)
    data: /home/cs571000/project2/extra/multilingual/
  • Unsupervised NER (50pts): Develop a way to exploit a large unlabeled (untagged) corpus (one year's worth of NYT) to improve accuracy compared to your supervised-only approach. This is a difficult task, and can grow into final project for the semester.
    data: /home/cs571000/project2/extra/unlabeled/NYT95/