q
Method
-
Baseline:
Train a classifier (such as one used in Project 1) to classify all
NPs into Per, Loc, Org, or Other (if none). You may use heuristics
as needed.
- HMM
(or CMM):
Train an HMM (or CMM or MEMM or CRF) to tag the named
entities.
Libraries
You may use the HMM
(and/or CMM or CRF, or machine learning/classification components) of
the following libraries, but not the full-fledged NER taggers. If
you prefer to use other libraries, make sure to check with the
instructor.
Evaluation
- Split data (labeled files) into disjoint
training, development, and test sets. Suggested split:
60% for training, 20% for dev-test
(validation/parameter tuning), and 20% for test (reporting
results)
- Metrics: Per-Class Precision, Recall. Overall: Accuracy.
- Note: I will provide an evaluation script that takes your
predictions and outputs variety of scores.
- Format for your output:
same format as the input, except the Named Entity label should be
generated by your system.
Deliverables
- Project code: Create Project2 directory in
/aut/proj/cs571/your_id/ and place all code/scripts
there. I must be able to run your code
- Teams: Collaboration -- i.e., working in pairs and
submitting joint project, is strongly encouraged. Discussion
of project-related issues, questions, difficulties is encouraged,
please use the Discussion board at
http://classes.emory.edu/
- Short (< 3 page) writeup:
- Describe implementation details (e.g.,
approach taken and main algorithms used, programming language,
external libraries used, code implemented, etc),
- Sufficient documentation on how to run your code
- Summary of accuracy (plot or
tabular format) for varying size of amount of training data
(e.g., 20%, 40%, 60%, 80%, 100% of available training data)
Data
- /home/cs571000/Project2/data/
Extra Credit
To receive credit,
describe exactly what was done, provide appropriate experimental
results. You may use up to additional 2 pages to describe the extra
credit submission.
- Entity
Subtypes (20pts): Instead of
recognising only the main classes (PER, ORG, LOC), predict precise
entity subtype (as specified in the training data). Take advantage
of the type hierarchy/taxonomy (i.e., combine training evidence for
general ORG to predict OrgPol).
- Language-independent NER
(20pts): Show that your code generalizes to other languages
(Spanish and Dutch) by training and testing on data (provided in
similar format to main task)
data:
/home/cs571000/project2/extra/multilingual/
- Unsupervised NER (50pts):
Develop a way to exploit a large unlabeled (untagged) corpus
(one year's worth of NYT) to improve accuracy compared to your
supervised-only approach. This is a difficult task, and can grow
into final project for the semester.
data:
/home/cs571000/project2/extra/unlabeled/NYT95/