Natural Language Processing (CS571) Project 1, Part 1
Named Entity Classification
Due: Tuesday, 25 September, 2007

 

n

 

Goal:

Use class-conditioned character Language Models to classify given list of entities into one of the four target classes: Person, Location, Organization, or Movie. Use Bayes to make final decision based on language model predictions.

 

Entity types:

  • qPerson (PER): Name, mostly U.S./English  Source: Hoovers' company database, 1999

  • Location (LOC): City, predominantly U.S.  Source: Hoovers' company database, 1999

  • Company (ORG): Organization, mostly U.S. Source: Hoovers' company database, 1999

  • Movie (MOV): Big-screen/significant movie. Source: Wikipedia

q

Method

  • Train n-gram (n=1,2,3,4,5) character language models for each entity type (class) to estimate probability of a given string (entity) generated from that class.
  • Combine these probabilities with Bayes rule to estimate the posterior probability of the most likely class given the string.
  • Use backoff smoothing (e.g., Good-Turing + backoff with linear interpolation) to approximate values with lower-order statistics.
  • Compute confidence for each class prediction: Conf(C) = P(C|string) / Sum_i ( P(C_i | string ) )
  • When designing your code, keep in mind that Part 2 will use your classifier from Part 2, so make sure there is a reasonable way to call your code to classify a given string.

Evaluation

  • Use k-fold cross validation (usually k=3, 5 or 10) to train your models on random subset of the data (training size = (k-1)/k), and test on rest. Repeat k times.
  • Metrics: Per-Class Precision, Recall. Overall: Accuracy.
  • Note: I will provide an evaluation script that takes your predictions and outputs variety of scores.
  • Format for your output: String \t Predicted_Type \t Confidence Score \t True_Type

Deliverables

  • Project code:  Create Project1.1 directory in  /aut/proj/cs571/your_id/  and place all code/scripts there. I must be able to run your code
  • Teams: Collaboration -- i.e., working in pairs and submitting joint project, is strongly encouraged. Discussion of project-related issues, questions, difficulties is encouraged, please use the Discussion board at http://classes.emory.edu/
  • Short (< 3 page) writeup:
    • Describe implementation details (e.g., programming language, main classes/modules, smoothing methods used, etc),
    • Sufficient documentation on how to run your code
    • Summary of the results (plot or tabular format) for varying size of n-gram or any other interesting parameter

Data

The set of entities (99550 total) for training and test are available on MathCS: /home/cs571000/Project1.1/entities.txt
Note that the dataset is not balanced (there are more Person and Location entities than ORG and MOV).