CS571: Natural Language Processing
Fall 2007

Tuesdays and Thursdays, 10am-11:15am
Math & Science Center
W408

[ Course Overview ] [ Lecture Notes and Schedule ] [ Assignments ] [ Discussion Board ] [ Announcements ]

Recent Announcements

 

.

Course Description

This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an understanding of the algorithms available for the processing of linguistic information and the underlying computational properties of text. Word-level, syntactic, and semantic processing from an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, parsing, and information extraction. Advanced topics will include text mining and knowledge discovery from text data applied to bioinformatics and medical informatics domains.

 

Prerequisites:

Proficiency in Java (or C, or Perl) programming, comfort with basic probability and statistics, CS253 (data structures & algorithms) or equivalent.
 

Topics:

PART I: General Concepts and Techniques

  • Preliminaries: basic probability & statistics: estimation, smoothing; information theory: entropy, mutual information, perplexity

  • Words: Morphology, word sense disambiguation, collocations

  • Language Models: LMs in text and speech processing

  • Syntax and Grammar: Context-Free Grammars (CFGs), top-down and bottom-up parsing, Probabilistic Context- Free Grammars (PCFGs), modern statistical parsers

  • Semantics and Pragmatics: Semantic parsing, role labeling, contextual processing, logical representation

  • Context and World Knowledge:  using world knowledge for parsing, interpretation/disambiguation, reasoning

 

PART II: Applications and Domain-Specific Techniques (tentative)

  • Information Extraction: Part of speech tagging and sequence inference: Hidden Markov Models and refinements, Named Entity Recognition (NER), Relation Extraction (RE), anaphora resolution.

  • Information Retrieval: Indexing, retrieval, and presentation of information in text; ranking; Web search

  • Text Mining: Classification, clustering, information integration

  • Knowledge Representation and Reasoning:  frames, quantification, procedural semantics

  • Biomedical NLP: term and relation extraction, ontologies, knowledge acquisition.

  • Other advanced topics (based on student interest): discourse structure, language generation, question answering, …

 

Readings:

 

Grading:

20% Project 1: small well-defined project

30% Project 2: larger open-ended project
20% Midterm: focused on Part I

30% Final: cumulative, but focused more on advanced material in Part II

 

Collaboration:

 

For both project you are  free to work alone, but you are also allowed (and indeed encouraged) to work in teams of up to 2 people (pairs). This means developing ideas together, writing code together, and submitting a joint report. If you choose to collaborate, your submission must include a statement describing the contributions of each collaborator. For example, "We did the entire project as pair programming over several late nights". Or, "Sue built the initial parser, while Joe worked on improving parse quality through the use of features." Ordinarily, all team members will receive the same grade for an assignment unless there was significantly and obviously unequal contribution.

Instructor:

Eugene Agichtein
 

Office:

N420 Math & Science Center. Telephone: (404) 727-7962

Office Hours:

Catch me immediately after class (might as well make it official)

Tue 5-6pm

Wed 5-6pm

Other days/times: by appointment Email: eugene@mathcs.emory.edu

Last modified: 3 September 2007