6.863J/9.611J Natural Language Processing
 
Fifty Ways to leave your 6.863: Including past projects
[ Main ] [ About ] [ Assignments]

Projects from the NLTK docs (at the end of each chapter) and others

A list of projects by type, including other past projects (well beyond 50 projects...)
• Nltk projects: http://nltk.org/index.php/Projects (over 60 nltk-based projects, more coding). These are typically very ambitious, but can be truncated to make them more doable.
• Nltk contributed packages can be extended as the foundation of a termproject. Look at the code in the Athena dir or browse the subversion directory on the nltk site under the nltk_contrib section). Current packages to build on include the following:
1. Brill part of speech tagging
2. Readability indices (ie, a program that 'scores' how hard or easy a text is; Micros*ft Word tries to do this)
3. Classifiers for texts (ie, email, web data, etc.)
• Particular topics in the NLTK Guides (http://nltk.org/doc/guides/) make good term projects, including:
1. Discourse representation theory (DRT) – essentially invented by Prof. Irene Heim here at MIT, a method used to 'resolve anaphora' in a discourse in a systematic way; read section see http://nltk.org/doc/guides/discourse.html; http://nltk.org/doc/guides/drt.html; (You should also read chapter 12 of the NLTK book on semantics and logic.0 Reference: Irene Heim and Angelika Kratzer. Semantics in Generative Grammar. Blackwell, 1998 and Patrick Blackburn and Johan Bos. Representation and Inference for Natural Language: A First Course in Computational Semantics. CSLI Publications, Stanford, Ca, 2005.
2. Chunk parsing: Read section 7.3 of the NLTK documentation and develop a chunking parser by extending the exercises at the end of section 7.4 (combine exercises 7, 8, 9 and run on the CoNLL corpus); see also section 7.7 on 'named entity extraction' to use chunking in conjunction with 'shallow interpretation' (this was an actual Darpa task at one time). For more on 'named entity extraction' in NLTK see http://nltk.org/doc/guides/relextract.html
3. Lexical acquisition: exercise 7, section 8.4.6 – use chunking to find lexical semantic patterns, and apply to the Penn TreeBank to evaluate this method for 'learning' lexical-semantic classes (references: Levin, English Verb Classes and Their Alternations)
4. Parsing: Exercise 4 of section 8.5.4. Develop a left-corner parser based on the recursive descent parser, and inheriting from ParseI. (Note, this exercise requires knowledge of Python classes, covered in Chapter 10.) Test the speed of this parser (via counting the edges it creates) systematically against the Earley parser, preferably using two distinct languages, e.g., English vs. German (or Japanese), since these two languages have very different 'left corner' structures. See: http://nltk.org/doc/guides/parse.html.
5. Parsing: Actually get the 'slash category' grammars described in chapter 11 of the NLTK book to work on the full set of gazdar sentences (we exhibit a partially working grammar in lecture). For the gazdar sentences see here.

6. Parsing: Read section 8.5 and develope a parallel LR parser; test it on the so-called 'gazdar sentences' (which we can provide)
7. Semantics: Read chapter 12 of the NLTK book and extend the chat80 code there so that it will extract data from a relational database using SQL queries. (We have RDB information from the so-called ATIS system so that you can use that too.)
And there is more. There are many more ideas to be found below:

  1. Morphology and words (11 project suggestions)
  2. Tagging (8 project suggestions)
  3. Parsing (13 project suggestions)
  4. Semantic interpretation & Lexical conceptual Structure (21 project suggestions)

Links to a potpourri of recent past projects (the scope varies tremendously - don't be intimidated!)

  1. Learning quantifiers from examples, a Bayesian approach
  2. Parsing seminar announcements (syntax to semantics)
  3. Identifying and classifying newsgroup requests
  4. Learning spanish verbal morphology by Bayesian methods
  5. Email processing (Sorry, the demo links won't work!)
  6. Intelligent pronoun resolution
  7. Extending semantic interpretation to handle quantifiers
  8. Adding general inference to semantic interpretation
  9. Italian-English machine translation using lexical-conceptual structure
  10. Parsing conjunctions in generalized phrase structure grammar
  11. Parsing prescriptions for medical databases

 

MIT Home