Course project: LIN353C introduction to computational linguistics


LIN 353C course projects will usually be done in groups of 2 students. If you want to work in a larger or smaller group, you need prior approval of the instructor.

Section "Project topic suggestions" gives a list of possible project topics. I would suggest that you choose something from this list, but you can also choose something different.

All projects need to address an NLP problem and involve programming. All projects need to be evaluated in some form. 

As you need some initial results to discuss in your intermediate progress report, I suggest that you start out with some simple rule-based approach, then improve on it, maybe with some change in technique, in the second half of the semester.

Project topic suggestions

  • Create a system that can guess what language a given text is written in. To start out, you could use hand-written cues, for example typical words or character sequences. For more of a challenge, use training data to learn about frequent words in each language, or frequent character sequences.

  • Create a dialog agent that pretends to be a specific kind of person (for example, Parry was a 20-something single guy with specific hobbies and some paranoid tendencies). This can be done with rules that hard-code behaviors, or react to particular key words. You can also add probabilities if you like, such that your agent would, for example, have a particular probability of reacting with anger.

  • Create a system that will extract meeting specifics (time, place) from emails. I can make available part of the ENRON emails (which were made publicly available by the Federal Energy Regulation Commission) for you to work with. Some rules encoding frequent patterns will let you extract some times and places of meetings, but time and date expressions can be surprisingly complicated. If you want a challenge, take on relative time expressions like "next Monday".

  • Create a morphological analyzer for a language other than English, ideally a morphologically complex one.

  • Create a system that can identify different types of "named entities", such as person names, locations, and organizations. To start out, you can use rules to identify named entities, for example a sequence of capitalized words of which the first is "Ms." is likely a person names. You can also train a machine learning system to do this task.


Requirements


You will write two documents about your course project, the intermediate report and the final report. You will also have a chance to discuss your project in the last week of classes. But this discussion will not be graded.

Intermediate report

At the time of the intermediate report, you need to have some system that addresses your problem. This can be a very simple, rule-based system, It need not be the final system.
  • 2-3 pages
  • contents:
    • Introduction with motivation: What the project is about, and why is this important?
    • What algorithms, rules, and data structures you are using
    • What corpus resources (if any) you are using
    • Initial results. At the least, this is discussion of some things your system is currently doing right or wrong. You can report some performance by some performance measure, but you do not have to.

Final report

The final report is about your final system. This should improve over the system as it was at the time of the intermediate report, either by using a different technique, or by using the same technique in a more sophisticated way. 
  • 4-5 pages
  • This is a revised version of your intermediate report. Do take into account all feedback that you got on your intermediate report. Do not omit introduction and motivation just because they were already in the intermediate report: The final report has to be self-contained.
  • Write this as a research report to an audience of computational linguists.
  • contents:
    • Introduction with motivation: What the project is about, and why is this important?
    • What algorithms, rules, and data structures you are using
    • What corpus resources (if any) you are using
    • Results: Describe as clearly as possible what it is your system can (and cannot) do. You can show examples of things your system is getting correct and of errors it is making. If you can, measure performance by some performance measure.


Comments