Course project: LIN353C introduction to computational linguistics

LIN 353C course projects will usually be done in groups of 2 students. If you want to work in a larger or smaller group, you need prior approval of the instructor.

Section "Project topic suggestions" gives a list of possible project topics. I would suggest that you choose something from this list, but you can also choose something different.

All projects need to address an NLP problem and involve programming. All projects need to be evaluated in some form.

As you need some initial results to discuss in your intermediate progress report, I suggest that you start out with some simple rule-based approach, then improve on it, maybe with some change in technique, in the second half of the semester.

The process

    • Around week 6 we discuss project ideas. Be prepared to come with ideas for what you might want to do. The aim is to make sure all projects are right-sized and doable.

    • Around week 9: Intermediate progress reports due. We also discuss projects in class in order to clear any roadblocks and to share data and methods that may be helpful to other projects.

    • Around week 14: Final project reports due.

    • Last week of classes: In-class presentations about projects. These are not graded.

Project topic suggestions

    • Create a system that can guess what language a given text is written in. To start out, you could use hand-written cues, for example typical words or character sequences. For more of a challenge, use training data to learn about frequent words in each language, or frequent character sequences.

    • Create a dialog agent that pretends to be a specific kind of person (for example, Parry was a 20-something single guy with specific hobbies and some paranoid tendencies). This can be done with rules that hard-code behaviors, or react to particular key words. You can also add probabilities if you like, such that your agent would, for example, have a particular probability of reacting with anger.

    • Create a system that will extract meeting specifics (time, place) from emails. I can make available part of the ENRON emails (which were made publicly available by the Federal Energy Regulation Commission) for you to work with. Some rules encoding frequent patterns will let you extract some times and places of meetings, but time and date expressions can be surprisingly complicated. If you want a challenge, take on relative time expressions like "next Monday".

    • Create a morphological analyzer for a language other than English, ideally a morphologically complex one.

    • Create a system that can identify different types of "named entities", such as person names, locations, and organizations. To start out, you can use rules to identify named entities, for example a sequence of capitalized words of which the first is "Ms." is likely a person names. You can also train a machine learning system to do this task.

    • Create a grammar checker. To start out, you can use some rules, for example you could flag sentences that end in a preposition. Then you can experiment with a language model to flag unlikely word sequences.

Previous projects have also included:

    • Automatically determining the genre of a tweet, for example news, sports, entertainment

    • Automatically summarizing a document by automatically selecting the most important sentences

    • Automatically producing text in the style of some famous person, using language models to learn to mimic their style

    • Automatically producing a syntactic analysis of a sentence, or automatically assigning part-of-speech tags

    • Detecting the level of reading difficulty of a text

    • Authorship attribution: automatically detecting who, out of a number of possible authors, wrote a given text

Working in groups

If you are working in a group:

    • In the Intermediate report, please include a short paragraph that briefly describes which part of the project each group member is responsible for.

    • In the Final report: The report is joint from the group, but needs to include a separate section from each group member that describes the part of the project done by that group member


You will write two documents about your course project, the intermediate report and the final report. You will also have a chance to discuss your project in the last week of classes. But this discussion will not be graded.

Intermediate report

At the time of the intermediate report, you need to have some system that addresses your problem. This can be a very simple, rule-based system, It need not be the final system.

  • 2-3 pages

    • contents:

      • Introduction with motivation: What the project is about, and why is this important?

      • What algorithms, rules, and data structures you are using

      • What corpus resources (if any) you are using

      • Initial results. At the least, this is discussion of some things your system is currently doing right or wrong. You can report some performance by some performance measure, but you do not have to.

      • If you are working in a group: who does what (as described above)

Final report

The final report is about your final system. This should improve over the system as it was at the time of the intermediate report, either by using a different technique, or by using the same technique in a more sophisticated way.

  • 4-5 pages

    • This is a revised version of your intermediate report. Do take into account all feedback that you got on your intermediate report. Do not omit introduction and motivation just because they were already in the intermediate report: The final report has to be self-contained.

    • Write this as a research report to an audience of computational linguists.

    • contents:

      • Introduction with motivation: What the project is about, and why is this important?

      • What algorithms, rules, and data structures you are using

      • What corpus resources (if any) you are using

      • Results: Describe as clearly as possible what it is your system can (and cannot) do. You can show examples of things your system is getting correct and of errors it is making. If you can, measure performance by some performance measure.

      • If you are working in a group: separate section describing who did what (as described above)


Are you building a supervised classification system? Then check the NLTK chapter on classification, chapter 6.

Also, you may want to use scikit-learn, a Python machine learning package.

Whatever kind of system you build, you will need to do an error analysis. Counter to its somewhat negative-sounding name, an error analysis is not just a sad list of errors, but an in-depth look at how your system deals with the language data it sees: both what it does right and where it does something wrong. For a discussion of the general spirit of error analysis, check Emily Bender's blog post on "putting the linguistics in computational linguistics"