Courses‎ > ‎Working with Corpora‎ > ‎

Working with Corpora: Links

LIN392 Working with Corpora: Links

Here are some links that students might find useful while taking this course. If you find additional useful links, let me know. I am always happy to add to the list.

Online texts for the course

There is no course textbook. Readings will be from the following resources available online:

Corpus studies over breakfast

How about some recreational corpus lingistics? Language Log has some fun studies, particularly in Googlometry:


Sources for finding corpora:

  • The Linguistics Data Consortium (LDC) distributes corpora under a membership model. 
  • ELRA is the European equivalent of the LDC. It has an associated conference devoted to language resources, LREC.
  • The Corpora mailing list is devoted to work with corpora. Its archives are a good source for looking for corpus resources.

Some useful corpora:
  • Project Gutenberg has free e-books for which the copyright has expired. There are books in many languages. All those books are also available as plain text. 
  • WaCKY: web as corpus resources, huge corpora in multiple languages
  • OPUS: a collection of parallel corpora, searchable with cqp queries. (Parallel corpora have the same text in multiple languages.)

Corpus search tools

Search tools:

  • tregex: a powerful tool for searching syntactically analyzed corpora (though the query language is quite complex) (link currently unavailable, hopefully back soon)

Automatic corpus analysis tools


Learning Unix

Here is an online tutorial.

About Python