Working with Corpora: Links

LIN392 Working with Corpora: Links

Here are some links that students might find useful while taking this course. If you find additional useful links, let me know. I am always happy to add to the list.

Online texts for the course

There is no course textbook. Readings will be from the following resources available online:

Corpus studies over breakfast

How about some recreational corpus lingistics? Language Log has some fun studies, particularly in Googlometry:


Sources for finding corpora:

    • The Linguistics Data Consortium (LDC) distributes corpora under a membership model.

    • ELRA is the European equivalent of the LDC. It has an associated conference devoted to language resources, LREC.

    • The Corpora mailing list is devoted to work with corpora. Its archives are a good source for looking for corpus resources.

Some useful corpora:

    • Project Gutenberg has free e-books for which the copyright has expired. There are books in many languages. All those books are also available as plain text.

    • WaCKY: web as corpus resources, huge corpora in multiple languages

    • OPUS: a collection of parallel corpora, searchable with cqp queries. (Parallel corpora have the same text in multiple languages.)

Corpus search tools

Search tools:

    • tregex: a powerful tool for searching syntactically analyzed corpora (though the query language is quite complex) (link currently unavailable, hopefully back soon)

    • The IMS corpus workbench

Automatic corpus analysis tools


Learning Unix

Here is an online tutorial.

About Python

Installing Python: