Working with Corpora: Links
LIN392 Working with Corpora: Links
Here are some links that students might find useful while taking this course. If you find additional useful links, let me know. I am always happy to add to the list.
Online texts for the course
There is no course textbook. Readings will be from the following resources available online:
Martin Wynne (ed): Developing Linguistic Corpora: a Guide to Good Practice
Allen B. Downey, Jeffrey Elkner and Chris Meyers: How to Think Like a Computer Scientist: Learning with Python.
Steven Bird, Ewan Klein, and Edward Loper: Natural Language Processing - Analyzing Text with Python and the Natural Language Toolkit
Corpus studies over breakfast
Do women really produce more words than men? Let's ask a corpus.
Google counts and French politics, and other corpus studies by Jean Veronis
Sources for finding corpora:
The Linguistics Data Consortium (LDC) distributes corpora under a membership model.
The Corpora mailing list is devoted to work with corpora. Its archives are a good source for looking for corpus resources.
Some useful corpora:
Project Gutenberg has free e-books for which the copyright has expired. There are books in many languages. All those books are also available as plain text.
WaCKY: web as corpus resources, huge corpora in multiple languages
OPUS: a collection of parallel corpora, searchable with cqp queries. (Parallel corpora have the same text in multiple languages.)
Corpus search tools
Automatic corpus analysis tools
Bookmarks for Corpus-based Linguists: http://tiny.cc/corpora
Here is an online tutorial.
Here is the download page for Python. In the course, we use version 2.7.7.