Step 0: Have a research questionWhat do you want to know? What data can you use to answer that question?
Step 1: Get the data to be on your hard diskYou need the data for your course project to be on your hard disk, in electronically readable form. There are several way in which you can achieve this:
Then you are done. If your data is plain text
with open("projectdata.txt") as f: text = f.read() After this, the whole text is in the variable text , as a gigantic string. You can now split the text into words, and apply part-of-speech tagging or lemmatization as needed, as described in the notebook on preprocessing text. After that, count words (or part-of-speech tags, or lemmas, or whatever) in the text using an NLTK FreqDist , and transform counts into a pandas data frame, as shown in the notebook on turning dictionaries into Pandas data frame.Step 3: What does your data look like?Once you have the data in a Pandas data frame, you first need to do exploratory data analysis:
If your research question (or one of your questions) is of the type "I want to know what typical words/patterns can be found in...", then you may want to do some clustering analysis on your data. See the upcoming clustering notebook for this. If your research question (or one of your questions) is one that can be answered with hypothesis testing, this is the point at which you do this. There will be notebooks on this. |
Courses > Analyzing linguistic data, and programming for linguists > Analyzing linguistic data: project >