Doing your course project
Step 0: Have a research question
What do you want to know? What data can you use to answer that question?
Do you have a question that looks like this?
"I want to explore what words / patterns are used in ..."
Then you will need to do an in-depth exploratory data analysis, as described below.
Do you have a research hypothesis that looks something like this?
"I think that X has an influence on Y"
"I think that A's use more Xs than Bs do."
Then you will need to do hypothesis testing, as described below.
Step 1: Get the data to be on your hard disk
You need the data for your course project to be on your hard disk, in electronically readable form. There are several way in which you can achieve this:
Your data is an established dataset, and is available with NLTK.
If you have already downloaded it, you are set.
If you haven't downloaded it, use nltk.download()
Your data is an established dataset, and is available as a file or files for download -- then you download it, and are done.
Just one thing to watch out for: Has the data been cleaned? For example, if you are working with Project Gutenberg files, they have some "small print" before and after the actual text. If that is the case, you can delete the "small print", or remove it with Python code in the way you practiced in homework 2.
You can copy-and-paste the data off of webpages.
This is relatively straightforward, but recommended only if it's not too many webpages.
You want to automatically download pages off the web. For this, and for how to remove HTML formatting, see the notebook on preprocessing text.
Step 2: Preprocess the data
If your data is already in the shape of a Pandas data frame, with all the counts you need
Then you are done.
If your data is plain text
Read the data into Python. If the data is in the same directory as your jupyter notebook, and is in a file called, say, "projectdata.txt", then you load it as follows:
After this, the whole text is in the variable text, as a gigantic string.
You can now split the text into words, and apply part-of-speech tagging or lemmatization as needed, as described in the notebook on preprocessing text.
After that, count words (or part-of-speech tags, or lemmas, or whatever) in the text using an NLTK FreqDist, and transform counts into a pandas data frame, as shown in the notebook on turning dictionaries into Pandas data frame.
Step 3: What does your data look like?
Once you have the data in a Pandas data frame, you first need to do exploratory data analysis:
What is the shape of your data? Make a histogram or density plot of your data: see the notebook on exploring data.
What is its location, and its spread? Analyze central tendencies and spread as described in the notebook on central tendency and spread
Inspect the data more closely: Ask a few questions about rows and columns in your data to get a feel for what you have. See the notebook on exploring data.
Step 4: An in-depth exploratory data analysis
If your research question (or one of your questions) is of the type "I want to know what typical words/patterns can be found in...", then you may want to do some clustering analysis on your data.
See the upcoming clustering notebook for this.
Step 5: Hypothesis testing
If your research question (or one of your questions) is one that can be answered with hypothesis testing, this is the point at which you do this. There will be notebooks on this.