Contingency tables: How often does each value occur?
Contingency tables can answer the question "How many times does each value occur in my data frame?"
For example, we can check, in the inaugural speeches dataset, how often we find each president's name:
Note the slightly weird format: We have to put a tilde (~) before the name of the column that we want to analyze.
Or we can ask, again about the inaugural speeches dataset: How many inaugural speeches were there that mentioned "America" never, just once, twice, etc?
We can send the output of xtabs() directly to barplot for graphical inspection.
xtabs() becomes especially useful when we cross-analyze two different columns. This is better demonstrated with the dataset "verbs" from the languageR package. (You will need to install this before you can use it. In the "Packages and Data" menu of R, choose the Package Installer. When you install a package, don't forget to tick "also install dependencies".)
The "verbs" dataset describes corpus occurrences of the dative alternation, that is, occurrences of either the pattern "John gave Mary the book" or "John gave the book to Mary". It encodes a number of different characteristics (features) of each occurrence, such that one can check what circumstances coincide with (and maybe lead to) which form of the alternation.
Here is how to check how often the receiver ("Mary" in the example above) was realized as an NP versus a PP, and how often it was animate versus inanimate:
xtabs() answers the question "how often did each value occur?" But if you have other questions about the data, for example, "what was the average value of number of words in the Fisher speech corpus?" or "separately for male and female speakers, what was the average number of words spoken?", we cannot use xtabs().
The first of the two questions above can be answered simply by using
But for the second question, we first need to group rows by the gender of the speaker, and then compute the average. The R function aggregate() can do this, like this:
What aggregate() does is this:
Take the column(s) listed in the first argument
group them by the values of the column(s) listed in the second argument
then apply the R function given as the 3rd argument.
The second argument, the grouping columns, have to be given as an R data type called "list", which we do by just putting list() around the column names. (Basically, an R list is something like a data frame, but instead of equal-length column vectors, the different parts of a list can be anything.)
We can apply something else but the mean, for example standard deviation:
We can also group rows by values of more than one column. For example, say we want to know mean number of words by gender and age. Looking at each individual age will give us too many values. So we use the R function cut() to cut the age values into 6 different bins. We can then aggregate rows by gender and age group, and compute a mean for each of them.
The R function merge() is very helpful for adding information to a data frame, or combining data from multiple data frames. As the first, simplest example, suppose we have two data frames with information about the Fisher telephone communication corpus. One contains background information about speakers:
The second data frame has the actual experimental data we collected, namely the number of words that each speaker spoke:
Suppose we want to know: "How many words did men say on average, and how many words did women say on average"? Then we need information from both dataframes. We can merge the two dataframes to combine the data:
The "merge()" command has combined the two dataframes based on the column that they have in common, "speaker". Note that merge() leaves the original dataframes fisher.bg and fisher.numWords unchanged. It creates a new, merged dataframe, which we have saved in fisher.merged.
But what if the column names did not match? Suppose that fisher.numWords looked like this instead:
So fisher.numWords has a column "person" that matches the column "speaker" in fisher.bg. In this case, we have to tell merge() explicitly which columns have the same data:
The function merge() considers the first dataframe its "x" dataframe, and the second one its "y" dataframe. So "by.x" means: Use this column from dataframe x (that is, the first dataframe, which is actually fisher.bg). And "by.y" means: Use this column from dataframe y (that is, the second dataframe, fisher.numWords).
The function merge() can do even more than that. It works even when the two dataframes do not have the same number of rows. We make a new data frame inaug.last that is a small piece of the "inaugural speeches" dataset, with only the few most recent entries. We then make a data frame that lists, for each of the surnames in inaug.last, a party affiliation. Note that the party data frame has each last name only once, so it is shorter than inaug.last.
Now we can merge the data frames "inaug.last" and "party" by the "president" column of "inaug.last" and the "name" column of "party". This results in a new data frame that combines "inaug.last" and "party". The row about Johnson, for example, combines the inaugural speech information for Johnson with the party affiliation information. Note that when the same president occurs multiple times, like Bush, the same party affiliation information is added to each of them.
Graphical data exploration
The following plotting commands start a new canvas:
plot(): general purpose plotting. Set the "type" parameter to get different types of plots. See the help page for additional parameters for color, point type, and so on.
plot(inaug.all$length, type="p") plots points
plot(inaug.all$length, type="l") plots a line
plot(inaug.all$length, type = "b") plots a line with points superimposed
barplot() plots a barplot
boxplot() gives a box-and-whiskers plot that shows median and 1st and 3rd quartiles.
hist() draws a histogram
To get a "true histogram" in which the area of the bars adds up to 1 (like in a density plot), do
You may need to install the package MASS first.
The following plotting commands do not start a new canvas, but superimpose the next plot on the canvas that you last plotted:
lines() draws a line
points() draws points
text() adds text at the given coordinates on the canvas
By default, R draws one plot at a time. But sometimes you may want to draw more than one. For example, here we draw histograms for number of words spoken separately for each of the age groups that we have added above. The par() command sets general plotting parameters. It has a huge number of possible settings -- see the help function. This particular command sets plotting to 2 rows and 3 columns of plotting canvases.
To circumvent problems with missing values (NA), we omit them all for now:
Then we plot:
You get back to a single canvas by typing
A constant problem when we superimpose multiple plots is the problem of calculating boundaries for the plotted object. If you plot only one thing, then R draws the canvas in such a way that all points will fit. But if you superimpose something on this plot, it need not fit. Try for example:
To solve this problem, you need to use the "xlim" and "ylim" parameters of plot() to set up the canvas size yourself. You can use range() to determine the range you need to accommodate. Note that you can hand "range" multiple vectors at once, and it gives you the overall range for all of them combined!
Data exploration: What to plot?
To get an overview over possible values of a single vector:
For discrete-valued variables: a barplot
For continuous valued variables: ordered values, histogram, density, boxplot, quantiles -- see the Baayen book for examples.
Plotting two or more variables in comparison:
We can send the result of xtabs() directly on to barplot(), also for the case where we compare multiple variables.
Then there are also mosaic plots, discussed in the Baayen book, but I find them hard to read, so I will not go into them any further.
Scatter plots are another standard way of plotting the relation between two variables. Here we look at the rate at which the word "freedom" is used in inaugural speeches over the years (we divide by length to abstract from the lengths of the speeches):
When we plot two vectors of the same length, here "year" and normalized "freedom", the first vector describes the x-coordinates of each data point, and the second vector describes the y-coordinates. We should put informative labels on the axes to say what the plot does:
Is there a trend? We add a line to the plot to sketch the main tendency.
Such a curve is often called a scatterplot smoother. lowess() computes one possible smoothing line (there are multiple ways of estimating main trend), and lines() adds the smoother to the current plot.
See the Baayen book for more information on scatterplot smoothers, and also for more plotting variants.