R worksheet: contingency tables
Contingency tables can answer the question "How many times does each value occur in my data frame?"
For example, we can check, in the inaugural speeches dataset, how often we find each president's name:
> xtabs(~president, data = inaugural)
Adams Buchanan Bush Carter Cleveland Clinton Coolidge
2 1 3 1 2 2 1
Eisenhower Garfield Grant Harding Harrison Hayes Hoover
2 1 2 1 2 1 1
Jackson Jefferson Johnson Kennedy Lincoln Madison McKinley
2 2 1 1 2 2 2
Monroe Nixon Obama Pierce Polk Reagan Roosevelt
2 2 1 1 1 2 5
Taft Taylor Truman VanBuren Washington Wilson
1 1 1 1 2 2
Note the slightly weird format: We have to put a tilde (~) before the name of the column that we want to analyze. This call counts, for each possible value of the column 'president", how often it appeared.
Or we can ask, again about the inaugural speeches dataset: How many inaugural speeches were there that mentioned "America" never, just once, twice, etc?
We can send the output of xtabs() directly to barplot for graphical inspection.
> xtabs(~America, data = inaugural)
0 1 2 3 4 5 6 7 8 10 11 15 19 20 21
27 6 4 2 1 2 3 3 1 1 1 2 1 1 1
> barplot(xtabs(~America, data = inaugural))
Take a moment to think about what this xtabs shows: In this case, it is a count of counts: How often does each count of the word "America" appear?
Cross-analyzing two columns
xtabs() becomes especially useful when we cross-analyze two different columns. This is better demonstrated with the dataset "verbs" from the languageR package. (You will need to install this before you can use it. In the "Packages and Data" menu of R, choose the Package Installer. When you install a package, don't forget to tick "also install dependencies".)
The "verbs" dataset describes corpus occurrences of the dative alternation, that is, occurrences of either the pattern "John gave Mary the book" or "John gave the book to Mary". It encodes a number of different characteristics (features) of each occurrence, such that one can check what circumstances coincide with (and maybe lead to) which form of the alternation.
Here is how to check how often the receiver ("Mary" in the example above) was realized as an NP versus a PP, and how often it was animate versus inanimate:
> xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs)
RealizationOfRec animate inanimate
NP 521 34
PP 301 47
This shows how often each realization of recipient -- NP versus PP -- appears with each value for the animacy of the recipient, either animate or inanimate.
When you send this to barplot(), it shows the numbers of NP versus PP as two different-colored bar parts for the values "animate" and "inanimate".
barplot(xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs))
If you like to compare the NP and PP counts for each value of AnimacyOfRec, asking barplot() to show the different-colored bars next to one another, rather than on top of one another, is helpful:
barplot(xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs), beside=T)