Data exploration

Contingency tables: How often does each value occur?

Contingency tables can answer the question "How many times does each value occur in my data frame?"

For example, we can check, in the inaugural speeches dataset, how often we find each president's name:

> xtabs(~president, data = inaug.all)

president

     Adams   Buchanan       Bush     Carter  Cleveland    Clinton   Coolidge

         2          1          3          1          2          2          1

Eisenhower   Garfield      Grant    Harding   Harrison      Hayes     Hoover

         2          1          2          1          2          1          1

   Jackson  Jefferson    Johnson    Kennedy    Lincoln    Madison   McKinley

         2          2          1          1          2          2          2

    Monroe      Nixon      Obama     Pierce       Polk     Reagan  Roosevelt

         2          2          1          1          1          2          5

      Taft     Taylor     Truman   VanBuren Washington     Wilson

         1          1          1          1          2          2

Note the slightly weird format: We have to put a tilde (~) before the name of the column that we want to analyze.

Or we can ask, again about the inaugural speeches dataset: How many inaugural speeches were there that mentioned "America" never, just once, twice, etc?

We can send the output of xtabs() directly to barplot for graphical inspection.

> xtabs(~America, data = inaug.all)

America

 0  1  2  3  4  5  6  7  8 10 11 15 19 20 21

27  6  4  2  1  2  3  3  1  1  1  2  1  1  1

> barplot(xtabs(~America, data = inaug.all))

xtabs() becomes especially useful when we cross-analyze two different columns. This is better demonstrated with the dataset "verbs" from the languageR package. (You will need to install this before you can use it. In the "Packages and Data" menu of R, choose the Package Installer.  When you install a package, don't forget to tick "also install dependencies".)

The "verbs" dataset describes corpus occurrences of the dative alternation, that is, occurrences of either the pattern "John gave Mary the book" or "John gave the book to Mary". It encodes a number of different characteristics (features) of each occurrence, such that one can check what circumstances coincide with (and maybe lead to) which form of the alternation.

Here is how to check how often the receiver ("Mary" in the example above) was realized as an NP versus a PP, and how often it was animate versus inanimate:

> library(languageR)

> xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs)

                AnimacyOfRec

RealizationOfRec animate inanimate

              NP     521        34

              PP     301        47

Aggregate

xtabs() answers the question "how often did each value occur?" But if you have other questions about the data, for example, "what was the average value of number of words in the Fisher speech corpus?" or "separately for male and female speakers, what was the average number of words spoken?", we cannot use xtabs().

The first of the two   questions above can be answered simply by using

> mean(fisher$numWords)

[1] 952.7249

But for the second question, we first need to group rows by the gender of the speaker, and then compute the average. The R function aggregate() can do this, like this:

> aggregate(fisher$numWords, list(fisher$gender), mean)

  Group.1       x

1       F 935.242

2       M 974.933

What aggregate() does is this:

The second argument, the grouping columns, have to be given as an R data type called "list", which we do by just putting list() around the column names. (Basically, an R list is something like a data frame, but instead of equal-length column vectors, the different parts of a list can be anything.) 

We can apply something else but the mean, for example standard deviation:

> aggregate(fisher$numWords, list(fisher$gender), sd)

  Group.1        x

1       F 288.0056

2       M 301.1527

We can also group rows by values of more than one column. For example, say we want to know mean number of words by gender and age. Looking at each individual age will give us too many values. So we use the R function cut() to cut the age values into 6 different bins. We can then aggregate rows by gender and age group, and compute a mean for each of them.

> fisher$agegroup = cut(fisher$age, 6)

> aggregate(fisher$numWords, list(fisher$agegroup, fisher$gender), mean)

Merge

The R function merge() is very helpful for adding information to a data frame, or combining data from multiple data frames. As the first, simplest example, suppose we have two data frames with information about the Fisher telephone communication corpus. One contains background information about speakers:

> fisher.bg

  speaker gender age

1    2602      M  34

2    1790      F  24

3    2152      F  34

4    9998      M  NA

5    5897      F  20

6    9997      M  NA

The second data frame has the actual experimental data we collected, namely the number of words that each speaker spoke:

> fisher.numWords

  speaker callID numWords

1    2602      1      785

2    1790      1      882

3    2152      2      632

4    9998      2     1457

5    5897      3      522

6    9997      3     1096

Suppose we want to know: "How many words did men say on average, and how many words did women say on average"? Then we need information from both dataframes. We can merge the two dataframes to combine the data:

> fisher.merged = merge(fisher.bg, fisher.numWords)

> fisher.merged

  speaker gender age callID numWords

1    1790      F  24      1      882

2    2152      F  34      2      632

3    2602      M  34      1      785

4    5897      F  20      3      522

5    9997      M  NA      3     1096

6    9998      M  NA      2     1457

The "merge()" command has combined the two dataframes based on the column that they have in common, "speaker". Note that merge() leaves the original dataframes fisher.bg and fisher.numWords unchanged. It creates a new, merged dataframe, which we have saved in fisher.merged.

But what if the column names did not match? Suppose that fisher.numWords looked like this instead:

> fisher.numWords

  person callID numWords

1   2602      1      785

2   1790      1      882

3   2152      2      632

4   9998      2     1457

5   5897      3      522

6   9997      3     1096

So fisher.numWords has a column "person" that matches the column "speaker" in fisher.bg. In this case, we have to tell merge() explicitly which columns have the same data:

> fisher.merged = merge(fisher.bg, fisher.numWords, by.x = "speaker", by.y = "person")

> fisher.merged

  speaker gender age callID numWords

1    1790      F  24      1      882

2    2152      F  34      2      632

3    2602      M  34      1      785

4    5897      F  20      3      522

5    9997      M  NA      3     1096

6    9998      M  NA      2     1457

The function merge() considers the first dataframe its "x" dataframe, and the second one its "y" dataframe. So "by.x" means: Use this column from dataframe x (that is, the first dataframe, which is actually fisher.bg). And "by.y" means: Use this column from dataframe y (that is, the second dataframe, fisher.numWords).

The function merge() can do even more than that. It works even when the two dataframes do not have the same number of rows. We make a new data frame inaug.last that is a small piece of the "inaugural speeches" dataset, with only the few most recent entries. We then make a data frame that lists, for each of the surnames in inaug.last, a party affiliation. Note that the party data frame has each last name only once, so it is shorter than inaug.last.

> inaug.last = inaug.all[45:56,]

> party = data.frame(name = c("Johnson", "Nixon", "Carter", "Reagan", "Bush", "Clinton", "Obama"), party = c("D", "R", "D", "R", "R", "D", "D"))

> party

     name party

1 Johnson     D

2   Nixon     R

3  Carter     D

4  Reagan     R

5    Bush     R

6 Clinton     D

7   Obama     D

Now we can merge the data frames "inaug.last" and "party" by the "president" column of "inaug.last" and the "name" column of "party". This results in a new data frame that combines "inaug.last" and "party". The row about Johnson, for example, combines the inaugural speech information for Johnson with the party affiliation information. Note that when the same president occurs multiple times, like Bush, the same party affiliation information is added to each of them.

> merge(inaug.last, party, by.x = "president", by.y = "name")

   president year length America citizen citizens democracy freedom  I me duties party

1       Bush 1989   2713       7       1        2         5       6 26  0      0     R

2       Bush 2001   1825      11       1        9         2       5 11  1      0     R

3       Bush 2005   2376      20       1        6         1      27  9  0      2     R

4     Carter 1977   1380       2       0        0         0       4  6  3      1     D

5    Clinton 1993   1855      19       0        2         4       3  7  1      0     D

6    Clinton 1997   2462      15       1        7         4       2  2  0      0     D

7    Johnson 1965   1715       3       3        1         1       2 15  2      0     D

8      Nixon 1969   2425       6       0        1         0       2 21  1      0     R

9      Nixon 1973   2028      21       0        1         0       4 12  2      0     R

10     Obama 2009   2726      10       0        1         0       3  3  0      2     D

11    Reagan 1981   2801       6       0        3         0       8 23  3      0     R

12    Reagan 1985   2946       7       0        6         2      14 12  4      0     R

Graphical data exploration

The following plotting commands start a new canvas:

The following plotting commands do not start a new canvas, but superimpose the next plot on the canvas that you last plotted:

By default, R draws one plot at a time. But sometimes you may want to draw more than one. For example, here we draw histograms for number of words spoken separately for each of the age groups that we have added above. The par() command sets general plotting parameters. It has a huge number of possible settings -- see the help function. This particular command sets plotting to 2 rows and 3 columns of plotting canvases.

par(mfrow = c(2,3))

To circumvent problems with missing values (NA), we omit them all for now:

> fisher.a = na.omit(fisher)

Then we plot:

> levels(fisher.a$agegroup)

[1] "(13.9,25.6]" "(25.6,37.3]" "(37.3,49]"   "(49,60.7]"   "(60.7,72.4]" "(72.4,84.1]"

> truehist(fisher.a[fisher.a$agegroup == "(13.9,25.6]",]$numWords, xlab = "age 13.9-25.6")

> truehist(fisher.a[fisher.a$agegroup == "(25.6,37.3]",]$numWords, xlab = "age 25.6-37.3")

> truehist(fisher.a[fisher.a$agegroup == "(37.3,49]",]$numWords, xlab = "age 37.3-49")

> truehist(fisher.a[fisher.a$agegroup == "(49,60.7]",]$numWords, xlab = "age 49-60.7")

> truehist(fisher.a[fisher.a$agegroup == "(60.7,72.4]",]$numWords, xlab = "age 60.7-72.4")

> truehist(fisher.a[fisher.a$agegroup == "(72.4,84.1]",]$numWords, xlab = "age 72.4-84.1")

You get back to a single canvas by typing

par(mfrow = c(1,1))

A constant problem when we superimpose multiple plots is the problem of calculating boundaries for the plotted object. If you plot only one thing, then R draws the canvas in such a way that all points will fit. But if you superimpose something on this plot, it need not fit. Try for example:

> plot(inaug.all$duties, type = "l", col = "blue")

> lines(inaug.all$America, type = "l", col = "red")

To solve this problem, you need to use the "xlim" and "ylim" parameters of plot() to set up the canvas size yourself. You can use range() to determine the range you need to accommodate. Note that you can hand "range" multiple vectors at once, and it gives you the overall range for all of them combined!

> range(inaug.all$duties)

[1] 0 9

> range(inaug.all$America)

[1]  0 21

> range(inaug.all$duties, inaug.all$America)

[1]  0 21

> plot(inaug.all$duties, type = "l", col = "blue", ylim = range(inaug.all$duties, inaug.all$America))

> lines(inaug.all$America, type = "l", col = "red")

Data exploration: What to plot?

To get an overview over possible values of a single vector:

Plotting two or more variables in comparison:

We can send the result of xtabs() directly on to barplot(), also for the case where we compare multiple variables.

# The following command draws two bars, one on top of each other, for NP and PP

> barplot(xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs))

# The following command draws two bars for "animate" and two for "inanimate",

# with the bars for NP and PP next to each other

# instead on top of each other

> barplot(xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs), beside = T)

Then there are also mosaic plots, discussed in the Baayen book, but I find them hard to read, so I will not go into them any further.

Scatter plots are another standard way of plotting the relation between two variables. Here we look at the rate at which the word "freedom" is used in inaugural speeches over the years (we divide by length to abstract from the lengths of the speeches):

> plot(inaug.all$year, inaug.all$freedom/inaug.all$length)

When we plot two vectors of the same length, here "year" and normalized "freedom", the first vector describes the x-coordinates of each data point, and the second vector describes the y-coordinates. We should put informative labels on the axes to say what the plot does:

> plot(inaug.all$year, inaug.all$freedom/inaug.all$length, xlab = "Year", ylab = "rel.freq. of 'freedom'")

Is there a trend? We add a line to the plot to sketch the main tendency.

> lines(lowess(inaug.all$year, inaug.all$freedom/inaug.all$length), col="dark grey")

Such a curve is often called a scatterplot smoother. lowess() computes one possible smoothing line (there are multiple ways of estimating main trend), and lines() adds the smoother to the current plot.

See the Baayen book for more information on scatterplot smoothers, and also for more plotting variants.