R worksheets: merge and aggregate

Merging data frames

The inaugural data frame does not contain the party affiliation of the presidents. If we had that, we could do a number of additional analyses, for speech length or the degree to which presidents use terms like "freedom" or "democracy". This is a good opportunity to demonstrate how to merge data frames in R.

For convenience, we will use a smaller amount of data. Here is a small dataset of "recent inaugural addresses":

inaug.new = inaugural[ inaugural$year > 1960, ]

It contains 13 entries. We would like to make a new data frame that maps inaugural address years to the affiliation of the president. But how do you make a new data frame by hand (rather than reading one in from file)?

Here is how: You use the function data.frame( ). Inside the parentheses, you specify each column of the data frame. For example, to make a data frame with one column called "A" and one called "B", you could type:

my.example = data.frame(A = c(1,2,3), B = c(20, 30, 40))

So the column A contains the sequence of values 1,2,3 (in that order), and the column B contains the sequence 20, 30, 40.

Now we are ready for the data frame mapping inaugural address years to the affiliation of each president:

affiliation = data.frame(year = inaug.new$year, party = c("D", "D", "R", "R", "D", "R", "R", "R", "D", "D", "R", "R", "D"))

The "year" column in this data frame is a sequence of numbers that are the same as in inaug.new, and the "party" column is a sequence of either "D" or "R".

Now we want to merge the two data frames by their common column "year". And that is exactly what we say:

merge(inaug.new, affiliation, by = "year")

This produces a new data frame, and we can give it a name:

inaug.new.withparty = merge(inaug.new, affiliation, by = "year")

What merge( ) has done is to link every row in inaug.new to the matching row in affiliation.

We could have saved ourselves some work if we had defined party affiliations for presidents, not years, as some presidents appear multiple times in our data frame. (And we even have two different presidents with the same name and same affiliation in our data.) So we could also say:

affiliation = data.frame(president = c("Kennedy", "Johnson", "Nixon", "Carter", "Reagan", "Bush", "Clinton", "Obama"), party = c("D", "D", "R", "D", "R", "R", "D", "D"))

But now affiliation is a data frame with 9 rows, and inaug.new has 13 rows. What will merge( ) do?

merge(inaug.new, affiliation, by = "president")

It actually does the right thing: It links each row in inaug.new to the single matching row in affiliation. After all, there is only one affiliation row with the right president name that each row in inaug.new could link to.

We save this new data frame to inaug.new, with the effect that we update the data frame to have a column named party.

inaug.new = merge(inaug.new, affiliation, by = "president")


Now suppose we wanted to know how often "freedom" was used by the Republican vs. the Democratic candidates.

We can just use a constraint:

> sum(inaug.new[inaug.new$party == "D",]$freedom)

[1] 18

> sum(inaug.new[inaug.new$party == "R",]$freedom)

[1] 66

(Although if we were really interested in this, we should use relative frequencies, not absolute frequencies. )

Or, if we want to do a lot of processing on the Republicans versus the Democrats separately, we can give names to the two sub-data frames:

> inaug.new.d = inaug.new[inaug.new$party == "D",]

> inaug.new.r = inaug.new[inaug.new$party == "R",]

> sum(inaug.new.d$freedom)

[1] 18

> sum(inaug.new.r$freedom)

[1] 66

Or better with relative frequencies:

> sum(inaug.new.d$freedom)/sum(inaug.new.d$length)

[1] 0.001540568

> sum(inaug.new.r$freedom)/sum(inaug.new.r$length)

[1] 0.003856492

And there is a third option: We can use the R function aggregate( ). It  first divides a data frame up into sub-data frames according to the value of some columns (for example party), and then applies a function (for example sum()) to each sub-data frame separately. Note that it does not return the sub-data frames, it only returns the results from each group:

> aggregate(inaug.new$freedom, list(inaug.new$party), sum)

  Group.1  x

1       D 18

2       R 66

This says: compute sum() (3rd parameter) over inaug.new$freedom (1st parameter), but do it separately for each value of inaug.new$party (2nd parameter). The 2nd parameter is a list() because we could in principle separate the data frame by more than one column.

Here is the mean length of speeches separately by party:

> aggregate(inaug.new$length, list(inaug.new$party), mean)

  Group.1        x

1       D 1947.333

2       R 2444.857

Other options; You can also define a function yourself that you want to give as a 3rd parameter to aggregate( ). And there are further options for applying functions to whole data frames: tapply, lapply, and sapply, as well as the functions in the package plyr.