R worksheet: plotting
For this worksheet we will again use the inaugural dataset, which has data on inaugural speeches, including their lengths and counts of particular words. We also use the fisher dataset, which has word counts in telephone conversations among total strangers.
Plotting in R
The following plotting commands start a new canvas:
plot(): general purpose plotting. Set the "type" parameter to get different types of plots. See the help page for additional parameters for color, point type, and so on.
plot(inaug.all$length, type="p") plots points
plot(inaug.all$length, type="l") plots a line
plot(inaug.all$length, type = "b") plots a line with points superimposed
...
barplot() plots a barplot
boxplot() gives a box-and-whiskers plot that shows median and 1st and 3rd quartiles.
hist() draws a histogram
To get a "true histogram" in which the area of the bars adds up to 1 (like in a density plot), do
library(MASS)
truehist(inaug.all$length)
You may need to install the package MASS first.
The following plotting commands do not start a new canvas, but superimpose the next plot on the canvas that you last plotted:
lines() draws a line
points() draws points
text() adds text at the given coordinates on the canvas
By default, R draws one plot at a time. But sometimes you may want to draw more than one. For example, here we draw histograms for number of words spoken separately for each of the age groups that we have added above. The par() command sets general plotting parameters. It has a huge number of possible settings -- see the help function. This particular command sets plotting to 2 rows and 3 columns of plotting canvases.
par(mfrow = c(2,3))
To circumvent problems with missing values (NA), we omit them all for now:
> fisher.a = na.omit(fisher)
We can then plot the number of words that speakers said in the Fisher telephone conversations based on their gender, like this, setting up the canvas to plot two histograms one above the other:
par(mfrow = c(2,1))
truehist(fisher.a[fisher.a$gender == "M",]$numWords, xlab = "male")
truehist(fisher.a[fisher.a$gender == "F",]$numWords, xlab = "female")
You get back to a single canvas by typing
par(mfrow = c(1,1))
Text as points in a plot
The R command text() plots texts at the given x and y coordinates. This can sometimes be a fun visualization option.
To plot how many counts of "freedom" versus "duties" each (recent) president has, we can use the count of freedom as the x axis and the count of duties as the y axis, and plot each president name at the matching coordinates. We only use speeches more recent than 1960, otherwise the plot gets too busy.
inaug.new = inaugural[inaugural$year > 1960,]
plot(inaug.new$freedom, inaug.new$duties, type = "n", xlab = "freedom", ylab = "duties")
text(inaug.new$freedom, inaug.new$duties, labels = inaug.new$president)
This makes a data frame inaug.new of recent speeches. It then plots nothing -- that is what type = "n" does. The plot command just sets up the canvas to have the right size (so it fits all counts of freedom on the x axis, and all counts of duties on the y axis), and labels the axes. We need to do that because "text" does not start a new canvas, it superimposes on the previous one.
The third command, "text", then prints each president's name at the matching coordinates. For example, Obama is at coordinates (3,2) because his speech contains "freedom" three times and "duties" twice.
Setting xlim and ylim
A constant problem when we superimpose multiple plots is the problem of calculating boundaries for the plotted object. If you plot only one thing, then R draws the canvas in such a way that all points will fit. But if you superimpose something on this plot, it need not fit. Try for example:
plot(inaugural$duties, type = "l", col = "blue")
lines(inaugural$America, type = "l", col = "red")
As you can see, the red line runs out of the plotted area and becomes invisible.
To solve this problem, you need to use the "xlim" and "ylim" parameters of plot() to set up the canvas size yourself. xlim sets the limits of the canvas on the x axis, and ylim does the same for the y axis. You can use range() to determine the range you need to accommodate. Note that you can hand "range" multiple vectors at once, and it gives you the overall range for all of them combined.
> range(inaugural$duties)
[1] 0 9
> range(inaugural$America)
[1] 0 21
> range(inaugural$duties, inaugural$America)
[1] 0 21
> plot(inaugural$duties, type = "l", col = "blue", ylim = range(inaugural$duties, inaugural$America))
> lines(inaugural$America, type = "l", col = "red")
Now all of the red line is visible.
Labeling axes
If you want a plot for official purposes like a research paper, you need to have proper labels on the x and y axes, and ideally also a label for the whole graph. Here is an example, plotting counts of "freedom" versus "duties" with the year on the x axis, and with x and y labels:
plot(inaugural$year, inaugural$freedom, type="l", col="red", xlab = "year in which speech was given", ylab = "word count", main = "Word counts in inaugural speeches")
lines(inaugural$year, inaugural$duties, type="l", col="blue")
Providing a legend
A legend is a little box inside your plot that indicates what each color or line type in your plot means. You need this whenever you superimpose multiple pieces of information in your plot.
Continuing with the year vs. freedom plot from above, we can state:
legend(1800, 25, legend = c("freedom", "duties"), col = c("red", "blue"), pch=15, cex = 0.8)
This command places a legend at x value 1800 and y value 25, showing the words "freedom" and "duties" with a box (point character "pch" 15) next to each of them, The boxes are to be red and blue, respectively. The text size, "cex", is reduced to 80% of normal.
More graphics parameters
You can look up graphing parameters by typing "?par". The point characters are explained under the entry of "points", so look them up using "?points"
Barrier-free plotting
The graphs as we had them up to now are not friendly to colorblind people, who may not see the difference between the red and the blue lines. Here are some options:
Use different line types. lty = "dashed" gives you a dashed line. Other options include "dotted" and "dotdash". Look up "?par" to see all options.
Use type="b" to get both points and lines, and use a different point character for each line by setting the parameter pch.
Using "?points" you can see a number of pretty point types available: pch=0 is an unfilled box, pch=1 an unfilled circle, and so on.
You can also set pch="F" to use the letter "F" as a point. For our example, here is how to use "F" for freedom and "D" for duties as the points:
plot(inaugural$year, inaugural$freedom, type="b", col="red", xlab = "year in which speech was given", ylab = "word count", main = "Word counts in inaugural speeches", pch = "F")
lines(inaugural$year, inaugural$duties, type="b", col="blue", pch="D")
Now over to you:
Plot the counts of "I" and the counts of "we" across the years: The x axis should show the years, and the y axis counts. Label your axes, and label the whole graph.
Add a legend.
If you haven't already done so, experiment with different ways to make the plot friendly to colorblind viewers (and to people who view it in grayscale printout).
Plot the names of all recent presidents (choose how recent -- it doesn't have to be 1960, maybe you can fit more or you need to show less) using their count of "I" as the x axis and their count of "we" as the y axis.
Add two new columns to the data frame that show the relative frequency of "I" and of "we" in each speech. Redo the plotting (both (3) and (4)) with relative frequencies.
Data exploration: What to plot?
To get an overview over possible values of a single vector:
For discrete-valued variables: a barplot
For continuous valued variables: ordered values, histogram, density, boxplot, quantiles -- see the Baayen book for examples.
Plotting two or more variables in comparison:
We can send the result of xtabs() directly on to barplot(), also for the case where we compare multiple variables.
# The following command draws two bars, one on top of each other, for NP and PP
> barplot(xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs))
# The following command draws two bars for "animate" and two for "inanimate",
# with the bars for NP and PP next to each other
# instead on top of each other
> barplot(xtabs(~RealizationOfRec + AnimacyOfRec, data = verbs), beside = T)
Then there are also mosaic plots, discussed in the Baayen book, but I find them hard to read, so I will not go into them any further.
Scatter plots are another standard way of plotting the relation between two variables. Here we look at the rate at which the word "freedom" is used in inaugural speeches over the years (we divide by length to abstract from the lengths of the speeches):
> plot(inaug.all$year, inaug.all$freedom/inaug.all$length)
When we plot two vectors of the same length, here "year" and normalized "freedom", the first vector describes the x-coordinates of each data point, and the second vector describes the y-coordinates. We should put informative labels on the axes to say what the plot does:
> plot(inaug.all$year, inaug.all$freedom/inaug.all$length, xlab = "Year", ylab = "rel.freq. of 'freedom'")
Is there a trend? We add a line to the plot to sketch the main tendency.
> lines(lowess(inaug.all$year, inaug.all$freedom/inaug.all$length), col="dark grey")
Such a curve is often called a scatterplot smoother. lowess() computes one possible smoothing line (there are multiple ways of estimating main trend), and lines() adds the smoother to the current plot.
See the Baayen book for more information on scatterplot smoothers, and also for more plotting variants.