R worksheet: descriptive statistics
In this worksheet, we use the data frame inaugural (columns separated by commas).
Measuring central tendency
# which is the same as:
sum(inaugural$length) / length(inaugural$length)
# which is the same as:
quantile(inaugural$length, probs = 0.5)
If you just ask for the quantiles without further parameters, you get the lowest, the 1st quartile, the median, the 3rd quartile, and the highest:
0% 25% 50% 75% 100%
147.00 1544.00 2380.00 3172.25 9165.00
If you feel like it, you can also request deciles:
> quantile(inaugural$length, probs = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
147.0 1223.5 1478.0 1770.0 1935.0 2380.0 2585.0 2873.5 3693.0 4406.5
[ This worksheet previously also explained how to compute the mode, the value that appears most often in a data set. However, this is nontrivial in R, and gets us into many details of contingency tables. If you really need the mode, you find its computation in R code snippets. ]
Mean absolute distance to mean: (not usually used)
mean(abs(inaugural$length - mean(inaugural$length)))
# which is the same as:
hist() and the command truehist() from the MASS package show histograms. Here are the histograms: To use "truehist" you may first have to install the package MASS. You can do this in RStudio in the menu Tools, entry Install Packages.
par(mfrow = c(1,2))
The first line sets the canvas up for plotting two things at once, next to one another: "par" is for setting general parameters for the system, "mfrow" for whatever reason is for setting up the plotting canvas with a given number of rows and columns, and c(1,2) says that we want one row with two columns, so two plots next to one another. par(mfrow = c(2,3)) would set up the plotting canvas for 6 plots: 2 rows of 3 columns each. And par(mfrow = c(1,1)) re-sets the canvas to just show one plot.
Instead of plotting the overall distribution of the data as a histogram, which bins the data, you can also do a density plot, which does not bin the data and instead estimates a density curve at each point;
boxplot() shows the first and third quartile as a box with the median as a line through the box. The whiskers extend 1.5 times the length of the box by default (though you can change that), and outliers further than that are shown as dots.
Over to you
Show a histogram of the use of "I" (absolute frequency) in inaugural speeches, using both hist() and truehist(). Also do a density plot of the absoute frequencies of "I". Show all three graphs next to one another on the same canvas.
The relative frequency of "I" is its absolute frequency in a speech divided by the length of the speech. This shows the fraction of words that was "I". To obtain the relative frequency of "I" in all speeches, you divide the absolute frequencies by the speech lengths. You can do this for all speeches at once, like this:
relfreq.I = inaugural$I / inaugural$length
Set up your canvas to hold 2 rows of 3 plots each. Then in the first row, show the histogram, both hist() and truehist(), and density plot of the absolute frequencies of "I". In the second row, do a histogram, both hist() and truehist(), and a density plot of the relative frequencies of "I".
Compare the two.
Show a boxplot of the relative frequencies of "I" in inaugural speeches. Also use quantile() to determine the 1st and 3rd quartile and the median exactly.
Problems using the dative dataset
The dative dataset is available in the package languageR. Once you have installed the package (again, using "Install packages" in the menu item "Tools" in RStudio), you make it available using
The dative dataset is the extended version of the verbs dataset. Get an idea of what it contains using
This dataset analyzes corpus occurrences of ditransitive verbs: In English you can say both "John gave Mary the book" and "John gave the book to Mary". Are these two truly interchangeable, or are there cases when people prefer one form over the other? The column RealizationOfRecipient is the outcome we are interested in: "NP" stands for the form "John gave Mary the book", and "PP" stands for "John gave the book to Mary".
Using the dative dataset:
What is the mean length of theme where RealizationOfRecipient is "NP"? What is the median? What is the standard deviation? What is the range?
Now do the same for the case where RealizationOfRecipient is "PP".
Draw a histogram of the Length of Theme, as well as a boxplot. (Without paying attention to NP versus PP realization of recipient.)
The R package "lattice" contains several commands for comparison plots. For example, here are boxplots of Length of Theme separately by Realization of Recipient:
> bwplot(LengthOfTheme ~ RealizationOfRecipient, data = dative)
Here, bwplot is showing LengthOfTheme depending on RealizationOfRecipient. We could also have said
bwplot(dative$LengthOfTheme ~ dative$RealizationOfRecipient) but that is clunkier to write, also the long names show up on the axes.
Inspect the plot you got: Do you think there is an effect of length of theme on realization of recipient?
Using bwplot(), inspect the LengthOfRecipient separately for NP and PP recipients. Do you think there is an effect?