Courses‎ > ‎R worksheets‎ > ‎

R worksheet: descriptive statistics

In this worksheet, we use the data frame inauguralX (columns separated by whitespace). It is the same as inaugural, except that it has an extra column counting "we". We use it under the name "inaugural" below.

Measuring central tendency

Mean:
mean(inaugural$length)
# which is the same as:
sum(inaugural$length) / length(inaugural$length)

Median:
median(inaugural$length)
# which is the same as:
quantile(inaugural$length, probs = 0.5)
If you just ask for the quantiles without further parameters, you get the lowest, the 1st quartile, the median, the 3rd quartile, and the highest:
> quantile(inaugural$length)
     0%     25%     50%     75%    100%
 147.00 1544.00 2380.00 3172.25 9165.00
If you feel like it, you can also request deciles:
> quantile(inaugural$length, probs = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%
 147.0 1223.5 1478.0 1770.0 1935.0 2380.0 2585.0 2873.5 3693.0 4406.5
  100%
9165.0

Mode: This is not a one-liner in R, but here is how you do it. We use the counts of "freedom", not the speech lengths, as the speech lengths
never repeat, so there is no mode (or everything is a mode).

freq.df = data.frame( xtabs( ~inaugural$freedom) )
freq.df[ which(freq.df$Freq == max(freq.df$Freq)), ]$inaugural.freedom
The first line creates a data frame that counts how often each count of "freedom" occurs. Its columns are "inaugural.freedom" and "Freq". Visualize this data frame by itself. (It takes the output of xtabs and transforms it to a data frame.)
The second line finds in this new data frame freq.df all the lines in which the Freq value is the same as the maximum Freq value, and displays their inaugural.freedom value: Which are all the inaugural.freedom values that are maximally frequent? So it displays multiple modes if there is more than one.

Spread

Range:
range(inaugural$length)


Mean absolute distance to mean: (not usually used)
mean(abs(inaugural$length - mean(inaugural$length)))


Variance:
var(inaugural$length)


Standard deviation:
sd(inaugural$length)
# which is the same as:
sqrt(var(inaugural$length))

Visualization

hist() and the command truehist() from the MASS package show histograms. Here are the histograms and a density plot: (The first line sets the canvas up for plotting three things at once in a row).
par(mfrow = c(1,3))
hist(inaugural$length)
library(MASS)
truehist(inaugural$length)
plot(density(inaugural$length))

boxplot() shows the first and third quartile as a box with the median as a line through the box. The whiskers extend 1.5 times the length of the box by default (though you can change that), and outliers further than that are shown as dots.

Over to you

  • Show a histogram of the use of "I" (relative frequency, not absolute) in inaugural speeches, using both hist() and truehist(). Also do a density plot of the relative frequencies of "I". Show all three graphs next to one another on the same canvas.
  • Show a boxplot of the relative frequencies of "I" in inaugural speeches. Also use quantile() to determine the 1st and 3rd quartile and the median exactly.
  • What is most frequent frequency of "I" in inaugural speeches? What is the most frequent frequency of "we"? (This is the mode.)
  • Use xtabs to check how often each frequency of "duties" occurred. Do the same for "I" and "we". Visualize the result of xtabs() using barplot. Use the parameter beside=T to show the bars next to each other rather than on top of each other.

Problems using the dative dataset

The dative dataset is available in the package languageR. Once you have installed the package, you make it available using
library(languageR)
The dative dataset is the extended version of the verbs dataset. Get an idea of what it contains using
head(dative)

The column RealizationOfRecipient is the outcome we are interested in: "NP" stands for the form "John gave Mary the book", and "PP" stands for "John gave the book to Mary".

Using the dative dataset:
  • What is the mean length of theme where RealizationOfRecipient is "NP"? What is the median? What is the standard deviation? What is the range?
  • Now do the same for the case where RealizationOfRecipient is "PP".
  • Draw a histogram of the Length of Theme, as well as a boxplot. (Without paying attention to NP versus PP realization of recipient.)
  • The R package "lattice" contains several commands for comparison plots. For example, here are boxplots of Length of Theme separately by Realization of Recipient:
    > library(lattice)
    > bwplot(LengthOfTheme ~ RealizationOfRecipient, data = dative)
    Here, bwplot is showing LengthOfTheme depending on RealizationOfRecipient. We could also have said
    bwplot(dative$LengthOfTheme ~ dative$RealizationOfRecipient) but that is clunkier to write, also the long names show up on the axes.

    Inspect the plot you got: Do you think there is an effect of length of theme on realization of recipient?
    Using bwplot(), inspect the LengthOfRecipient separately for NP and PP recipients. Do you think there is an effect?
  • Use xtabs() to explore the effect of different categorial variables (that is, variables that have factor values, such as PronomOfTheme being either "pronominal" or "nonpronominal") on RealizationOfRecipient. Use barplot() to visualize the results of xtabs(). Which variables seem to have an influence?
  • Try using xtabs() with more than 2 variables, like xtabs(~ RealizationOfRecipient + PronomOfRec + AccessOfRec). What happens?





Comments