R worksheet: descriptive statistics

In this worksheet, we use the data frame inaugural (columns separated by commas).

Measuring central tendency

Mean:

mean(inaugural$length)

# which is the same as:

sum(inaugural$length) / length(inaugural$length)

Median:

median(inaugural$length)

# which is the same as:

quantile(inaugural$length, probs = 0.5)

If you just ask for the quantiles without further parameters, you get the lowest, the 1st quartile, the median, the 3rd quartile, and the highest:

> quantile(inaugural$length)

     0%     25%     50%     75%    100%

 147.00 1544.00 2380.00 3172.25 9165.00

If you feel like it, you can also request deciles:

> quantile(inaugural$length, probs = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%

 147.0 1223.5 1478.0 1770.0 1935.0 2380.0 2585.0 2873.5 3693.0 4406.5

  100%

9165.0

[ This worksheet previously also explained how to compute the mode, the value that appears most often in a data set. However, this is nontrivial in R, and gets us into many details of contingency tables. If you really need the mode, you find its computation in R code snippets. ]

Spread

Range:

range(inaugural$length)

Mean absolute distance to mean: (not usually used)

mean(abs(inaugural$length - mean(inaugural$length)))

Variance:

var(inaugural$length)

Standard deviation:

sd(inaugural$length)

# which is the same as:

sqrt(var(inaugural$length))

Visualization

hist() and the command truehist() from the MASS package show histograms. Here are the histograms: To use "truehist" you may first have to install the package MASS. You can do this in RStudio in the menu Tools, entry Install Packages.

par(mfrow = c(1,2))

hist(inaugural$length)

library(MASS)

truehist(inaugural$length)

The first line sets the canvas up for plotting two things at once, next to one another: "par" is for setting general parameters for the system, "mfrow" for whatever reason is for setting up the plotting canvas with a given number of rows and columns, and c(1,2) says that we want one row with two columns, so two plots next to one another. par(mfrow = c(2,3)) would set up the plotting canvas for 6 plots: 2 rows of 3 columns each. And par(mfrow = c(1,1)) re-sets the canvas to just show one plot.

Instead of plotting the overall distribution of the data as a histogram, which bins the data, you can also do a density plot, which does not bin the data and instead estimates a density curve at each point;

plot(density(inaugural$length))

boxplot() shows the first and third quartile as a box with the median as a line through the box. The whiskers extend 1.5 times the length of the box by default (though you can change that), and outliers further than that are shown as dots.

boxplot(inaugural$length)

Over to you

Problems using the dative dataset

The dative dataset is available in the package languageR. Once you have installed the package (again, using "Install packages" in the menu item "Tools" in RStudio), you make it available using

library(languageR)

The dative dataset is the extended version of the verbs dataset. Get an idea of what it contains using

head(dative)

This dataset analyzes corpus occurrences of ditransitive verbs: In English you can say both "John gave Mary the book" and "John gave the book to Mary". Are these two truly interchangeable, or are there cases when people prefer one form over the other? The column RealizationOfRecipient is the outcome we are interested in: "NP" stands for the form "John gave Mary the book", and "PP" stands for "John gave the book to Mary".

Using the dative dataset: