First steps in R

When you start R, you should see something like this:

This is the R shell. You see a blinking cursor next to the ">". Here you can type something.

Using R as a calculator

The simplest thing you can type in after the ">" are arithmetic expressions. Try some out -- you will see that R provides the answer, next to a number in straight brackets:

(In the following, lines starting with > are what you type in, and the other lines are what the system will answer.)

> 1+3

[1] 4

> 10 * 2

[1] 20

> 10 ** 2

[1] 100

We will get back to the [1] below. For now, just ignore it. The two stars, "**" stand for "to the power of". So "10 ** 2" means 10 squared.

More interesting data, and variables

Mostly when we do statistics, we want to deal not with single numbers but with a whole collection of values. Here is an example: the lengths of all the inaugural addresses (in words) from 1789 to 2009.

c(1538, 147, 2585, 1935, 2384, 1265, 1304, 3693, 4909, 3150, 1208, 1267, 4171, 9165, 5196, 1182, 3657, 3098, 4005, 785,

1239, 1478, 2724, 3239, 1828, 4750, 2153, 4371, 2450, 1091, 5846, 1905, 1656, 3756, 4442, 3890, 2063, 2019, 1536, 637,

2528, 2775, 1917, 1546, 1715, 2425, 2028, 1380, 2801, 2946, 2713, 1855, 2462, 1825, 2376, 2726)

This is what R calls a "vector" -- simply an ordered sequence. The"c()" is part of this: Every time you want to give R an ordered sequence, you have to type c() around it, and commas between the values. (Why "c"? Well, obviously "c" for "vector".... Frankly, no idea.)

If you paste this into R, it will quote the whole vector back to you. This will look approximately like this (depending on how wide your R window is):

[1] 1538 147 2585 1935 2384 1265 1304 3693 4909 3150 1208 1267 4171 9165 5196 1182 3657 3098 4005 785 1239 1478

[23] 2724 3239 1828 4750 2153 4371 2450 1091 5846 1905 1656 3756 4442 3890 2063 2019 1536 637 2528 2775 1917 1546

[45] 1715 2425 2028 1380 2801 2946 2713 1855 2462 1825 2376 2726

It is showing an [1] in front of the first line: The first line starts with the first inaugural address. The second line starts with [23]: The next number is the length of the 23rd inaugural address. The third line says [45]: The next number belongs to the 45th inaugural address.

What can we do with a vector? For example, we can ask how many inaugural addresses we have, that is, what the length of the vector is:

> length(c(1538, 147, 2585, 1935, 2384, 1265, 1304, 3693, 4909, 3150, 1208, 1267, 4171, 9165, 5196, 1182, 3657, 3098, 4005, 785,

1239, 1478, 2724, 3239, 1828, 4750, 2153, 4371, 2450, 1091, 5846, 1905, 1656, 3756, 4442, 3890, 2063, 2019, 1536, 637,

2528, 2775, 1917, 1546, 1715, 2425, 2028, 1380, 2801, 2946, 2713, 1855, 2462, 1825, 2376, 2726))

[1] 56

So length() gives us the length of a vector. But typing in the whole vector every time we want to do something with it is very cumbersome. Fortunatly, there is a simple solution: Just give this vector a name, and refer to it by that name. I will call it "inaug.len" because it has the lengths of inaugural addresses.

> inaug.len = c(1538, 147, 2585, 1935, 2384, 1265, 1304, 3693, 4909, 3150, 1208, 1267, 4171, 9165, 5196, 1182, 3657, 3098, 4005, 785, 239, 1478, 2724, 3239, 1828, 4750, 2153, 4371, 2450, 1091, 5846, 1905, 1656, 3756, 4442, 3890, 2063, 2019, 1536, 637,

2528, 2775, 1917, 1546, 1715, 2425, 2028, 1380, 2801, 2946, 2713, 1855, 2462, 1825, 2376, 2726)

So whenever I want to give a name to a vector (or, actually, some other kind of data), I write the name, then "=", then the vector. For example:

xyz = c(1,2,3,4)

Names for vectors and other pieces of data are called variables.

Now that our vector of inaugural address lengths has a name, we can simply say:

> length(inaug.len)

[1] 56

What can variables look like? I can't put anything I like, for example the following won't work:

> 1x = c(1,2,3,4)

Error: unexpected symbol in "1x"

> this!is!the!name!! = c(1,2,3,4)

Error: unexpected '!' in "this!"

Here's the rule: A variable can contain any letter or digit (A-Z, a-z, 0-9), the underscore _, and the fullstop. And it cannot begin with a digit.

Working with vectors

What else can we do with inaug.len? Think of at least 3 questions that we may want to answer about the vector of inaugural address lengths. (Scroll down to see my suggestions. But think first.)

Here are some questions that I came up with:

# How long is the longest address?

> max(inaug.len)

[1] 9165

# And which one is it?

> which.max(inaug.len)

[1] 14

# (This only tells us that it's the 14th address. We'll see later

# how we can figure out what year, and what president, it was)

# How would we determine how long the shortest address is, and which

# one it is? Try to guess the name of the command.

# How long were the inaugural addresses on average?

> mean(inaug.len)

[1] 2602.411

And here is something else that we can do with the vector: We can visualize it.

# This will plot each inaugural address length as a point.

> plot(inaug.len)

# A bit easier to read: each inaugural address length as a bar.

> barplot(inaug.len)

Accessing individual entries in a vector: indices

Suppose we wanted to know what the length of the first inaugural address was. How can we say this? Like that:

> inaug.len[1]

[1] 1538

So: name of the vector, then straight brackets, and the number of the address that we are interested in. This is called an index. Here are some more:

> inaug.len[14]

[1] 9165

> inaug.len[2]

[1] 147

> inaug.len[100]

[1] NA

We don't have 100 inaugural addresses, so the entry number 100 is "NA", Not Available.

But there is more that we can do with indices. What does the following do, and why?


How about this?


(Scroll down to read what this is.)

1:10 is actually a vector: It is an abbreviation for c(1,2,3,4,5,6,7,8,9,10). If you would like to check that this is true, try

> 1:10

Now, what could this mean?

> inaug.len[-1]

It's all the inaugural addresses except for the first one.

But vector indices can do even more. What do you think this does?

> inaug.len[inaug.len > 2000]

This gives you all the inaugural addresses that were more than 2000 words long.

We can also combine two conditions. "&" stands for "and", and "|" stands for "or":

> inaug.len[inaug.len > 1000 & inaug.len < 2000]

[1] 1538 1935 1265 1304 1208 1267 1182 1239 1478 1828 1091 1905 1656 1536 1917 1546 1715 1380 1855 1825

> inaug.len[inaug.len < 200 | inaug.len > 4000]

[1] 147 4909 4171 9165 5196 4005 4750 4371 5846 4442

What other conditions can you put in straight brackets? Experiment a bit.

Data frames

Vectors are just a simple sequence of numbers. Above, we used inaug.len, a vector of lengths of inaugural addresses. But it would have been nice to have more information, for example the name of the president and the year for each inaugural address. Without that, it is not so interesting to know how long each inaugural address is. So what we need is a table, something like this:

In R, tables are called data frames. Each row in a data frame stands for one datapoint, in our case one inaugural address. Each column in the data frame stands for one type of data, for example presidents or years. Each column has a name, for example "President" or "Year".

Here is a small example of just the first few inaugural addresses as an R data frame:

inaug = data.frame(president = c("Washington", "Washington", "Adams", "Jefferson", "Jefferson", "Madison"),

year = c(1789, 1793, 1797, 1801, 1805, 1809),

length = c(1538, 147, 2585, 1935, 2384, 1265))

We make a data frame and name it "inaug" (that's a variable again). A data frame is described by saying data.frame(). Compare to vectors above: They had c(), and data frames have data.frame().

Within the data frame, there are three vectors, one for each column. The first one is named "president", and what follows has the form c() because it is a vector. All the vectors within a data frame must have the same length (because we are describing a table).

If we now type "inaug" at the prompt, R will print the table.

> inaug

president year length

1 Washington 1789 1538

2 Washington 1793 147

3 Adams 1797 2585

4 Jefferson 1801 1935

5 Jefferson 1805 2384

6 Madison 1809 1265

How can we access parts of a data frame?

We can access a column by the name of the data frame, then "$", then the name of the column:


Inside a column, you can index entries just like you would in a vector. Here is how you access the length of the 2nd inaugural address:


There is another way to access parts of a data frame: In a vector, we used a single number for indexing, for example "2" to access the second entry. In a data frame, we need to talk about both the row and the column. Here is how we get at row 3, column 1:


If you want a complete row, just leave the column index empty (but do not forget to put the comma). Here's the row for the 4th inaugural address:


Now suppose you want to access a complete column: column 3, which has the length of each speech. You know (or can figure out) two ways of doing this. What are they?

You can access a column by name or by index:

> inaug$length

[1] 1538 147 2585 1935 2384 1265

> inaug[,3]

[1] 1538 147 2585 1935 2384 1265

Look back to all the things that we did with indices for a vector. Can you do them with a data frame as well?

Yes, you can do the following:

# Get rows 1 and 3.

# And by the way, R code lines that you start with "#" are comments. they are ignored by R.

# Note the comma after c(1,3): This is because we are selecting rows, not columns


# Get columns 2,4


# Selecting rows with lengthy speeches

inaug[inaug$length > 2000,]

# Selecting rows where the president has less than 10 characters.

# as.character(...) transforms to a string.

# nchar(...) determines the number of characters in a string

> nchar("Washington")

[1] 10

> nchar(as.character("Washington"))

[1] 10

> nchar(as.character(inaug$president))

[1] 10 10 5 9 9 7

> inaug[nchar(as.character(inaug$president)) < 10, ]

president year length

3 Adams 1797 2585

4 Jefferson 1801 1935

5 Jefferson 1805 2384

6 Madison 1809 1265

# Selecting rows where the president has less than 10 characters

# but the speech is long.

# "&" again stands for "and"

> inaug[nchar(as.character(inaug$president)) < 10 & inaug$length > 2000, ]

president year length

3 Adams 1797 2585

5 Jefferson 1805 2384

Try some other conditions!

Sometimes we are only interested in some columns of a selected row. Here are two ways in which we can answer the following question:

In what years did presidents give long speeches (more than 2000 words)?

> inaug[inaug$length > 2000, ]$year

[1] 1797 1805

> inaug[inaug$length > 2000, 2]

[1] 1797 1805

Working with more data

We are currently working with a very short dataset of inaugural speeches. Things would be more interesting with more data. But typing in larger data frames by hand would be very tedious. Fortunately R has an easy mechanism for reading in data from files.

On the course schedule page, there is a link called "Data: Word counts in inaugural speeches". Follow the link to see a longer version of the small dataset that we have been working with. This is a file in "csv" format, which means "comma separated values". It is basically a very simple text file: Its first line contains the names of all columns. In the following lines there are the data for all inaugural speeches from 1789 to 2009, with columns separated by commas. Save this data as a file to your computer. Then you can read it into R like this:

> inaug.all = read.csv("/Users/katrinerk/Downloads/inaugural.csv")

(I gave the path where I stored the inaugural speeches file. This will probably be different on your computer.)

Incidentally, .csv is a format that Excel can write, so you can collect data in an Excel spreadsheet, "Save As" .csv, and then read the data into R.

If you now just type inaug.all at the prompt, you see the whole file fly by. (Try it so you will know that you have more data now.) To see just the first few lines, you can say


To get a quick overview of our table, type


Now we can try working with indices and constraints again to get some more interesting answers. Try answering the following questions:

    • How many speeches were there that were more than 2000 words long?

    • Here is a new command that will be useful for this: With nrow() you can determine the number of rows in a data frame.

    • How many in the last century?

    • Who was the president that gave the shortest speech? Who gave the longest?

    • In which year was the word "democracy" used the most?

    • How many times was "citizen" mentioned in the year when "duties" was used the most?


R has a function for automatically sorting vectors. Here is how it works on our vector of speech lengths from above:

> sort(inaug.len)

[1] 147 239 637 785 1091 1182 1208 1265 1267 1304 1380 1478 1536 1538 1546 1656 1715 1825 1828 1855 1905 1917

[23] 1935 2019 2028 2063 2153 2376 2384 2425 2450 2462 2528 2585 2713 2724 2726 2775 2801 2946 3098 3150 3239 3657

[45] 3693 3756 3890 4005 4171 4371 4442 4750 4909 5196 5846 9165

This sorts the lengths of inaugural addresses from shortest to longest.

Here is what happens when we sort the president names (as strings):

> sort(as.character(inaug.all$president))

[1] "Adams" "Adams" "Buchanan" "Bush" "Bush" "Bush" "Carter" "Cleveland"

[9] "Cleveland" "Clinton" "Clinton" "Coolidge" "Eisenhower" "Eisenhower" "Garfield" "Grant"

[17] "Grant" "Harding" "Harrison" "Harrison" "Hayes" "Hoover" "Jackson" "Jackson"

[25] "Jefferson" "Jefferson" "Johnson" "Kennedy" "Lincoln" "Lincoln" "Madison" "Madison"

[33] "McKinley" "McKinley" "Monroe" "Monroe" "Nixon" "Nixon" "Obama" "Pierce"

[41] "Polk" "Reagan" "Reagan" "Roosevelt" "Roosevelt" "Roosevelt" "Roosevelt" "Roosevelt"

[49] "Taft" "Taylor" "Truman" "VanBuren" "Washington" "Washington" "Wilson" "Wilson"

So strings are sorted alphabetically.

But how about if we wanted to sort president names by the lengths of their speeches? In this case, what we do is:

# add to this that we want to extract presidents

> inaug.all$president[order(inaug.all$length)]

[1] Washington Roosevelt Lincoln Roosevelt Taylor Jackson Grant Madison Jackson Madison

[11] Carter Grant Roosevelt Washington Kennedy Wilson Johnson Bush Cleveland Clinton

[21] Wilson Eisenhower Jefferson Roosevelt Nixon Roosevelt Cleveland Bush Jefferson Nixon

[31] McKinley Clinton Truman Adams Bush Hayes Obama Eisenhower Reagan Reagan

[41] Buchanan Adams Garfield Pierce Monroe Harding Hoover Lincoln VanBuren McKinley

[51] Coolidge Harrison Monroe Polk Taft Harrison

How does this work? (Note: This will get more involved than all previous R commands that we have looked at. But it will be useful to understand this, not only for sorting stuff, but also to understand what you can do with indices.)

As a first step, let's look at the order() command. What does it give you? Let's look at a simpler vector first.

> myvec = c(4,1,10, 20, 9)

> order(myvec)

[1] 2 1 5 3 4

What does this do?

The R function order() computes the rank of each item in an ordering. So, it answers the question: ``What order would

the elements of the vector be in if I ordered them?'' So, in the example from above,

> myvec = c(4,1,10, 20, 9)

> order(myvec)

[1] 2 1 5 3 4

the first element in the ordered vector would be the 2nd in the original, namely the number 1. The second element in the ordered vector would be the 1st in the origial, namely 4. And so on.

Why is that interesting? Because we can now sort one vector by the order of another vector. In our case -- and let's go back to the small dataset --, we want to sort inaug$president by the order of inaug$length.

> order(inaug$length)

[1] 2 6 1 4 5 3

> inaug$president

[1] Washington Washington Adams Jefferson Jefferson Madison

So if we select from inaug$president the entries in the order of order(inaug$length), we select:

    • the 2nd president: Washington

    • then the 6th president: Madison

    • then the 1st president: Washington

    • then the 4th president: Jefferson

and so on.

So, when we say

inaug$president[ order(inaug$length) ]

we are using order(inaug$length) as a vector of indices that we use to select from inaug$president. We select every entry from inaug$president in turn, but we select them in the order or order(inaug$length).

We can sort whole data frames this way, but we have to put a comma after the order() command, because now we are sorting rows of a data frame:

inaug.all[ order(inaug.all$length), ]

Adding columns to a data frame

Adding a column to a data frame is straightforward: Here, for example, we add a column that stores the counts for both "I" and "me":

> inaug.all$I_me = inaug.all$I + inaug.all$me

Now we do something more complicated: Say we want to see whether 2nd speeches by the same president tend to be shorter. But that means that we have to figure out which speeches are the 2nd ones by the same president. Another way of putting it: We need to determine, for each speech, who gave the previous speech. If it's the same person, then we have a 2nd speech by the same president.

Here is our list of presidents throughout the years:

> inaug.all$president

Even if the result may look like a vector of strings, they are not strings. They are actually factors -- we will get into this later. For now, let us turn the vector of presidents into a vector of strings:

> as.character(inaug.all$president)

As you can see, we now have quotes around all of them, and the "Levels" are gone.

We can get a vector of "previous presidents" by shifting this vector by one: We cut off the last president because he is not anyone's previous president:

> as.character(inaug.all$president)[1:(length(inaug.all$president) - 1)]

Then we add that for the very first president, the previous president was no one (X).

> inaug.all$prev = c("X", as.character(inaug.all$president)[1:(length(inaug.all$president) - 1)])

Now we can compare the average length of speeches where the current president is the same as the previous one to speeches where that is not the case. And in fact, there is a difference:

> mean(inaug.all[inaug.all$president == inaug.all$prev,]$length)

[1] 1900.059

> mean(inaug.all[inaug.all$president != inaug.all$prev,]$length)

[1] 2908.564

Drawing graphs

Did the inaugural addresses generally get longer over time, or shorter? The easiest way to get a quick first impression of what the answer will be is to visualize. This is really easy in R:

> plot(inaug.all$length)

though this is maybe not the best way to visualize this particular data. A single line would be a more typical (and better readable) way to visualize changing lengths of speeches. This, too, is easy: Just tell R to plot the data with type "l" for "line":

> plot(inaug.all$length,type = "l")

That is better. Try this with some of the other columns in the data frame: How about the number of times the word "citizen" was used? The word "America"? The word "I"?

(Is this a good way to check whether these words were used more or less over the years? There is one thing that we are not taking into account -- what is it?)

We can also visualize multiple columns in the same plot. But we cannot use the "plot" command both times. "plot()" erases what was on the canvas, and starts from scratch. There are separate commands for adding lines or points to a plot that is already there. To add a line, the command is "lines()":

> plot(inaug.all$America, type = "l")

> lines(inaug.all$citizen)

Now we can't see anything. We better use some color and/or different line types:

> plot(inaug.all$America, type = "l", col = "blue")

> lines(inaug.all$citizen, col= "red", lty = "dotted")

Here are some more ways of plotting: A barplot shows each entry as a bar. A boxplot illustrates the median, the first and third quantile, and outliers:

> barplot(inaug.all$length)

> boxplot(inaug.all$length)