First steps in R
When you start R, you should see something like this:
This is the R shell. You see a blinking cursor next to the ">". Here you can type something.
Using R as a calculator
The simplest thing you can type in after the ">" are arithmetic expressions. Try some out -- you will see that R provides the answer, next to a number in straight brackets:
(In the following, lines starting with > are what you type in, and the other lines are what the system will answer.)
We will get back to the  below. For now, just ignore it. The two stars, "**" stand for "to the power of". So "10 ** 2" means 10 squared.
More interesting data, and variables
Mostly when we do statistics, we want to deal not with single numbers but with a whole collection of values. Here is an example: the lengths of all the inaugural addresses (in words) from 1789 to 2009.
This is what R calls a "vector" -- simply an ordered sequence. The"c()" is part of this: Every time you want to give R an ordered sequence, you have to type c() around it, and commas between the values. (Why "c"? Well, obviously "c" for "vector".... Frankly, no idea.)
If you paste this into R, it will quote the whole vector back to you. This will look approximately like this (depending on how wide your R window is):
It is showing an  in front of the first line: The first line starts with the first inaugural address. The second line starts with : The next number is the length of the 23rd inaugural address. The third line says : The next number belongs to the 45th inaugural address.
What can we do with a vector? For example, we can ask how many inaugural addresses we have, that is, what the length of the vector is:
So length() gives us the length of a vector. But typing in the whole vector every time we want to do something with it is very cumbersome. Fortunatly, there is a simple solution: Just give this vector a name, and refer to it by that name. I will call it "inaug.len" because it has the lengths of inaugural addresses.
So whenever I want to give a name to a vector (or, actually, some other kind of data), I write the name, then "=", then the vector. For example:
Names for vectors and other pieces of data are called variables.
Now that our vector of inaugural address lengths has a name, we can simply say:
What can variables look like? I can't put anything I like, for example the following won't work:
Here's the rule: A variable can contain any letter or digit (A-Z, a-z, 0-9), the underscore _, and the fullstop. And it cannot begin with a digit.
Working with vectors
What else can we do with inaug.len? Think of at least 3 questions that we may want to answer about the vector of inaugural address lengths. (Scroll down to see my suggestions. But think first.)
Here are some questions that I came up with:
And here is something else that we can do with the vector: We can visualize it.
Accessing individual entries in a vector: indices
Suppose we wanted to know what the length of the first inaugural address was. How can we say this? Like that:
So: name of the vector, then straight brackets, and the number of the address that we are interested in. This is called an index. Here are some more:
We don't have 100 inaugural addresses, so the entry number 100 is "NA", Not Available.
But there is more that we can do with indices. What does the following do, and why?
How about this?
(Scroll down to read what this is.)
1:10 is actually a vector: It is an abbreviation for c(1,2,3,4,5,6,7,8,9,10). If you would like to check that this is true, try
Now, what could this mean?
It's all the inaugural addresses except for the first one.
But vector indices can do even more. What do you think this does?
This gives you all the inaugural addresses that were more than 2000 words long.
We can also combine two conditions. "&" stands for "and", and "|" stands for "or":
What other conditions can you put in straight brackets? Experiment a bit.
Vectors are just a simple sequence of numbers. Above, we used inaug.len, a vector of lengths of inaugural addresses. But it would have been nice to have more information, for example the name of the president and the year for each inaugural address. Without that, it is not so interesting to know how long each inaugural address is. So what we need is a table, something like this:
In R, tables are called data frames. Each row in a data frame stands for one datapoint, in our case one inaugural address. Each column in the data frame stands for one type of data, for example presidents or years. Each column has a name, for example "President" or "Year".
Here is a small example of just the first few inaugural addresses as an R data frame:
We make a data frame and name it "inaug" (that's a variable again). A data frame is described by saying data.frame(). Compare to vectors above: They had c(), and data frames have data.frame().
Within the data frame, there are three vectors, one for each column. The first one is named "president", and what follows has the form c() because it is a vector. All the vectors within a data frame must have the same length (because we are describing a table).
If we now type "inaug" at the prompt, R will print the table.
How can we access parts of a data frame?
We can access a column by the name of the data frame, then "$", then the name of the column:
Inside a column, you can index entries just like you would in a vector. Here is how you access the length of the 2nd inaugural address:
There is another way to access parts of a data frame: In a vector, we used a single number for indexing, for example "2" to access the second entry. In a data frame, we need to talk about both the row and the column. Here is how we get at row 3, column 1:
If you want a complete row, just leave the column index empty (but do not forget to put the comma). Here's the row for the 4th inaugural address:
Now suppose you want to access a complete column: column 3, which has the length of each speech. You know (or can figure out) two ways of doing this. What are they?
You can access a column by name or by index:
Look back to all the things that we did with indices for a vector. Can you do them with a data frame as well?
Yes, you can do the following:
Try some other conditions!
Sometimes we are only interested in some columns of a selected row. Here are two ways in which we can answer the following question:
In what years did presidents give long speeches (more than 2000 words)?
Working with more data
We are currently working with a very short dataset of inaugural speeches. Things would be more interesting with more data. But typing in larger data frames by hand would be very tedious. Fortunately R has an easy mechanism for reading in data from files.
On the course schedule page, there is a link called "Data: Word counts in inaugural speeches". Follow the link to see a longer version of the small dataset that we have been working with. This is a file in "csv" format, which means "comma separated values". It is basically a very simple text file: Its first line contains the names of all columns. In the following lines there are the data for all inaugural speeches from 1789 to 2009, with columns separated by commas. Save this data as a file to your computer. Then you can read it into R like this:
(I gave the path where I stored the inaugural speeches file. This will probably be different on your computer.)
Incidentally, .csv is a format that Excel can write, so you can collect data in an Excel spreadsheet, "Save As" .csv, and then read the data into R.
If you now just type inaug.all at the prompt, you see the whole file fly by. (Try it so you will know that you have more data now.) To see just the first few lines, you can say
To get a quick overview of our table, type
Now we can try working with indices and constraints again to get some more interesting answers. Try answering the following questions:
How many speeches were there that were more than 2000 words long?
Here is a new command that will be useful for this: With nrow() you can determine the number of rows in a data frame.
How many in the last century?
Who was the president that gave the shortest speech? Who gave the longest?
In which year was the word "democracy" used the most?
How many times was "citizen" mentioned in the year when "duties" was used the most?
R has a function for automatically sorting vectors. Here is how it works on our vector of speech lengths from above:
This sorts the lengths of inaugural addresses from shortest to longest.
Here is what happens when we sort the president names (as strings):
So strings are sorted alphabetically.
But how about if we wanted to sort president names by the lengths of their speeches? In this case, what we do is:
How does this work? (Note: This will get more involved than all previous R commands that we have looked at. But it will be useful to understand this, not only for sorting stuff, but also to understand what you can do with indices.)
As a first step, let's look at the order() command. What does it give you? Let's look at a simpler vector first.
What does this do?
The R function order() computes the rank of each item in an ordering. So, it answers the question: ``What order would
the elements of the vector be in if I ordered them?'' So, in the example from above,
the first element in the ordered vector would be the 2nd in the original, namely the number 1. The second element in the ordered vector would be the 1st in the origial, namely 4. And so on.
Why is that interesting? Because we can now sort one vector by the order of another vector. In our case -- and let's go back to the small dataset --, we want to sort inaug$president by the order of inaug$length.
So if we select from inaug$president the entries in the order of order(inaug$length), we select:
the 2nd president: Washington
then the 6th president: Madison
then the 1st president: Washington
then the 4th president: Jefferson
and so on.
So, when we say
we are using order(inaug$length) as a vector of indices that we use to select from inaug$president. We select every entry from inaug$president in turn, but we select them in the order or order(inaug$length).
We can sort whole data frames this way, but we have to put a comma after the order() command, because now we are sorting rows of a data frame:
Adding columns to a data frame
Adding a column to a data frame is straightforward: Here, for example, we add a column that stores the counts for both "I" and "me":
Now we do something more complicated: Say we want to see whether 2nd speeches by the same president tend to be shorter. But that means that we have to figure out which speeches are the 2nd ones by the same president. Another way of putting it: We need to determine, for each speech, who gave the previous speech. If it's the same person, then we have a 2nd speech by the same president.
Here is our list of presidents throughout the years:
Even if the result may look like a vector of strings, they are not strings. They are actually factors -- we will get into this later. For now, let us turn the vector of presidents into a vector of strings:
As you can see, we now have quotes around all of them, and the "Levels" are gone.
We can get a vector of "previous presidents" by shifting this vector by one: We cut off the last president because he is not anyone's previous president:
Then we add that for the very first president, the previous president was no one (X).
Now we can compare the average length of speeches where the current president is the same as the previous one to speeches where that is not the case. And in fact, there is a difference:
Did the inaugural addresses generally get longer over time, or shorter? The easiest way to get a quick first impression of what the answer will be is to visualize. This is really easy in R:
though this is maybe not the best way to visualize this particular data. A single line would be a more typical (and better readable) way to visualize changing lengths of speeches. This, too, is easy: Just tell R to plot the data with type "l" for "line":
That is better. Try this with some of the other columns in the data frame: How about the number of times the word "citizen" was used? The word "America"? The word "I"?
(Is this a good way to check whether these words were used more or less over the years? There is one thing that we are not taking into account -- what is it?)
We can also visualize multiple columns in the same plot. But we cannot use the "plot" command both times. "plot()" erases what was on the canvas, and starts from scratch. There are separate commands for adding lines or points to a plot that is already there. To add a line, the command is "lines()":
Now we can't see anything. We better use some color and/or different line types:
Here are some more ways of plotting: A barplot shows each entry as a bar. A boxplot illustrates the median, the first and third quantile, and outliers: