Getting started with R
After downloading RStudio from here, start it up. You will see an interface that looks approximately like this:
On the left, you will be able to type in R commands. On the lower right, you can get help on all things R, and if you produce a graph, it will be displayed there too. But before you can do that, you need data.
Converting an Excel spreadsheet into a format that R can read
Let's start with a some data that I have extracted from a corpus (a text collection) of U.S. inaugural addresses. Often when you produce data, you first have it in a spreadsheet, such as Excel. So let's first practice getting data out of Excel and into R, if you have Excel. (Otherwise skip this step and go to "Reading data into RStudio".)
You can download the inaugural addresses data as an Excel .xlsx file here. R cannot directly read this file -- the internal Excel format looks like gibberish to it. But you can export the data form Excel in a format called "CSV", or comma separated values:
In Excel, choose the "File" menu, and from the "File" menu choose "Save As..."
In the panel that comes up, change "Format" to "Comma Separated Values (.csv)"
Reading data into RStudio
If you have skipped the previous step, you find the inaugural data as a .csv file here.
In RStudio, choose "Import Dataset" on the upper right side of the window, and choose "From Text File":
Choose the .csv file with the inaugural addresses. You now see a panel with many choices that you can make about the encoding (you should be fine leaving everything as is), and a preview on the right-hand side. Hit "Import".
Congratulations: you have imported your first dataset into R. You can now see the table in the upper left panel in RStudio:
Each row describes one speech.
The first column contains column numbers
The second column has president names
The third column contains the year the address was given
The 4th column is the length of the speech in words
All the rest of the columns are word counts. For example, the column headed "America" says how often this speech contained the word "America".
The lower left panel of RStudio is for typing in R commands. Instead of importing the file by clicking on "Import Dataset", you could also have typed a command in the lower left panel to achieve the same effect. And in fact, RStudio is now showing in the lower left panel the command that you could have typed. In my case, it is
> inaugural <- read.csv("~/Teaching/repeating_classes/stats_intro/materials/data/inaugural.csv")
> View(inaugural)
The ">" is not something you would type. It is just the prompt that R gives you to show that you can type in a command. "read.csv", somewhat unsurprisingly, is a command that reads a csv file. The contents of the file have been stored under the name "inaugural" using the left-arrow: "inaugural <- read.csv(...)". The command "View(inaugural)" views the contents of "inaugural" in the upper left panel. You can also view the whole thing in the lower left panel by typing "inaugural" and hitting Return. To see just the first few lines of the table, type
> head(inaugural)
(Again, don't type the ">", it's what R shows you automatically.) This should give you the following output:
X president year length America citizen citizens democracy freedom I me duties
1 1 Washington 1789 1538 0 0 5 0 0 23 8 1
2 2 Washington 1793 147 1 0 1 0 0 6 1 0
3 3 Adams 1797 2585 5 1 5 0 0 13 5 1
4 4 Jefferson 1801 1935 0 0 7 0 4 21 4 2
5 5 Jefferson 1805 2384 0 0 10 0 2 18 8 4
6 6 Madison 1809 1265 0 0 1 0 1 11 8 2
Note: In this course, we will try to get by with relatively little R. The commands that we do in this worksheet will be about the most complex you will have to master. R can do much more, and if you are curious for additional R constructions or commands, let me know. But you will not need more for this course.
Inspecting a data frame
inaugural is a table -- R calls this a data frame. A data frame consists of cells. For example, the cell at row 1, column 2 contains the word "Washington". You can get this information from R as follows:
> inaugural[1,2]
[1] Washington
34 Levels: Adams Buchanan Bush Carter Cleveland Clinton Coolidge Eisenhower Garfield Grant Harding ... Wilson
"inaugural[ 1, 2]" says that you want to access row 1, column 2 of the data frame stored in inaugural. You have to use square brackets for this. When you type inaugural[1,2], the first piece of data that R gives you, labeled [1], is "Washington". (In this case it is also the only piece of data. But when you type in a query that lets R give you tens or hundreds of pieces of data as answer, you may appreciate the numbering.) The data that it outputs is "Washington", the entry at row 1 column 2.
Additionally, it tells you that there are 34 different values for column 2, ranging from "Adams" to "Wilson". (You can ignore that information for now.)
Try it yourself:
What is the name of the president who gave the 3rd inaugural address? (For now, look in the upper left panel to find out what the column number is for the "president" row. We will discuss a simpler way of doing this later.)
What was the year in which the 10th inaugural address was given?
In the 20th inaugural address, how often was the word "democracy" mentioned?
Helpful hint: If you hit the "uparrow" in the R panel (the lower left panel of RStudio), you see the previous command you entered, and you can edit it. When you are done, hit "Return" to execute it.
If you want to see a whole row, just give the number of the row and nothing for the column (but don't forget the comma). For example, to bring up the whole entry for the 5th inaugural address you write inaugural[5, ]: row 5, then a comma, then nothing for the column.
> inaugural[5,]
X president year length America citizen citizens democracy freedom I me duties
5 5 Jefferson 1805 2384 0 0 10 0 2 18 8 4
The same way, if you want to see all entries for a column, just give the number of the column but no number for the row. So to see all years in which inaugural addresses were given, you put inaugural[, 3]: nothing for the row, then a comma, then column 3.
> inaugural[,3]
[1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845 1849 1853 1857 1861 1865 1869 1873
[23] 1877 1881 1885 1889 1893 1897 1901 1905 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961
[45] 1965 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009
Incidentally, this is the first time that the numbers in brackets at the beginning of the output lines become useful: You can see from the output above that the 1st address, number [1], was in 1789. The 23rd, number [23], was in 1877, and the 45th was in 1965.
Try it yourself:
Bring up the row for the 100th inaugural address
Show how often the word "citizen" was used in each inaugural addresses
Column names
Suppose you are interested in the lengths of all inaugural addresses. Then it is a bit cumbersome that you first have to look at the table and count columns to find out that the column of interest is column 4.
And in fact, there is a simpler way: Just ask for the column by name. Just type the name of the whole data frame, inaugural, then a $, then length. Here is the same output twice, produced first using the column number and then using the column name:
> inaugural[,4]
[1] 1538 147 2585 1935 2384 1265 1304 3693 4909 3150 1208 1267 4171 9165 5196 1182 3657 3098 4005 785 1239 1478
[23] 2724 3239 1828 4750 2153 4371 2450 1091 5846 1905 1656 3756 4442 3890 2063 2019 1536 637 2528 2775 1917 1546
[45] 1715 2425 2028 1380 2801 2946 2713 1855 2462 1825 2376 2726
> inaugural$length
[1] 1538 147 2585 1935 2384 1265 1304 3693 4909 3150 1208 1267 4171 9165 5196 1182 3657 3098 4005 785 1239 1478
[23] 2724 3239 1828 4750 2153 4371 2450 1091 5846 1905 1656 3756 4442 3890 2063 2019 1536 637 2528 2775 1917 1546
[45] 1715 2425 2028 1380 2801 2946 2713 1855 2462 1825 2376 2726
In general, you first put the name of the data frame, then a $ to separate the data frame name and the column name, and then the column name.
Try it yourself:
Show how often the word "citizen" was used in each inaugural address, using the column name.
List all the presidents (with duplicates if they gave two inaugural addresses) that appear in our data frame
Plots
Now that we can select columns from a data frame, we can also plot them. Here is a scatter plot for how often the word "duties" was used through the years:
> plot(inaugural$duties)
A scatter plot has one point in the graph for each data point. When you issue a plot( ) command in the lower left-hand panel of RStudio, the plot shows up in the lower right-hand panel.
What do you observe? Do you see any trends in the use of the word "duties"? Now try plotting the uses of the word "freedom" instead. What do you see?
> plot(inaugural$freedom)
Variants:
Try
plot(inaugural$duties, type="l")
(that's a lowercase ell). What is different?
Other possible types: p, b, h. (not a complete list, there are more). What does each option do? And what option do you find most appropriate for the data?
Try
plot(inaugural$duties, type="p", col="red")
What other colors work? If you want to see a complete list, try
> colors()
The plot( ) command has many parameters. To see more information, type
?plot
in the R panel, or choose "Help" in the lower right-hand panel and type "plot" into the search window.
Two graphs in one frame
So far you have only plotted one graph per frame. Whenever you issued the next plot( ) command, the previous frame was erased. Here is how you can put more than one graph into a single frame:
> plot(inaugural$freedom, type="l", col="red")
> lines(inaugural$duties, col="blue")
The command plot( ) cleans the frame before plotting. The command "lines" (which by default uses the type "l") does not.
What do you observe when you compare "duties" and "freedom"?
One thing that you may observe is that both lines tend to go up and down in sync. Why is that? (leaving some space free so you can think before you go on reading)
Confounders and relative frequency
It is because we are plotting the absolute numbers of occurrences (the counts) for "duties" and "freedom". If the overall speech is longer, then there will in general be both more "duties" and more "freedom". So what we have been plotting is not really a good indicator of whether presidents tend to talk more about duties or freedom: We have the length of the overall speech as a confounder. That is, our graph shows a mixture of (at least) two effects, the tendency to talk about duties versus freedom, and the length of the speech. We need to separate the two to see what is really going on.
One way to separate the two effects is by graphing the relative frequency of "duties" instead of the absolute: What percentage of the words is "duties"? Is 0.1% of all words "duties"? 0.4%?
Let's look at an example, the very first speech:
> inaugural[1,]
X president year length America citizen citizens democracy freedom I me duties
1 1 Washington 1789 1538 0 0 5 0 0 23 8 1
>
We have one occurrence of "duties" per 1538 words of length, or 1/1538. Let's use R as a calculator to determine the percentage.
> 1/1538
[1] 0.0006501951
It's 0.06%.
If we want to determine the relative frequency of "duties" across all speeches, we can do this relatively easily:
> inaugural$duties / inaugural$length
[1] 0.0006501951 0.0000000000 0.0003868472 0.0010335917 0.0016778523 0.0015810277
[7] 0.0007668712 0.0016246954 0.0008148299 0.0019047619 0.0024834437 0.0015785320
[13] 0.0002397507 0.0005455537 0.0017321016 0.0042301184 0.0008203445 0.0006455778
[19] 0.0002496879 0.0000000000 0.0008071025 0.0000000000 0.0011013216 0.0003087373
[25] 0.0010940919 0.0010526316 0.0004644682 0.0004575612 0.0016326531 0.0018331806
[31] 0.0008552857 0.0000000000 0.0000000000 0.0002662407 0.0002251238 0.0000000000
[37] 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0003955696 0.0000000000
[43] 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0007246377
[49] 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
[55] 0.0008417508 0.0007336757
We simply divide the "duties" column by the "length" column.
Now we can plot relative frequencies. Here it is for "duties":
> plot(inaugural$duties / inaugural$length)
What do you see? Now plot the relative frequencies of "freedom". What do you see?
Now plot both in the same plot, again using "lines". What do you see?
Getting the frame size right
There are 2 options for plotting both the relative frequencies of "duties" and "freedom" in one frame. You can either plot "duties" first:
> plot(inaugural$duties / inaugural$length, type = "l")
> lines(inaugural$freedom / inaugural$length, type = "l", col = "red")
Or you can plot "freedom" first:
> plot(inaugural$freedom / inaugural$length, type = "l", col = "red")
> lines(inaugural$duties / inaugural$length, type = "l")
In the second case, you can see both graphs completely. In the first case, parts of "freedom" are cut off. Why could that be? (Again, space left free so you can ponder a bit before reading on.)
When you issue the command plot(inaugural$duties / inaugural$length), R uses the values in inaugural$duties/inaugural$length to figure out how many values on the x-axis need to be shown, and how many on the y-axis. This makes sense: When you are plotting values that are between 0 and 0.1 (as is the case when you plot relative frequencies of "duties"), you don't want the y-axis to show values between 0 and 100, because then the graph would be a straight line very close to the x-axis, and you wouldn't be able to see anything. But when you then use lines( ) to graph relative frequencies that are slightly larger, they are above the top of the frame.
What to do? The simplest solution is to try out different orders of plotting things, and plotting the largest numbers first. There is also a more elegant way to do this that uses additional R commands, but it leads to somewhat more involved commands to type. Ask me if you would like to know about it.
Plotting years against counts of "duties"
Another thing that you may have noticed is that our x-axis is not too informative. It just says "Index" and shows consecutive numbers from 0 to 56. That is not to say that it is useless -- it shows the row numbers, or consecutive indices of the speeches. But wouldn't it be nice if the x-axis could show the year of each speech instead?
If we want to do that, then what we need is to plot, say, the count of "duties" against the year.
When you give the plot( ) command one list of numbers, for example by saying plot( inaugural$duties ), you are saying that the values of inaugural$duties is what should be shown on the y-axis. If you don't say anything about the x-axis, the x-axis will be just indices. Now we need to give plot( ) two lists of numbers: the x-values and the y-values, in this order. So we say:
> plot(inaugural$year, inaugural$duties)
This is how you tell R to plot the years against the counts of "duties". Remember that the x-values come first, then the y-values.
More column arithmetic, and extending a data frame by a new column
A number of people have tried to gauge how egocentric different people are (or how egocentric different areas of history were) by counting the number of times people say "I" or "me" or "mine". That is a bit of a dubious argument (see what UT's Jamie Pennebaker says about it), but never mind: Could we do the same with the inaugural speeches?
The inaugual data frame has one column that shows the number of times the word "I" occurs in an inaugural speech. And it has another column that shows the count of the word "me". So we can visualize them separately. (Try it!) Can we also visualize the count for all the I-words together?
That is not difficult. You have seen above that you can divide one data frame column by another to get relative frequency, like this:
> inaugural$duties / inaugural$length
In the same way, you can add two columns of a data frame:
> inaugural$I + inaugural$me
Now say we want to do a number of analyses on I-words. If we know we will need them often, we can just add a new column to the data frame like this:
> inaugural$Iwords = inaugural$I + inaugural$me
Check the data frame in the upper left panel of RStudio: You can see that the table really has a new column now, which contains the sum of the columns for "I" and "me". Now we can visualize the I-words:
> plot(inaugural$Iwords)
That does not look like much of a pattern. But then, we are again looking at absolute frequency rather than relative frequency -- remember the discussion of confounders above? So can you plot the relative frequency of I-words in all inaugural speeches? Is there anything you notice now?
Filtering
Note: The commands in this section are a bit more involved, but this is the most complex R you will have to do in this course.
In the data frame viewer in the upper left panel of RStudio, there is a button labeled "Filter". When you click it, you can choose to filter the data frame rows. For example, click on the field labeled "All " in the "length" column. You can now slide the slider to decide which rows to display, for example only those speeches that have a length of 2000 words or more. Try it: How many rows are left?
This is nice, but it is confined to the viewer in the upper left panel: You cannot directly use this filtered view in the R code panel (as far as I know). But there are R commands that let you do the same kinds of filters.
Remember the command you used to choose a particular row:
> inaugural[ 1, ]
Note: I put more spaces inside the straight brackets to make things more readable. You can put additional spaces or not, it makes no difference.
chooses the first row. So within the straight brackets, there are two entries: [ row_description ; column_description ] And if you don't put anything for column_description, you select all columns of the row.
So far, the only row description we have used is a row index, like 1. Instead, you can put a filtering condition. For example, you could be interested in all rows where
inaugural$length > 2000
You can put that directly as a row description:
> inaugural[ inaugural$length > 2000 , ]
Here are some more filtering conditions that we can use:
Speeches of 4000 words or more: (not the >= )
inaugural[ inaugural$length >= 4000,]
Speeches of less than 1000 words:
inaugural[ inaugural$length < 1000, ]
Speeches that have the word "freedom" exactly once: note that it's ==, not just =
inaugural[ inaugural$freedom == 1, ]
Speeches that have the word "freedom" two times or less, and that took place before 1900:
inaugural[ inaugural$freedom <= 2 & inaugural$year < 1900, ]
To string two conditions together with "and", use &. You can also use | (vertical pipe) for "or".
Speeches given by a president Bush: Note that I have to use == again, and that I have to put quotes around a string.
inaugural[ inaugural$president == "Bush", ]
Try it yourself: Can you find...
Speeches that occurred after 1950
Speeches that were given by Jackson
Speeches that were less than 500 words and occurred before 1900
Speeches that were either less than 500 words or more than 4000 words (use | for "or")
Speeches that had more occurrences of "freedom" than "duties"
Filtering produces new data frames
Suppose you run a filtering command, for example to find all speeches given after 1960:
inaugural[ inaugural$year > 1960, ]
What that gives you is, in fact, a new data frame. You can give it a name:
inaug.new = inaugural[ inaugural$year > 1960, ]
You can check its length:
nrow(inaug.new)
You can get a list of all presidents occurring in inaug.new:
inaug.new$president
And since you can do that with inaug.new, you can do the same with inaugural[ inaugural$year > 1960, ]: What are the presidents who have given inaugural speeches after 1960?
inaugural[ inaugural$year > 1960, ]$president
Try it for yourself:
How long was each of the speeches given by a president named Bush?
What presidents gave speeches that were more than 3000 words long?
In what years did "America" appear more often than "citizens"?
Vectors
inaugural is a data frame, a table. inaugural$length is what R calls a vector, a sequence of values. To address a cell in a data frame, you need to know its row and column, so you need two indices. To address a value in a sequence, you only need to know one index. Here is the length of the 1st speech:
inaugural$length[1]
You can use conditions to filter a vector the same way as a data frame;
inaug.length = inaugural$length
inaug.length[ inaug.length > 2000 ]
Here we do not need a comma after the condition, because we only have one index, not two.
To make a vector "by hand", type the sequence of values, with a c() around it. If you have a sequence of strings, remember to put quotes around each of them. Here are two example vectors:
primes = c(2,3,5,7,11,13)
recent.presidents = c("Bush", "Clinton", "Bush", "Obama")
Selecting multiple rows or multiple columns
You can select more than one row in a data frame using a condition:
inaugural[inaugural$freedom > 5,]
You can also just put multiple values as the row index. The next command selects the first 10 rows, the command after that selects rows 1, 3, and 5:
inaugural[1:10,]
inaugural[c(1,3,5), ]
The same works for columns, but you can also select multiple columns by name. This selects columns 5, 6, and 8, either by number or by name:
inaugural[ , c(5,6,8) ]
inaugural[ , c("America", "citizen" , "democracy") ]
Sorting and ordering
You can sort a vector using the command sort( ). Here is how you sort speech lengths from the shortest to the longest:
inaug.length = inaugural$length
sort(inaug.length)
You can use this sorted vector, for example, to determine the 10 shortest speeches:
sort(inaug.length)[ 1:10]
For data frames, you usually want to do something slightly different. For example, you might not want to sort just the lengths of the inaugural speeches, but sort all the rows in the order of their speech lengths. The command for this is order( ), and you put it where you would put a filtering constraint. The following command says "I want the inaugural table, in the order of the speech lengths:
inaugural[ order( inaugural$length) , ]
Both sort( ) and order( ) can optionally be given another parameter, "decreasing = TRUE", which makes them sort from largest to smallest instead of smallest to largest.
Try it for yourself:
Sort the speech lengths (by themselves, not the whole data frame) from largest to smallest, and determine the lengths of the 3 longest speeches.
Sort the whole inaugural data frame by lengths of speeches, longest to shortest.
Sort the whole inaugural data frame by how often each speech mentions "democracy"
Merging data frames
Our inaugural data frame does not contain the party affiliation of the presidents. If we had that, we could do a number of additional analyses, for speech length or the degree to which presidents use terms like "freedom" or "democracy". This is a good opportunity to demonstrate how to merge data frames in R.
For convenience, we will use a smaller amount of data. Above we made a "recent inaugural addresses" data frame inaug.new using
inaug.new = inaugural[ inaugural$year > 1960, ]
It contains 13 entries. We would like to make a new data frame that maps inaugural address years to the affiliation of the president. But how do you make a new data frame by hand (rather than reading one in from file)?
Here is how: You use the function data.frame( ). Inside the parentheses, you specify each column of the data frame. For example, to make a data frame with one column called "A" and one called "B", you could type:
my.example = data.frame(A = c(1,2,3), B = c(20, 30, 40))
So the column A contains the sequence of values 1,2,3 (in that order), and the column B contains the sequence 20, 30, 40.
Now we are ready for the data frame mapping inaugural address years to the affiliation of each president:
affiliation = data.frame(year = inaug.new$year, party = c("D", "D", "R", "R", "D", "R", "R", "R", "D", "D", "R", "R", "D"))
The "year" column in this data frame is a sequence of numbers that are the same as in inaug.new, and the "party" column is a sequence of either "D" or "R".
Now we want to merge the two data frames by their common column "year". And that is exactly what we say:
merge(inaug.new, affiliation, by = "year")
This produces a new data frame, and we can give it a name:
inaug.new.withparty = merge(inaug.new, affiliation, by = "year")
What merge( ) has done is to link every row in inaug.new to the matching row in affiliation.
We could have saved ourselves some work if we had defined party affiliations for presidents, not years, as some presidents appear multiple times in our data frame. (And we even have two different presidents with the same name and same affiliation in our data.) So we could also say:
affiliation = data.frame(president = c("Kennedy", "Johnson", "Nixon", "Carter", "Reagan", "Bush", "Clinton", "Obama"), party = c("D", "D", "R", "D", "R", "R", "D", "D"))
But now affiliation is a data frame with 9 rows, and inaug.new has 13 rows. What will merge( ) do?
merge(inaug.new, affiliation, by = "president")
It actually does the right thing: It links each row in inaug.new to the single matching row in affiliation. After all, there is only one affiliation row with the right president name that each row in inaug.new could link to.
Reading and writing data in the console
You can always read data using the "Import Dataset" button in RStudio, and that is probably the most convenient route. But you can also read and write data using commands in the console.
read.csv() reads CSV files, and write.csv() writes data to a file in CSV format
read.table() reads files that contain a table with whitespace between values, and write.table() writes data to a file with whitespace between values.
Watch out: When you use read.table(), and the table has header information in the file, you have to add the parameter header=T, or your data will be mangled.
After reading in new data, say to a data frame named my.df, I recommend always visualizing the first few lines of the data frame to make sure it has been read in without any trouble:
head(my.df)
The biggest issue with using read.csv() and read.table() is usually figuring out the locations of the files.