Try it yourself:
In general, you first put the name of the data frame, then a $ to separate the data frame name and the column name, and then the column name.
Try it yourself:
Additionally, it tells you that there are 34 different values for column 2, ranging from "Adams" to "Wilson". (You can ignore that information for now.)
After downloading RStudio from here, start it up. You will see an interface that looks approximately like this:On the left, you will be able to type in R commands. On the lower right, you can get help on all things R, and if you produce a graph, it will be displayed there too. But before you can do that, you need data.
You can download the inaugural addresses data as an Excel .xlsx file here. R cannot directly read this file -- the internal Excel format looks like gibberish to it. But you can export the data form Excel in a format called "CSV", or comma separated values:
If you have skipped the previous step, you find the inaugural data as a .csv file here.
In RStudio, choose "Import Dataset" on the upper right side of the window, and choose "From Text File":
Congratulations: you have imported your first dataset into R. You can now see the table in the upper left panel in RStudio:
The lower left panel of RStudio is for typing in R commands. Instead of importing the file by clicking on "Import Dataset", you could also have typed a command in the lower left panel to achieve the same effect. And in fact, RStudio is now showing in the lower left panel the command that you could have typed. In my case, it is
The ">" is not something you would type. It is just the prompt that R gives you to show that you can type in a command. "read.csv", somewhat unsurprisingly, is a command that reads a csv file. The contents of the file have been stored under the name "inaugural" using the left-arrow: "inaugural <- read.csv(...)". The command "View(inaugural)" views the contents of "inaugural" in the upper left panel. You can also view the whole thing in the lower left panel by typing "inaugural" and hitting Return. To see just the first few lines of the table, type
(Again, don't type the ">", it's what R shows you automatically.) This should give you the following output:
Note: In this course, we will try to get by with relatively little R. The commands that we do in this worksheet will be about the most complex you will have to master. R can do much more, and if you are curious for additional R constructions or commands, let me know. But you will not need more for this course.
"inaugural[ 1, 2]" says that you want to access row 1, column 2 of the data frame stored in
Helpful hint: If you hit the "uparrow" in the R panel (the lower left panel of RStudio), you see the previous command you entered, and you can edit it. When you are done, hit "Return" to execute it.
If you want to see a whole row, just give the number of the row and nothing for the column (but don't forget the comma). For example, to bring up the whole entry for the 5th inaugural address you write
The same way, if you want to see all entries for a column, just give the number of the column but no number for the row. So to see all years in which inaugural addresses were given, you put
Incidentally, this is the first time that the numbers in brackets at the beginning of the output lines become useful: You can see from the output above that the 1st address, number , was in 1789. The 23rd, number , was in 1877, and the 45th was in 1965.
Try it yourself:
Suppose you are interested in the lengths of all inaugural addresses. Then it is a bit cumbersome that you first have to look at the table and count columns to find out that the column of interest is column 4.
And in fact, there is a simpler way: Just ask for the column by name. Just type the name of the whole data frame,
A scatter plot has one point in the graph for each data point. When you issue a
What do you observe? Do you see any trends in the use of the word "duties"? Now try plotting the uses of the word "freedom" instead. What do you see?
So far you have only plotted one graph per frame. Whenever you issued the next plot( ) command, the previous frame was erased. Here is how you can put more than one graph into a single frame:
The command plot( ) cleans the frame before plotting. The command "lines" (which by default uses the type "l") does not.
What do you observe when you compare "duties" and "freedom"?
One thing that you may observe is that both lines tend to go up and down in sync. Why is that? (leaving some space free so you can think before you go on reading)
It is because we are plotting the absolute numbers of occurrences (the counts) for "duties" and "freedom". If the overall speech is longer, then there will in general be both more "duties" and more "freedom". So what we have been plotting is not really a good indicator of whether presidents tend to talk more about duties or freedom: We have the length of the overall speech as a confounder. That is, our graph shows a mixture of (at least) two effects, the tendency to talk about duties versus freedom, and the length of the speech. We need to separate the two to see what is really going on.
One way to separate the two effects is by graphing the relative frequency of "duties" instead of the absolute: What percentage of the words is "duties"? Is 0.1% of all words "duties"? 0.4%?
We have one occurrence of "duties" per 1538 words of length, or 1/1538. Let's use R as a calculator to determine the percentage.
If we want to determine the relative frequency of "duties" across all speeches, we can do this relatively easily:
We simply divide the "duties" column by the "length" column.
Now we can plot relative frequencies. Here it is for "duties":
What do you see? Now plot the relative frequencies of "freedom". What do you see?
Now plot both in the same plot, again using "lines". What do you see?
There are 2 options for plotting both the relative frequencies of "duties" and "freedom" in one frame. You can either plot "duties" first:
Or you can plot "freedom" first:
In the second case, you can see both graphs completely. In the first case, parts of "freedom" are cut off. Why could that be? (Again, space left free so you can ponder a bit before reading on.)
When you issue the command
What to do? The simplest solution is to try out different orders of plotting things, and plotting the largest numbers first. There is also a more elegant way to do this that uses additional R commands, but it leads to somewhat more involved commands to type. Ask me if you would like to know about it.
Another thing that you may have noticed is that our x-axis is not too informative. It just says "Index" and shows consecutive numbers from 0 to 56. That is not to say that it is useless -- it shows the row numbers, or consecutive indices of the speeches. But wouldn't it be nice if the x-axis could show the year of each speech instead?
If we want to do that, then what we need is to plot, say, the count of "duties" against the year.
When you give the
This is how you tell R to plot the years against the counts of "duties". Remember that the x-values come first, then the y-values.
A number of people have tried to gauge how egocentric different people are (or how egocentric different areas of history were) by counting the number of times people say "I" or "me" or "mine". That is a bit of a dubious argument (see what UT's Jamie Pennebaker says about it), but never mind: Could we do the same with the inaugural speeches?
The inaugual data frame has one column that shows the number of times the word "I" occurs in an inaugural speech. And it has another column that shows the count of the word "me". So we can visualize them separately. (Try it!) Can we also visualize the count for all the I-words together?
That is not difficult. You have seen above that you can divide one data frame column by another to get relative frequency, like this:
In the same way, you can add two columns of a data frame:
Now say we want to do a number of analyses on I-words. If we know we will need them often, we can just add a new column to the data frame like this:
Check the data frame in the upper left panel of RStudio: You can see that the table really has a new column now, which contains the sum of the columns for "I" and "me". Now we can visualize the I-words:
That does not look like much of a pattern. But then, we are again looking at absolute frequency rather than relative frequency -- remember the discussion of confounders above? So can you plot the relative frequency of I-words in all inaugural speeches? Is there anything you notice now?
Note: The commands in this section are a bit more involved, but this is the most complex R you will have to do in this course.
In the data frame viewer in the upper left panel of RStudio, there is a button labeled "Filter". When you click it, you can choose to filter the data frame rows. For example, click on the field labeled "All " in the "length" column. You can now slide the slider to decide which rows to display, for example only those speeches that have a length of 2000 words or more. Try it: How many rows are left?
This is nice, but it is confined to the viewer in the upper left panel: You cannot directly use this filtered view in the R code panel (as far as I know). But there are R commands that let you do the same kinds of filters.
Note: I put more spaces inside the straight brackets to make things more readable. You can put additional spaces or not, it makes no difference.
chooses the first row. So within the straight brackets, there are two entries: [ row_description ; column_description ] And if you don't put anything for column_description, you select all columns of the row.
So far, the only row description we have used is a row index, like 1. Instead, you can put a filtering condition. For example, you could be interested in all rows where
inaugural$length > 2000
You can put that directly as a row description:
Here are some more filtering conditions that we can use:
Try it yourself: Can you find...
Suppose you run a filtering command, for example to find all speeches given after 1960:
What that gives you is, in fact, a new data frame. You can give it a name:
You can check its length:
You can get a list of all presidents occurring in inaug.new:
And since you can do that with
Try it for yourself:
You can use conditions to filter a vector the same way as a data frame;
Here we do not need a comma after the condition, because we only have one index, not two.
To make a vector "by hand", type the sequence of values, with a c() around it. If you have a sequence of strings, remember to put quotes around each of them. Here are two example vectors:
You can select more than one row in a data frame using a condition:
inaugural[inaugural$freedom > 5,]
You can also just put multiple values as the row index. The next command selects the first 10 rows, the command after that selects rows 1, 3, and 5:
The same works for columns, but you can also select multiple columns by name. This selects columns 5, 6, and 8, either by number or by name:
You can sort a vector using the command sort( ). Here is how you sort speech lengths from the shortest to the longest:
You can use this sorted vector, for example, to determine the 10 shortest speeches:
For data frames, you usually want to do something slightly different. For example, you might not want to sort just the lengths of the inaugural speeches, but sort all the rows in the order of their speech lengths. The command for this is order( ), and you put it where you would put a filtering constraint. The following command says "I want the inaugural table, in the order of the speech lengths:
Try it for yourself:
Our inaugural data frame does not contain the party affiliation of the presidents. If we had that, we could do a number of additional analyses, for speech length or the degree to which presidents use terms like "freedom" or "democracy". This is a good opportunity to demonstrate how to merge data frames in R.
For convenience, we will use a smaller amount of data. Above we made a "recent inaugural addresses" data frame inaug.new using
It contains 13 entries. We would like to make a new data frame that maps inaugural address years to the affiliation of the president. But how do you make a new data frame by hand (rather than reading one in from file)?
Here is how: You use the function data.frame( ). Inside the parentheses, you specify each column of the data frame. For example, to make a data frame with one column called "A" and one called "B", you could type:
So the column A contains the sequence of values 1,2,3 (in that order), and the column B contains the sequence 20, 30, 40.
Now we are ready for the data frame mapping inaugural address years to the affiliation of each president:
The "year" column in this data frame is a sequence of numbers that are the same as in inaug.new, and the "party" column is a sequence of either "D" or "R".
Now we want to merge the two data frames by their common column "year". And that is exactly what we say:
This produces a new data frame, and we can give it a name:
What merge( ) has done is to link every row in
We could have saved ourselves some work if we had defined party affiliations for presidents, not years, as some presidents appear multiple times in our data frame. (And we even have two different presidents with the same name and same affiliation in our data.) So we could also say:
It actually does the right thing: It links each row in
You can always read data using the "Import Dataset" button in RStudio, and that is probably the most convenient route. But you can also read and write data using commands in the console.
Watch out: When you use read.table(), and the table has header information in the file, you have to add the parameter header=T, or your data will be mangled.
After reading in new data, say to a data frame named my.df, I recommend always visualizing the first few lines of the data frame to make sure it has been read in without any trouble:
The biggest issue with using read.csv() and read.table() is usually figuring out the locations of the files.