After downloading RStudio from here, start it up. You will see an interface that looks approximately like this: On the left, you will be able to type in R commands. On the lower right, you can get help on all things R, and if you produce a graph, it will be displayed there too. But before you can do that, you need data.You can download the inaugural addresses data as an Excel .xlsx file here. R cannot directly read this file -- the internal Excel format looks like gibberish to it. But you can export the data form Excel in a format called "CSV", or comma separated values:
Reading data into RStudioIf you have skipped the previous step, you find the inaugural data as a .csv file here. In RStudio, choose "Import Dataset" on the upper right side of the window, and choose "From Text File": Congratulations: you have imported your first dataset into R. You can now see the table in the upper left panel in RStudio:
The lower left panel of RStudio is for typing in R commands. Instead of importing the file by clicking on "Import Dataset", you could also have typed a command in the lower left panel to achieve the same effect. And in fact, RStudio is now showing in the lower left panel the command that you could have typed. In my case, it is
The ">" is not something you would type. It is just the prompt that R gives you to show that you can type in a command. "read.csv", somewhat unsurprisingly, is a command that reads a csv file. The contents of the file have been stored under the name "inaugural" using the left-arrow: "inaugural <- read.csv(...)". The command "View(inaugural)" views the contents of "inaugural" in the upper left panel. You can also view the whole thing in the lower left panel by typing "inaugural" and hitting Return. To see just the first few lines of the table, type > head(inaugural) (Again, don't type the ">", it's what R shows you automatically.) This should give you the following output: X president year length America citizen citizens democracy freedom I me duties 1 1 Washington 1789 1538 0 0 5 0 0 23 8 1 2 2 Washington 1793 147 1 0 1 0 0 6 1 0 3 3 Adams 1797 2585 5 1 5 0 0 13 5 1 4 4 Jefferson 1801 1935 0 0 7 0 4 21 4 2 5 5 Jefferson 1805 2384 0 0 10 0 2 18 8 4 6 6 Madison 1809 1265 0 0 1 0 1 11 8 2 Note: In this course, we will try to get by with relatively little R. The commands that we do in this worksheet will be about the most complex you will have to master. R can do much more, and if you are curious for additional R constructions or commands, let me know. But you will not need more for this course. Inspecting a data frameinaugural is a table -- R calls this a data frame. A data frame consists of cells. For example, the cell at row 1, column 2 contains the word "Washington". You can get this information from R as follows:> inaugural[1,2] [1] Washington 34 Levels: Adams Buchanan Bush Carter Cleveland Clinton Coolidge Eisenhower Garfield Grant Harding ... Wilson inaugural . You have to use square brackets for this. When you type inaugural[1,2] , the first piece of data that R gives you, labeled [1], is "Washington". (In this case it is also the only piece of data. But when you type in a query that lets R give you tens or hundreds of pieces of data as answer, you may appreciate the numbering.) The data that it outputs is "Washington", the entry at row 1 column 2.
Helpful hint: If you hit the "uparrow" in the R panel (the lower left panel of RStudio), you see the previous command you entered, and you can edit it. When you are done, hit "Return" to execute it. If you want to see a whole row, just give the number of the row and nothing for the column (but don't forget the comma). For example, to bring up the whole entry for the 5th inaugural address you write > inaugural[5,] X president year length America citizen citizens democracy freedom I me duties 5 5 Jefferson 1805 2384 0 0 10 0 2 18 8 4 The same way, if you want to see all entries for a column, just give the number of the column but no number for the row. So to see all years in which inaugural addresses were given, you put
Incidentally, this is the first time that the numbers in brackets at the beginning of the output lines become useful: You can see from the output above that the 1st address, number [1], was in 1789. The 23rd, number [23], was in 1877, and the 45th was in 1965. Try it yourself:
Column namesSuppose you are interested in the lengths of all inaugural addresses. Then it is a bit cumbersome that you first have to look at the table and count columns to find out that the column of interest is column 4. And in fact, there is a simpler way: Just ask for the column by name. Just type the name of the whole data frame,
PlotsNow that we can select columns from a data frame, we can also plot them. Here is a scatter plot for how often the word "duties" was used through the years:> plot(inaugural$duties) A scatter plot has one point in the graph for each data point. When you issue a plot( ) command in the lower left-hand panel of RStudio, the plot shows up in the lower right-hand panel.What do you observe? Do you see any trends in the use of the word "duties"? Now try plotting the uses of the word "freedom" instead. What do you see? > plot(inaugural$freedom) Variants:
Two graphs in one frameSo far you have only plotted one graph per frame. Whenever you issued the next plot( ) command, the previous frame was erased. Here is how you can put more than one graph into a single frame:
The command plot( ) cleans the frame before plotting. The command "lines" (which by default uses the type "l") does not. What do you observe when you compare "duties" and "freedom"? One thing that you may observe is that both lines tend to go up and down in sync. Why is that? (leaving some space free so you can think before you go on reading) Confounders and relative frequencyIt is because we are plotting the absolute numbers of occurrences (the counts) for "duties" and "freedom". If the overall speech is longer, then there will in general be both more "duties" and more "freedom". So what we have been plotting is not really a good indicator of whether presidents tend to talk more about duties or freedom: We have the length of the overall speech as a confounder. That is, our graph shows a mixture of (at least) two effects, the tendency to talk about duties versus freedom, and the length of the speech. We need to separate the two to see what is really going on. One way to separate the two effects is by graphing the relative frequency of "duties" instead of the absolute: What percentage of the words is "duties"? Is 0.1% of all words "duties"? 0.4%?
We have one occurrence of "duties" per 1538 words of length, or 1/1538. Let's use R as a calculator to determine the percentage. > 1/1538 [1] 0.0006501951 If we want to determine the relative frequency of "duties" across all speeches, we can do this relatively easily: > inaugural$duties / inaugural$length [1] 0.0006501951 0.0000000000 0.0003868472 0.0010335917 0.0016778523 0.0015810277 [7] 0.0007668712 0.0016246954 0.0008148299 0.0019047619 0.0024834437 0.0015785320 [13] 0.0002397507 0.0005455537 0.0017321016 0.0042301184 0.0008203445 0.0006455778 [19] 0.0002496879 0.0000000000 0.0008071025 0.0000000000 0.0011013216 0.0003087373 [25] 0.0010940919 0.0010526316 0.0004644682 0.0004575612 0.0016326531 0.0018331806 [31] 0.0008552857 0.0000000000 0.0000000000 0.0002662407 0.0002251238 0.0000000000 [37] 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0003955696 0.0000000000 [43] 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0007246377 [49] 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 [55] 0.0008417508 0.0007336757 We simply divide the "duties" column by the "length" column. Now we can plot relative frequencies. Here it is for "duties": > plot(inaugural$duties / inaugural$length) What do you see? Now plot the relative frequencies of "freedom". What do you see? Now plot both in the same plot, again using "lines". What do you see? Getting the frame size rightThere are 2 options for plotting both the relative frequencies of "duties" and "freedom" in one frame. You can either plot "duties" first:
Or you can plot "freedom" first: > plot(inaugural$freedom / inaugural$length, type = "l", col = "red") > lines(inaugural$duties / inaugural$length, type = "l") In the second case, you can see both graphs completely. In the first case, parts of "freedom" are cut off. Why could that be? (Again, space left free so you can ponder a bit before reading on.) When you issue the command plot(inaugural$duties / inaugural$length) , R uses the values in inaugural$duties/inaugural$length to figure out how many values on the x-axis need to be shown, and how many on the y-axis. This makes sense: When you are plotting values that are between 0 and 0.1 (as is the case when you plot relative frequencies of "duties"), you don't want the y-axis to show values between 0 and 100, because then the graph would be a straight line very close to the x-axis, and you wouldn't be able to see anything. But when you then use lines( ) to graph relative frequencies that are slightly larger, they are above the top of the frame. What to do? The simplest solution is to try out different orders of plotting things, and plotting the largest numbers first. There is also a more elegant way to do this that uses additional R commands, but it leads to somewhat more involved commands to type. Ask me if you would like to know about it. Plotting years against counts of "duties"Another thing that you may have noticed is that our x-axis is not too informative. It just says "Index" and shows consecutive numbers from 0 to 56. That is not to say that it is useless -- it shows the row numbers, or consecutive indices of the speeches. But wouldn't it be nice if the x-axis could show the year of each speech instead? If we want to do that, then what we need is to plot, say, the count of "duties" against the year. When you give the > plot(inaugural$year, inaugural$duties) This is how you tell R to plot the years against the counts of "duties". Remember that the x-values come first, then the y-values. A number of people have tried to gauge how egocentric different people are (or how egocentric different areas of history were) by counting the number of times people say "I" or "me" or "mine". That is a bit of a dubious argument (see what UT's Jamie Pennebaker says about it), but never mind: Could we do the same with the inaugural speeches? The inaugual data frame has one column that shows the number of times the word "I" occurs in an inaugural speech. And it has another column that shows the count of the word "me". So we can visualize them separately. (Try it!) Can we also visualize the count for all the I-words together? That is not difficult. You have seen above that you can divide one data frame column by another to get relative frequency, like this: > inaugural$duties / inaugural$length In the same way, you can add two columns of a data frame: > inaugural$I + inaugural$me Now say we want to do a number of analyses on I-words. If we know we will need them often, we can just add a new column to the data frame like this: > inaugural$Iwords = inaugural$I + inaugural$me Check the data frame in the upper left panel of RStudio: You can see that the table really has a new column now, which contains the sum of the columns for "I" and "me". Now we can visualize the I-words: > plot(inaugural$Iwords) That does not look like much of a pattern. But then, we are again looking at absolute frequency rather than relative frequency -- remember the discussion of confounders above? So can you plot the relative frequency of I-words in all inaugural speeches? Is there anything you notice now? FilteringNote: The commands in this section are a bit more involved, but this is the most complex R you will have to do in this course. In the data frame viewer in the upper left panel of RStudio, there is a button labeled "Filter". When you click it, you can choose to filter the data frame rows. For example, click on the field labeled "All " in the "length" column. You can now slide the slider to decide which rows to display, for example only those speeches that have a length of 2000 words or more. Try it: How many rows are left? This is nice, but it is confined to the viewer in the upper left panel: You cannot directly use this filtered view in the R code panel (as far as I know). But there are R commands that let you do the same kinds of filters. > inaugural[ 1, ] chooses the first row. So within the straight brackets, there are two entries: [ row_description ; column_description ] And if you don't put anything for column_description, you select all columns of the row. So far, the only row description we have used is a row index, like 1. Instead, you can put a filtering condition. For example, you could be interested in all rows where inaugural$length > 2000 You can put that directly as a row description: > inaugural[ inaugural$length > 2000 , ] Here are some more filtering conditions that we can use:
Try it yourself: Can you find...
Filtering produces new data framesSuppose you run a filtering command, for example to find all speeches given after 1960: inaugural[ inaugural$year > 1960, ] What that gives you is, in fact, a new data frame. You can give it a name: inaug.new = inaugural[ inaugural$year > 19 6 0, ] You can check its length: nrow(inaug.new) You can get a list of all presidents occurring in inaug.new: inaug.new$president And since you can do that with inaug.new , you can do the same with inaugural[ inaugural$year > 1960, ] : What are the presidents who have given inaugural speeches after 1960?
inaugural[ inaugural$year > 1960, ]$president Try it for yourself:
Vectorsinaugural is a data frame, a table. inaugural$length is what R calls a vector, a sequence of values. To address a cell in a data frame, you need to know its row and column, so you need two indices. To address a value in a sequence, you only need to know one index. Here is the length of the 1st speech:inaugural$length[1] You can use conditions to filter a vector the same way as a data frame; inaug.length = inaugural$length inaug.length[ inaug.length > 2000 ] To make a vector "by hand", type the sequence of values, with a c() around it. If you have a sequence of strings, remember to put quotes around each of them. Here are two example vectors: primes = c(2,3,5,7,11,13) recent.presidents = c("Bush", "Clinton", "Bush", "Obama") Selecting multiple rows or multiple columnsYou can select more than one row in a data frame using a condition: inaugural[inaugural$freedom > 5,] You can also just put multiple values as the row index. The next command selects the first 10 rows, the command after that selects rows 1, 3, and 5:
The same works for columns, but you can also select multiple columns by name. This selects columns 5, 6, and 8, either by number or by name: inaugural[ , c(5,6,8) ] inaugural[ , c( "America", "citizen" , "democracy") ] Sorting and orderingYou can sort a vector using the command sort( ). Here is how you sort speech lengths from the shortest to the longest:
You can use this sorted vector, for example, to determine the 10 shortest speeches: sort(inaug.length)[ 1:10] For data frames, you usually want to do something slightly different. For example, you might not want to sort just the lengths of the inaugural speeches, but sort all the rows in the order of their speech lengths. The command for this is order( ), and you put it where you would put a filtering constraint. The following command says "I want the inaugural table, in the order of the speech lengths: inaugural[ order( inaugural$length) , ] Both Try it for yourself:
Merging data framesOur inaugural data frame does not contain the party affiliation of the presidents. If we had that, we could do a number of additional analyses, for speech length or the degree to which presidents use terms like "freedom" or "democracy". This is a good opportunity to demonstrate how to merge data frames in R. For convenience, we will use a smaller amount of data. Above we made a "recent inaugural addresses" data frame inaug.new using inaug.new = inaugural[ inaugural$year > 1960, ] It contains 13 entries. We would like to make a new data frame that maps inaugural address years to the affiliation of the president. But how do you make a new data frame by hand (rather than reading one in from file)? Here is how: You use the function data.frame( ). Inside the parentheses, you specify each column of the data frame. For example, to make a data frame with one column called "A" and one called "B", you could type: my.example = data.frame(A = c(1,2,3), B = c(20, 30, 40)) So the column A contains the sequence of values 1,2,3 (in that order), and the column B contains the sequence 20, 30, 40. Now we are ready for the data frame mapping inaugural address years to the affiliation of each president: affiliation = data.frame(year = inaug.new$year, party = c("D", "D", "R", "R", "D", "R", "R", "R", "D", "D", "R", "R", "D")) The "year" column in this data frame is a sequence of numbers that are the same as in inaug.new, and the "party" column is a sequence of either "D" or "R". Now we want to merge the two data frames by their common column "year". And that is exactly what we say: merge(inaug.new, affiliation, by = "year") inaug.new.withparty = merge(inaug.new, affiliation, by = "year") What merge( ) has done is to link every row in We could have saved ourselves some work if we had defined party affiliations for presidents, not years, as some presidents appear multiple times in our data frame. (And we even have two different presidents with the same name and same affiliation in our data.) So we could also say: affiliation = data.frame(president = c("Kennedy", "Johnson", "Nixon", "Carter", "Reagan", "Bush", "Clinton", "Obama"), party = c("D", "D", "R", "D", "R", "R", "D", "D")) But now merge(inaug.new, affiliation, by = "president") inaug.new to the single matching row in affiliation. After all, there is only one affiliation row with the right president name that each row in inaug.new could link to.Reading and writing data in the consoleYou can always read data using the "Import Dataset" button in RStudio, and that is probably the most convenient route. But you can also read and write data using commands in the console.
Watch out: When you use read.table(), and the table has header information in the file, you have to add the parameter header=T, or your data will be mangled. After reading in new data, say to a data frame named my.df, I recommend always visualizing the first few lines of the data frame to make sure it has been read in without any trouble: head(my.df) The biggest issue with using read.csv() and read.table() is usually figuring out the locations of the files. |
Courses > R worksheets >