Reading a file
Counting wordsRe-using the result of "scan", we can map words to their counts in the text using xtabs:
Package "tau": lowercasing, removing punctuation, and countingThe package "tau" lets you count how often each word appears in a text, but while reading in the text, you can preprocess your text. The relevant function is textcnt(). It can do the following preprocessing:
The result is a vector with names on the entries. # to do word counting, we need to paste it all together into a string again oz.str = paste(text, collapse=" ") library(tau) # this does the counting, lowercasing everything first oz.counts = textcnt(oz.str, n=1, method="string", tolower=T) # oz.counts is a vector with names on the entries. # Here is how you access entries: oz.counts[["oz"]] # and here is how you get a list of all the words that have counts: names(oz.counts) # if you would rather have it as a data frame with one column "word" and one column "count", # do this: # now you find the entry for "oz" like this: oz.counts.df[oz.counts.df$word == "oz",] Package "tm": lowercasing, removing punctuation and stopwords, stemming, and countingThe R package tm for "text mining" has useful functions for processing text. It goes a bit further than the package "tau": It also lowercases and removes punctuation, but it can also remove stopwords (words that you often may not want to count, such as "in", "the", "a", "of", "and"), and it can do stemming. Stemming is like lemmatization, only simpler. It basically just hacks off what might be affixes and hopes for the best. Here is an example, which assumes that you have several "Wizard of Oz" books (from Project Gutenberg) in a directory called "ozbooks". This directory contains nothing except for these files, which you want to process. This code builds on an article on the tm package by Ingo Feinerer and a blog post on the tm package. library(tm) oz <- Corpus(DirSource("~/Desktop/ozbooks/")) # normalization of the text: oz <- tm_map(oz, tolower) #lowercase oz <- tm_map(oz, removePunctuation, preserve_intra_word_dashes = FALSE) # remove punctuation oz <- tm_map(oz, removeWords, stopwords("english")) # remove stopwords oz <- tm_map(oz, stemDocument) # reduce word forms to stems # inspecting terms that appear at least 100 times in the first book of the collection oz.tdm.1 <- TermDocumentMatrix(oz[1]) findFreqTerms(oz.tdm.1, 100) # inspecting terms that appear at least 50 times in the second book of the collection oz.tdm.2 <- TermDocumentMatrix(oz[2]) findFreqTerms(oz.tdm.2, 50) # count how often the term "woodman" appears in each of the documents in the collection tdm = TermDocumentMatrix(oz) tm_term_score(tdm, "woodman") Counting n-grams instead of wordsSuppose you want to count word sequences of length 2, also called word bigrams. (I do bigrams here in the example, but you can also do trigrams or longer n-grams.) # whole text in one string again oz.str = paste(text, collapse = " ") # we use the package TM to lowercase everything # and to remove punctuation. # Here is how you turn a single text, given as a string, # into a tm object: oz.corpus = Corpus(VectorSource(oz.str)) oz.corpus = tm_map(oz.corpus, tolower) oz.corpus = tm_map(oz.corpus, removePunctuation, preserve_intra_word_dashes = FALSE) # Now change this tm object back into a long string, # lowercased and minus the punctuation cleaned.oz.str = as.character(oz.corpus)[1] # split into words oz.words = strsplit(cleaned.oz.str, " ", fixed = T)[[1]] library(NLP) # the NLP function "ngrams" returns a list of pairs of words. # We would rather have a vector of strings, each consisting # of two words with a space inbetween oz.bigrams = vapply(ngrams(oz.words, 2), paste, "", collapse = " ") # Here is what the first 3 bigrams look like. the first one is a bit broken. > oz.bigrams[1] [1] "list(\"the project" > oz.bigrams[2] [1] "project gutenberg" > oz.bigrams[3] [1] "gutenberg ebook" # we count them using xtabs, # and put the result into a data frame. oz.bigram.counts = as.data.frame(xtabs(~oz.bigrams)) # here are the most frequent bigrams > head(oz.bigram.counts[order(oz.bigram.counts$Freq, decreasing = T),]) oz.bigrams Freq 12304 of the 363 16976 the scarecrow 208 18196 to the 201 1529 and the 169 9123 in the 160 14351 said the 152 Getting access to lots of n-gramsThe R package ngramr gives you access to the Google n-grams. These are frequencies of word n-grams computed off of a massive amount of books. So if your project requires you to find general frequencies of particular word n-grams in a reasonable approximation of the English language in general, this could be useful. Package "koRpus": part-of-speech tagging and lemmatizationIf you need a better preprocessing that tells you the lemma of each word rather than the stem, and that tells you the part of speech of each word (so you can distinguish object-the-noun from object-the-verb, or so you can find the verbs in all the sentences), here is one way to do it: Download the TreeTagger, a tool that can do part-of-speech tagging and lemmatization. Then install the "koRpus" package in R. It provides a simple interface to the TreeTagger:
Downloading and cleaning text from webpagesIf you have the URL, then downloading the page is really simple in R: # NOTE: We are downloading from a Project Gutenberg Mirror # that is set up to handle automatic text requests, # **not** from the main page. raw.oz.lines = readLines("ftp://gutenberg.pglaf.org/mirrors/gutenberg/5/55/55-h/55-h.htm") Now you have the contents of the page as a vector of strings, each containing a line of "text". It's "text" in scare quotes because it also includes all the HTML commands. So the bigger problem is: How do you get at the plain text? There are several options. raw.oz = paste(raw.oz.lines, collapse = "\n") # Option 1: package tm.plugin.webmining library(tm.plugin.webmining) # This gives you the result as a single long string oz.text.1 = extractHTMLStrip(raw.oz) # We inspect the results: there is some # remaining code, but mostly it looks okay substring(oz.text.1, 1, 700) # Option 2: package qdap library(qdap) oz.text.2 = bracketX(raw.oz, "angle") # The results are pretty much the same: # Some leftover code, but mostly okay substring(oz.text.2, 1, 700) # Option 3: package rvest library(rvest) oz.obj = html(raw.oz) oz.text.3 = html_text(oz.obj) # and same result again substring(oz.text.3, 1, 700) # Option 4: package XML library(XML) oz.obj.2 = htmlParse(raw.oz, asText = T) # This gives you the result as a vector of strings oz.lines = xpathSApply(oz.obj.2, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue) # Much cleaner than with the other options oz.lines[1:5] With this straightforward HTML page, there was not much difference between the four packages. But if you use a page with more formatting, the differences become more pronounced:
Now try the four options again. EncodingsDifferent texts come with different encodings, to accommodate different alphabets. You can check the encoding of a string with the function Encoding():> Encoding("über") [1] "UTF-8" |
Courses > Analyzing linguistic data: an introductory statistics course > Schedule: words in a haystack >