R code: reading, preprocessing and counting text.

Reading a file

# reading in the Wizard of Oz:

# this produces a vector of strings, one per line

textlines = readLines("~/Desktop/ozbooks/pg55.txt")
> head(textlines)
[1] "The Project Gutenberg EBook of The Marvellous Land of Oz, by L. Frank Baum"
[2] ""                                                                         
[3] "This eBook is for the use of anyone anywhere at no cost and with"         
[4] "almost no restrictions whatsoever.  You may copy it, give it away or"     
[5] "re-use it under the terms of the Project Gutenberg License included"      
[6] "with this eBook or online at www.gutenberg.net"     

# this produces a  vector of string, one per word

> text = scan("~/Desktop/ozbooks/pg55.txt", quote=NULL, what="x")
Read 45518 items
> head(text)
[1] "The"       "Project"   "Gutenberg" "EBook"     "of"      
[6] "The"  

Counting words

Re-using the result of "scan", we can map words to their counts in the text using xtabs:

counts = as.data.frame(xtabs(~text))

[1] "text" "Freq"

> head(counts)
         text Freq
1           -    7
2        -The    1
3 'Advantages    1
4     'AS-IS'    1
5         'em    2
6   'Emperor'    1

Package "tau": lowercasing, removing punctuation, and counting

The package "tau" lets you count how often each word appears in a text, but while reading in the text, you can preprocess your text. The relevant function is textcnt(). It can do the following preprocessing:

  • lowercase all words: tolower=T
  • discard all words with a count lower than, say, 10: lower = 10

The result is a vector with names on the entries.

# to do word counting, we need to paste it all together into a string again
oz.str = paste(text, collapse=" ")

# this does the counting, lowercasing everything first
oz.counts = textcnt(oz.str, n=1, method="string", tolower=T)

# oz.counts is a vector with names on the entries.
# Here is how you access entries:

# and here is how you get a list of all the words that have counts:

# if you would rather have it as a data frame with one column "word" and one column "count",
# do this:
oz.counts.df = data.frame(word = names(oz.counts), count = c(oz.counts))
# now you find the entry for "oz" like this:
oz.counts.df[oz.counts.df$word == "oz",]

Package "tm": lowercasing, removing punctuation and stopwords, stemming, and counting

The R package tm for "text mining" has useful functions for processing text. It goes a bit further than the package "tau": It also lowercases and removes punctuation, but
it can also remove stopwords (words that you often may not want to count, such as "in", "the", "a", "of", "and"), and it can do stemming. Stemming is like lemmatization, only simpler. It basically just hacks off what might be affixes and hopes for the best.

Here is an example, which assumes that you have several "Wizard of Oz" books (from Project Gutenberg) in a directory called "ozbooks". This directory contains nothing except for these files, which you want to process. This code builds on an article on the tm package by Ingo Feinerer and a blog post on the tm package.


oz <- Corpus(DirSource("~/Desktop/ozbooks/"))

# normalization of the text:
oz <- tm_map(oz, tolower) #lowercase
oz <- tm_map(oz, removePunctuation, preserve_intra_word_dashes = FALSE) # remove punctuation
oz <- tm_map(oz, removeWords, stopwords("english")) # remove stopwords
oz <- tm_map(oz, stemDocument) # reduce word forms to stems

# inspecting terms that appear at least 100 times in the first book of the collection
oz.tdm.1 <- TermDocumentMatrix(oz[1])
findFreqTerms(oz.tdm.1, 100)

# inspecting terms that appear at least 50 times in the second book of the collection
oz.tdm.2 <- TermDocumentMatrix(oz[2])
findFreqTerms(oz.tdm.2, 50)

# count how often the term "woodman" appears in each of the documents in the collection
tdm = TermDocumentMatrix(oz)
tm_term_score(tdm, "woodman")

Counting n-grams instead of words

Suppose you want to count word sequences of length 2, also called word bigrams. (I do bigrams here in the example, but you can also do trigrams or longer n-grams.)

# whole text in one string again
oz.str = paste(text, collapse = " ")

# we use the package TM to lowercase everything
# and to remove punctuation.
# Here is how you turn a single text, given as a string,
# into a tm object:
oz.corpus = Corpus(VectorSource(oz.str))
oz.corpus = tm_map(oz.corpus, tolower)
oz.corpus = tm_map(oz.corpus, removePunctuation, preserve_intra_word_dashes = FALSE)

# Now change this tm object back into a long string,
# lowercased and minus the punctuation
cleaned.oz.str = as.character(oz.corpus)[1]

# split into words
oz.words = strsplit(cleaned.oz.str, " ", fixed = T)[[1]]

# the NLP function "ngrams" returns a list of pairs of words.
# We would rather have a vector of strings, each consisting
# of two words with a space inbetween
oz.bigrams = vapply(ngrams(oz.words, 2), paste, "", collapse = " ")
# Here is what the first 3 bigrams look like. the first one is a bit broken.
> oz.bigrams[1]
[1] "list(\"the project"
> oz.bigrams[2]
[1] "project gutenberg"
> oz.bigrams[3]
[1] "gutenberg ebook"

# we count them using xtabs,
# and put the result into a data frame.
oz.bigram.counts = as.data.frame(xtabs(~oz.bigrams))
# here are the most frequent bigrams
> head(oz.bigram.counts[order(oz.bigram.counts$Freq, decreasing = T),])
         oz.bigrams Freq
12304        of the  363
16976 the scarecrow  208
18196        to the  201
1529        and the  169
9123         in the  160
14351      said the  152

Getting access to lots of n-grams

The R package ngramr gives you access to the Google n-grams. These are frequencies of word n-grams computed off of a massive amount of books. So if your project requires you to find general frequencies of particular word n-grams in a reasonable approximation of the English language in general, this could be useful.

Package "koRpus": part-of-speech tagging and lemmatization

If you need a better preprocessing that tells you the lemma of each word rather than the stem, and that tells you the part of speech of each word (so you can distinguish object-the-noun from object-the-verb, or so you can find the verbs in all the sentences), here is one way to do it:

Download the TreeTagger, a tool that can do part-of-speech tagging and lemmatization. Then install the "koRpus" package in R. It provides a simple interface to the TreeTagger:


# lemmatization and tagging in R
# install the treetagger somewhere!
# then set treetagger path and language English:

# Global settings: where the Tree Tagger is on my system,
# and that I want to use English data
set.kRp.env(TT.cmd = "~/Software/treetagger/cmd/tree-tagger-english", lang = "en")

# tagging a file, this produces some object
oz.tagged.obj <- treetag("~/Desktop/ozbooks/pg55.txt")

# Extracting the tagging results:
oz.tagged = taggedText(oz.tagged.obj)

# This yields a data frame. The relevant columns are "token",
# "tag", and "lemma"
> head(oz.tagged[, c("token", "tag", "lemma")])
      token tag     lemma
1       The  DT       the
2   Project  NP   Project
3 Gutenberg  NP Gutenberg
4     EBook  NP <unknown>
5        of  IN        of
6       The  DT       the

Downloading and cleaning text from webpages

 If you have the URL, then downloading the page is really simple in R:

# NOTE: We are downloading from a Project Gutenberg Mirror
# that is set up to handle automatic text requests,
# **not** from the main page.
raw.oz.lines = readLines("ftp://gutenberg.pglaf.org/mirrors/gutenberg/5/55/55-h/55-h.htm")

Now you have the contents of the page as a vector of strings, each containing a line of "text". It's "text" in scare quotes because it also includes all the HTML commands.

So the bigger problem is: How do you get at the plain text? There are several options.

raw.oz = paste(raw.oz.lines, collapse = "\n")

# Option 1: package tm.plugin.webmining
# This gives you the result as a single long string
oz.text.1 = extractHTMLStrip(raw.oz)
# We inspect the results: there is some
# remaining code, but mostly it looks okay
substring(oz.text.1, 1, 700)

# Option 2: package qdap
oz.text.2 = bracketX(raw.oz, "angle")
# The results are pretty much the same:
# Some leftover code, but mostly okay
substring(oz.text.2, 1, 700)

# Option 3: package rvest
oz.obj = html(raw.oz)
oz.text.3 = html_text(oz.obj)
# and same result again
substring(oz.text.3, 1, 700)

# Option 4: package XML
oz.obj.2 = htmlParse(raw.oz, asText = T)
# This gives you the result as a vector of strings
oz.lines = xpathSApply(oz.obj.2, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
# Much cleaner than with the other options

With this straightforward HTML page, there was not much difference between the four packages. But if you use a page with more formatting, the differences become more pronounced:

raw.news.lines = readLines("http://www.livescience.com/52259-quantum-teleportation-sets-distance-record.html")

raw.news = paste(raw.news.lines, collapse = "\n")

Now try the four options again.


Different texts come with different encodings, to accommodate different alphabets. You can check the encoding of a string with the function Encoding():
> Encoding("über")
[1] "UTF-8"

To convert a string from one encoding to another, you can use the function iconv(). The function iconvlist() gives you a list of all available encodings.

The function translate() of package "tau" can sometimes fix bad encodings.