Courses‎ > ‎R worksheets‎ > ‎

Topic modeling: a demo




library(tm)
library(topicmodels)

# If you have a directory called /Users/someone/Desktop/oz that contains
# a number of Wizard of Oz books in plain text (and nothing else)
# you can read them into the right format like this:
oz <- Corpus(DirSource("/Users/someone/Desktop/oz/"))

# Now we create a Document-Term matrix, which counts
# how often each word appears in each document.
# This call also does preprocessing of the data:
# - lowercasing
# - removing punctuation
# - removing stopwords
dtm.oz = DocumentTermMatrix(oz, control = list(tolower = TRUE,removePunctuation = TRUE, stopwords = TRUE))

# Inspecting the first 5 documents and some term frequencies (in my version, the frequencies of the first
# terms that are not numbers):
inspect(dtm.oz[1:5, 237:242])

# For topic modeling, we use the function LDA()
# from the topicmodels package.
# For more information, see the topicmodels package documentation at
# http://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf
#
# 'control' contains the parameters of the LDA model
# k is the number of topics.
# you can also set the learning method for LDA, default is VEM
# you can also set method = "Gibbs"
ldaobj = LDA(dtm.oz, k = 20, control = list(alpha = 0.1))

# Now we inspect the 20 most likely terms for each topic
# to get some insight into what the topics mean
terms(ldaobj, 20)

# We can also obtain the 5 most likely topics for each document
# to get some insight into what the different documents cover
topics(ldaobj, 5)

# Or you can get all the information:
# for each term, what is its probability under each topic,
# and for each document, what is the probability of each topic.
lda_inf <- posterior(ldaobj, dtm.oz)

# this is a gigantic matrix
# with one column for each word in the vocabulary
# and one row for each of the k topics.
# Entry in row i and column j is the probability of term j under topic i.
lda.inf$terms[1,300:305]

# This is a matrix with a row for each document
# and a column for each topic. Entry in row i and column j is
# probability of topic j for document i.
lda.inf$topics


Comments