`library(tm)` `library(topicmodels)` `# If you have a directory called /Users/someone/Desktop/oz that contains` `# a number of Wizard of Oz books in plain text (and nothing else)` `# you can read them into the right format like this:` `oz <- Corpus(DirSource("/Users/someone/Desktop/oz/"))` `# Now we create a Document-Term matrix, which counts` `# how often each word appears in each document. ` `# This call also does preprocessing of the data:` `# - lowercasing` `# - removing punctuation` `# - removing stopwords` `dtm.oz = DocumentTermMatrix(oz, control = list(tolower = TRUE,removePunctuation = TRUE, stopwords = TRUE))` `# Inspecting the first 5 documents and some term frequencies (in my version, the frequencies of the first` `# terms that are not numbers):` `inspect(dtm.oz[1:5, 237:242])` `# For topic modeling, we use the function LDA()` `# from the topicmodels package.` `# For more information, see the topicmodels package documentation at` `# http://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf` `#` `# 'control' contains the parameters of the LDA model` `# k is the number of topics.` `# you can also set the learning method for LDA, default is VEM` `# you can also set method = "Gibbs"` `ldaobj = LDA(dtm.oz, k = 20, control = list(alpha = 0.1))` `# Now we inspect the 20 most likely terms for each topic` `# to get some insight into what the topics mean` `terms(ldaobj, 20)` `# We can also obtain the 5 most likely topics for each document` `# to get some insight into what the different documents cover` `topics(ldaobj, 5)` `# Or you can get all the information:` `# for each term, what is its probability under each topic,` `# and for each document, what is the probability of each topic.` `lda_inf <- posterior(ldaobj, dtm.oz)` `# this is a gigantic matrix` `# with one column for each word in the vocabulary` `# and one row for each of the k topics.` `# Entry in row i and column j is the probability of term j under topic i.` `lda.inf$terms[1,300:305]` `# This is a matrix with a row for each document` `# and a column for each topic. Entry in row i and column j is` `# probability of topic j for document i.` `lda.inf$topics` |

Courses > R worksheets >