Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Sunday, October 6, 2013

Topic Modeling in R

As a part of Twitter Data Analysis, So far I have completed Movie review using R & Document Classification using R. Today we will be dealing with discovering topics in Tweets, i.e. to mine the tweets data to discover underlying topics– approach known as Topic Modeling.
What is Topic Modeling?

A statistical approach for discovering “abstracts/topics” from a collection of text documents based on statistics of each word. In simple terms, the process of looking into a large collection of documents, identifying clusters of words and grouping them together based on similarity and identifying patterns in the clusters appearing in multitude.
Consider the below Statements:
  1. I love playing cricket.
  2. Sachin is my favorite cricketer.
  3. Titanic is heart touching movie.
  4. Data Analytics is next Future in IT.
  5. Data Analytics & Big Data complements each other.
When we apply Topic Modeling to the above statements, we will be able to group statement 1&2 as Topic-1 (later we can identify that the topic is Sport), statement 3 as Topic-2 (topic is Movies), statement 4&5 as Topic-3 (topic is data Analytics).
fig: Identifying topics in Documents and classifying as Topic 1 & Topic 2

Latent Dirichlet Allocation algorithm (LDA):
Topic Modeling can be achieved by using Latent Dirichlet Allocation algorithm. Not going into the nuts & bolts of the Algorithm, LDA automatically learns itself to assign probabilities to each & every word in the Corpus and classify into Topics. A simple explanation for LDA could be found here:
Twitter Data Analysis Using LDA:
Steps Involved:
  1. Fetch tweets data using ‘twitteR’ package.
  2. Load the data into the R environment.
  3. Clean the Data to remove: re-tweet information, links, special characters, emoticons, frequent words like is, as, this etc.
  4. Create a Term Document Matrix (TDM) using ‘tm Package.
  5. Calculate TF-IDF i.e. Term Frequency Inverse Document Frequency for all the words in word matrix created in Step 4.
  6. Exclude all the words with tf-idf <= 0.1, to remove all the words which are less frequent.
  7. Calculate the optimal Number of topics (K) in the Corpus using log-likelihood method for the TDM calculated in Step6.
  8. Apply LDA method using topicmodels Package to discover topics.
  9. Evaluate the model.
 Topic modeling using LDA is a very good method of discovering topics underlying. The analysis will give  good results if and only if we have large set of Corpus.In the above analysis using tweets from top 5 Airlines, I could find that one of the topics which people are talking about is about FOOD being served. We can Sentiment Analysis techniques to mine what people thinks about, talks about products/companies etc.

#Load Text
con <- file("tweets.txt", "rt")
tweets = readLines(con)
#Clean Text
tweets = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
tweets = gsub("http[^[:blank:]]+", "", tweets)
tweets = gsub("@\\w+", "", tweets)
tweets = gsub("[ \t]{2,}", "", tweets)
tweets = gsub("^\\s+|\\s+$", "", tweets)
tweets <- gsub('\\d+', '', tweets)
tweets = gsub("[[:punct:]]", " ", tweets)
corpus = Corpus(VectorSource(tweets))
corpus = tm_map(corpus,removePunctuation)
corpus = tm_map(corpus,stripWhitespace)
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,removeWords,stopwords("english"))
tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix
# create tf-idf matrix
term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm > 0))
tdm <- tdm[,term_tfidf >= 0.1]
tdm <- tdm[row_sums(tdm) > 0,]
#Deciding best K value using Log-likelihood method
best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})
best.model.logLik <-, logLik)))
#calculating LDA
k = 50;#number of topics
SEED = 786; # number of tweets used
CSC_TM <-list(VEM = LDA(tdm, k = k, control = list(seed = SEED)),VEM_fixed = LDA(tdm, k = k,control = list(estimate.alpha = FALSE, seed = SEED)),Gibbs = LDA(tdm, k = k, method = "Gibbs",control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000)),CTM = CTM(tdm, k = k,control = list(seed = SEED,var = list(tol = 10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the values of the models fitted with VEM and estimated and with VEM and fixed 
sapply(CSC_TM[1:2], slot, "alpha")
sapply(CSC_TM, function(x) mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))
Topic <- topics(CSC_TM[["VEM"]], 1)
Terms <- terms(CSC_TM[["VEM"]], 8)


  1. Your site may be amazing and furthermore require awesome open on your blog bit of paper. Not too bad introduction keep engraving. I totally cherished the manner in which you reviewed this put. The substance are written positively and all the wordings are extremely straightforward. This blog is one in my top choice. Continue sharing extra supportive and useful posts. Feel free to visit site Cheap essay writing service...

  2. This comment has been removed by the author.

  3. Thank you so much for this article!
    I am an absolute newbie to R and topic modeling, I would like to do a LDA analysis on a corpus of 7000+ articles containing a certain term in order to understand the topics associated with said term. I dowloaded the articles and now I have a folder with 53 .html files... from here, I really don't know what to do. I have been looking for manuals, tutorial and explanations but they're all too "basic" (beginner guides to R) or too complex for me (in-depth insights on topic modeling).
    I know the theory behind, i.e. what steps are involved in such a topic modeling, but I am having a hard time coding.
    Could you help me out?

  4. Some of those taxes and rules are federally regulated and they are therefore consistent across Canada, but a majority of are specific Alberta. canadian mortgage calculator If you do not choose to get CMA PRO, you are able to continue while using the Canadian Mortgage App free of charge forever. mortgage calculator canada

  5. Hi! I know this is kind of off topic but I was wondering which blog platform are you using for this website? I'm getting sick and tired of Wordpress because I've had issues with hackers and I'm looking at options for another platform. I would be great if you could point me in the direction of a good platform. Stretch mark removal Singapore