Thursday, October 24, 2013

Fetch Twitter data using R


This short post will explain how you can fetch twitter data using  twitteR & StreamR packages available in R. In order to connect to twitter API, we need to undergo an authentication process known as OAuth explained in my previous post.
Twitter data can be fetched from twitter in two ways: a) Rest API b) Streaming Api.
In today's blog post we shall go through both using Rest API & Stream API.
twitteR Package:
One of the available package in R for fetching Twitter Data. The package can be obtained from here.
This package allows us to make REST API calls to twitter using the ConsumerKey & ConsumerSecret code. Code below illustrates how as how to extract the Twitter Data.
This package offers below functionality:
  • Authenticate with Twitter API
  • Fetch User timeline
  • User Followers
  • User Mentions
  • Search twitter
  • User Information
  • User friends information
  • Location based Trends
  • Convert JSON object to dataframes
REST API CALLS using R - twitteR package: 
  1. Register your application with twitter.
  2. After registration, you will be getting ConsumerKey & ConsumerSecret code which needs to be used for calling twitter API.
  3. Load TwitteR library in R environment.
  4. Call twitter API using OAuthFactory$new() method with ConsumerKey & ConsumerSecret code as input params.
  5. The above step will return an authorization link, which needs to be copied & pasted in the internet browser.
  6. You will be redirected to Twitter application authentication page where you need to authenticate yourself by providing you twitter credentials.
  7. After authenticating , we will be provided with a Authorization code, which needs to be pasted in the R console.
  8. Call registerTwitterOAuth().
Source Code:
library(twitteR)
requestURL <-  "https://api.twitter.com/oauth/request_token"
accessURL =    "https://api.twitter.com/oauth/access_token"
authURL =      "https://api.twitter.com/oauth/authorize"
consumerKey =   "XXXXXXXXXXXX"
consumerSecret = "XXXXXXXXXXXXXXXX"
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=requestURL,
                             accessURL=accessURL,
                             authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="cacert.pem")
twitCred$handshake(cainfo="cacert.pem")
save(list="twitCred", file="twitteR_credentials")
load("twitteR_credentials")
registerTwitterOAuth(twitCred)#Register your app with Twitter.
StreamR Package:
This package allows users to fetch twitter Data in real time by connecting to Twitter Stream API. We can obtain the package from here.  Few important functions this package offers are: it allows R users to access Twitter's search streams,user streams, parse the output into data frames.

filterStream() - filterStream method opens a connection to Twitter’s Streaming API that will return public statuses that match one or more filter predicates like search keywords. Tweets can be filtered by keywords, users, language, and location. The output can be saved as an object in memory or written to a text file.

parseTweets() - This function parses tweets downloaded using filterStream, sampleStream or userStream and returns a data frame.
Below code example shows how to fetch data in real time using RStream:
library(streamR)
library(twitteR)
load("twitteR_credentials")  # make using the save credentials in the previous code.
registerTwitterOAuth(twitCred)
filterStream(file.name = "tweets.json", track = "#bigdata",timeout = 0, locations=c(-74,40,-73,41), oauth = twitCred)
Executing the above will capturing Tweets on "#bigdata" from "NEW YORK" location. Here when we mention timeout=0, we are setting it to fetch continuously, to fetch records for certain time then use timeout=300 (fetches data for 300 secs)
To Parse the fetched tweets use the below code:
tweets.df <- parseTweets("tweets.json")


Sunday, October 6, 2013

Topic Modeling in R

As a part of Twitter Data Analysis, So far I have completed Movie review using R & Document Classification using R. Today we will be dealing with discovering topics in Tweets, i.e. to mine the tweets data to discover underlying topics– approach known as Topic Modeling.
What is Topic Modeling?
A statistical approach for discovering “abstracts/topics” from a collection of text documents based on statistics of each word. In simple terms, the process of looking into a large collection of documents, identifying clusters of words and grouping them together based on similarity and identifying patterns in the clusters appearing in multitude.
Consider the below Statements:
  1. I love playing cricket.
  2. Sachin is my favorite cricketer.
  3. Titanic is heart touching movie.
  4. Data Analytics is next Future in IT.
  5. Data Analytics & Big Data complements each other.
When we apply Topic Modeling to the above statements, we will be able to group statement 1&2 as Topic-1 (later we can identify that the topic is Sport), statement 3 as Topic-2 (topic is Movies), statement 4&5 as Topic-3 (topic is data Analytics).
fig: Identifying topics in Documents and classifying as Topic 1 & Topic 2

Latent Dirichlet Allocation algorithm (LDA):
Topic Modeling can be achieved by using Latent Dirichlet Allocation algorithm. Not going into the nuts & bolts of the Algorithm, LDA automatically learns itself to assign probabilities to each & every word in the Corpus and classify into Topics. A simple explanation for LDA could be found here:
Twitter Data Analysis Using LDA:
Steps Involved:
  1. Fetch tweets data using ‘twitteR’ package.
  2. Load the data into the R environment.
  3. Clean the Data to remove: re-tweet information, links, special characters, emoticons, frequent words like is, as, this etc.
  4. Create a Term Document Matrix (TDM) using ‘tm Package.
  5. Calculate TF-IDF i.e. Term Frequency Inverse Document Frequency for all the words in word matrix created in Step 4.
  6. Exclude all the words with tf-idf <= 0.1, to remove all the words which are less frequent.
  7. Calculate the optimal Number of topics (K) in the Corpus using log-likelihood method for the TDM calculated in Step6.
  8. Apply LDA method using topicmodels Package to discover topics.
  9. Evaluate the model.
Conclusion:
 Topic modeling using LDA is a very good method of discovering topics underlying. The analysis will give  good results if and only if we have large set of Corpus.In the above analysis using tweets from top 5 Airlines, I could find that one of the topics which people are talking about is about FOOD being served. We can Sentiment Analysis techniques to mine what people thinks about, talks about products/companies etc.

SourceCode:
library("tm")
library("wordcloud")
library("slam")
library("topicmodels")
#Load Text
con <- file("tweets.txt", "rt")
tweets = readLines(con)
#Clean Text
tweets = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
tweets = gsub("http[^[:blank:]]+", "", tweets)
tweets = gsub("@\\w+", "", tweets)
tweets = gsub("[ \t]{2,}", "", tweets)
tweets = gsub("^\\s+|\\s+$", "", tweets)
tweets <- gsub('\\d+', '', tweets)
tweets = gsub("[[:punct:]]", " ", tweets)
corpus = Corpus(VectorSource(tweets))
corpus = tm_map(corpus,removePunctuation)
corpus = tm_map(corpus,stripWhitespace)
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,removeWords,stopwords("english"))
tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix
# create tf-idf matrix
term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
tdm <- tdm[,term_tfidf >= 0.1]
tdm <- tdm[row_sums(tdm) > 0,]
summary(col_sums(tdm))
#Deciding best K value using Log-likelihood method
best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
#calculating LDA
k = 50;#number of topics
SEED = 786; # number of tweets used
CSC_TM <-list(VEM = LDA(tdm, k = k, control = list(seed = SEED)),VEM_fixed = LDA(tdm, k = k,control = list(estimate.alpha = FALSE, seed = SEED)),Gibbs = LDA(tdm, k = k, method = "Gibbs",control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000)),CTM = CTM(tdm, k = k,control = list(seed = SEED,var = list(tol = 10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the values of the models fitted with VEM and estimated and with VEM and fixed 
sapply(CSC_TM[1:2], slot, "alpha")
sapply(CSC_TM, function(x) mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))
Topic <- topics(CSC_TM[["VEM"]], 1)
Terms <- terms(CSC_TM[["VEM"]], 8)
Terms