Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Tuesday, August 20, 2013

Sentiment Analysis using R

September 23, 2013

Today I will explain you how to create a basic Movie review engine based on the tweets by people using R.
The implementation of the Review Engine will be as follows:
  •          Gets Tweets from Twitter
  •          Clean the data
  •          Create a Word Cloud
  •          Create a data dictionary
  •          Score each tweet.

Gets Tweets from Twitter:
                First step is to fetch the data from Twitter. In R, we have facility to call the twitter API using package twitter. Below are the steps for fetch the tweets using twitter package. Each tweet data contains:
  • Text
  • Is re-tweeted
  • Re-tweet count
  • Tweeted User name
  • Latitude/Longitude 
  • Replied to, etc.
For our case we only consider Text feature of the Tweet as we are interested on the review of the movie. We can also use the other features such as Latitude/Longitude, replied to, etc. do other analysis on the tweeted data.

          tweets = searchTwitter("#ChennaiExpress", n=500, lang="en")

Clean the data:
In the next step, we need to clean the data so that we can use it for our analysis. Cleaning of data is a very important step in Data Analysis. This step includes:

Extracting only text from Tweets:
tweets_txt = sapply(tweets,function(x) x$getText())

Removing Url links, Reply to, punctuations, non-alphanumeric, symbols, spaces etc.
             tweets_cl = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
             tweets_cl = gsub("http[^[:blank:]]+", "", tweets_cl)
             tweets_cl = gsub("@\\w+", "", tweets_cl)
             tweets_cl = gsub("[ \t]{2,}", "", tweets_cl)
             tweets_cl = gsub("^\\s+|\\s+$", "", tweets_cl)
             tweets_cl = gsub("[[:punct:]]", " ", tweets_cl)
             tweets_cl = gsub("[^[:alnum:]]", " ", tweets_cl)
             tweets_cl <- gsub('\\d+', '', tweets_cl)
Create a Word Cloud:
At this point let us view Word-Cloud of frequently tweeted words in the data considered for visual understanding/analyzing the data.
Create a data dictionary:
In this step, we create use a Dictionary of words containing positive, negative words which are downloaded from here. These 2 types of words are used as keywords for classifying the each tweet into one of the 4 categories: Very Positive, Positive, Negative and Very Negative.
Score each tweet:
In this step, we will write a function which will calculate rating of the movie. The function is given below. After calculating the scores we plot graphs showing the rating as “WORST”,”BAD”,”GOOD”,”VERYGOOD”

Future steps in this project will be:
  • To create a UI preferably using .NET, as I’m a dot-net developer ;)
  • To Build a Movie Review Model which can classify a new tweet as and when provided?

#include required libraries

#get the tweets
tweets = searchTwitter("#ChennaiExpress", n=500, lang="en")
tweets_txt = sapply(tweets[1:50],function(x) x$getText())

#function to clean data
cleanTweets = function(tweets)
tweets_cl = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
tweets_cl = gsub("http[^[:blank:]]+", "", tweets_cl)
tweets_cl = gsub("@\\w+", "", tweets_cl)
tweets_cl = gsub("[ \t]{2,}", "", tweets_cl)
tweets_cl = gsub("^\\s+|\\s+$", "", tweets_cl)
tweets_cl = gsub("[[:punct:]]", " ", tweets_cl)
tweets_cl = gsub("[^[:alnum:]]", " ", tweets_cl)
tweets_cl <- gsub('\\d+', '', tweets_cl)

#function to calculate number of words in each category within a sentence
sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){
  final_scores <- matrix('', 0, 5)
  scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){
    initial_sentence <- sentence
    #remove unnecessary characters and split up by word
        sentence = cleanTweets(sentence)
        sentence <- tolower(sentence)
        wordList <- str_split(sentence, '\\s+')
    words <- unlist(wordList)
    #build vector with matches between sentence and each category
    vPosMatches <- match(words, vPosTerms)
    posMatches <- match(words, posTerms)
    vNegMatches <- match(words, vNegTerms)
    negMatches <- match(words, negTerms)
    #sum up number of words in each category
    vPosMatches <- sum(!
    posMatches <- sum(!
    vNegMatches <- sum(!
    negMatches <- sum(!
    score <- c(vNegMatches, negMatches, posMatches, vPosMatches)
    #add row to scores table
    newrow <- c(initial_sentence, score)
    final_scores <- rbind(final_scores, newrow)
  }, vNegTerms, negTerms, posTerms, vPosTerms)

#load pos,neg statements
afinn_list <- read.delim(file='~/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE)
names(afinn_list) <- c('word', 'score')
afinn_list$word <- tolower(afinn_list$word)

#categorize words as very negative to very positive and add some movie-specific words
vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4]
negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", "distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", "uneven", "outdated", "dreadful", "bland")
posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious")
vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")   

#Calculate score on each tweet
tweetResult <-, vNegTerms, negTerms, posTerms, vPosTerms))
tweetResult$'2' = as.numeric(tweetResult$'2')
tweetResult$'3' = as.numeric(tweetResult$'3')
tweetResult$'4' = as.numeric(tweetResult$'4')
tweetResult$'5' = as.numeric(tweetResult$'5')
counts = c(sum(tweetResult$'2'),sum(tweetResult$'3'),sum(tweetResult$'4'),sum(tweetResult$'5'))
names = c("Worst","BAD","GOOD","VERY GOOD")
mr = list(counts,names)
colors = c("red", "yellow", "green", "violet")
barplot(mr[[1]], main="Movie Review", xlab="Number of votes",legend=mr[[2]],col=colors)