Tuesday, December 31, 2013

Data Scientist. The Path I Chose

As we all are marching into the New Year, I would like to post about my plans to become a Data Scientist, my 2014 resolution at Professional front. The term Data Science was first introduced to me a year ago same time. Since then I have started researching and gathering necessary information and decided to become Data Scientist. After one year I just wanted to look back myself to understand where I stand now & still what needs to be done.
Power of Data – possible Use Cases:
Automobile companies are installing special devices which continuously feed data about the nature of driving, the roads being taken, places being visited etc. to the Insurance companies who by analyzing this information can customize the insurance plans for its customers.
Another possible use case, making use of the data we store on social media, we can develop products which can understand us and sometimes guide us. Like, based on people’s posts on social media, with location, time, road you are travelling & LIVE Traffic feeds, we can develop an app which could suggest us which road to take to our destination taking.
Who is Data Scientist? 
Data Scientist termed as the Sexiest Job of the 21st Century and is the most sought after & well paid job all over the Industry. The role of the Data Scientist is to Make Sense out of the tons of Data being generated every day, build products which can redefine more statistically/ scientifically the way the current businesses is going. The best part of this Job Is, a Data Scientist can fit into any industry where ever data exists.
What it needs to become a Data Scientist? 
"Web is my University, Time is the only Investment" 

Going back to where I have started my journey, after my initial research with my folks and over internet, I started taking up Machine Learning course from Coursera.org, an online offering from Stanford University, definitely a recommended Starting Point for anyone who wants to be a Data Scientist.
After completing the course, I have started playing with Data. Predictive models, being my first Hands-on. At this point I got an advice from my Colleague,

 “It is easy (because free and online) in order to grab some interesting key messages and get a rather complete overview within a 8-week period timeframe. Be sure you have a sufficient mathematical background to follow (maybe Calculas/mathematics before?) and more importantly, be certain you can practice afterwards!” 

My First analytics assignment on Predictive modelling gave me a lot of learnings, the need to brush up my Mathematics basics, Stats fundamentals, data mining techniques, Visualization Techniques.
I was lucky enough to find many of the courses listed in my TODO list
@ Coursera.org, UDACITY:
  • Machine Learning – for Machine Learning Algorithms 
  • Neural Networks for Machine Learning 
  • Data Analysis – Introduction to Programming Language R 
  • Language for Data Analysis 
  • Making Sense of Data - basics of Statistics 
  • Introduction to Data Science – intro to Data Science 
  • Social Network Analysis
The image, below which I have got over internet, is a simplified road map for a Data Scientist, which has become my Target over the next few months combined with Hands-on.
Road map to data science 
After completing the required courses mentioned above, I got an opportunity to play around with Data using Text Mining, Classification techniques, Predictive modelling techniques both from within my Organization & outside (KAGGLE.COM).
It would be incomplete, if I don’t mention KAGGLE, the website where I found the much needed hands-on experience. KAGGLE is an online Open competition forum, where you can find Data Science Use cases posted by big Companies. Advantage of working on this site is not only you can get hands-on experience but also you increase your Network with similar competent techies.
Apart from KAGGLE, I have made use of LINKEDIN data science/data Analysis groups where I could get suggestions and answers to almost every question of mine by the Experts of this field.
Data Analytics Should Meet Big Data: 
“Learn a bit of Java, Learn a bit of Linux – shell scripting, Learn Hadoop” – an advice from my friend.

After completing few POCs and began working with real time projects I have soon realized the need to handle huge amounts of data and need to learn BIG DATA. In my personal opinion, a Data Scientist should have a very strong knowledge in Big Data-Hadoop.
For this, I planned to choose the road path of learning Big Data Hadoop concepts, Linux Shell scripting, MongoDB for data storage in the next 3 months.
That is me folks, after one year; long way to go through before becoming a Data Scientist. But still would love to hear someone calling me a Data Scientist. Over the last one year, I have spent most of my time in learning new concepts & technologies. I am eager to see my theory shaping into work.
In my next blog post I will explain the tools, concepts, technologies, online forums available on internet.

Tuesday, December 17, 2013

Cluster Analysis using R

In this post, I will explain you about Cluster Analysis, The process of grouping objects/individuals together in such a way that objects/individuals in one group are more similar than objects/individuals in other groups.
 For example, from a ticket booking engine database identifying clients with similar booking activities and group them together (called Clusters). Later these identified clusters can be targeted for business improvement by issuing special offers, etc.
Cluster Analysis falls into Unsupervised Learning algorithms, where in Data to be analyzed will be provided to a Cluster analysis algorithm to identify hidden patterns within as shown in the figure below.
 In the image above, the cluster algorithm has grouped the input data into two groups. There are 3 Popular Clustering algorithms, Hierarchical Cluster Analysis, K-Means Cluster Analysis, Two-step Cluster Analysis, of which today I will be dealing with K-Means Clustering.
Explaining k-Means Cluster Algorithm: 
In K-means algorithm, k stands for the number of clusters (groups) to be formed, hence this algorithm can be used to group known number of groups within the Analyzed data.
K Means is an iterative algorithm and it has two steps. First is a Cluster Assignment Step, and second is a Move Centroid Step.
CLUSTER ASSIGNMENT STEP: In this step, we randomly chose two cluster points (red dot & green dot) and we assign each data point to one of the two cluster points whichever is closer to it. (Top part of the below image)
MOVE CENTROID STEP: In this step, we take the average of the points of all the examples in each group and move the Centroid to the new position i.e. mean position calculated. (Bottom part of the below image)
The above steps are repeated until all the data points are grouped into 2 groups and the mean of the data points at the end of Move Centroid Step doesn’t change.


 By repeating the above steps the final output grouping of the input data will be obtained.

Cluster Analysis on Accidental Deaths by Natural Causes in India using R 
Implementation of k-Means Cluster algorithm can readily downloaded as R Package, CLUSTER . Using the package we shall do cluster analysis of Accidents deaths in India by Natural Causes.
Steps implemented will be discussed as below: 
The data for our analysis was downloaded from www.data.gov.in.
Between 2001 & 2012. Input data is displayed as below: 
For any cluster analysis, all the features have to be converted into numerical & the larger values in the Year Columns are converted to z-score for better results. 
Run Elbow method (code available below) is run to find the optimal number of clusters present within the data points. 
Run the K-means cluster method of the R package & visualize the results as below:
Code:
#Fetch data
data= read.csv("Cluster Analysis.csv")
APStats = data[which(data$STATE == 'ANDHRA PRADESH'),]
APMale = rowSums(APStats[,4:8])
APFemale = rowSums(APStats[,9:13])
APStats[,'APMale'] = APMale
APStats[,'APFemale'] = APFemale
data = APStats[c(2,3,14,15)]
library(cluster)
library(graphics)
library(ggplot2)
#factor the categorical fields
cause = as.numeric(factor(data$CAUSE))
data$CAUSE = cause
#Z-score for Year column
z = {}
m = mean(data$Year)
sd = sd(data$Year)
year = data$Year
for(i in 1:length(data$Year)){
z[i] = (year[i] - m)/sd
}
data$Year = as.numeric(z)
#Calculating K-means - Cluster assignment & cluster group steps
cost_df <- data.frame()

for(i in 1:100){
kmeans<- kmeans(x=data, centers=i, iter.max=100)
cost_df<- rbind(cost_df, cbind(i, kmeans$tot.withinss))
}
names(cost_df) <- c("cluster", "cost")

#Elbow method to identify the idle number of Cluster
#Cost plot
ggplot(data=cost_df, aes(x=cluster, y=cost, group=1)) +
theme_bw(base_family="Garamond") +
geom_line(colour = "darkgreen") +
theme(text = element_text(size=20)) +
ggtitle("Reduction In Cost For Values of 'k'\n") +
xlab("\nClusters") +
ylab("Within-Cluster Sum of Squares\n")
clust = kmeans(data,5)
clusplot(data, clust$cluster, color=TRUE, shade=TRUE,labels=13, lines=0)
data[,'cluster'] = clust$cluster
head(data[which(data$cluster == 5),])



Thursday, October 24, 2013

Fetch Twitter data using R


This short post will explain how you can fetch twitter data using  twitteR & StreamR packages available in R. In order to connect to twitter API, we need to undergo an authentication process known as OAuth explained in my previous post.
Twitter data can be fetched from twitter in two ways: a) Rest API b) Streaming Api.
In today's blog post we shall go through both using Rest API & Stream API.
twitteR Package:
One of the available package in R for fetching Twitter Data. The package can be obtained from here.
This package allows us to make REST API calls to twitter using the ConsumerKey & ConsumerSecret code. Code below illustrates how as how to extract the Twitter Data.
This package offers below functionality:
  • Authenticate with Twitter API
  • Fetch User timeline
  • User Followers
  • User Mentions
  • Search twitter
  • User Information
  • User friends information
  • Location based Trends
  • Convert JSON object to dataframes
REST API CALLS using R - twitteR package: 
  1. Register your application with twitter.
  2. After registration, you will be getting ConsumerKey & ConsumerSecret code which needs to be used for calling twitter API.
  3. Load TwitteR library in R environment.
  4. Call twitter API using OAuthFactory$new() method with ConsumerKey & ConsumerSecret code as input params.
  5. The above step will return an authorization link, which needs to be copied & pasted in the internet browser.
  6. You will be redirected to Twitter application authentication page where you need to authenticate yourself by providing you twitter credentials.
  7. After authenticating , we will be provided with a Authorization code, which needs to be pasted in the R console.
  8. Call registerTwitterOAuth().
Source Code:
library(twitteR)
requestURL <-  "https://api.twitter.com/oauth/request_token"
accessURL =    "https://api.twitter.com/oauth/access_token"
authURL =      "https://api.twitter.com/oauth/authorize"
consumerKey =   "XXXXXXXXXXXX"
consumerSecret = "XXXXXXXXXXXXXXXX"
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=requestURL,
                             accessURL=accessURL,
                             authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="cacert.pem")
twitCred$handshake(cainfo="cacert.pem")
save(list="twitCred", file="twitteR_credentials")
load("twitteR_credentials")
registerTwitterOAuth(twitCred)#Register your app with Twitter.
StreamR Package:
This package allows users to fetch twitter Data in real time by connecting to Twitter Stream API. We can obtain the package from here.  Few important functions this package offers are: it allows R users to access Twitter's search streams,user streams, parse the output into data frames.

filterStream() - filterStream method opens a connection to Twitter’s Streaming API that will return public statuses that match one or more filter predicates like search keywords. Tweets can be filtered by keywords, users, language, and location. The output can be saved as an object in memory or written to a text file.

parseTweets() - This function parses tweets downloaded using filterStream, sampleStream or userStream and returns a data frame.
Below code example shows how to fetch data in real time using RStream:
library(streamR)
library(twitteR)
load("twitteR_credentials")  # make using the save credentials in the previous code.
registerTwitterOAuth(twitCred)
filterStream(file.name = "tweets.json", track = "#bigdata",timeout = 0, locations=c(-74,40,-73,41), oauth = twitCred)
Executing the above will capturing Tweets on "#bigdata" from "NEW YORK" location. Here when we mention timeout=0, we are setting it to fetch continuously, to fetch records for certain time then use timeout=300 (fetches data for 300 secs)
To Parse the fetched tweets use the below code:
tweets.df <- parseTweets("tweets.json")


Sunday, October 6, 2013

Topic Modeling in R

As a part of Twitter Data Analysis, So far I have completed Movie review using R & Document Classification using R. Today we will be dealing with discovering topics in Tweets, i.e. to mine the tweets data to discover underlying topics– approach known as Topic Modeling.
What is Topic Modeling?
A statistical approach for discovering “abstracts/topics” from a collection of text documents based on statistics of each word. In simple terms, the process of looking into a large collection of documents, identifying clusters of words and grouping them together based on similarity and identifying patterns in the clusters appearing in multitude.
Consider the below Statements:
  1. I love playing cricket.
  2. Sachin is my favorite cricketer.
  3. Titanic is heart touching movie.
  4. Data Analytics is next Future in IT.
  5. Data Analytics & Big Data complements each other.
When we apply Topic Modeling to the above statements, we will be able to group statement 1&2 as Topic-1 (later we can identify that the topic is Sport), statement 3 as Topic-2 (topic is Movies), statement 4&5 as Topic-3 (topic is data Analytics).
fig: Identifying topics in Documents and classifying as Topic 1 & Topic 2

Latent Dirichlet Allocation algorithm (LDA):
Topic Modeling can be achieved by using Latent Dirichlet Allocation algorithm. Not going into the nuts & bolts of the Algorithm, LDA automatically learns itself to assign probabilities to each & every word in the Corpus and classify into Topics. A simple explanation for LDA could be found here:
Twitter Data Analysis Using LDA:
Steps Involved:
  1. Fetch tweets data using ‘twitteR’ package.
  2. Load the data into the R environment.
  3. Clean the Data to remove: re-tweet information, links, special characters, emoticons, frequent words like is, as, this etc.
  4. Create a Term Document Matrix (TDM) using ‘tm Package.
  5. Calculate TF-IDF i.e. Term Frequency Inverse Document Frequency for all the words in word matrix created in Step 4.
  6. Exclude all the words with tf-idf <= 0.1, to remove all the words which are less frequent.
  7. Calculate the optimal Number of topics (K) in the Corpus using log-likelihood method for the TDM calculated in Step6.
  8. Apply LDA method using topicmodels Package to discover topics.
  9. Evaluate the model.
Conclusion:
 Topic modeling using LDA is a very good method of discovering topics underlying. The analysis will give  good results if and only if we have large set of Corpus.In the above analysis using tweets from top 5 Airlines, I could find that one of the topics which people are talking about is about FOOD being served. We can Sentiment Analysis techniques to mine what people thinks about, talks about products/companies etc.

SourceCode:
library("tm")
library("wordcloud")
library("slam")
library("topicmodels")
#Load Text
con <- file("tweets.txt", "rt")
tweets = readLines(con)
#Clean Text
tweets = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
tweets = gsub("http[^[:blank:]]+", "", tweets)
tweets = gsub("@\\w+", "", tweets)
tweets = gsub("[ \t]{2,}", "", tweets)
tweets = gsub("^\\s+|\\s+$", "", tweets)
tweets <- gsub('\\d+', '', tweets)
tweets = gsub("[[:punct:]]", " ", tweets)
corpus = Corpus(VectorSource(tweets))
corpus = tm_map(corpus,removePunctuation)
corpus = tm_map(corpus,stripWhitespace)
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,removeWords,stopwords("english"))
tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix
# create tf-idf matrix
term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
tdm <- tdm[,term_tfidf >= 0.1]
tdm <- tdm[row_sums(tdm) > 0,]
summary(col_sums(tdm))
#Deciding best K value using Log-likelihood method
best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
#calculating LDA
k = 50;#number of topics
SEED = 786; # number of tweets used
CSC_TM <-list(VEM = LDA(tdm, k = k, control = list(seed = SEED)),VEM_fixed = LDA(tdm, k = k,control = list(estimate.alpha = FALSE, seed = SEED)),Gibbs = LDA(tdm, k = k, method = "Gibbs",control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000)),CTM = CTM(tdm, k = k,control = list(seed = SEED,var = list(tol = 10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the values of the models fitted with VEM and estimated and with VEM and fixed 
sapply(CSC_TM[1:2], slot, "alpha")
sapply(CSC_TM, function(x) mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))
Topic <- topics(CSC_TM[["VEM"]], 1)
Terms <- terms(CSC_TM[["VEM"]], 8)
Terms



Tuesday, August 20, 2013

Sentiment Analysis using R

September 23, 2013


Today I will explain you how to create a basic Movie review engine based on the tweets by people using R.
The implementation of the Review Engine will be as follows:
  •          Gets Tweets from Twitter
  •          Clean the data
  •          Create a Word Cloud
  •          Create a data dictionary
  •          Score each tweet.
Gets Tweets from Twitter:
                First step is to fetch the data from Twitter. In R, we have facility to call the twitter API using package twitter. Below are the steps for fetch the tweets using twitter package. Each tweet data contains:
  • Text
  • Is re-tweeted
  • Re-tweet count
  • Tweeted User name
  • Latitude/Longitude 
  • Replied to, etc.
For our case we only consider Text feature of the Tweet as we are interested on the review of the movie. We can also use the other features such as Latitude/Longitude, replied to, etc. do other analysis on the tweeted data.

          library(tm)
          tweets = searchTwitter("#ChennaiExpress", n=500, lang="en")


Clean the data:
In the next step, we need to clean the data so that we can use it for our analysis. Cleaning of data is a very important step in Data Analysis. This step includes:

Extracting only text from Tweets:
tweets_txt = sapply(tweets,function(x) x$getText())

Removing Url links, Reply to, punctuations, non-alphanumeric, symbols, spaces etc.
             tweets_cl = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
             tweets_cl = gsub("http[^[:blank:]]+", "", tweets_cl)
             tweets_cl = gsub("@\\w+", "", tweets_cl)
             tweets_cl = gsub("[ \t]{2,}", "", tweets_cl)
             tweets_cl = gsub("^\\s+|\\s+$", "", tweets_cl)
             tweets_cl = gsub("[[:punct:]]", " ", tweets_cl)
             tweets_cl = gsub("[^[:alnum:]]", " ", tweets_cl)
             tweets_cl <- gsub('\\d+', '', tweets_cl)
Create a Word Cloud:
At this point let us view Word-Cloud of frequently tweeted words in the data considered for visual understanding/analyzing the data.
library(wordcloud)
wordcloud(tweets_cl)
               
Create a data dictionary:
In this step, we create use a Dictionary of words containing positive, negative words which are downloaded from here. These 2 types of words are used as keywords for classifying the each tweet into one of the 4 categories: Very Positive, Positive, Negative and Very Negative.
Score each tweet:
In this step, we will write a function which will calculate rating of the movie. The function is given below. After calculating the scores we plot graphs showing the rating as “WORST”,”BAD”,”GOOD”,”VERYGOOD”

Future steps in this project will be:
  • To create a UI preferably using .NET, as I’m a dot-net developer ;)
  • To Build a Movie Review Model which can classify a new tweet as and when provided?
Code:

#include required libraries
library(plyr)
library(twitteR)
library(stringr)


#get the tweets
tweets = searchTwitter("#ChennaiExpress", n=500, lang="en")
tweets_txt = sapply(tweets[1:50],function(x) x$getText())

#function to clean data
cleanTweets = function(tweets)
{
tweets_cl = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweets)
tweets_cl = gsub("http[^[:blank:]]+", "", tweets_cl)
tweets_cl = gsub("@\\w+", "", tweets_cl)
tweets_cl = gsub("[ \t]{2,}", "", tweets_cl)
tweets_cl = gsub("^\\s+|\\s+$", "", tweets_cl)
tweets_cl = gsub("[[:punct:]]", " ", tweets_cl)
tweets_cl = gsub("[^[:alnum:]]", " ", tweets_cl)
tweets_cl <- gsub('\\d+', '', tweets_cl)
return(tweets_cl)
}

#function to calculate number of words in each category within a sentence
sentimentScore <- function(sentences, vNegTerms, negTerms, posTerms, vPosTerms){
  final_scores <- matrix('', 0, 5)
  scores <- laply(sentences, function(sentence, vNegTerms, negTerms, posTerms, vPosTerms){
    initial_sentence <- sentence
    #remove unnecessary characters and split up by word
        sentence = cleanTweets(sentence)
        sentence <- tolower(sentence)
        wordList <- str_split(sentence, '\\s+')
    words <- unlist(wordList)
    #build vector with matches between sentence and each category
    vPosMatches <- match(words, vPosTerms)
    posMatches <- match(words, posTerms)
    vNegMatches <- match(words, vNegTerms)
    negMatches <- match(words, negTerms)
    #sum up number of words in each category
    vPosMatches <- sum(!is.na(vPosMatches))
    posMatches <- sum(!is.na(posMatches))
    vNegMatches <- sum(!is.na(vNegMatches))
    negMatches <- sum(!is.na(negMatches))
    score <- c(vNegMatches, negMatches, posMatches, vPosMatches)
    #add row to scores table
    newrow <- c(initial_sentence, score)
    final_scores <- rbind(final_scores, newrow)
    return(final_scores)
  }, vNegTerms, negTerms, posTerms, vPosTerms)
  return(scores)
}


#load pos,neg statements
afinn_list <- read.delim(file='~/AFINN-111.txt', header=FALSE, stringsAsFactors=FALSE)
names(afinn_list) <- c('word', 'score')
afinn_list$word <- tolower(afinn_list$word)

#categorize words as very negative to very positive and add some movie-specific words
vNegTerms <- afinn_list$word[afinn_list$score==-5 | afinn_list$score==-4]
negTerms <- c(afinn_list$word[afinn_list$score==-3 | afinn_list$score==-2 | afinn_list$score==-1], "second-rate", "moronic", "third-rate", "flawed", "juvenile", "boring", "distasteful", "ordinary", "disgusting", "senseless", "static", "brutal", "confused", "disappointing", "bloody", "silly", "tired", "predictable", "stupid", "uninteresting", "trite", "uneven", "outdated", "dreadful", "bland")
posTerms <- c(afinn_list$word[afinn_list$score==3 | afinn_list$score==2 | afinn_list$score==1], "first-rate", "insightful", "clever", "charming", "comical", "charismatic", "enjoyable", "absorbing", "sensitive", "intriguing", "powerful", "pleasant", "surprising", "thought-provoking", "imaginative", "unpretentious")
vPosTerms <- c(afinn_list$word[afinn_list$score==5 | afinn_list$score==4], "uproarious", "riveting", "fascinating", "dazzling", "legendary")   

#Calculate score on each tweet
tweetResult <- as.data.frame(sentimentScore(tweets_txt, vNegTerms, negTerms, posTerms, vPosTerms))
tweetResult$'2' = as.numeric(tweetResult$'2')
tweetResult$'3' = as.numeric(tweetResult$'3')
tweetResult$'4' = as.numeric(tweetResult$'4')
tweetResult$'5' = as.numeric(tweetResult$'5')
counts = c(sum(tweetResult$'2'),sum(tweetResult$'3'),sum(tweetResult$'4'),sum(tweetResult$'5'))
names = c("Worst","BAD","GOOD","VERY GOOD")
mr = list(counts,names)
colors = c("red", "yellow", "green", "violet")
barplot(mr[[1]], main="Movie Review", xlab="Number of votes",legend=mr[[2]],col=colors)

Thursday, July 25, 2013

Document Classification using R

September 23, 2013


Recently I have developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data Analytics such as Predictive analysis, text mining.  I’m putting my next blog on Data Mining- more specifically document classification using R Programming language, one of the powerful languages used for Statistical Analysis.
What is Document classification?
Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify algorithmically. Document classification falls into Supervised Machine learning Technique.
Technically speaking, we create a machine learning model using a number of text documents (called Corpus) as Input & its corresponding class/category (called Labels) as Output. The model thus generated will be able to classify into classes when a new text is supplied.
Inside the Black Box:
Let’s have a look of what happens inside the black box in the above figure. We can divide the steps into:
  •      Creation of Corpus
  •      Preprocessing of Corpus 
  •      Creation of Term Document Matrix   
  •      Preparing Features & Labels for Model 
  •      Creating Train & test data 
  •      Running the model 
  •      Testing the model 
    To understand the above steps in detail, Let us consider a small used case:
We have speeches of US presidential contestants of Mr. Obama & Mr. Romney. We need to create a classifier which should be able to classify whether a particular new speech belongs to Mr. Obama or Mr. Romney.
Implementation

We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:

  • Step I: Corpus creation:
     Corpus is a large and structured set of texts used for analysis. 
          In our case, we create two corpuses- one each for contestant.

  • Step II: Preprocessing of Corpus
      Now the created corpus needs to clean before we use the data for our analysis.  
        Preprocessing involves removal of punctuations, white spaces, Stop words such as is, 
        the, for, etc.
  • Step III: Term Document Matrix                                                                                  This step involves creation of Term Document Matrix, i.e. matrix which has the   frequency of terms that occur in a collection of documents.                                               for example:                                                                                                                                     D1 = “I love Data analysis”                                                                                                        D2 = “I love to create data models”                                                                                               TDM:


Step IV: Feature Extraction & Labels for the model:
        In this step, we extract input feature words which are useful in distinguishing the 
        documents and attaching the corresponding classes as Labels.

  • Step V: Train & test data preparation
        In this step, we first randomize the data & then, divide the Data containing Features &
        Labels into Training (70%) & Test data (30%) before we feed into our Model.

  • Step VI: Running the model:                                                                                        For creating our model using the training data we have separated in the earlier step. We use KNN-model, whose description can be found from here.
  •      Step VII: Test Model                                                                                                  Now that the model is created, we have to test the accuracy of the model using the test                                                                                                                                                  data created in the Step V.
Find the complete code here.

Tuesday, July 16, 2013

OAuth Authentication - Part 2 - Signup using Social Networking Sites


September 23, 2013

As a continuation to my previous post, in this post I will be explaining you how to implement  - Sign up to a web application through Social Networks Using OAuth 2.0 protocol.
I will not be going through coding part; instead I will explain the steps to be followed.
In this post, I will explain you how to register our application with Google server & try to access user profile details from google and cature in our database tables for future user authentications.
The implementation includes the below steps:
  •   Registration of Application with Social Network.
  • Authentication along with requesting Scope of access
  •  User Authorizing the application to access resource server data
  • Call to Resource server API for access
  • Saving user details to Application database table
Flow Diagram:


Registering Application with GoogleServer:
Every application which needs to access the resources from Google server needs to be registered with it.
Registration steps are explained below:
  • Enter the Name for your project & Click Create Project.
  • As a next step, we need to create a ClientID for your newly created project. Click on “Create an OAuth 2.0 Client ID”.
  • Enter your Product Name & Home page url. Click on Next button.
  •  Select Web application, provide Redirecturl in the next step, this is the page where the Google Server will redirect to after the user authenticates & authorizes. 
  • Click on Create Client ID. 
  • Now Client ID, Client Secret for your newly registered application will be generated as shown in the image. The Client ID & Client Secret is very important as they are used while making API calls for fetching information from Google. Make sure you do not share Client ID & Client Secret with others.  

Authentication Step:
In the next step, our application should facilitate the user with an authentication process with Google server along with SCOPE of access that the application requesting for the user’s account.
Parameters required for this Authentication step is as below:
  • SCOPE : the access level which the application requesting from the Google server.
  • Response_type : should be Code
  • State:  should include the value of the anti-forgery unique session token
  • Redirect_Uri : This is the url which the Google server will redirect after authentication by the user. This is the same url which you have given in the Registration Step.
Note: This post just explains how to register to a website with google accounts, in a similar way we can access resources from Google/FB/Linkedin/Twitter by sending proper SCOPE parameter and API call.
A sample call to the Google server is shown below:
https://accounts.google.com/o/oauth2/auth?state=%2Fprofile&redirect_uri=http://localhost:14964/GoogleRedirectUrl.aspx&response_type=code&client_id=565625245779.apps.googleusercontent.com&approval_prompt=force&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile
Authorization Step:
Call to the above link will take the user to the Google Login screen, where the user needs to authenticate. Once the user authenticates, he is redirected to Google’s Authorization screen.Google's authorization server will display the name of your application and the Google services that it is requesting permission to access on the user's behalf. The user can then consent or refuse to grant access to your application. After the user consents or refuses to grant access to your application, Google will redirect the user to the redirecturl that you specified in the Registration step.
In this step, the user can check the details which Google will expose to the application, simply clicking on the information image - i in the below image.

If the user grants access to your application, Google will append a code parameter to the redirect_uri  and returns back to the application.
http://localhost/redirecturl?code=4/ux5gNj-_mIu4DOD_gNZdjX9EtOFf
The code  obtained in the above step is a temporary authorization code that can be exchanged for an access_token by making a HTTPs post request which should include the below parameters:



The above HTTPs request returns JSON object with below details:

"access_token" : "ya29.AHES6ZTtm7SuokEB-RGtbBty9IIlNiP9-     eNMMQKtXdMP3sfjL1Fc",  "token_type" : "Bearer",  "expires_in" : 3600,  "refresh_token" : "1/HKSmLFXzqP0leUihZp2xUt3-5wkU7Gmu2Os_eBnzw74"
}
Accessing APIS Step:
Finally, an API call should be made by sending the Access_token received in the previous step to fetch the required profile details from the Resource server.


Google returns the user profile details to the application as shown below where we can store the user’s required data in our application tables for subsequent login authentications.



Thank you folks for your encouragement, please contact me for code level implementation.