Thursday, July 25, 2013

Document Classification using R

September 23, 2013


Recently I have developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data Analytics such as Predictive analysis, text mining.  I’m putting my next blog on Data Mining- more specifically document classification using R Programming language, one of the powerful languages used for Statistical Analysis.
What is Document classification?
Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify algorithmically. Document classification falls into Supervised Machine learning Technique.
Technically speaking, we create a machine learning model using a number of text documents (called Corpus) as Input & its corresponding class/category (called Labels) as Output. The model thus generated will be able to classify into classes when a new text is supplied.
Inside the Black Box:
Let’s have a look of what happens inside the black box in the above figure. We can divide the steps into:
  •      Creation of Corpus
  •      Preprocessing of Corpus 
  •      Creation of Term Document Matrix   
  •      Preparing Features & Labels for Model 
  •      Creating Train & test data 
  •      Running the model 
  •      Testing the model 
    To understand the above steps in detail, Let us consider a small used case:
We have speeches of US presidential contestants of Mr. Obama & Mr. Romney. We need to create a classifier which should be able to classify whether a particular new speech belongs to Mr. Obama or Mr. Romney.
Implementation

We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:

  • Step I: Corpus creation:
     Corpus is a large and structured set of texts used for analysis. 
          In our case, we create two corpuses- one each for contestant.

  • Step II: Preprocessing of Corpus
      Now the created corpus needs to clean before we use the data for our analysis.  
        Preprocessing involves removal of punctuations, white spaces, Stop words such as is, 
        the, for, etc.
  • Step III: Term Document Matrix                                                                                  This step involves creation of Term Document Matrix, i.e. matrix which has the   frequency of terms that occur in a collection of documents.                                               for example:                                                                                                                                     D1 = “I love Data analysis”                                                                                                        D2 = “I love to create data models”                                                                                               TDM:


Step IV: Feature Extraction & Labels for the model:
        In this step, we extract input feature words which are useful in distinguishing the 
        documents and attaching the corresponding classes as Labels.

  • Step V: Train & test data preparation
        In this step, we first randomize the data & then, divide the Data containing Features &
        Labels into Training (70%) & Test data (30%) before we feed into our Model.

  • Step VI: Running the model:                                                                                        For creating our model using the training data we have separated in the earlier step. We use KNN-model, whose description can be found from here.
  •      Step VII: Test Model                                                                                                  Now that the model is created, we have to test the accuracy of the model using the test                                                                                                                                                  data created in the Step V.
Find the complete code here.

Tuesday, July 16, 2013

OAuth Authentication - Part 2 - Signup using Social Networking Sites


September 23, 2013

As a continuation to my previous post, in this post I will be explaining you how to implement  - Sign up to a web application through Social Networks Using OAuth 2.0 protocol.
I will not be going through coding part; instead I will explain the steps to be followed.
In this post, I will explain you how to register our application with Google server & try to access user profile details from google and cature in our database tables for future user authentications.
The implementation includes the below steps:
  •   Registration of Application with Social Network.
  • Authentication along with requesting Scope of access
  •  User Authorizing the application to access resource server data
  • Call to Resource server API for access
  • Saving user details to Application database table
Flow Diagram:


Registering Application with GoogleServer:
Every application which needs to access the resources from Google server needs to be registered with it.
Registration steps are explained below:
  • Enter the Name for your project & Click Create Project.
  • As a next step, we need to create a ClientID for your newly created project. Click on “Create an OAuth 2.0 Client ID”.
  • Enter your Product Name & Home page url. Click on Next button.
  •  Select Web application, provide Redirecturl in the next step, this is the page where the Google Server will redirect to after the user authenticates & authorizes. 
  • Click on Create Client ID. 
  • Now Client ID, Client Secret for your newly registered application will be generated as shown in the image. The Client ID & Client Secret is very important as they are used while making API calls for fetching information from Google. Make sure you do not share Client ID & Client Secret with others.  

Authentication Step:
In the next step, our application should facilitate the user with an authentication process with Google server along with SCOPE of access that the application requesting for the user’s account.
Parameters required for this Authentication step is as below:
  • SCOPE : the access level which the application requesting from the Google server.
  • Response_type : should be Code
  • State:  should include the value of the anti-forgery unique session token
  • Redirect_Uri : This is the url which the Google server will redirect after authentication by the user. This is the same url which you have given in the Registration Step.
Note: This post just explains how to register to a website with google accounts, in a similar way we can access resources from Google/FB/Linkedin/Twitter by sending proper SCOPE parameter and API call.
A sample call to the Google server is shown below:
https://accounts.google.com/o/oauth2/auth?state=%2Fprofile&redirect_uri=http://localhost:14964/GoogleRedirectUrl.aspx&response_type=code&client_id=565625245779.apps.googleusercontent.com&approval_prompt=force&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile
Authorization Step:
Call to the above link will take the user to the Google Login screen, where the user needs to authenticate. Once the user authenticates, he is redirected to Google’s Authorization screen.Google's authorization server will display the name of your application and the Google services that it is requesting permission to access on the user's behalf. The user can then consent or refuse to grant access to your application. After the user consents or refuses to grant access to your application, Google will redirect the user to the redirecturl that you specified in the Registration step.
In this step, the user can check the details which Google will expose to the application, simply clicking on the information image - i in the below image.

If the user grants access to your application, Google will append a code parameter to the redirect_uri  and returns back to the application.
http://localhost/redirecturl?code=4/ux5gNj-_mIu4DOD_gNZdjX9EtOFf
The code  obtained in the above step is a temporary authorization code that can be exchanged for an access_token by making a HTTPs post request which should include the below parameters:



The above HTTPs request returns JSON object with below details:

"access_token" : "ya29.AHES6ZTtm7SuokEB-RGtbBty9IIlNiP9-     eNMMQKtXdMP3sfjL1Fc",  "token_type" : "Bearer",  "expires_in" : 3600,  "refresh_token" : "1/HKSmLFXzqP0leUihZp2xUt3-5wkU7Gmu2Os_eBnzw74"
}
Accessing APIS Step:
Finally, an API call should be made by sending the Access_token received in the previous step to fetch the required profile details from the Resource server.


Google returns the user profile details to the application as shown below where we can store the user’s required data in our application tables for subsequent login authentications.



Thank you folks for your encouragement, please contact me for code level implementation.

Sunday, July 7, 2013

OAuth Authentication - Part 1


As part of tutorial series on Data Science with R from Data Perspective, in the first chapter we learn about basic data types in R.

What we learn:

  • Assignment Operator
  • Numeric
  • Integer
  • Complex number
  • Character
  • Factor
  • Vector
  • Data Frame

After the end of the chapter, you are provided with R console so that you can practice what you have learnt in this chapter.
  • Assignment Operator
# assigning string literal to variable x
x = 'welcome to R programming' 
x
[1] "welcome to R programming"
#to check the data type of the variable x
typeof(x) 
[1] "character"