A Journey on Data Analysis, Predictive Analytics, Data Mining.

Thursday, July 25, 2013

Document Classification using R

September 23, 2013


Recently I have developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data Analytics such as Predictive analysis, text mining.  I’m putting my next blog on Data Mining- more specifically document classification using R Programming language, one of the powerful languages used for Statistical Analysis.

What is Document classification?

Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify algorithmically. Document classification falls into Supervised Machine learning Technique.
Technically speaking, we create a machine learning model using a number of text documents (called Corpus) as Input & its corresponding class/category (called Labels) as Output. The model thus generated will be able to classify into classes when a new text is supplied.


Inside the Black Box:

Let’s have a look of what happens inside the black box in the above figure. We can divide the steps into:

  •      Creation of Corpus
  •      Preprocessing of Corpus 
  •      Creation of Term Document Matrix   
  •      Preparing Features & Labels for Model 
  •      Creating Train & test data 
  •      Running the model 
  •      Testing the model 
     To understand the above steps in detail, Let us consider a small used case:
We have speeches of US presidential contestants of Mr. Obama & Mr. Romney. We need to create a classifier which should be able to classify whether a particular new speech belongs to Mr. Obama or Mr. Romney.

Implementation

We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:

  • Step I: Corpus creation:

          Corpus is a large and structured set of texts used for analysis. 
          In our case, we create two corpuses- one each for contestant.


  • Step II: Preprocessing of Corpus

        Now the created corpus needs to clean before we use the data for our analysis.  
        Preprocessing involves removal of punctuations, white spaces, Stop words such as is, 
        the, for, etc.


  • Step III: Term Document Matrix

         This step involves creation of Term Document Matrix, i.e. matrix which has the 
         frequency of terms that occur in a collection of documents.
         for example: 

                D1 = “I love Data analysis”
    D2 = “I love to create data models”
    TDM:





  • Step IV: Feature Extraction & Labels for the model:

        In this step, we extract input feature words which are useful in distinguishing the 
        documents and attaching the corresponding classes as Labels.



  • Step V: Train & test data preparation

        In this step, we first randomize the data & then, divide the Data containing Features &
        Labels into Training (70%) & Test data (30%) before we feed into our Model.


  • Step VI: Running the model:

          For creating our model using the training data we have separated in the earlier step. 
         We use KNN-model, whose description can be found from here.



  •      Step VII: Test Model:

         Now that the model is created, we have to test the accuracy of the model using the test
         data created in the Step V.