Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Thursday, July 25, 2013

Document Classification using R

September 23, 2013

Recently I have developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data Analytics such as Predictive analysis, text mining.  I’m putting my next blog on Data Mining- more specifically document classification using R Programming language, one of the powerful languages used for Statistical Analysis.
What is Document classification?

Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify algorithmically. Document classification falls into Supervised Machine learning Technique.
Technically speaking, we create a machine learning model using a number of text documents (called Corpus) as Input & its corresponding class/category (called Labels) as Output. The model thus generated will be able to classify into classes when a new text is supplied.
Inside the Black Box:
Let’s have a look of what happens inside the black box in the above figure. We can divide the steps into:
  •      Creation of Corpus
  •      Preprocessing of Corpus 
  •      Creation of Term Document Matrix   
  •      Preparing Features & Labels for Model 
  •      Creating Train & test data 
  •      Running the model 
  •      Testing the model 
    To understand the above steps in detail, Let us consider a small used case:
We have speeches of US presidential contestants of Mr. Obama & Mr. Romney. We need to create a classifier which should be able to classify whether a particular new speech belongs to Mr. Obama or Mr. Romney.

We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:

  • Step I: Corpus creation:
     Corpus is a large and structured set of texts used for analysis. 
          In our case, we create two corpuses- one each for contestant.

  • Step II: Preprocessing of Corpus
      Now the created corpus needs to clean before we use the data for our analysis.  
        Preprocessing involves removal of punctuations, white spaces, Stop words such as is, 
        the, for, etc.
  • Step III: Term Document Matrix                                                                                  This step involves creation of Term Document Matrix, i.e. matrix which has the   frequency of terms that occur in a collection of documents.                                               for example:                                                                                                                                     D1 = “I love Data analysis”                                                                                                        D2 = “I love to create data models”                                                                                               TDM:

Step IV: Feature Extraction & Labels for the model:
        In this step, we extract input feature words which are useful in distinguishing the 
        documents and attaching the corresponding classes as Labels.

  • Step V: Train & test data preparation
        In this step, we first randomize the data & then, divide the Data containing Features &
        Labels into Training (70%) & Test data (30%) before we feed into our Model.

  • Step VI: Running the model:                                                                                        For creating our model using the training data we have separated in the earlier step. We use KNN-model, whose description can be found from here.
  •      Step VII: Test Model                                                                                                  Now that the model is created, we have to test the accuracy of the model using the test                                                                                                                                                  data created in the Step V.
Find the complete code here.


  1. I would like to show appreciation to the writer just for bailing me out of this predicament. After surfing around through the world-wide-web and obtaining recommendations that were not pleasant, I assumed my entire life was done. Existing devoid of the solutions to the issues you have fixed by means of this guideline is a serious case, and the ones which could have adversely damaged my entire career if I hadn't come across your site. Your good knowledge and kindness in touching every aspect was very helpful. I am not sure what I would've done if I hadn't discovered such a stuff like this. I can also at this time look ahead to my future. Thanks a lot very much for your impressive and results-oriented help. I won't hesitate to refer the website to anybody who should have guidance on this subject. seo keywords singapore

  2. The customary measure given by USA Today AdMeter recommended that Coca-Cola had done rather inadequately, yet when rethought, the genuine degrees of shopper reaction and commitment Coca-Cola's was top of the outlines. machine learning course

  3. Great post - thanks for sharing! Lots of great info about document classification

  4. Our R programming course in Gurgaon with placement assistance helps you to fabricate your resume to make you a job-ready candidate toward the end of the training. As a large portion of the organizations are depending on a data analytics device, there is constantly a high demand for R developers in IT current market.
    For More Info: R Programming Course in Gurgaon

  5. This comment has been removed by the author.