September 23, 2013
Recently I have developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data Analytics such as Predictive analysis, text mining. I’m putting my next blog on Data Mining- more specifically document classification using R Programming language, one of the powerful languages used for Statistical Analysis.
What is Document classification?
Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify algorithmically. Document classification falls into Supervised Machine learning Technique.
Technically speaking, we create a machine learning model using a number of text documents (called Corpus) as Input & its corresponding class/category (called Labels) as Output. The model thus generated will be able to classify into classes when a new text is supplied.
Inside the Black Box:
Let’s have a look of what happens inside the black box in the above figure. We can divide the steps into:
- Creation of Corpus
- Preprocessing of Corpus
- Creation of Term Document Matrix
- Preparing Features & Labels for Model
- Creating Train & test data
- Running the model
- Testing the model
We have speeches of US presidential contestants of Mr. Obama & Mr. Romney. We need to create a classifier which should be able to classify whether a particular new speech belongs to Mr. Obama or Mr. Romney.
We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:
- Step I: Corpus creation:
Corpus is a large and structured set of texts used for analysis.
In our case, we create two corpuses- one each for contestant.
- Step II: Preprocessing of Corpus
Now the created corpus needs to clean before we use the data for our analysis.
Preprocessing involves removal of punctuations, white spaces, Stop words such as is,
the, for, etc.
- Step III: Term Document Matrix
This step involves creation of Term Document Matrix, i.e. matrix which has the
frequency of terms that occur in a collection of documents.
D1 = “I love Data analysis”
D2 = “I love to create data models”
- Step IV: Feature Extraction & Labels for the model:
In this step, we extract input feature words which are useful in distinguishing the
documents and attaching the corresponding classes as Labels.
- Step V: Train & test data preparation
In this step, we first randomize the data & then, divide the Data containing Features &
Labels into Training (70%) & Test data (30%) before we feed into our Model.
- Step VI: Running the model:
For creating our model using the training data we have separated in the earlier step.
We use KNN-model, whose description can be found from here.
- Step VII: Test Model:
Now that the model is created, we have to test the accuracy of the model using the test
data created in the Step V.
Find the complete code here.