A Series of blog posts on Data Science, Data Mining.

Saturday, June 21, 2014

Assessing Model Accuracy - Part1

Recently, I have started reading a book "Introduction to statistical Learning", which had good introduction for model accuracy assessing. This post contains excerpts of the chapter:

Often we take different statistical approaches to build a solution for a data analytical problem. Why is it necessary to introduce so many different approaches, rather than a single best method? The answer is: in Statistics no single method dominates all other methods over all possible datasets. One statistical method may work well with a specific dataset and some other method may work better on a similar but different dataset. So it is important to decide for a particular dataset which method produces best results.

 In order to evaluate the performance of a statistical learning method, we need to measure how well its predictions actually match the observed data. Today we will look into differnt measures which help us in assessing the model accuracy:
  •  Mean Squared Error
  •  Variance/bias 

Mean Squared Error:

 In regression Analysis, the most commonly used measure is Mean Squared Error, given by the equation:


 where f(xi) is the prediction that f gives for the ith observation.

The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially. The MSE calculated above is called training MSE. But here we need to make a note that we would be more interested in the accuracy of test MSE, i.e. the MSE of previously unseen test data on the model rather than training MSE. The equation for calculating the test MSE is 

Test MSE =Avg(f(x0)-y0)^2

 Where test MSE is calculated by squared average of the difference between the actual (y0) & estimated f(x0) on. 

 Consider the below situation where 3 different regression model fits are applied on a randomly generated data set and is shown in below fig:
Note: The above images are taken from An Introduction to Statistical Learning - Authors: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

In the above figure, black curve is actual distribution of data, orange, blue, green are three different fits to the original data distribution. For the right portion of the image we can say that the model fit representing the blue curve is more near to the original & it is the ideal model for prediction.

 Now consider left part of image, the red curve is the test MSE for three model fits & the grey curve is the training MSE for the same three model fits. We can observe that the grey curve, the training MSE will keep on declining telling us that with more complex models than required, the MSE will be high. On the other hand, the test MSE is a U shaped curve which decreases initially and then raises. The ideal model is the one after which the test MSE increases. In our example it is at the Blue point where the curve starts raising. The model we built should not result in Overfitting, a condition in which the method yields small training MSE but a large test MSE. It means that the modelled fit/method is trying to so hard to find patterns in the data and fitting up even though the occurrences were by chance in reality. 

 In my next post will explain about Variance-Bias trade-off while choosing a best fit model.