Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Thursday, July 31, 2014

Assessing Model Accuracy - Part 2

In my last post, I have explained about MSE, today I will explain the variance & bias trade-off, Precision recall trade-off while assessing the model accuracy.

What is Variance and bias of a statistical learning Method?
Variance refers to the amount by which the estimated output (f) would change if we estimated it (f) using a different training dataset. Since the training data is used to fit the statistical learning method, different training sets will result in different outputs (f).



Ideally, the estimate should not vary much between training sets.
Bias refers to the error that is introduced by approximating a complicated problem by a simpler model.
For example: Consider the distribution of dataset (black fit curve). If a simple linear regression (orange fit curve) is fitted for a dataset which actually needs much flexible model (blue fit curve), the simple linear regression model induces Bias in the model.
Fig: linear regression provides a very poor fit to the data
Explanation:
The test MSE calculated here, can be decomposed into three properties, Variance, Bias, error.
E(y0 − ˆ f(x0))^2 = Var( ˆ f(x0)) + [Bias( ˆ f(x0))]^2 +Var(ɛ).
Where E(y0 − ˆ f(x0))^2 defines the expected test MSE, first part of the equation is the variance, second part is the bias and the third part is the variance of error.
In order to develop best model for any analysis, we need to select a statistical method which achieves low Variance and low bias. This is called Variance-bias trade-off.
Fig: Variance-bias trade-off - Below explanation explains the details.

As a general rule, as the complexity of the model increases the variance will increase & Bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. Initially, as the flexibility increases the bias decreases faster than the variance increases. As a result, test MSE decreases. However, at some point increasing flexibility has little effect on the bias and after this point the variance tends to increase significantly. This point can be treated as optimal point for model selection. As an end note, while assessing the model accuracy we need to take variance-bias trade-off into consideration.

What is Precision-Recall of a statistical learning Method?
While dealing with a classification problem, we validate the relevance of the model using Precision Recall methods.
Consider the below confusion matrix, one of the best ways to represent the results of a classifier:

ACTUAL
POSITIVE
NEGATIVE
PREDICTED
POSITIVE
TRUE POSITIVE
FALSE POSITIVE
NEGATIVE
FALSE NEGATIVE
TRUE NEGATIVE

Let’s understand the confusion matrix:
TRUE POSITVE: The actual is +ve and model predicted +ve.
FALSE POSITIVE: The actual is –ve but the model predicted positive, FALSE ALARM
FALSE NEGATIVE: The actual is +ve but the model predicted negative - A MISS
TRUE NEGATIVE: The actual is –ve & model predicted –ve.
Precision: % of the predicted values that are correct.
Precision = TP/(TP+FP)
Recall: % of the correct items that are relevant.
Recall = TP/(TP+FN)
Both Precision & Recall are inversely related, if precision increases, the recall falls, vice versa. Trade-off between Precision and Recall varies from problem to problem. For example, in Legal document classification problem, model needs to have high Recall, as the model needs to extract/classify more relevant.