Sunday, October 5, 2014

Regression Analysis using R

What is a Prediction Problem?
A business problem which involves predicting future events by extracting patterns in the historical data. Prediction problems are solved using Statistical techniques, mathematical models or machine learning techniques.
For example: Forecasting stock price for the next week, predicting which football team wins the world cup, etc.

What is Regression analysis, where is it applicable?
While dealing with any prediction problem, the easiest, most widely used yet powerful technique is the Linear Regression. Regression analysis is used for modeling the relationship between a response variable and one or more input variables.
In simpler terms,Regression Analysis helps us to find answers to:
  • Prediction of Future observations
  • find association, relationship between variables.
  • Identify which variables contribute more towards predicting the future outcomes.Types of regression problems:

Simple Linear Regression:

 If model deals with one input, called as independent or predictor variable and one output variable, called as dependent or response variable then it is called Simple Linear Regression. In this type of Linear regression, it assumes that there exists a linear relation between predictor and response variable of the form.
Y ≈ β0 + β1X + e.
In the above equation, β0,β1 are the unknown constants that represent intercept and slop of a straight line which we learned in our high schools. These known constants are known as the model coefficients or parameters. From the above equation, X is the known input variable and if we can estimate β0,β1 by some method then Y can be predicted. In order to predict future outcomes, by using the training data we need to estimate the unknown model parameters (ˆ β0,ˆ β1) using the equation.
ˆy = ˆ β0 + ˆ β1x + ˆe, where ˆ y,ˆ β0,ˆ β1 are the estimates.
Multiple Linear Regression:
If the problem contains more than one input variables and one response variable, then it is called Multiple Linear regression.

How do we apply Regression analysis using R?
Let us apply regression analysis on power plant dataset available from here. The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
  • Read the data into R environment:
sample1 = read.xlsx("C:\\Suresh\\blogs\\datasets\\CCPP\\Folds5x2_pp.xlsx",sheetIndex=1)
  •  Understand and observing the data: View(sample1) 
Check for missing values, range of variables, density plots for each of the varaible:
[1] 0
range(sample1$AT) #1.81,37.11
mean(sample1$AT) #m: 19.65
Density plot for Temperature

Scatter plots shows us that temperature (AT) and vaccum(V) are inversely related to power while pressure(AP) and RH are not related.
  •  Check for correlation among the variables. This step is very important to understand the relation of dependant variable with the independent variables and correlations among the variables. In general, there shouldn’t be any correlation among the independent variables.
'AT          V          AP          RH         PE
AT  1.0000000  0.8441067 -0.50754934 -0.54253465 -0.9481285
V   0.8441067  1.0000000 -0.41350216 -0.31218728 -0.8697803
AP -0.5075493 -0.4135022  1.00000000  0.09957432  0.5184290
RH -0.5425347 -0.3121873  0.09957432  1.00000000  0.3897941'
'inferences--> AT has -ve relation with PE
V is highly related to PE
other two are relatively related'
  • Divide the data into training and test set and train the model with linear regression using lm() method available in R and thendo predictions on new test data using predict() method.
tr = rand[1:6697,]
ts = rand[6698:9568,]
model2 = lm(PE~AT+V+AP+RH,data=tr)

lm(formula = PE ~ AT + V + AP + RH, data = tr)

    Min      1Q  Median      3Q     Max
-43.533  -3.170  -0.068   3.229  17.451

              Estimate Std. Error  t value Pr(>|t|)   
(Intercept) 457.729155  11.794172   38.810  < 2e-16 ***
AT           -1.987307   0.018208 -109.147  < 2e-16 ***
V            -0.231996   0.008692  -26.689  < 2e-16 ***
AP            0.059235   0.011442    5.177 2.32e-07 ***
RH           -0.159916   0.005015  -31.886  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.585 on 6692 degrees of freedom
Multiple R-squared:  0.9281,      Adjusted R-squared:  0.9281
F-statistic: 2.161e+04 on 4 and 6692 DF,  p-value: < 2.2e-16

New predictions are made using predict method.
pred = predict(model,ts[,1:4])

The above image shows the results of actual vs predicted which are quite accurate. In the summary results of  the model, below are the key takeways: 
  • Model is accurate as R2 is near to 1 (0.912). 
  • Model states all the variables are significant, the ***  indicate the significance.
  •  P-statistics is less than 0.05, F-statistics is significantly high. 
  • Residuals vs fitted plots and Q-Q normal plots are also good with mean variance of the errors around 0.

In the next blog we learn about model validations and extensions in linear regression.

Thursday, July 31, 2014

Assessing Model Accuracy - Part 2

In my last post, I have explained about MSE, today I will explain the variance & bias trade-off, Precision recall trade-off while assessing the model accuracy.

What is Variance and bias of a statistical learning Method?
Variance refers to the amount by which the estimated output (f) would change if we estimated it (f) using a different training dataset. Since the training data is used to fit the statistical learning method, different training sets will result in different outputs (f).
Ideally, the estimate should not vary much between training sets.
Bias refers to the error that is introduced by approximating a complicated problem by a simpler model.
For example: Consider the distribution of dataset (black fit curve). If a simple linear regression (orange fit curve) is fitted for a dataset which actually needs much flexible model (blue fit curve), the simple linear regression model induces Bias in the model.
Fig: linear regression provides a very poor fit to the data
The test MSE calculated here, can be decomposed into three properties, Variance, Bias, error.
E(y0 − ˆ f(x0))^2 = Var( ˆ f(x0)) + [Bias( ˆ f(x0))]^2 +Var(ɛ).
Where E(y0 − ˆ f(x0))^2 defines the expected test MSE, first part of the equation is the variance, second part is the bias and the third part is the variance of error.
In order to develop best model for any analysis, we need to select a statistical method which achieves low Variance and low bias. This is called Variance-bias trade-off.
Fig: Variance-bias trade-off - Below explanation explains the details.

As a general rule, as the complexity of the model increases the variance will increase & Bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. Initially, as the flexibility increases the bias decreases faster than the variance increases. As a result, test MSE decreases. However, at some point increasing flexibility has little effect on the bias and after this point the variance tends to increase significantly. This point can be treated as optimal point for model selection. As an end note, while assessing the model accuracy we need to take variance-bias trade-off into consideration.

What is Precision-Recall of a statistical learning Method?
While dealing with a classification problem, we validate the relevance of the model using Precision Recall methods.
Consider the below confusion matrix, one of the best ways to represent the results of a classifier:


Let’s understand the confusion matrix:
TRUE POSITVE: The actual is +ve and model predicted +ve.
FALSE POSITIVE: The actual is –ve but the model predicted positive, FALSE ALARM
FALSE NEGATIVE: The actual is +ve but the model predicted negative - A MISS
TRUE NEGATIVE: The actual is –ve & model predicted –ve.
Precision: % of the predicted values that are correct.
Precision = TP/(TP+FP)
Recall: % of the correct items that are relevant.
Recall = TP/(TP+FN)
Both Precision & Recall are inversely related, if precision increases, the recall falls, vice versa. Trade-off between Precision and Recall varies from problem to problem. For example, in Legal document classification problem, model needs to have high Recall, as the model needs to extract/classify more relevant.

Saturday, June 21, 2014

Assessing Model Accuracy - Part1

Recently, I have started reading a book "Introduction to statistical Learning", which had good introduction for model accuracy assessing. This post contains excerpts of the chapter:

Often we take different statistical approaches to build a solution for a data analytical problem. Why is it necessary to introduce so many different approaches, rather than a single best method? The answer is: in Statistics no single method dominates all other methods over all possible datasets. One statistical method may work well with a specific dataset and some other method may work better on a similar but different dataset. So it is important to decide for a particular dataset which method produces best results.

Sunday, May 25, 2014

Basic recommendation engine using R

In our day to day life, we come across a large number of Recommendation engines like Facebook Recommendation Engine for Friends’ suggestions, and suggestions of similar Like Pages, Youtube recommendation engine suggesting videos similar to our previous searches/preferences. In today’s blog post I will explain how to build a basic recommender System.

Types of Collaborative Filtering:

  1. User based Collaborative Filtering
  2. Item based Collaborative filtering
 In this post will explain about User based Collaborative Filtering. This algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours. It looks at other things they like and combines them to create a ranked list of suggestions.

Implementing User Based Collaborative Filtering:
This involves two steps:
  1. Calculating Similarity Function 
  2. Recommend items to users based on user Similarity Score
Consider the below data sample of Movie critics and their movie rankings, the objective is to recommend the unrated movies based on similar users:

Step1- Calculate Similarity Score for CHAN:

Creating Similarity score for people helps us to identify similar people. We use Cosine based Similarity function to calculate the similarity between the users. Know more about cosine similarity here. In R we have a cosine function readily available:
user_sim = cosine(as.matrix(t(x)))

Step2- recommending Movies for CHAN:

For recommending movies for Chan using the above similarity matrix, we need to first fill the N/A where he has not rated. As first step, separate the non-rated movies by Chan and a weighted matrix is created by multiplying user similarity score (user_sim[,7]) with ratings given by other users.
Next step is to sum up all the columns of the weight matrix, then divide by the sum of all the similarities for critics that reviewed that movie. The result calculation gives what the user might rate this movie, the results as below:
The above explanation is written in the below R function:
rec_itm_for_user = function(userNo) 
{ #calcualte column wise sum 
col_sums= list()
 rat_user = critics[userNo,2:7]
tot = list()
 for(i in 1:ncol(rat_user)){ 
 col_sums[x] = sum(weight_mat[,i],na.rm=TRUE)
 temp =[,i])
 for(j in 1:nrow(temp))
{ if(![j,1]))
 sum_temp = sum_temp+user_sim[j,7]
 tot[z] = sum_temp z=z+1 
 for(i in 1:ncol(rat_user)){ 
 rat_user[1,i] = col_sums[[z]]/tot[[z]] z=z+1 
Calling the above function gives the below results:

Titanic Batman Inception Superman.Returns spiderMan Matrix

2.811   4.5     2.355783           4            1    3.481427
Recommending movies for Chan will be in the order: Matrix (3.48), Titanic(2.81), Inception(2.35).
complete sourceCode is available on github

Thursday, April 17, 2014

Thursday, March 20, 2014

Build Web applications using Shiny R

Ever since I’ve started working on R , I always wondered how I can present
the results of my statistical models as web applications. After doing some
research over the internet I’ve come across – ShinyR – a new package
from RStudio which can be used to develop interactive web applications with R.
Before going into how to build web apps using R, let me give you some overview
about ShinyR.

Monday, March 3, 2014

Monday, February 3, 2014

Data Analysis Steps

After going through the overview of tools & technologies needed to become a Data scientist in my previous blog post, in this post, we shall understand how to tackle a data analysis problem.
Any data analysis project starts with identifying a business problem where historical data exists. A business problem can be anything which can include prediction problems, analyzing customer behavior, identifying new patterns from past events, building recommendation engines etc.

Tuesday, January 7, 2014

Data Analysis Tools

As mentioned in my previous post , in this post I will be listing out the tools, blogs and forums, online courses that I have gathered over the past one year, which I felt necessary in my journey, which will be helpful to my fellow data science aspirants.
 Skillset Required:
  •  Knowledge in Statistics – Exploratory analysis, doing initial analysis of the data & understanding the data to decide what techniques needs to be applied, which I feel is a must know subject. 
  • Mathematics – basics of calculus, algebra etc. for mathematical formulation of the problem statement.
  •  Understanding Machine learning algorithms for predictive modeling, recommendation engines, classification models, cluster analysis, social network analysis.
  • Data mining skills like data cleaning/Data Munging skills, apply Machine learning techniques on the data. 
  •  Visualization skills to display the results, to understand the results during building data modeling.
Tools required: 
Programming Languages: Proficiency in any two of the below mentioned languages would be advisable:
  •  R 
  • Python 
  • Java – comes in handy when we work on Hadoop 
  • C,C++
Tools required: Since I’m using Open source tools, I will be confined to them:
  • R-Studio
  • NLTK toolkit 
  • Rapid Miner
  • Weka
Important Point: 
Most of the machine learning algorithms has been already implemented as packages in the above languages/tools . We need to just download and make use of them.
Big data Tools: 
  • Hadoop setup from Cloudera/Hortonworks
  • Mongodb- NoSQL DB
Visualization tools: 
Though I have not explored much in this area, but till day I’m happy with R packages for visualizations.
  • Data exploration in R/Python 
Few Books I have referred: 
  • The Elements of Statistical Learning - 2nd Edition 
  • Simon Sheather, A Modern Approach to Regression
  •  Data Mining 3rd Edition by Ian H. Witten, Eibe Frank, Mark A. Hall
Online Courses: 
Though a lot of courses are available online, I have stick to very few sites as below,
For Data Analysis, Stats, Maths:
For Big Data: 
Blogs and forums: 
Online forums is one place where I used to get a lot of information, in Linked Groups I could get answer to all my trivial questions. You post any query and you will get elaborate answer from research scholars to industry experts, I really love this place. I will list down few Linkedin groups I follow,
Blogs I follow:
Will add more when I come across the new tools. Guys hope this will serve you as a starting point for the Journey. All the best, Happy New Year. Please do add any new tools and technologies to the above list.
In my next post, shall post about how to tackle a data analysis problem