Friday, December 25, 2015

Data Science with R

As R programming language becoming popular more and more among data science group, industries, researchers, companies embracing R, going forward I will be writing posts on learning Data science using R. The tutorial course will include topics on data types of R, handling data using R, probability theory, Machine Learning, Supervised – unSupervised learning, Data Visualization using R, etc. Before going further, let’s just see some stats and tidbits on data science and R.

"A data scientist is simply someone who is highly adept at studying large amounts of often unorganized/undigested data"

“R programming language is becoming the Magic Wand for Data Scientists”

Why R for Data Science?

Daryl Pregibon, a research scientist at Google said- “R is really important to the point that it’s hard to overvalue it. It allows statisticians to do very intricate and complicated data analysis without knowing the blood and guts of computing systems.”

A brief stats for R popularity

“The shortage of data scientists is becoming a serious constraint in some sectors”
David Smith, Chief Community Officer at Revolution Analytics said –
“Investing in R, whether from the point of view of an individual Data Scientist or a company as a whole is always going to pay off because R is always available. If you’ve got a Data Scientist new to an organization, you can always use R. If you’re a company and you’re putting your practice on R, R is always going to be available. And, there’s also an ecosystem of companies built up around R including Revolution Enterprise to help organizations implement R into their machine critical production processes.”
Enough of candies about R, The topics which I cover in the course are:

  1. R Basics
  2. Probability theory in R
  3. Machine Learning in R
  4. Supervised machine learning 
  5.  Unsupervised machine learning 
  6. Advanced Machine Learning in R 
  7. Data Visualization in R 
See you in the first chapter, meanwhile read about the various data analysis steps involved.

Wednesday, November 18, 2015

Item Based Collaborative Filtering Recommender Systems in R

In the series of implementing Recommendation engines, in my previous blog about recommendation system in R, I have explained about implementing user based collaborative filtering approach using R. In this post, I will be explaining about basic implementation of Item based collaborative filtering recommender systems in r.
Intuition:

Item based Collaborative Filtering:
Unlike in user based collaborative filtering discussed previously, in item-based collaborative filtering, we consider set of items rated by the user and computes item similarities with the targeted item. Once similar items are found, and then rating for the new item is predicted by taking weighted average of the user’s rating on these similar items.
let's understand with an example:
As an example: consider below dataset, containing users rating to movies. Let us build an algorithm to recommend movies to CHAN.
Implementing Item based recommender systems, like user based collaborative filtering, requires two steps:
  • Calculating Item similarities
  •  Predicting the targeted item rating for the targeted User.

Step1: Calculating Item Similarity:
This is a critical step; we calculate the similarity between co-rated items. We use cosine similarity or pearson-similarity to compute the similarity between items. The output for step is similarity matrix between Items.

Code snippet:
#step 1: item-similarity calculation co-rated items are considered and similarity between two items
#are calculated using cosine similarity
library(lsa)
ratings = read.csv("Rating Matrix.csv")
x = ratings[,2:7]
x[is.na(x)] = 0

item_sim = cosine(as.matrix(x))
Step2: Predicting the targeted item rating for the targeted User CHAN.
In this most important step, we first predict the items which the user is not rated by making use of the ratings he has made to previously interacted items and the similarity values calculated in the previous step. First we select item to be predicted, in our case “INCEPTION”, we predict the rating for INCEPTION movie by calculating the weighted sum of ratings made to movies similar to INCEPTION. i.e We take the similarity score for each rated movie by CHAN w.r.t INCEPTION and multiply with the corresponding rating and sum up all the for all the rated movies. This final sum is divided by total sum of similarity scores of rated items w.r.t INCEPTION.
Recommending Top N items:
Once all the non rated movies are predicted we recommend top N movies to CHAN. Code for Item based collaborative filtering in R:
 #data input
 ratings = read.csv("~Rating Matrix.csv")

"step 1: item-similarity calculation\nco-rated items are considered and similarity between two items\nare calculated using cosine similarity"

 library(lsa)
 x = ratings[,2:7]
 x[is.na(x)] = 0
 item_sim = cosine(as.matrix(x))
 
"Recommending items for chan: since three movies are not rated\nas a first step we have to predict rating value for each movie\nin CHANs case we have to first predict values for Titanic, Inception,Matrix"

 rec_itm_for_user = function(userno)
 {
   #extract all the movies not rated by CHAN
   userRatings = ratings[userno,]
   non_rated_movies = list()
   rated_movies = list()
   for(i in 2:ncol(userRatings)){
     if(is.na(userRatings[,i]))
     {
       non_rated_movies = c(non_rated_movies,colnames(userRatings)[i])
     }
     else
     {
       rated_movies = c(rated_movies,colnames(userRatings)[i])
     }
   }
   non_rated_movies = unlist(non_rated_movies)
   rated_movies = unlist(rated_movies)
   #create weighted similarity for all the rated movies by CHAN
   non_rated_pred_score = list()
   for(j in 1:length(non_rated_movies)){
     temp_sum = 0
     df = item_sim[which(rownames(item_sim)==non_rated_movies[j]),]
     for(i in 1:length(rated_movies)){
       temp_sum = temp_sum+ df[which(names(df)==rated_movies[i])]
        }
     weight_mat = df*ratings[userno,2:7]
     non_rated_pred_score = c(non_rated_pred_score,rowSums(weight_mat,na.rm=T)/temp_sum)
     }
   pred_rat_mat = as.data.frame(non_rated_pred_score)
   names(pred_rat_mat) = non_rated_movies
   for(k in 1:ncol(pred_rat_mat)){
     ratings[userno,][which(names(ratings[userno,]) == names(pred_rat_mat)[k])] = pred_rat_mat[1,k]
   }
   return(ratings[userno,])
 }

> rec_itm_for_user(7)
  Users  Titanic Batman Inception SuperMan Spiderman   matrix

7  CHAN 3.085298    4.5  2.940811        4         1 3.170034

Calling above function gives the predicted values not previously seen values for movies Titanic, Inception, Matrix. Now we can sort and recommend the top items.
This is all about Collaborative filtering in R, in my upcoming posts I will talk about content based recommender systems in r.

Monday, October 19, 2015

Data Mining Standard Process across Organizations

Recently I have come across a term, CRISP-DM - a data mining standard. Though this process is not a new one but I felt every analyst should know about commonly used Industry wide process. In this post I will explain about different phases involved in creating a data mining solution.

CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.

Wednesday, October 7, 2015

Introduction to Logistic Regression with R

In my previous blog I have explained about linear regression. In today’s post I will explain about logistic regression.
        Consider a scenario where we need to predict a medical condition of a patient (HBP) ,HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.

Logistic regression – a classification problem, not a prediction problem:

In my previous blog I told that we use linear regression for scenarios which involves prediction. But there is a check; the regression analysis cannot be applied in scenarios where the response variable is not continuous. In our case the response variable is not a continuous variable but a value among a fixed set of classes. We call such scenarios as Classification problem rather than prediction problem. In such scenarios where the response variables are more of qualitative nature rather than continuous nature, we have to apply more suitable models namely logistic regression for classification.

Thursday, April 9, 2015

Exposing R-script as API

R is getting popular programming language in the area of Data Science. Integrating Rscript with web UI pages is a challenge which many application developers are facing. In this blog post I will explain how we can expose R script as an API, using rApache and Apache webserver.
rApache is a project supporting web application development using the R statistical language and environmentand the Apache web server.

Exposing Rscipt as API typically involves 3 steps:
  1. Pre-requisites
  2. Installing rApache
  3. Configuring rApache
Step #1 - Pre_requisites :
  • Linux environment - ubuntu
  • latest version of R - R- 3.1.0

Step#2 - Install apache webserver as below:
apt-get install r-base-dev apache2-mpm-prefork apache2-prefork-dev
wget http://www.rapache.net/rapache-1.2.3.tar.gz
 tarxzvf rapache-1.2.3.tar.gz
cd rapache-1.2.3
 ./configure
 make
make install
Step#3: Configuring rApache.
  • Once rApache is installed, start the rApache server as below:
  • service apache2 start
  • Since we are installing for the first time let us test if the setup is installed properly. Do the below steps.
  • sudo vim /etc/apache2/apache2.conf #Added the following:

  • Once you have done the above step, Now access the below pages:
  • IPAddress/html/index.html

    IPAddress/RApacheInfo.html

  • Before getting into expose R-script as API let us understand the rApache configuration.
    Once the rApache is installed the folder structure of rApache would be as below:

  • In apache2.conf file, we need to configure all the sites/r-scripts that we need to expose as api.. Below we have set a directory at path “/var/www/RProject“ in directory tags in apache2.conf file.


  • Now place the below R-script (test.r) generating a random normal distribution in the RProject folder specified above,in the code belowGET$p is input parameter we pass from URL.


  • The api is ready for testing.We can access the api using IP.XXX.XXX.XXX/RProject/test.r?q=10 And the results would be as below:
  • Detailed documentation of Installation is found here