Friday, March 18, 2016

apply lapply rapply sapply functions in R

As part of Data Science with R, this is third tutorial after basic data types,control structures in r.

One of the issues with for loop is its memory consumption and its slowness in executing a repetitive task at hand. Often dealing with large data and iterating it, for loop is not advised. R provides many few alternatives to be applied on vectors for looping operations. In this section, we deal with apply function and its variants:

Datasets for apply family tutorial
 For understanding the apply functions in R we use,the data from 1974 Motor Trend
US magazine which comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Reynolds (1994) describes a small part of a study of the long-term temperature dynamics
of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by
telemetry every 10 minutes for four females, but data from a one period of less than a 
day for each of two animals is used there. 

        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]    [,9]   [,10]
day   346.00 346.00 346.00 346.00 346.00 346.00 346.00 346.00  346.00  346.00
time  840.00 850.00 900.00 910.00 920.00 930.00 940.00 950.00 1000.00 1010.00
temp   36.33  36.34  36.35  36.42  36.55  36.69  36.71  36.75   36.81   36.88
activ   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00    0.00    0.00

apply() function is the base function. We will learn how to apply family functions by trying out the code. apply() function takes 3 arguments:
  • data matrix
  • row/column operation, - 1 for row wise operation, 2 for column wise operation
  • function to be applied on the data.
when 1 is passed as second parameter, the function max is applied row wise and gives
us the result. In the below example, row wise maximum value is calculated.Since we 
have four types of attributes we got 4 results.
    day    time    temp   activ 
 347.00 2350.00   37.53    1.00 

When 2 is passed as second  parameter the function  mean is applied column wise.
In the below example mean function is applied on each column and mean for each 
column is calculated. Hence  we can see results for each column.
       mpg        cyl       disp         hp       drat         wt       qsec         vs         am       gear       carb 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500   0.406250   3.687500   2.812500 
We can also pass custom function instead of default functions. For example in 
the below example let us divide each column element with modulus of 10.
For this we use a custom function which takes each element from each column and
apply the modulus operation.
head(apply(mtcars,2,function(x) x%%10))
                  mpg cyl disp hp drat    wt qsec vs am gear carb
Mazda RX4         1.0   6    0  0 3.90 2.620 6.46  0  1    4    4
Mazda RX4 Wag     1.0   6    0  0 3.90 2.875 7.02  0  1    4    4
Datsun 710        2.8   4    8  3 3.85 2.320 8.61  1  1    4    1
Hornet 4 Drive    1.4   6    8  0 3.08 3.215 9.44  1  0    3    1
Hornet Sportabout 8.7   8    0  5 3.15 3.440 7.02  0  0    3    2
Valiant           8.1   6    5  5 2.76 3.460 0.22  1  0    3    1

lapply function is applied for operations on list objects and returns a list object of same length of original set.
lapply function in R, returns a list of the same length as input list object, each element of which is the result of applying FUN to the corresponding element of list.
 #create a list with 2 elements 
l = (a=1:10,b=11:20)  # the mean of the value in each element
lapply(l, mean)
[1] 5.5
[1] 15.5
class(lapply(l, mean))
[1] "list
  # the sum of the values in each element 
lapply(l, sum)
[1] 55

[1] 155

sapply is wrapper class to lapply with difference being it returns vector or matrix instead of list object.
 # create a list with 2 elements 
 l = (a=1:10,b=11:20)  # mean of values using sapply 
sapply(l, mean)
   a    b 
 5.5 15.5

tapply() is a very powerful function that lets you break a vector into pieces and then apply some function to each of the pieces. In the below code, first each of mpg in mtcars data is grouped by cylinder type and then mean() function is calculated.
 num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
[1] "4" "6" "8"

In the dataset we have 3 types of cylinders and now we want to see the average mpg
for each cylinder type.

       4        6        8 
26.66364 19.74286 15.10000 

In the output above we see that the average mpg for 4 cylinder engine 
is 26.664, 6-cyinder engine is 19.74 and 8-cylinder engine is 15.10

by works similar to group by function in SQL, applied to factors, where in we may apply operations on individual results set. In the below example, we apply colMeans() function to all the observations on iris dataset grouped by Species.
'data.frame': 150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris$Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 
iris$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.936        2.770        4.260        1.326 
iris$Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       6.588        2.974        5.552        2.026 

rapply() is a recursive version of lapply.
rapply() applies a function recursively on each element of the list with 2 modes for "how" parameter. If how = "replace", each element of the list which is not itself a list and has a class included in classes is replaced by the result of applying f to the element.If the mode is how = "list" or how = "unlist", the list is copied, all non-list elements which have a class included in classes are replaced by the result of applying f to the element and all others are replaced by deflt. Finally, if how = "unlist", unlist(recursive = TRUE) is called on the result.
l2 = list(a = 1:10, b = 11:20,c=c('d','a','t','a'))
 [1]  1  2  3  4  5  6  7  8  9 10

 [1] 11 12 13 14 15 16 17 18 19 20

[1] "d" "a" "t" "a"

rapply(l2, mean, how = "list", classes = "integer")
[1] 5.5

[1] 15.5


rapply(l2, mean, how = "unlist", classes = "integer")
 a    b 
 5.5 15.5 
rapply(l2, mean, how = "replace", classes = "integer")
[1] 5.5

[1] 15.5

[1] "d" "a" "t" "a"

mapply is a multivariate version of sapply. By R definition, mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments. In short, mapply applies a Function to Multiple List or multiple Vector Arguments. In the below example word function is applied to vector argument LETTERS. '
word = function(C, k) paste(, k), collapse = "")
utils::str(mapply(word, LETTERS[1:6], 6:1, SIMPLIFY = FALSE))
List of 6
 $ A: chr "AAAAAA"
 $ B: chr "BBBBB"
 $ C: chr "CCCC"
 $ D: chr "DDD"
 $ E: chr "EE"
 $ F: chr "F"

Saturday, February 27, 2016

Control Structures Loops in R

As part of Data Science tutorial Series in my previous post I posted on basic data types in R. I have kept the tutorial very simple so that beginners of R programming  may takeoff immediately.
Please find the online R editor at the end of the post so that you can execute the code on the page itself.
In this section we learn about control structures loops used in R. Control strcutures in R contains conditionals, loop statements like any other programming languages.
Loops are very important and forms backbone to any programming languages.Before we get into the control structures in R, just type as below in Rstudio:

If else statement:
#See the code syntax below for if else statement 
 print("x is greater than 1")
  print("x is less than 1")

#See the code below for nested if else statement

 if(x>1 & x<7){
     print("x is between 1 and 7")}else if(x>8 & x< 15){
         print("x is between 8 and 15")

[1] "x is between 8 and 15" 

For loops:
As we know for loops are used for iterating items
 #Below code shows for  loop implementation
x = c(1,2,3,4,5)
 for(i in 1:5){
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

While loop :

 #Below code shows while loop in R
x = 2.987
while(x <= 4.987) { 
     x = x + 0.987
[1] 3.974 1.974 2.974
[1] 4.961 2.961 3.961
[1] 5.948 3.948 4.948

Repeat Loop:
The repeat loop is an infinite loop and used in association with a break statement.

 #Below code shows repeat loop:
a = 1
 repeat { print(a) a = a+1 if(a > 4) break }
[1] 1
[1] 2
[1] 3
[1] 4

Break statement:
A break statement is used in a loop to stop the iterations and flow the control outside of the loop.

 #Below code shows break statement:
x = 1:10 
 for (i in x){ 
     if (i == 2){ 
[1] 1

Next statement:
Next statement enables to skip the current iteration of a loop without terminating it.

 #Below code shows next statement 
x = 1: 4 
 for (i in x) { 
     if (i == 2){ 
[1] 1
[1] 3
[1] 4

Creating a function in R:
function() is a built-in R function whose job is to create functions. In the below example function() takes one parameter x, executes a for loop logic.
The function object thus created using function() is assigned to a variable ('words.names'). Now this created function will be called using the variable 'word.names'

 #Below code shows us, how a function is created in R:

function_name = function(parameters,..){ code}
words = c("R", "datascience", "machinelearning","algorithms","AI") 
words.names = function(x) {
     for(name in x){ 
#Calling the function
[1] "R"
[1] "datascience"
[1] "machinelearning"
[1] "algorithms"
[1] "AI"

Hands on exercise of what we have learnt so far

We create a data frame DF, run for loop, ifelse in a function and call the function
#create 3 vectors name,age,salary
name = c("David","John","Mathew")
age = c(30,40,50)
salary = c(30000,120000,55000)
#create a data frame DF by combining the 3 vectors using cbind() function
DF = data.frame(cbind(name,age,salary))
#display DF
    name age salary
1  David  30  30000
2   John  40 120000
3 Mathew  50  55000
#dimensions of DF
[1] 3 3
#write a function which displays the salaried person name
findHighSalary = function(df){
     Maxsal = 0
     empname = ""
     for(i in 1:nrow(DF)){
         tmpsal = as.numeric(DF[i,3] )
         if(tmpsal > Maxsal){
             Maxsal = tmpsal
             empname = DF[i,1]
#calling the function
[1] "Mathew"

Principal Component Analysis using R

Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:
Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.
Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.
Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.
In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:
For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below.
    142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.5      0      0      0      0     0      1      0      0      0     0      0      0      0      0     0      0
9.6      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.7      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.8      0      0      0      0     0      0      1      0      0     0      0      0      0      0     0      0
9.9      0      0      1      0     1      0      1      0      0     0      0      0      0      0     0      0
    182.88 185.42 187.96 190.5 193.04 195.58
9.4      0      0      0     0      0      0
9.5      0      0      0     0      0      0
9.6      0      0      0     0      0      0
9.7      0      0      0     0      0      0
9.8      0      0      0     0      0      0
9.9      0      0      0     0      0      0
[1] 42 22
 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
  ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

[1] 3000

 [1] "142.24" "144.78" "147.32" "149.86" "152.4"  "154.94" "157.48" "160.02" "162.56" "165.1"  "167.64" "170.18" "172.72" "175.26" "177.8"  "180.34"
[17] "182.88" "185.42" "187.96" "190.5"  "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.

We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().
pca =prcomp(crimtab)

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.
par(mar = rep(2, 4))

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.
#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one.
biplot (pca , scale =0)

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them.
Complete Code for PCA implementation in R:

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Tuesday, February 16, 2016

Basic Data Types in R

As part of tutorial series on Data Science with R from Data Perspective, this first tutorial introduces the very basics of R programming language about basic data types in R.

What we learn:
After the end of the chapter, you are provided with R console so that you can practice what you have learnt in this chapter.

R assignment operator
x = 'welcome to R programming' # assigning string literal to variable x 
[1] "welcome to R programming"
typeof(x) #to check the data type of the variable x
[1] "character"
Numeric Numeric data represents decimal data.
x = 1.5 #assigning decimal value1.5 to x
[1] 1.5 
To check the data type we use class() function:
[1] "numeric"
To check if the variable “x” is of numerical or not, we use
[1] TRUE
To convert any compatible data into numeric, we use:
 x = '1' #assigning value 1 to variable x
[1] "character"
 x = as.numeric(x)
[1] 1
[1] "numeric"

Note: if we try to convert a string literal to numeric data type we get the following result.

x= 'welcome to R programming'
[1] NA
Warning message:
NAs introduced by coercion
Integer We use as.integer() function to convert into integers. This converts numeric value to integer values.
x = 1.34
 [1] 1.34
[1] "numeric"
y = as.integer(x)
[1] "integer"
[1] 1
Note: to check if the value is integer or not we use is.integer() function.
In the below example ‘y’ is numerical or decimal value whereas x is integer.
[1] TRUE
Complex: Complex data types are shown as below, though we use it very less in our day to day data analysis:
c = 3.5+4i
[1] 3.5+4i
[1] TRUE
[1] "complex"
Logical Logical data type is one of the frequently used data type usually used for comparing two values. Values a logical data type takes is TRUE or FALSE.
logical = T
[1] TRUE
Character String literals or string values are stored as Character objects in R.
str = "R Programming"
[1] "R Programming"
[1] "character"
[1] TRUE
We can convert other data types to character data type using as.character() function.
x = as.character(1)
[1] "1"
[1] "character"
Note: There are a variety of operations that can be applied on characters such as substrings, finding lengths; etc will be dealt as when appropriate.
So far we learnt about the basic data types in R, let’s get into a bit complex data types.
Vector How do we hold collection of same data types? We come across this requirement very frequently. We have vector data type to solve this problem.
Consider a numerical vector below:
num_vec = c(1,2,3,4,5)
[1] 1 2 3 4 5
[1] "numeric"
We can apply many operations on the vector variables such as length, accessing values or members of the vector variable.
Length of the vector can be found using length() function.
[1] 5
We access each element or member of the vector num_vec using its indexes starting from
In the below example we can access the members at 1st,2nd,3rd positons.
[1] 1
[1] 2
[1] 3
Similarly string vectors, logical vectors, integer vectors can be created.
char_vec = c("A", "Course","On","Data science","R rprogramming")
[1] "A"  "Course" "On" "Data science" "R rprogramming"
[1] 5
[1] "A"
[1] "Course"
[1] "Data science"
Matrix Matrix data type is used when we want to represent the data as collection of numerical values in mXn, m by n, dimensions. Matrices are used mostly when dealing with mathematical equations, machine learning, text mining algorithms.
Now how do we create a matrix?
m = matrix(c(1,2,3,6,7,8),nrow = 2,ncol = 3)
     [,1] [,2] [,3]
[1,]    1    3    7
[2,]    2C    6    8
[1] "matrix"
Knowing the dimension of the matrix is:
[1] 2 3
How do we access elements of matrix m:
#accessing individual elements are done using the indexes shown as below. In the below example we are accessing 1st, 2nd, 6th element of matrix m.
[1] 1
[1] 2
[1] 8
m[2,3] #  here we accessing 2nd row 3rd column element.
[1] 8
# accessing all elements of rows of the matrix m shown below. 
[1] 1 3 7
[1] 2 6 8
#accessing all elements of each column
[1] 1 2
[1] 3 6
[1] 7 8 
What happens when we add different data types to a vector?
v = c("a","b",1,2,3,T)
[1] "a"    "b"    "1"    "2"    "3"    "TRUE"
[1] "character"
[1] "TRUE"
[1] "character"
What happened in the above example is, R coerced all different data types into a single data type of character type to maintain the condition of single data type.
List What if we want to handle different data types in a single object?
List data type helps us in storing elements of different data types in a single object.
We create list objects using list() function.
In the below example I have created a list object “list_exp” with 6 different elements of character, numeric and logical data types.
list_exp = list("r programming","data perspective",12345,67890,TRUE,F)
[1] "r programming"
[1] "data perspective"
[1] 12345
[1] 67890
[1] TRUE
Using str() function, we can know the structure of the list object, i.e. the internal structure of the list objects can be known. This is one of the very important functions which we use in our day to day analysis.
In the below example we can see a list of 6 elements of character, numerical and logical data types.
List of 6
 $ : chr "r programming"
 $ : chr "data perspective"
 $ : num 12345
 $ : num 67890
 $ : logi TRUE
 $ : logi FALSE
#accessing the  data type of list_exp
[1] "list"
[1] 6
[1] "r programming"
#accessing the list elements using indexing.
[1] "r programming"
list_exp[[7]] # when we try accessing not existing elements we get the below error.
Error in list_exp[[7]] : subscript out of bounds
# finding the class of individual list element
[1] "logical"
[1] "numeric"
[1] "character"
Data Frame: Most of us would be from a bit of SQL background and we would be very much comfortable in handling data in the form of SQL table because of the functionalities which a SQL table offers while working the data.
How would it be if we have such data type object available in R which can be used to store the data and manipulate data in very easy, efficient and convenient way?
R offers a data frame data type object. It is another way that information is stored as data frames. We can treat a data frame similar to a SQL table.
How do we create a data frame?
#creating a data frame
data_frame = data.frame(first=c(1,2,3,4),second=c("a","b","c","d"))
  first second
     1      a
     2      b
     3      c
     4      d
#accessing  the data type of the object
[1] "data.frame"
#finding out the row count of data_frame using nrow()
[1] 4
#finding out the column count of data_frame using ncol()
[1] 2
#finding out the dimensions of data_frame using dim()
[1] 4 2
#finding the structure of the data frame using str()
'data.frame': 4 obs. of  2 variables:
 $ first : num  1 2 3 4
 $ second: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
#accessing the entire row of data frame using row index number. Observe below that if we use data_frame[1,] without specifying the column number it means that we want to access all the columns of row 1.
  first second
1     1      a
#similarly to access only 1st column values without row information use data_frame[,1] 
[1] 1 2 3 4
#accessing the row names of the data frame.
[1] "1" "2" "3" "4"
#accessing the column names of the data frame
[1] "first"  "second"
#column data can accessed using the column names explicitly instead of column indexes
[1] 1 2 3 4
[1] a b c d
Levels: a b c d
#accessing individual values using row and column indexes
data_frame[1,1] # accessing first row first column
[1] 1
data_frame[2,2] # accessing second row second column
[1] b
Levels: a b c d
data_frame[3,2]  # accessing third row second column
[1] c
Levels: a b c d
data_frame[3,1] # accessing third row first column
[1] 3
Note: Observe the below data frame:
dt_frame = data.frame(first=c(1,2,3,4,5,6,7),second=c("Big data","Python","R","NLP","machine learning","data science","data perspective"))
  first           second
     1         Big data
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
     7 data perspective
Assume we have a dataset with 1000 rows instead of 6 rows shown in above data frame. If we want to see a sample of data of the data frame, how do we do?
Using head() function.
  first           second
     1         Big data
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
head() function returns us the first six rows of any data frame so that we can have a look of what the data frame is like.
Also we can use tail() function to see the last six rows of the data frame.
  first           second
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
     7 data perspective
We have View() function to see the values of a data frame in a tabular form.

Friday, December 25, 2015

Data Science with R

As R programming language becoming popular more and more among data science group, industries, researchers, companies embracing R, going forward I will be writing posts on learning Data science using R. The tutorial course will include topics on data types of R, handling data using R, probability theory, Machine Learning, Supervised – unSupervised learning, Data Visualization using R, etc. Before going further, let’s just see some stats and tidbits on data science and R.

"A data scientist is simply someone who is highly adept at studying large amounts of often unorganized/undigested data"

“R programming language is becoming the Magic Wand for Data Scientists”

Why R for Data Science?

Daryl Pregibon, a research scientist at Google said- “R is really important to the point that it’s hard to overvalue it. It allows statisticians to do very intricate and complicated data analysis without knowing the blood and guts of computing systems.”

A brief stats for R popularity

“The shortage of data scientists is becoming a serious constraint in some sectors”
David Smith, Chief Community Officer at Revolution Analytics said –
“Investing in R, whether from the point of view of an individual Data Scientist or a company as a whole is always going to pay off because R is always available. If you’ve got a Data Scientist new to an organization, you can always use R. If you’re a company and you’re putting your practice on R, R is always going to be available. And, there’s also an ecosystem of companies built up around R including Revolution Enterprise to help organizations implement R into their machine critical production processes.”
Enough of candies about R, The topics which I cover in the course are:

  1. R Basics
  2. Probability theory in R
  3. Machine Learning in R
  4. Supervised machine learning 
  5.  Unsupervised machine learning 
  6. Advanced Machine Learning in R 
  7. Data Visualization in R 
See you in the first chapter, meanwhile read about the various data analysis steps involved.

Wednesday, November 18, 2015

Item Based Collaborative Filtering Recommender Systems in R

In the series of implementing Recommendation engines, in my previous blog about recommendation system in R, I have explained about implementing user based collaborative filtering approach using R. In this post, I will be explaining about basic implementation of Item based collaborative filtering recommender systems in r.

Item based Collaborative Filtering:
Unlike in user based collaborative filtering discussed previously, in item-based collaborative filtering, we consider set of items rated by the user and computes item similarities with the targeted item. Once similar items are found, and then rating for the new item is predicted by taking weighted average of the user’s rating on these similar items.
let's understand with an example:
As an example: consider below dataset, containing users rating to movies. Let us build an algorithm to recommend movies to CHAN.
Implementing Item based recommender systems, like user based collaborative filtering, requires two steps:
  • Calculating Item similarities
  •  Predicting the targeted item rating for the targeted User.

Step1: Calculating Item Similarity:
This is a critical step; we calculate the similarity between co-rated items. We use cosine similarity or pearson-similarity to compute the similarity between items. The output for step is similarity matrix between Items.

Code snippet:
#step 1: item-similarity calculation co-rated items are considered and similarity between two items
#are calculated using cosine similarity
ratings = read.csv("Rating Matrix.csv")
x = ratings[,2:7]
x[] = 0

item_sim = cosine(as.matrix(x))
Step2: Predicting the targeted item rating for the targeted User CHAN.
In this most important step, we first predict the items which the user is not rated by making use of the ratings he has made to previously interacted items and the similarity values calculated in the previous step. First we select item to be predicted, in our case “INCEPTION”, we predict the rating for INCEPTION movie by calculating the weighted sum of ratings made to movies similar to INCEPTION. i.e We take the similarity score for each rated movie by CHAN w.r.t INCEPTION and multiply with the corresponding rating and sum up all the for all the rated movies. This final sum is divided by total sum of similarity scores of rated items w.r.t INCEPTION.
Recommending Top N items:
Once all the non rated movies are predicted we recommend top N movies to CHAN. Code for Item based collaborative filtering in R:
 #data input
 ratings = read.csv("~Rating Matrix.csv")

"step 1: item-similarity calculation\nco-rated items are considered and similarity between two items\nare calculated using cosine similarity"

 x = ratings[,2:7]
 x[] = 0
 item_sim = cosine(as.matrix(x))
"Recommending items for chan: since three movies are not rated\nas a first step we have to predict rating value for each movie\nin CHANs case we have to first predict values for Titanic, Inception,Matrix"

 rec_itm_for_user = function(userno)
   #extract all the movies not rated by CHAN
   userRatings = ratings[userno,]
   non_rated_movies = list()
   rated_movies = list()
   for(i in 2:ncol(userRatings)){
       non_rated_movies = c(non_rated_movies,colnames(userRatings)[i])
       rated_movies = c(rated_movies,colnames(userRatings)[i])
   non_rated_movies = unlist(non_rated_movies)
   rated_movies = unlist(rated_movies)
   #create weighted similarity for all the rated movies by CHAN
   non_rated_pred_score = list()
   for(j in 1:length(non_rated_movies)){
     temp_sum = 0
     df = item_sim[which(rownames(item_sim)==non_rated_movies[j]),]
     for(i in 1:length(rated_movies)){
       temp_sum = temp_sum+ df[which(names(df)==rated_movies[i])]
     weight_mat = df*ratings[userno,2:7]
     non_rated_pred_score = c(non_rated_pred_score,rowSums(weight_mat,na.rm=T)/temp_sum)
   pred_rat_mat =
   names(pred_rat_mat) = non_rated_movies
   for(k in 1:ncol(pred_rat_mat)){
     ratings[userno,][which(names(ratings[userno,]) == names(pred_rat_mat)[k])] = pred_rat_mat[1,k]

> rec_itm_for_user(7)
  Users  Titanic Batman Inception SuperMan Spiderman   matrix

7  CHAN 3.085298    4.5  2.940811        4         1 3.170034

Calling above function gives the predicted values not previously seen values for movies Titanic, Inception, Matrix. Now we can sort and recommend the top items.
This is all about Collaborative filtering in R, in my upcoming posts I will talk about content based recommender systems in r.

Monday, October 19, 2015

Data Mining Standard Process across Organizations

Recently I have come across a term, CRISP-DM - a data mining standard. Though this process is not a new one but I felt every analyst should know about commonly used Industry wide process. In this post I will explain about different phases involved in creating a data mining solution.

CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.

Wednesday, October 7, 2015

Introduction to Logistic Regression with R

In my previous blog I have explained about linear regression. In today’s post I will explain about logistic regression.
        Consider a scenario where we need to predict a medical condition of a patient (HBP) ,HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.

Logistic regression – a classification problem, not a prediction problem:

In my previous blog I told that we use linear regression for scenarios which involves prediction. But there is a check; the regression analysis cannot be applied in scenarios where the response variable is not continuous. In our case the response variable is not a continuous variable but a value among a fixed set of classes. We call such scenarios as Classification problem rather than prediction problem. In such scenarios where the response variables are more of qualitative nature rather than continuous nature, we have to apply more suitable models namely logistic regression for classification.

Thursday, April 9, 2015

Exposing R-script as API

R is getting popular programming language in the area of Data Science. Integrating Rscript with web UI pages is a challenge which many application developers are facing. In this blog post I will explain how we can expose R script as an API, using rApache and Apache webserver.
rApache is a project supporting web application development using the R statistical language and environmentand the Apache web server.

Exposing Rscipt as API typically involves 3 steps:
  1. Pre-requisites
  2. Installing rApache
  3. Configuring rApache
Step #1 - Pre_requisites :
  • Linux environment - ubuntu
  • latest version of R - R- 3.1.0

Step#2 - Install apache webserver as below:
apt-get install r-base-dev apache2-mpm-prefork apache2-prefork-dev
 tarxzvf rapache-1.2.3.tar.gz
cd rapache-1.2.3
make install
Step#3: Configuring rApache.
  • Once rApache is installed, start the rApache server as below:
  • service apache2 start
  • Since we are installing for the first time let us test if the setup is installed properly. Do the below steps.
  • sudo vim /etc/apache2/apache2.conf #Added the following:

  • Once you have done the above step, Now access the below pages:
  • IPAddress/html/index.html


  • Before getting into expose R-script as API let us understand the rApache configuration.
    Once the rApache is installed the folder structure of rApache would be as below:

  • In apache2.conf file, we need to configure all the sites/r-scripts that we need to expose as api.. Below we have set a directory at path “/var/www/RProject“ in directory tags in apache2.conf file.

  • Now place the below R-script (test.r) generating a random normal distribution in the RProject folder specified above,in the code belowGET$p is input parameter we pass from URL.

  • The api is ready for testing.We can access the api using IP.XXX.XXX.XXX/RProject/test.r?q=10 And the results would be as below:
  • Detailed documentation of Installation is found here

Sunday, October 5, 2014

Regression Analysis using R

What is a Prediction Problem?
A business problem which involves predicting future events by extracting patterns in the historical data. Prediction problems are solved using Statistical techniques, mathematical models or machine learning techniques.
For example: Forecasting stock price for the next week, predicting which football team wins the world cup, etc.

What is Regression analysis, where is it applicable?
While dealing with any prediction problem, the easiest, most widely used yet powerful technique is the Linear Regression. Regression analysis is used for modeling the relationship between a response variable and one or more input variables.
In simpler terms,Regression Analysis helps us to find answers to:
  • Prediction of Future observations
  • find association, relationship between variables.
  • Identify which variables contribute more towards predicting the future outcomes.Types of regression problems:

Simple Linear Regression:

 If model deals with one input, called as independent or predictor variable and one output variable, called as dependent or response variable then it is called Simple Linear Regression. In this type of Linear regression, it assumes that there exists a linear relation between predictor and response variable of the form.
Y ≈ β0 + β1X + e.
In the above equation, β0,β1 are the unknown constants that represent intercept and slop of a straight line which we learned in our high schools. These known constants are known as the model coefficients or parameters. From the above equation, X is the known input variable and if we can estimate β0,β1 by some method then Y can be predicted. In order to predict future outcomes, by using the training data we need to estimate the unknown model parameters (ˆ β0,ˆ β1) using the equation.
ˆy = ˆ β0 + ˆ β1x + ˆe, where ˆ y,ˆ β0,ˆ β1 are the estimates.
Multiple Linear Regression:
If the problem contains more than one input variables and one response variable, then it is called Multiple Linear regression.

How do we apply Regression analysis using R?
Let us apply regression analysis on power plant dataset available from here. The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
  • Read the data into R environment:
sample1 = read.xlsx("C:\\Suresh\\blogs\\datasets\\CCPP\\Folds5x2_pp.xlsx",sheetIndex=1)
  •  Understand and observing the data: View(sample1) 
Check for missing values, range of variables, density plots for each of the varaible:
[1] 0
range(sample1$AT) #1.81,37.11
mean(sample1$AT) #m: 19.65
Density plot for Temperature

Scatter plots shows us that temperature (AT) and vaccum(V) are inversely related to power while pressure(AP) and RH are not related.
  •  Check for correlation among the variables. This step is very important to understand the relation of dependant variable with the independent variables and correlations among the variables. In general, there shouldn’t be any correlation among the independent variables.
'AT          V          AP          RH         PE
AT  1.0000000  0.8441067 -0.50754934 -0.54253465 -0.9481285
V   0.8441067  1.0000000 -0.41350216 -0.31218728 -0.8697803
AP -0.5075493 -0.4135022  1.00000000  0.09957432  0.5184290
RH -0.5425347 -0.3121873  0.09957432  1.00000000  0.3897941'
'inferences--> AT has -ve relation with PE
V is highly related to PE
other two are relatively related'
  • Divide the data into training and test set and train the model with linear regression using lm() method available in R and thendo predictions on new test data using predict() method.
tr = rand[1:6697,]
ts = rand[6698:9568,]
model2 = lm(PE~AT+V+AP+RH,data=tr)

lm(formula = PE ~ AT + V + AP + RH, data = tr)

    Min      1Q  Median      3Q     Max
-43.533  -3.170  -0.068   3.229  17.451

              Estimate Std. Error  t value Pr(>|t|)   
(Intercept) 457.729155  11.794172   38.810  < 2e-16 ***
AT           -1.987307   0.018208 -109.147  < 2e-16 ***
V            -0.231996   0.008692  -26.689  < 2e-16 ***
AP            0.059235   0.011442    5.177 2.32e-07 ***
RH           -0.159916   0.005015  -31.886  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.585 on 6692 degrees of freedom
Multiple R-squared:  0.9281,      Adjusted R-squared:  0.9281
F-statistic: 2.161e+04 on 4 and 6692 DF,  p-value: < 2.2e-16

New predictions are made using predict method.
pred = predict(model,ts[,1:4])

The above image shows the results of actual vs predicted which are quite accurate. In the summary results of  the model, below are the key takeways: 
  • Model is accurate as R2 is near to 1 (0.912). 
  • Model states all the variables are significant, the ***  indicate the significance.
  •  P-statistics is less than 0.05, F-statistics is significantly high. 
  • Residuals vs fitted plots and Q-Q normal plots are also good with mean variance of the errors around 0.

In the next blog we learn about model validations and extensions in linear regression.