Saturday, February 27, 2016

Control Structures Loops in R

As part of Data Science tutorial Series in my previous post I posted on basic data types in R. I have kept the tutorial very simple so that beginners of R programming  may takeoff immediately.
Please find the online R editor at the end of the post so that you can execute the code on the page itself.
In this section we learn about control structures loops used in R. Control strcutures in R contains conditionals, loop statements like any other programming languages.
Loops are very important and forms backbone to any programming languages.Before we get into the control structures in R, just type as below in Rstudio:

If else statement:
#See the code syntax below for if else statement 
 print("x is greater than 1")
  print("x is less than 1")

#See the code below for nested if else statement

 if(x>1 & x<7){
     print("x is between 1 and 7")}else if(x>8 & x< 15){
         print("x is between 8 and 15")

[1] "x is between 8 and 15" 

For loops:
As we know for loops are used for iterating items
 #Below code shows for  loop implementation
x = c(1,2,3,4,5)
 for(i in 1:5){
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

While loop :

 #Below code shows while loop in R
x = 2.987
while(x <= 4.987) { 
     x = x + 0.987
[1] 3.974 1.974 2.974
[1] 4.961 2.961 3.961
[1] 5.948 3.948 4.948

Repeat Loop:
The repeat loop is an infinite loop and used in association with a break statement.

 #Below code shows repeat loop:
a = 1
 repeat { print(a) a = a+1 if(a > 4) break }
[1] 1
[1] 2
[1] 3
[1] 4

Break statement:
A break statement is used in a loop to stop the iterations and flow the control outside of the loop.

 #Below code shows break statement:
x = 1:10 
 for (i in x){ 
     if (i == 2){ 
[1] 1

Next statement:
Next statement enables to skip the current iteration of a loop without terminating it.

 #Below code shows next statement 
x = 1: 4 
 for (i in x) { 
     if (i == 2){ 
[1] 1
[1] 3
[1] 4

Creating a function in R:
function() is a built-in R function whose job is to create functions. In the below example function() takes one parameter x, executes a for loop logic.
The function object thus created using function() is assigned to a variable ('words.names'). Now this created function will be called using the variable 'word.names'

 #Below code shows us, how a function is created in R:

function_name = function(parameters,..){ code}
words = c("R", "datascience", "machinelearning","algorithms","AI") 
words.names = function(x) {
     for(name in x){ 
#Calling the function
[1] "R"
[1] "datascience"
[1] "machinelearning"
[1] "algorithms"
[1] "AI"

Hands on exercise of what we have learnt so far

We create a data frame DF, run for loop, ifelse in a function and call the function
#create 3 vectors name,age,salary
name = c("David","John","Mathew")
age = c(30,40,50)
salary = c(30000,120000,55000)
#create a data frame DF by combining the 3 vectors using cbind() function
DF = data.frame(cbind(name,age,salary))
#display DF
    name age salary
1  David  30  30000
2   John  40 120000
3 Mathew  50  55000
#dimensions of DF
[1] 3 3
#write a function which displays the salaried person name
findHighSalary = function(df){
     Maxsal = 0
     empname = ""
     for(i in 1:nrow(DF)){
         tmpsal = as.numeric(DF[i,3] )
         if(tmpsal > Maxsal){
             Maxsal = tmpsal
             empname = DF[i,1]
#calling the function
[1] "Mathew"

Principal Component Analysis using R

Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:
Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.
Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.
Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.
In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:
For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below.
    142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.5      0      0      0      0     0      1      0      0      0     0      0      0      0      0     0      0
9.6      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.7      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.8      0      0      0      0     0      0      1      0      0     0      0      0      0      0     0      0
9.9      0      0      1      0     1      0      1      0      0     0      0      0      0      0     0      0
    182.88 185.42 187.96 190.5 193.04 195.58
9.4      0      0      0     0      0      0
9.5      0      0      0     0      0      0
9.6      0      0      0     0      0      0
9.7      0      0      0     0      0      0
9.8      0      0      0     0      0      0
9.9      0      0      0     0      0      0
[1] 42 22
 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
  ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

[1] 3000

 [1] "142.24" "144.78" "147.32" "149.86" "152.4"  "154.94" "157.48" "160.02" "162.56" "165.1"  "167.64" "170.18" "172.72" "175.26" "177.8"  "180.34"
[17] "182.88" "185.42" "187.96" "190.5"  "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.

We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().
pca =prcomp(crimtab)

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.
par(mar = rep(2, 4))

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.
#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one.
biplot (pca , scale =0)

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them.
Complete Code for PCA implementation in R:

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Tuesday, February 16, 2016

Basic Data Types in R

As part of tutorial series on Data Science with R from Data Perspective, this first tutorial introduces the very basics of R programming language about basic data types in R.

What we learn:
After the end of the chapter, you are provided with R console so that you can practice what you have learnt in this chapter.

R assignment operator
x = 'welcome to R programming' # assigning string literal to variable x 
[1] "welcome to R programming"
typeof(x) #to check the data type of the variable x
[1] "character"
Numeric Numeric data represents decimal data.
x = 1.5 #assigning decimal value1.5 to x
[1] 1.5 
To check the data type we use class() function:
[1] "numeric"
To check if the variable “x” is of numerical or not, we use
[1] TRUE
To convert any compatible data into numeric, we use:
 x = '1' #assigning value 1 to variable x
[1] "character"
 x = as.numeric(x)
[1] 1
[1] "numeric"

Note: if we try to convert a string literal to numeric data type we get the following result.

x= 'welcome to R programming'
[1] NA
Warning message:
NAs introduced by coercion
Integer We use as.integer() function to convert into integers. This converts numeric value to integer values.
x = 1.34
 [1] 1.34
[1] "numeric"
y = as.integer(x)
[1] "integer"
[1] 1
Note: to check if the value is integer or not we use is.integer() function.
In the below example ‘y’ is numerical or decimal value whereas x is integer.
[1] TRUE
Complex: Complex data types are shown as below, though we use it very less in our day to day data analysis:
c = 3.5+4i
[1] 3.5+4i
[1] TRUE
[1] "complex"
Logical Logical data type is one of the frequently used data type usually used for comparing two values. Values a logical data type takes is TRUE or FALSE.
logical = T
[1] TRUE
Character String literals or string values are stored as Character objects in R.
str = "R Programming"
[1] "R Programming"
[1] "character"
[1] TRUE
We can convert other data types to character data type using as.character() function.
x = as.character(1)
[1] "1"
[1] "character"
Note: There are a variety of operations that can be applied on characters such as substrings, finding lengths; etc will be dealt as when appropriate.
So far we learnt about the basic data types in R, let’s get into a bit complex data types.
Vector How do we hold collection of same data types? We come across this requirement very frequently. We have vector data type to solve this problem.
Consider a numerical vector below:
num_vec = c(1,2,3,4,5)
[1] 1 2 3 4 5
[1] "numeric"
We can apply many operations on the vector variables such as length, accessing values or members of the vector variable.
Length of the vector can be found using length() function.
[1] 5
We access each element or member of the vector num_vec using its indexes starting from
In the below example we can access the members at 1st,2nd,3rd positons.
[1] 1
[1] 2
[1] 3
Similarly string vectors, logical vectors, integer vectors can be created.
char_vec = c("A", "Course","On","Data science","R rprogramming")
[1] "A"  "Course" "On" "Data science" "R rprogramming"
[1] 5
[1] "A"
[1] "Course"
[1] "Data science"
Matrix Matrix data type is used when we want to represent the data as collection of numerical values in mXn, m by n, dimensions. Matrices are used mostly when dealing with mathematical equations, machine learning, text mining algorithms.
Now how do we create a matrix?
m = matrix(c(1,2,3,6,7,8),nrow = 2,ncol = 3)
     [,1] [,2] [,3]
[1,]    1    3    7
[2,]    2C    6    8
[1] "matrix"
Knowing the dimension of the matrix is:
[1] 2 3
How do we access elements of matrix m:
#accessing individual elements are done using the indexes shown as below. In the below example we are accessing 1st, 2nd, 6th element of matrix m.
[1] 1
[1] 2
[1] 8
m[2,3] #  here we accessing 2nd row 3rd column element.
[1] 8
# accessing all elements of rows of the matrix m shown below. 
[1] 1 3 7
[1] 2 6 8
#accessing all elements of each column
[1] 1 2
[1] 3 6
[1] 7 8 
What happens when we add different data types to a vector?
v = c("a","b",1,2,3,T)
[1] "a"    "b"    "1"    "2"    "3"    "TRUE"
[1] "character"
[1] "TRUE"
[1] "character"
What happened in the above example is, R coerced all different data types into a single data type of character type to maintain the condition of single data type.
List What if we want to handle different data types in a single object?
List data type helps us in storing elements of different data types in a single object.
We create list objects using list() function.
In the below example I have created a list object “list_exp” with 6 different elements of character, numeric and logical data types.
list_exp = list("r programming","data perspective",12345,67890,TRUE,F)
[1] "r programming"
[1] "data perspective"
[1] 12345
[1] 67890
[1] TRUE
Using str() function, we can know the structure of the list object, i.e. the internal structure of the list objects can be known. This is one of the very important functions which we use in our day to day analysis.
In the below example we can see a list of 6 elements of character, numerical and logical data types.
List of 6
 $ : chr "r programming"
 $ : chr "data perspective"
 $ : num 12345
 $ : num 67890
 $ : logi TRUE
 $ : logi FALSE
#accessing the  data type of list_exp
[1] "list"
[1] 6
[1] "r programming"
#accessing the list elements using indexing.
[1] "r programming"
list_exp[[7]] # when we try accessing not existing elements we get the below error.
Error in list_exp[[7]] : subscript out of bounds
# finding the class of individual list element
[1] "logical"
[1] "numeric"
[1] "character"
Data Frame: Most of us would be from a bit of SQL background and we would be very much comfortable in handling data in the form of SQL table because of the functionalities which a SQL table offers while working the data.
How would it be if we have such data type object available in R which can be used to store the data and manipulate data in very easy, efficient and convenient way?
R offers a data frame data type object. It is another way that information is stored as data frames. We can treat a data frame similar to a SQL table.
How do we create a data frame?
#creating a data frame
data_frame = data.frame(first=c(1,2,3,4),second=c("a","b","c","d"))
  first second
     1      a
     2      b
     3      c
     4      d
#accessing  the data type of the object
[1] "data.frame"
#finding out the row count of data_frame using nrow()
[1] 4
#finding out the column count of data_frame using ncol()
[1] 2
#finding out the dimensions of data_frame using dim()
[1] 4 2
#finding the structure of the data frame using str()
'data.frame': 4 obs. of  2 variables:
 $ first : num  1 2 3 4
 $ second: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
#accessing the entire row of data frame using row index number. Observe below that if we use data_frame[1,] without specifying the column number it means that we want to access all the columns of row 1.
  first second
1     1      a
#similarly to access only 1st column values without row information use data_frame[,1] 
[1] 1 2 3 4
#accessing the row names of the data frame.
[1] "1" "2" "3" "4"
#accessing the column names of the data frame
[1] "first"  "second"
#column data can accessed using the column names explicitly instead of column indexes
[1] 1 2 3 4
[1] a b c d
Levels: a b c d
#accessing individual values using row and column indexes
data_frame[1,1] # accessing first row first column
[1] 1
data_frame[2,2] # accessing second row second column
[1] b
Levels: a b c d
data_frame[3,2]  # accessing third row second column
[1] c
Levels: a b c d
data_frame[3,1] # accessing third row first column
[1] 3
Note: Observe the below data frame:
dt_frame = data.frame(first=c(1,2,3,4,5,6,7),second=c("Big data","Python","R","NLP","machine learning","data science","data perspective"))
  first           second
     1         Big data
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
     7 data perspective
Assume we have a dataset with 1000 rows instead of 6 rows shown in above data frame. If we want to see a sample of data of the data frame, how do we do?
Using head() function.
  first           second
     1         Big data
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
head() function returns us the first six rows of any data frame so that we can have a look of what the data frame is like.
Also we can use tail() function to see the last six rows of the data frame.
  first           second
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
     7 data perspective
We have View() function to see the values of a data frame in a tabular form.