Friday, March 18, 2016

apply lapply rapply sapply functions in R

As part of Data Science with R, this is third tutorial after basic data types,control structures in r.

One of the issues with for loop is its memory consumption and its slowness in executing a repetitive task at hand. Often dealing with large data and iterating it, for loop is not advised. R provides many few alternatives to be applied on vectors for looping operations. In this section, we deal with apply function and its variants:
?apply


Datasets for apply family tutorial
 For understanding the apply functions in R we use,the data from 1974 Motor Trend
US magazine which comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).
 
data("mtcars")
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
 

Reynolds (1994) describes a small part of a study of the long-term temperature dynamics
of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by
telemetry every 10 minutes for four females, but data from a one period of less than a 
day for each of two animals is used there. 

data(beavers)
head(t(beaver1)[1:4,1:10])
        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]    [,9]   [,10]
day   346.00 346.00 346.00 346.00 346.00 346.00 346.00 346.00  346.00  346.00
time  840.00 850.00 900.00 910.00 920.00 930.00 940.00 950.00 1000.00 1010.00
temp   36.33  36.34  36.35  36.42  36.55  36.69  36.71  36.75   36.81   36.88
activ   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00    0.00    0.00

apply():
apply() function is the base function. We will learn how to apply family functions by trying out the code. apply() function takes 3 arguments:
  • data matrix
  • row/column operation, - 1 for row wise operation, 2 for column wise operation
  • function to be applied on the data.
 
when 1 is passed as second parameter, the function max is applied row wise and gives
us the result. In the below example, row wise maximum value is calculated.Since we 
have four types of attributes we got 4 results.
 
apply(t(beaver1),1,max) 
    day    time    temp   activ 
 347.00 2350.00   37.53    1.00 

 
When 2 is passed as second  parameter the function  mean is applied column wise.
In the below example mean function is applied on each column and mean for each 
column is calculated. Hence  we can see results for each column.
 
apply(mtcars,2,mean) 
       mpg        cyl       disp         hp       drat         wt       qsec         vs         am       gear       carb 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500   0.406250   3.687500   2.812500 
 
We can also pass custom function instead of default functions. For example in 
the below example let us divide each column element with modulus of 10.
For this we use a custom function which takes each element from each column and
apply the modulus operation.
 
head(apply(mtcars,2,function(x) x%%10))
                  mpg cyl disp hp drat    wt qsec vs am gear carb
Mazda RX4         1.0   6    0  0 3.90 2.620 6.46  0  1    4    4
Mazda RX4 Wag     1.0   6    0  0 3.90 2.875 7.02  0  1    4    4
Datsun 710        2.8   4    8  3 3.85 2.320 8.61  1  1    4    1
Hornet 4 Drive    1.4   6    8  0 3.08 3.215 9.44  1  0    3    1
Hornet Sportabout 8.7   8    0  5 3.15 3.440 7.02  0  0    3    2
Valiant           8.1   6    5  5 2.76 3.460 0.22  1  0    3    1

lapply():
lapply function is applied for operations on list objects and returns a list object of same length of original set.
lapply function in R, returns a list of the same length as input list object, each element of which is the result of applying FUN to the corresponding element of list.
 #create a list with 2 elements 
l = (a=1:10,b=11:20)  # the mean of the value in each element
lapply(l, mean)
$a
[1] 5.5
$b
[1] 15.5
class(lapply(l, mean))
[1] "list
  # the sum of the values in each element 
lapply(l, sum)
$a
[1] 55

$b
[1] 155



sapply():
sapply is wrapper class to lapply with difference being it returns vector or matrix instead of list object.
 
 # create a list with 2 elements 
 l = (a=1:10,b=11:20)  # mean of values using sapply 
sapply(l, mean)
   a    b 
 5.5 15.5

tapply():
tapply() is a very powerful function that lets you break a vector into pieces and then apply some function to each of the pieces. In the below code, first each of mpg in mtcars data is grouped by cylinder type and then mean() function is calculated.
str(mtcars$cyl)
 num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
levels(as.factor(mtcars$cyl))
[1] "4" "6" "8"

In the dataset we have 3 types of cylinders and now we want to see the average mpg
for each cylinder type.

tapply(mtcars$mpg,mtcars$cyl,mean)
       4        6        8 
26.66364 19.74286 15.10000 

In the output above we see that the average mpg for 4 cylinder engine 
is 26.664, 6-cyinder engine is 19.74 and 8-cylinder engine is 15.10

by():
by works similar to group by function in SQL, applied to factors, where in we may apply operations on individual results set. In the below example, we apply colMeans() function to all the observations on iris dataset grouped by Species.
data(iris) 
'data.frame': 150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

by(iris[,1:4],iris$Species,colMeans)
iris$Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.006        3.428        1.462        0.246 
------------------------------------------------------------------------------------ 
iris$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       5.936        2.770        4.260        1.326 
------------------------------------------------------------------------------------ 
iris$Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       6.588        2.974        5.552        2.026 

rapply():
rapply() is a recursive version of lapply.
rapply() applies a function recursively on each element of the list with 2 modes for "how" parameter. If how = "replace", each element of the list which is not itself a list and has a class included in classes is replaced by the result of applying f to the element.If the mode is how = "list" or how = "unlist", the list is copied, all non-list elements which have a class included in classes are replaced by the result of applying f to the element and all others are replaced by deflt. Finally, if how = "unlist", unlist(recursive = TRUE) is called on the result.
'
l2 = list(a = 1:10, b = 11:20,c=c('d','a','t','a'))
l2
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
 [1] 11 12 13 14 15 16 17 18 19 20

$c
[1] "d" "a" "t" "a"

rapply(l2, mean, how = "list", classes = "integer")
$a
[1] 5.5

$b
[1] 15.5

$c
NULL

rapply(l2, mean, how = "unlist", classes = "integer")
 a    b 
 5.5 15.5 
 
rapply(l2, mean, how = "replace", classes = "integer")
$a
[1] 5.5

$b
[1] 15.5

$c
[1] "d" "a" "t" "a"

mapply():
mapply is a multivariate version of sapply. By R definition, mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments. In short, mapply applies a Function to Multiple List or multiple Vector Arguments. In the below example word function is applied to vector argument LETTERS. '
word = function(C, k) paste(rep.int(C, k), collapse = "")
utils::str(mapply(word, LETTERS[1:6], 6:1, SIMPLIFY = FALSE))
List of 6
 $ A: chr "AAAAAA"
 $ B: chr "BBBBB"
 $ C: chr "CCCC"
 $ D: chr "DDD"
 $ E: chr "EE"
 $ F: chr "F"
 

Saturday, February 27, 2016

Control Structures Loops in R

As part of Data Science tutorial Series in my previous post I posted on basic data types in R. I have kept the tutorial very simple so that beginners of R programming  may takeoff immediately.
Please find the online R editor at the end of the post so that you can execute the code on the page itself.
In this section we learn about control structures loops used in R. Control strcutures in R contains conditionals, loop statements like any other programming languages.
Loops are very important and forms backbone to any programming languages.Before we get into the control structures in R, just type as below in Rstudio:
 ?control 

If else statement:
#See the code syntax below for if else statement 
if(x>1){
 print("x is greater than 1")
 }else{
  print("x is less than 1")
  } 

#See the code below for nested if else statement

 x=10
  x=10
 if(x>1 & x<7){
     print("x is between 1 and 7")}else if(x>8 & x< 15){
         print("x is between 8 and 15")
     }

[1] "x is between 8 and 15" 

For loops:
As we know for loops are used for iterating items
 #Below code shows for  loop implementation
x = c(1,2,3,4,5)
 for(i in 1:5){
     print(x[i])
 }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

While loop :

 #Below code shows while loop in R
x = 2.987
while(x <= 4.987) { 
     x = x + 0.987
     print(c(x,x-2,x-1)) 
 }
[1] 3.974 1.974 2.974
[1] 4.961 2.961 3.961
[1] 5.948 3.948 4.948

Repeat Loop:
The repeat loop is an infinite loop and used in association with a break statement.

 #Below code shows repeat loop:
a = 1
 repeat { print(a) a = a+1 if(a > 4) break }
[1] 1
[1] 2
[1] 3
[1] 4

Break statement:
A break statement is used in a loop to stop the iterations and flow the control outside of the loop.

 #Below code shows break statement:
x = 1:10 
 for (i in x){ 
     if (i == 2){ 
         break 
     }
     print(i)
 }
[1] 1

Next statement:
Next statement enables to skip the current iteration of a loop without terminating it.

 #Below code shows next statement 
x = 1: 4 
 for (i in x) { 
     if (i == 2){ 
         next}
     print(i)
 }
[1] 1
[1] 3
[1] 4

Creating a function in R:
function() is a built-in R function whose job is to create functions. In the below example function() takes one parameter x, executes a for loop logic.
The function object thus created using function() is assigned to a variable ('words.names'). Now this created function will be called using the variable 'word.names'

 #Below code shows us, how a function is created in R:

Syntax: 
function_name = function(parameters,..){ code}
 
words = c("R", "datascience", "machinelearning","algorithms","AI") 
words.names = function(x) {
     for(name in x){ 
         print(name) 
     }
} 
#Calling the function
 words.names(words)
[1] "R"
[1] "datascience"
[1] "machinelearning"
[1] "algorithms"
[1] "AI"

Hands on exercise of what we have learnt so far


We create a data frame DF, run for loop, ifelse in a function and call the function
#create 3 vectors name,age,salary
name = c("David","John","Mathew")
age = c(30,40,50)
salary = c(30000,120000,55000)
#create a data frame DF by combining the 3 vectors using cbind() function
DF = data.frame(cbind(name,age,salary))
#display DF
DF
    name age salary
1  David  30  30000
2   John  40 120000
3 Mathew  50  55000
#dimensions of DF
 dim(DF)
[1] 3 3
 
#write a function which displays the salaried person name
findHighSalary = function(df){
     Maxsal = 0
     empname = ""
     for(i in 1:nrow(DF)){
         tmpsal = as.numeric(DF[i,3] )
         if(tmpsal > Maxsal){
             Maxsal = tmpsal
             empname = DF[i,1]
         }
     }
     return(as.character(empname))
 }
#calling the function
findHighSalary(DF)
[1] "Mathew"

Principal Component Analysis using R

Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.


Principal component analysis:
Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.
Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.
Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.
In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:
For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below.
head(crimtab)
    142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.5      0      0      0      0     0      1      0      0      0     0      0      0      0      0     0      0
9.6      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.7      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.8      0      0      0      0     0      0      1      0      0     0      0      0      0      0     0      0
9.9      0      0      1      0     1      0      1      0      0     0      0      0      0      0     0      0
    182.88 185.42 187.96 190.5 193.04 195.58
9.4      0      0      0     0      0      0
9.5      0      0      0     0      0      0
9.6      0      0      0     0      0      0
9.7      0      0      0     0      0      0
9.8      0      0      0     0      0      0
9.9      0      0      0     0      0      0
 dim(crimtab)
[1] 42 22
str(crimtab)
 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
  ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

sum(crimtab)
[1] 3000

colnames(crimtab)
 [1] "142.24" "144.78" "147.32" "149.86" "152.4"  "154.94" "157.48" "160.02" "162.56" "165.1"  "167.64" "170.18" "172.72" "175.26" "177.8"  "180.34"
[17] "182.88" "185.42" "187.96" "190.5"  "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.
apply(crimtab,2,var)

We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().
pca =prcomp(crimtab)
pca

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.
par(mar = rep(2, 4))
plot(pca)

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.
#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one.
pca$rotation=-pca$rotation
pca$x=-pca$x
biplot (pca , scale =0)

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them.
Complete Code for PCA implementation in R:

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Tuesday, February 16, 2016

Basic Data Types in R

As part of tutorial series on Data Science with R from Data Perspective, this first tutorial introduces the very basics of R programming language about basic data types in R.

What we learn:
After the end of the chapter, you are provided with R console so that you can practice what you have learnt in this chapter.




R assignment operator
x = 'welcome to R programming' # assigning string literal to variable x 
x
[1] "welcome to R programming"
typeof(x) #to check the data type of the variable x
[1] "character"
Numeric Numeric data represents decimal data.
x = 1.5 #assigning decimal value1.5 to x
x
[1] 1.5 
To check the data type we use class() function:
class(x)
[1] "numeric"
To check if the variable “x” is of numerical or not, we use
is.numeric(x)
[1] TRUE
To convert any compatible data into numeric, we use:
 x = '1' #assigning value 1 to variable x
 class(x)
[1] "character"
 x = as.numeric(x)
[1] 1
 class(x)
[1] "numeric"

Note: if we try to convert a string literal to numeric data type we get the following result.

x= 'welcome to R programming'
as.numeric(x)
[1] NA
Warning message:
NAs introduced by coercion
Integer We use as.integer() function to convert into integers. This converts numeric value to integer values.
x = 1.34
 [1] 1.34
class(x)
[1] "numeric"
y = as.integer(x)
class(y)
[1] "integer"
y
[1] 1
Note: to check if the value is integer or not we use is.integer() function.
In the below example ‘y’ is numerical or decimal value whereas x is integer.
 is.integer(y)
[1] TRUE
 is.integer(x)
[1] FALSE
Complex: Complex data types are shown as below, though we use it very less in our day to day data analysis:
c = 3.5+4i
[1] 3.5+4i
is.complex(c)
[1] TRUE
class(c)
[1] "complex"
Logical Logical data type is one of the frequently used data type usually used for comparing two values. Values a logical data type takes is TRUE or FALSE.
logical = T
logical
[1] TRUE
l = FALSE
l
[1] FALSE
Character String literals or string values are stored as Character objects in R.
str = "R Programming"
str
[1] "R Programming"
class(str)
[1] "character"
is.character(str)
[1] TRUE
We can convert other data types to character data type using as.character() function.
x = as.character(1)
x
[1] "1"
class(x)
[1] "character"
Note: There are a variety of operations that can be applied on characters such as substrings, finding lengths; etc will be dealt as when appropriate.
So far we learnt about the basic data types in R, let’s get into a bit complex data types.
Vector How do we hold collection of same data types? We come across this requirement very frequently. We have vector data type to solve this problem.
Consider a numerical vector below:
num_vec = c(1,2,3,4,5)
num_vec
[1] 1 2 3 4 5
class(num_vec)
[1] "numeric"
We can apply many operations on the vector variables such as length, accessing values or members of the vector variable.
Length of the vector can be found using length() function.
length(num_vec)
[1] 5
We access each element or member of the vector num_vec using its indexes starting from
In the below example we can access the members at 1st,2nd,3rd positons.
num_vec[1]
[1] 1
num_vec[2]
[1] 2
num_vec[3]
[1] 3
Similarly string vectors, logical vectors, integer vectors can be created.
char_vec = c("A", "Course","On","Data science","R rprogramming")
char_vec
[1] "A"  "Course" "On" "Data science" "R rprogramming"
length(char_vec)
[1] 5
char_vec[1]
[1] "A"
char_vec[2]
[1] "Course"
char_vec[4]
[1] "Data science"
Matrix Matrix data type is used when we want to represent the data as collection of numerical values in mXn, m by n, dimensions. Matrices are used mostly when dealing with mathematical equations, machine learning, text mining algorithms.
Now how do we create a matrix?
m = matrix(c(1,2,3,6,7,8),nrow = 2,ncol = 3)
m
     [,1] [,2] [,3]
[1,]    1    3    7
[2,]    2C    6    8
class(m)
[1] "matrix"
Knowing the dimension of the matrix is:
dim(m)
[1] 2 3
How do we access elements of matrix m:
#accessing individual elements are done using the indexes shown as below. In the below example we are accessing 1st, 2nd, 6th element of matrix m.
m[1]
[1] 1
m[2]
[1] 2
m[6]
[1] 8
m[2,3] #  here we accessing 2nd row 3rd column element.
[1] 8
# accessing all elements of rows of the matrix m shown below. 
m[1,]
[1] 1 3 7
m[2,]
[1] 2 6 8
#accessing all elements of each column
m[,1]
[1] 1 2
m[,2]
[1] 3 6
m[,3]
[1] 7 8 
What happens when we add different data types to a vector?
v = c("a","b",1,2,3,T)
v
[1] "a"    "b"    "1"    "2"    "3"    "TRUE"
class(v)
[1] "character"
v[6]
[1] "TRUE"
class(v[6])
[1] "character"
What happened in the above example is, R coerced all different data types into a single data type of character type to maintain the condition of single data type.
List What if we want to handle different data types in a single object?
List data type helps us in storing elements of different data types in a single object.
We create list objects using list() function.
In the below example I have created a list object “list_exp” with 6 different elements of character, numeric and logical data types.
list_exp = list("r programming","data perspective",12345,67890,TRUE,F)
list_exp
[[1]]
[1] "r programming"
[[2]]
[1] "data perspective"
[[3]]
[1] 12345
[[4]]
[1] 67890
[[5]]
[1] TRUE
[[6]]
[1] FALSE
Using str() function, we can know the structure of the list object, i.e. the internal structure of the list objects can be known. This is one of the very important functions which we use in our day to day analysis.
In the below example we can see a list of 6 elements of character, numerical and logical data types.
str(list_exp)
List of 6
 $ : chr "r programming"
 $ : chr "data perspective"
 $ : num 12345
 $ : num 67890
 $ : logi TRUE
 $ : logi FALSE
#accessing the  data type of list_exp
class(list_exp)
[1] "list"
length(list_exp)
[1] 6
list_exp[1]
[[1]]
[1] "r programming"
#accessing the list elements using indexing.
list_exp[[1]]
[1] "r programming"
list_exp[[6]]
[1] FALSE
list_exp[[7]] # when we try accessing not existing elements we get the below error.
Error in list_exp[[7]] : subscript out of bounds
# finding the class of individual list element
class(list_exp[[6]])
[1] "logical"
class(list_exp[[3]])
[1] "numeric"
class(list_exp[[1]])
[1] "character"
Data Frame: Most of us would be from a bit of SQL background and we would be very much comfortable in handling data in the form of SQL table because of the functionalities which a SQL table offers while working the data.
How would it be if we have such data type object available in R which can be used to store the data and manipulate data in very easy, efficient and convenient way?
R offers a data frame data type object. It is another way that information is stored as data frames. We can treat a data frame similar to a SQL table.
How do we create a data frame?
#creating a data frame
data_frame = data.frame(first=c(1,2,3,4),second=c("a","b","c","d"))
data_frame
  first second
     1      a
     2      b
     3      c
     4      d
#accessing  the data type of the object
class(data_frame)
[1] "data.frame"
#finding out the row count of data_frame using nrow()
nrow(data_frame)
[1] 4
#finding out the column count of data_frame using ncol()
ncol(data_frame)
[1] 2
#finding out the dimensions of data_frame using dim()
dim(data_frame)
[1] 4 2
#finding the structure of the data frame using str()
str(data_frame)
'data.frame': 4 obs. of  2 variables:
 $ first : num  1 2 3 4
 $ second: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
#accessing the entire row of data frame using row index number. Observe below that if we use data_frame[1,] without specifying the column number it means that we want to access all the columns of row 1.
data_frame[1,]
  first second
1     1      a
#similarly to access only 1st column values without row information use data_frame[,1] 
data_frame[,1]
[1] 1 2 3 4
#accessing the row names of the data frame.
rownames(data_frame)
[1] "1" "2" "3" "4"
#accessing the column names of the data frame
colnames(data_frame)
[1] "first"  "second"
#column data can accessed using the column names explicitly instead of column indexes
data_frame$first
[1] 1 2 3 4
data_frame$second
[1] a b c d
Levels: a b c d
#accessing individual values using row and column indexes
data_frame[1,1] # accessing first row first column
[1] 1
data_frame[2,2] # accessing second row second column
[1] b
Levels: a b c d
data_frame[3,2]  # accessing third row second column
[1] c
Levels: a b c d
data_frame[3,1] # accessing third row first column
[1] 3
Note: Observe the below data frame:
dt_frame = data.frame(first=c(1,2,3,4,5,6,7),second=c("Big data","Python","R","NLP","machine learning","data science","data perspective"))
dt_frame
  first           second
     1         Big data
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
     7 data perspective
Assume we have a dataset with 1000 rows instead of 6 rows shown in above data frame. If we want to see a sample of data of the data frame, how do we do?
Using head() function.
head(dt_frame)
  first           second
     1         Big data
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
head() function returns us the first six rows of any data frame so that we can have a look of what the data frame is like.
Also we can use tail() function to see the last six rows of the data frame.
tail(dt_frame)
  first           second
     2           Python
     3                R
     4              NLP
     5 machine learning
     6     data science
     7 data perspective
We have View() function to see the values of a data frame in a tabular form.
View(dt_frame)


Friday, December 25, 2015

Data Science with R

As R programming language becoming popular more and more among data science group, industries, researchers, companies embracing R, going forward I will be writing posts on learning Data science using R. The tutorial course will include topics on data types of R, handling data using R, probability theory, Machine Learning, Supervised – unSupervised learning, Data Visualization using R, etc. Before going further, let’s just see some stats and tidbits on data science and R.

"A data scientist is simply someone who is highly adept at studying large amounts of often unorganized/undigested data"

“R programming language is becoming the Magic Wand for Data Scientists”

Why R for Data Science?

Daryl Pregibon, a research scientist at Google said- “R is really important to the point that it’s hard to overvalue it. It allows statisticians to do very intricate and complicated data analysis without knowing the blood and guts of computing systems.”

A brief stats for R popularity

“The shortage of data scientists is becoming a serious constraint in some sectors”
David Smith, Chief Community Officer at Revolution Analytics said –
“Investing in R, whether from the point of view of an individual Data Scientist or a company as a whole is always going to pay off because R is always available. If you’ve got a Data Scientist new to an organization, you can always use R. If you’re a company and you’re putting your practice on R, R is always going to be available. And, there’s also an ecosystem of companies built up around R including Revolution Enterprise to help organizations implement R into their machine critical production processes.”
Enough of candies about R, The topics which I cover in the course are:

  1. R Basics
  2. Probability theory in R
  3. Machine Learning in R
  4. Supervised machine learning 
  5.  Unsupervised machine learning 
  6. Advanced Machine Learning in R 
  7. Data Visualization in R 
See you in the first chapter, meanwhile read about the various data analysis steps involved.

Wednesday, December 16, 2015

Pearson_Coeffcient_Correlation

Since my original research question doesn't include any quantitative response or exploratory variable, I have chosen new set of variables for this assignment.
Below is the Hypothesis I have chosen:
Null Hypothesis: No association between Income per person and amount of Alcohol consumption in an year.
Alternate Hypothesis: There exists relation in income and alcohol consumption.

Saturday, December 12, 2015

Friday, December 11, 2015

Tuesday, December 8, 2015

Data Management Methods

Problem Statement: Prevalence of Self-treatment of psychiatric disorders with alcohol or drugs to improve the mood in NESARC

As part of Assignment_2, I have created a small program to perform the data management  steps such as below for few of the parameters/variables I have chosen for my research question.:
                     coding out missing data,
                     coding in valid data,
                     recoding variables,
                     creating secondary variables
                     binning or grouping variables.

The assignment is in steps:
1.                  Complete code for the assignment
2.                  output of the code after running in Spyder