Tuesday, December 31, 2013

Data Scientist. The Path I Chose

As we all are marching into the New Year, I would like to post about my plans to become a Data Scientist, my 2014 resolution at Professional front. The term Data Science was first introduced to me a year ago same time. Since then I have started researching and gathering necessary information and decided to become Data Scientist. After one year I just wanted to look back myself to understand where I stand now & still what needs to be done.
Power of Data – possible Use Cases:
Automobile companies are installing special devices which continuously feed data about the nature of driving, the roads being taken, places being visited etc. to the Insurance companies who by analyzing this information can customize the insurance plans for its customers.
Another possible use case, making use of the data we store on social media, we can develop products which can understand us and sometimes guide us. Like, based on people’s posts on social media, with location, time, road you are travelling & LIVE Traffic feeds, we can develop an app which could suggest us which road to take to our destination taking.
Who is Data Scientist? 
Data Scientist termed as the Sexiest Job of the 21st Century and is the most sought after & well paid job all over the Industry. The role of the Data Scientist is to Make Sense out of the tons of Data being generated every day, build products which can redefine more statistically/ scientifically the way the current businesses is going. The best part of this Job Is, a Data Scientist can fit into any industry where ever data exists.
What it needs to become a Data Scientist? 
"Web is my University, Time is the only Investment" 

Going back to where I have started my journey, after my initial research with my folks and over internet, I started taking up Machine Learning course from Coursera.org, an online offering from Stanford University, definitely a recommended Starting Point for anyone who wants to be a Data Scientist.
After completing the course, I have started playing with Data. Predictive models, being my first Hands-on. At this point I got an advice from my Colleague,

 “It is easy (because free and online) in order to grab some interesting key messages and get a rather complete overview within a 8-week period timeframe. Be sure you have a sufficient mathematical background to follow (maybe Calculas/mathematics before?) and more importantly, be certain you can practice afterwards!” 

My First analytics assignment on Predictive modelling gave me a lot of learnings, the need to brush up my Mathematics basics, Stats fundamentals, data mining techniques, Visualization Techniques.
I was lucky enough to find many of the courses listed in my TODO list
@ Coursera.org, UDACITY:
  • Machine Learning – for Machine Learning Algorithms 
  • Neural Networks for Machine Learning 
  • Data Analysis – Introduction to Programming Language R 
  • Language for Data Analysis 
  • Making Sense of Data - basics of Statistics 
  • Introduction to Data Science – intro to Data Science 
  • Social Network Analysis
The image, below which I have got over internet, is a simplified road map for a Data Scientist, which has become my Target over the next few months combined with Hands-on.
Road map to data science 
After completing the required courses mentioned above, I got an opportunity to play around with Data using Text Mining, Classification techniques, Predictive modelling techniques both from within my Organization & outside (KAGGLE.COM).
It would be incomplete, if I don’t mention KAGGLE, the website where I found the much needed hands-on experience. KAGGLE is an online Open competition forum, where you can find Data Science Use cases posted by big Companies. Advantage of working on this site is not only you can get hands-on experience but also you increase your Network with similar competent techies.
Apart from KAGGLE, I have made use of LINKEDIN data science/data Analysis groups where I could get suggestions and answers to almost every question of mine by the Experts of this field.
Data Analytics Should Meet Big Data: 
“Learn a bit of Java, Learn a bit of Linux – shell scripting, Learn Hadoop” – an advice from my friend.

After completing few POCs and began working with real time projects I have soon realized the need to handle huge amounts of data and need to learn BIG DATA. In my personal opinion, a Data Scientist should have a very strong knowledge in Big Data-Hadoop.
For this, I planned to choose the road path of learning Big Data Hadoop concepts, Linux Shell scripting, MongoDB for data storage in the next 3 months.
That is me folks, after one year; long way to go through before becoming a Data Scientist. But still would love to hear someone calling me a Data Scientist. Over the last one year, I have spent most of my time in learning new concepts & technologies. I am eager to see my theory shaping into work.
In my next blog post I will explain the tools, concepts, technologies, online forums available on internet.

Tuesday, December 17, 2013

Cluster Analysis using R

In this post, I will explain you about Cluster Analysis, The process of grouping objects/individuals together in such a way that objects/individuals in one group are more similar than objects/individuals in other groups.
 For example, from a ticket booking engine database identifying clients with similar booking activities and group them together (called Clusters). Later these identified clusters can be targeted for business improvement by issuing special offers, etc.
Cluster Analysis falls into Unsupervised Learning algorithms, where in Data to be analyzed will be provided to a Cluster analysis algorithm to identify hidden patterns within as shown in the figure below.
 In the image above, the cluster algorithm has grouped the input data into two groups. There are 3 Popular Clustering algorithms, Hierarchical Cluster Analysis, K-Means Cluster Analysis, Two-step Cluster Analysis, of which today I will be dealing with K-Means Clustering.
Explaining k-Means Cluster Algorithm: 
In K-means algorithm, k stands for the number of clusters (groups) to be formed, hence this algorithm can be used to group known number of groups within the Analyzed data.
K Means is an iterative algorithm and it has two steps. First is a Cluster Assignment Step, and second is a Move Centroid Step.
CLUSTER ASSIGNMENT STEP: In this step, we randomly chose two cluster points (red dot & green dot) and we assign each data point to one of the two cluster points whichever is closer to it. (Top part of the below image)
MOVE CENTROID STEP: In this step, we take the average of the points of all the examples in each group and move the Centroid to the new position i.e. mean position calculated. (Bottom part of the below image)
The above steps are repeated until all the data points are grouped into 2 groups and the mean of the data points at the end of Move Centroid Step doesn’t change.


 By repeating the above steps the final output grouping of the input data will be obtained.

Cluster Analysis on Accidental Deaths by Natural Causes in India using R 
Implementation of k-Means Cluster algorithm can readily downloaded as R Package, CLUSTER . Using the package we shall do cluster analysis of Accidents deaths in India by Natural Causes.
Steps implemented will be discussed as below: 
The data for our analysis was downloaded from www.data.gov.in.
Between 2001 & 2012. Input data is displayed as below: 
For any cluster analysis, all the features have to be converted into numerical & the larger values in the Year Columns are converted to z-score for better results. 
Run Elbow method (code available below) is run to find the optimal number of clusters present within the data points. 
Run the K-means cluster method of the R package & visualize the results as below:
Code:
#Fetch data
data= read.csv("Cluster Analysis.csv")
APStats = data[which(data$STATE == 'ANDHRA PRADESH'),]
APMale = rowSums(APStats[,4:8])
APFemale = rowSums(APStats[,9:13])
APStats[,'APMale'] = APMale
APStats[,'APFemale'] = APFemale
data = APStats[c(2,3,14,15)]
library(cluster)
library(graphics)
library(ggplot2)
#factor the categorical fields
cause = as.numeric(factor(data$CAUSE))
data$CAUSE = cause
#Z-score for Year column
z = {}
m = mean(data$Year)
sd = sd(data$Year)
year = data$Year
for(i in 1:length(data$Year)){
z[i] = (year[i] - m)/sd
}
data$Year = as.numeric(z)
#Calculating K-means - Cluster assignment & cluster group steps
cost_df <- data.frame()

for(i in 1:100){
kmeans<- kmeans(x=data, centers=i, iter.max=100)
cost_df<- rbind(cost_df, cbind(i, kmeans$tot.withinss))
}
names(cost_df) <- c("cluster", "cost")

#Elbow method to identify the idle number of Cluster
#Cost plot
ggplot(data=cost_df, aes(x=cluster, y=cost, group=1)) +
theme_bw(base_family="Garamond") +
geom_line(colour = "darkgreen") +
theme(text = element_text(size=20)) +
ggtitle("Reduction In Cost For Values of 'k'\n") +
xlab("\nClusters") +
ylab("Within-Cluster Sum of Squares\n")
clust = kmeans(data,5)
clusplot(data, clust$cluster, color=TRUE, shade=TRUE,labels=13, lines=0)
data[,'cluster'] = clust$cluster
head(data[which(data$cluster == 5),])