A Series of blog posts on Data Science, Data Mining.

Tuesday, January 7, 2014

Data Analysis Tools

As mentioned in my previous post , in this post I will be listing out the tools, blogs and forums, online courses that I have gathered over the past one year, which I felt necessary in my journey, which will be helpful to my fellow data science aspirants.




 Skillset Required:
  •  Knowledge in Statistics – Exploratory analysis, doing initial analysis of the data & understanding the data to decide what techniques needs to be applied, which I feel is a must know subject. 
  • Mathematics – basics of calculus, algebra etc. for mathematical formulation of the problem statement.
  •  Understanding Machine learning algorithms for predictive modeling, recommendation engines, classification models, cluster analysis, social network analysis.
  • Data mining skills like data cleaning/Data Munging skills, apply Machine learning techniques on the data. 
  •  Visualization skills to display the results, to understand the results during building data modeling.
Tools required: 
Programming Languages: Proficiency in any two of the below mentioned languages would be advisable:
  •  R 
  • Python 
  • Java – comes in handy when we work on Hadoop 
  • C,C++
Tools required: Since I’m using Open source tools, I will be confined to them:
  • R-Studio
  • NLTK toolkit 
  • Rapid Miner
  • Weka
Important Point: 
Most of the machine learning algorithms has been already implemented as packages in the above languages/tools . We need to just download and make use of them.
Big data Tools: 
  • Hadoop setup from Cloudera/Hortonworks
  • Mongodb- NoSQL DB
  • HBASE, PIG, HIVE.
Visualization tools: 
Though I have not explored much in this area, but till day I’m happy with R packages for visualizations.
  • Data exploration in R/Python 
Few Books I have referred: 
  • The Elements of Statistical Learning - 2nd Edition 
  • Simon Sheather, A Modern Approach to Regression
  •  Data Mining 3rd Edition by Ian H. Witten, Eibe Frank, Mark A. Hall
Online Courses: 
Though a lot of courses are available online, I have stick to very few sites as below,
For Data Analysis, Stats, Maths:
For Big Data: 
Blogs and forums: 
Online forums is one place where I used to get a lot of information, in Linked Groups I could get answer to all my trivial questions. You post any query and you will get elaborate answer from research scholars to industry experts, I really love this place. I will list down few Linkedin groups I follow,
Blogs I follow:
Will add more when I come across the new tools. Guys hope this will serve you as a starting point for the Journey. All the best, Happy New Year. Please do add any new tools and technologies to the above list.
In my next post, shall post about how to tackle a data analysis problem