A Series of blog posts on Data Science, Data Mining.

Monday, October 19, 2015

Data Mining Standard Process across Organizations

Recently I have come across a term, CRISP-DM - a data mining standard. Though this process is not a new one but I felt every analyst should know about commonly used Industry wide process. In this post I will explain about different phases involved in creating a data mining solution.

CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.

CRISP-DM model is a phased approach to tackle a business problem. Different phases involved in the model are defined below:
  • Use case Identification
  • Business Understanding
  • Data Acquisition and Data Understanding
  • Data Preparation
  • Exploratory Analysis
  • Data Modeling
  • Data Evaluation
  • Deployment
Let us see each of the phases in detailed way:
Use case Identification:This is the initial phase of CRISP-DM in which a potential business problem is formulated into a Data mining use case. Various levels of brainstorming sessions are conducted between different stakeholders to define the problem statement, its impact on the business and a clear objective of the solution and its timelines.
Audience:
  1. higher management
  2. IT teams – Application team, DBA team
  3. Analytics team – Data Scientist
Business Understanding:This phase focuses on understanding the business flow – how the current system addresses the business problem in consideration, identifying the data sources which would be required for data mining. Understanding the key features required for modeling from domain point of view.
Audience: 
  1. domain experts - for domain knowledge, business rules understanding
  2. IT teams -  for data sources identifications, key features of the system
  3. Analytics teams – Data Scientist
Data Acquisition and Data Understanding: This phase involves data collection activities such as pulling data sources from the client databases to a central repository where Data Analytics team develop solutions. Based on the type of problem in hand the data sources might vary from SQL databases to text files to log files to web pages etc.This phase also involves getting familiarity with data - such as data quality, missing values, extracting basic insights and identifying interesting patterns to form Hypothesis.

Data Preparation:This phase of CRISP-DM involves preparing data required to be fed into data mining algorithms. This Phase involves processing or cleaning of raw data. This is one of the crucial steps in data mining. The accuracy of the data mining solution depends on the quality of the data. All the data preparation activities which are required for creating final dataset for feeding into algorithms are done here - Handling missing data using methods such as imputations, converting data into proper formats such as unstructured to structured format, identifying outliers, normalizing the data etc.
Audience:
  1. Data Analytics team
Exploratory Analysis:This is visual analytics phase wherein the data scientist tries to understand the various patterns hiding in the data. The objective of this step is to understand the main characteristics of the data. This analysis is generally done using visualizing tools such as Tableau, R etc.
Performing an Exploratory analysis helps us:
  1.  To understand causes of an observed event. 
  2. To understand the nature of the data we are dealing with. 
  3. Assess assumptions on which our analysis will be based. 
  4. To identify the key features in the data needed for the analysis.
Exploratory Analysis involves both Quantitative analysis such as basic statistics such as mean/median/mode, standard deviations etc. and Graphical Analysis such as normal plots, histograms, box plots, scatter plots etc.

Data Modeling:In this phase, various modeling techniques are selected and applied to the data for feature extractions, to model the data, tune the model and to calibrate its parameters to optimal values. Typically this phase involves applying suitable data mining/machine learning algorithms to the dataset. Some problems can be solved using single methods where as some problems involves combination of multiple techniques.
For ex: A recommendation systems of Netflix uses a combination of Boltzman machines, Gradient Boosted Decision trees, logistic regression etc.
Also sometimes different methods are applied separately to select the optimal method to solve the issue at hand.
For ex: Logistic regression, decision tree, Random forests are applied to the dataset to see which model will result in optimal data model.
In this phase of modeling the data, the dataset is divided into two sets, Training Set & Test Set. The modeling the data is done using Training Set and the Test Set is used to evaluate the model.

Data Evaluation: This is the follow-up step to the data Modeling phase. Data Model built in the previous step needs to be thoroughly validated before moving into deployment. The model should address all the business objectives mentioned in the problem statement. The Test Data set created in the previous set is used to test the model build. The objective of this step is to check if the prediction error made on the test set. If the prediction error is less, then our model is good to go. Sometimes the error would be larger indicating the situation of under fitting and Overfitting. Based on the results we might have to go back to previous phases and tune the model.

Deployment:Once the model building and evaluation is completed and we are satisfied with results, the next step is to present the business users with the results. These publishing results should be in user readable or understandable form. Most of the time the results will be published in the form of reports or UI. For example: If the results are needed by the top management for taking key business decisions, visualization reports will be the accurate. If the end user needs to be recommended any new item on e-commerce website, then the results should be displayed on to the web UI.

Most of the time, back and forth between phases is required. For example, during evaluating the data model, if we find that model is suffering from over-fitting we can go back to the model phase and fine tune the Model. As an another example, if in modeling phase if we observe that the a feature column in the dataset with sparse data is very critical in achieving the solution then we will go back to the Business Understanding step and consult the domain experts to know if we can derive more information about the sparse data column and impute the column with relevant values.
To know more information about CRISP-DM, see the wiki page here.