Saturday, November 4, 2017
information retrieval document search using vector space model in R
Labels:
cosine similarity,
document search,
information retrieval,
inverse document frequency,
jaccard similarity,
natural language processing,
nlp,
R,
term document matrix,
term frequency,
tfidf,
vector space model,
VSM
Blogger on Data Science - www.dataperspective.info
Friday, March 18, 2016
apply lapply rapply sapply functions in R
As part of Data Science with R, this is third tutorial after basic data types,control structures in r.
One of the issues with for loop is its memory consumption and its slowness in executing a repetitive task at hand. Often dealing with large data and iterating it, for loop is not advised. R provides many few alternatives to be applied on vectors for looping operations. In this section, we deal with apply function and its variants:
One of the issues with for loop is its memory consumption and its slowness in executing a repetitive task at hand. Often dealing with large data and iterating it, for loop is not advised. R provides many few alternatives to be applied on vectors for looping operations. In this section, we deal with apply function and its variants:
Labels:
apply function r,
apply r,
functions R,
lapply r,
mapply r,
R,
r programming,
sapply r,
tapply r
Blogger on Data Science - www.dataperspective.info
Saturday, February 27, 2016
Control Structures Loops in R
As part of Data Science tutorial Series in my previous post I posted on basic data types in R. I have kept the tutorial very simple so that beginners of R programming may takeoff immediately.
Please find the online R editor at the end of the post so that you can execute the code on the page itself.
In this section we learn about control structures loops used in R. Control strcutures in R contains conditionals, loop statements like any other programming languages.
Please find the online R editor at the end of the post so that you can execute the code on the page itself.
In this section we learn about control structures loops used in R. Control strcutures in R contains conditionals, loop statements like any other programming languages.
Labels:
break statement R,
for loop statement R,
functions R,
ifelse statement R,
R,
repeat statement R,
while statement R
Blogger on Data Science - www.dataperspective.info
Principal Component Analysis using R
Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.
Principal component analysis:
Labels:
Curse of Dimensionality,
dimensionality reduction,
feature extraction,
image processing,
machine learning,
matrix,
PCA,
principal component analysis,
R,
Recommendation Engine,
speech recognition,
text processing
Blogger on Data Science - www.dataperspective.info
Tuesday, February 16, 2016
Basic Data Types in R
As part of tutorial series on Data Science with R from Data Perspective, this first tutorial introduces the very basics of R programming language about basic data types in R.
What we learn:
After the end of the chapter, you are provided with R console so that you can practice what you have learnt in this chapter.
What we learn:
After the end of the chapter, you are provided with R console so that you can practice what you have learnt in this chapter.
Labels:
data analysis,
data frame R,
data mining,
data types R,
integer R,
list R,
matrix R,
R
Blogger on Data Science - www.dataperspective.info
Friday, December 25, 2015
Data Science with R
As R programming language becoming popular more and more among data science group, industries, researchers, companies embracing R, going forward I will be writing posts on learning Data science using R. The tutorial course will include topics on data types of R, handling data using R, probability theory, Machine Learning, Supervised – unSupervised learning, Data Visualization using R, etc. Before going further, let’s just see some stats and tidbits on data science and R.
"A data scientist is simply someone who is highly adept at studying large amounts of often unorganized/undigested data"
Labels:
Data Science,
machine learning,
R,
supervised learning,
unsupervised learning,
visualization
Blogger on Data Science - www.dataperspective.info
Wednesday, November 18, 2015
Item Based Collaborative Filtering Recommender Systems in R
Intuition:
Labels:
collaborative filtering,
content based recommender system,
item based collaborative filtering,
R,
Recommendation Engine,
recommender systems,
user based collaborative filtering
Blogger on Data Science - www.dataperspective.info
Monday, October 19, 2015
Data Mining Standard Process across Organizations
Recently I have come across a term, CRISP-DM - a data mining standard. Though this process is not a new one but I felt every analyst should know about commonly used Industry wide process. In this post I will explain about different phases involved in creating a data mining solution.
CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.
CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.
Labels:
data mining standards,
data modeling,
Data Preparation,
Data Scientist,
Hypothesis,
logistic regression,
missing values imputation,
normalization,
R,
visualization
Blogger on Data Science - www.dataperspective.info
Wednesday, October 7, 2015
Introduction to Logistic Regression with R
In my previous blog I have explained about linear regression. In today’s post I will explain about logistic regression.
Consider a scenario where we need to predict a medical condition of a patient (HBP) ,HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.
Consider a scenario where we need to predict a medical condition of a patient (HBP) ,HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.
Logistic regression – a classification problem, not a prediction problem:
In my previous blog I told that we use linear regression for scenarios which involves prediction. But there is a check; the regression analysis cannot be applied in scenarios where the response variable is not continuous. In our case the response variable is not a continuous variable but a value among a fixed set of classes. We call such scenarios as Classification problem rather than prediction problem. In such scenarios where the response variables are more of qualitative nature rather than continuous nature, we have to apply more suitable models namely logistic regression for classification.
Labels:
algorithm,
big data,
classification,
data mining,
Data Science,
Exploratory analysis,
logistic regression,
machine learning,
R,
Statistics,
supervised learning
Blogger on Data Science - www.dataperspective.info
Thursday, April 9, 2015
Exposing R-script as API
R is getting popular programming language in the area of Data Science. Integrating Rscript with web UI pages is a challenge which many application developers are facing. In this blog post I will explain how we can expose R script as an API, using rApache and Apache webserver.
rApache is a project supporting web application development using the R statistical language and environmentand the Apache web server.
Labels:
apache,
API,
data analysis,
R,
Rapache,
web service
Blogger on Data Science - www.dataperspective.info
Subscribe to:
Posts (Atom)