Data Perspective: Introduction to Logistic Regression with R

In my previous blog I have explained about linear regression. In today’s post I will explain about logistic regression.
Consider a scenario where we need to predict a medical condition of a patient (HBP) ,HAVE HIGH BP or NO HIGH BP, based on some observed symptoms – Age, weight, Issmoking, Systolic value, Diastolic value, RACE, etc.. In this scenario we have to build a model which takes the above mentioned symptoms as input values and HBP as response variable. Note that the response variable (HBP) is a value among a fixed set of classes, HAVE HIGH BP or NO HIGH BP.

Logistic regression – a classification problem, not a prediction problem:

In my previous blog I told that we use linear regression for scenarios which involves prediction. But there is a check; the regression analysis cannot be applied in scenarios where the response variable is not continuous. In our case the response variable is not a continuous variable but a value among a fixed set of classes. We call such scenarios as Classification problem rather than prediction problem. In such scenarios where the response variables are more of qualitative nature rather than continuous nature, we have to apply more suitable models namely logistic regression for classification.

Definition:

Assume we have a binary category output variable Y and a vector of p input variables X. Rather than modeling this response Y directly, logistic regression models the conditional probability, Pr(Y = 1|X = x) as a function of x, that Y belongs to a particular category.

Mathematically, logistic regression is expressed as:

Estimating Coefficients – Maximum Likelihood function:
The unknown parameters, β0/ β1, in the function are estimated by maximum likelihood method using available input training data. The Maximum likelihood function expresses the probability of the observed data as a function of the unknown parameters. The maximum likelihood estimators of these parameters are chosen to be those values that maximize this function. Thus, the estimators are those which agree most closely with the observed data.

For now we assume that solving the above equation can be used to estimate the unknown parameters.

In R, we glm() which takes training data as input and gives us the fitted model with estimated parameters as output, which we will see in the later section.

Making Predictions:

Once the coefficients have been estimated, it is a simple matter to compute the probability of response variable for any given input values by putting values of β0/ β1/X in the below equation.

Note: we have predict() in R which takes fitted model, input parameters as input values to predict the response variables.

Use case – Classify if a person may have HBP/not HBP :

Let us take a use case and implement logistic regression in R. Let us classify/predict if a person suffers with High Blood Pressure (HPB) given input predictors AGE, SEX, IsSmoking, Avg Systolic BP, Average diastolic BP, RACE, Body Weight, Height, etc. Access the data set from here. The dataset is NHANES III dataset. In order to implement logistic regression, we need to follow the below steps:

data = read.csv('~\nhanes.csv')

str(data)

data.frame': 15643 obs. of 10 variables:

$ HSAGEIR : int 21 32 48 35 48 44 42 56 82 44 ...

$ HSSEX : int 1 0 0 1 1 1 0 0 0 1 ...

$ BMPWTLBS: num 180 136 150 204 155 ...

$ BMPHTIN : num 70.4 63.9 61.8 69.8 66.2 70.2 62.6 67.6 59.5 71.1 ...

$ TCP : int 268 160 236 225 260 187 216 156 179 162 ...

$ HBP : int 0 0 0 0 0 0 0 0 1 0 ...

$ RACE_2 : num 0 0 0 0 0 1 1 0 0 0 ...

$ RACE_3 : num 0 0 0 0 0 0 0 0 0 0 ...

$ SMOKE_2 : num 0 0 1 0 1 0 0 1 0 0 ...

$ SMOKE_3 : num 0 0 0 0 0 1 1 0 0 1 ...

#removing SEQN,SDPPSU6,SDPSTRA6 from the dataset, from description of the datset,

#we concluded that it is not related to data.

data = data[,-c(1:3)]

#excluding HAR3 for now.

data = data[,-c(1,10)]

# removing entire row if there is any single missing column in the data.

#check for NA's or missing values in the data

#removing all the na rows from data

data = na.omit(data)

# from the description we consider HBP as response variable.

names(data)

'[1] "HSAGEIR" "HSSEX" "DMARACER" "BMPWTLBS" "BMPHTIN" "PEPMNK1R" "PEPMNK5R" "HAR1" "SMOKE" "TCP" "HBP" '

summary(data)

Data Transformations:

'Creating dummy variables/design variables for DMARACER/HSSEX create number of dummy variables based on the levels -1, i.e if if there 3 levels create 3-1 levels of dummy variables x= data

for RACE create 2 dummy variables DMRACE1, DMRACE2 where White(1) - (0,0) black(2) (1,0),other (3) - (1,1)'

for(level in unique(data$DMARACER)){

data[paste("RACE", level, sep = "_")] <- ifelse(data$DMARACER == level, 1, 0)

}

#removing race_1

data = data[,-12]

for(level in unique(data$SMOKE)){

data[paste("SMOKE", level, sep = "_")] <- ifelse(data$SMOKE == level, 1, 0)

}

#removing SMOKE_1

data = data[,-14]

table(data$HSSEX,data$HBP)

     0    1

  0 6576 1702

  1 5817 1548

Fem_HBP = (1702/6576+1702)) = 0.2056052

Mal_HBP = (1548/(1548+5817)) = 0.2101833

Modelling:

#remove HAR1,PEPMNK1R as SMOKE & HBP are created from these variables.

data = data[,-c(3,6,7,8,9,16)]

#create training & testing data

train_x = data

smp_size <- floor(0.75 * nrow(train_x))

train_ind <- sample(seq_len(nrow(train_x)), size = smp_size)

train <- train_x[train_ind, ]

test <- train_x[-train_ind, ]

glm2 = glm(train$HBP~train$HSAGEIR+train$HSSEX+train$BMPWTLBS+train$BMPHTIN+train$TCP+train$RACE_2+train$RACE_3,data=train,family=binomial)

The image shows us the summary of the model. In my next post let us evaluate the logistic regression model and let us consider few other models and choose better model among them.

9 comments:

BroddyAdamsAug 13, 2019, 10:11:00 AM
I will right away grab your rss as I can not find your e-mail subscription link or newsletter service. Do you've any? Kindly let me know in order that I could subscribe. Thanks. Choosing the Perfect Color Scheme for Your Website
BroddyAdamsSep 24, 2019, 10:35:00 AM
Media One has become the best in business influencers singapore. They would ensure that strong online marketing presence of celebrities, persons, organizations, and companies would influence the targeted audience using their knowledge and expertise in the specific business arena.
AlanBarlowMar 30, 2021, 11:34:00 AM
Interesting, thanks for sharing this. Really helpful. Do make sure to check below and you might find an emblem for your business.
custom logo
eid mubarak quotes in englishApr 21, 2021, 11:48:00 AM
this article is very useful for me, thank you for sharing it,
ZenaLefflerMay 26, 2021, 5:41:00 PM
My brother recommended I might like this web site. He was entirely right. This post actually made my day. You cann't imagine just how much time I had spent for this information! Thanks! fat transfer singapore
HarisSep 17, 2021, 1:20:00 PM
Best article and such a great work.

law dissertation Writing Service
FruchtLewisSep 25, 2021, 8:03:00 PM
Someone essentially help to make seriously articles I would state. This is the very first time I frequented your website page and thus far? I surprised with the research you made to make this particular publish amazing. Great job! online rummy sites
TreverHarseyOct 19, 2021, 6:26:00 PM
If you are a man suffering from PE, ED, Low-T, Excess Weight, Fatigue or other common men’s health issues, contact the men's health clinic, Paramount Men’s Medical Center in St. Louis, MO.men's health clinic
Hire Essay WriterNov 4, 2021, 11:18:00 AM
i believe you have a nice page here these days was my initial time coming here.. i just happened to discover it doing a google search. anyway, great post.. ill be bookmarking this page for certain.

Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Wednesday, October 7, 2015

Introduction to Logistic Regression with R

Logistic regression – a classification problem, not a prediction problem:

Making Predictions:

Use case – Classify if a person may have HBP/not HBP :

9 comments: