From Basics to Brilliance: Insights into Logistic Regression Assumptions


Table of Contents

Logistic regression has proven to be an effective way of fitting a regression model when the response variable responds in a binary manner. Depending on the way independent variables changes, the dependent variable will respond in a “yes or no”, “0 or 1”, “true or false” manner. 

In this section, we will look at some assumptions logistic regression makes before applying it on models.

Exploratory Research Guide

Conducting exploratory research seems tricky but an effective guide can help.

Assumptions of logistic regression

  • Response variable is binary 

It basically the whole point of logistic regression. It assumes that the response variable or dependent variable can give only two variables. 

  • Yes/No
  • True/False
  • Disable/Enable
  • In/Out

The simple way to measure this assumptions to find out how many unique outcomes the response variable can possibly give. 

  • Observations are independent

Logistic regression assumes the observations to be independent of each other and independent of repetitive measurement. Any individual should not be measured more than once and neither should it be taken in for the model.

A way to check this assumptions is by maintaining an order for the observations. You need to make sure the observations are done at random without any biases, or else the assumption get violated. 

  • Explanatory variable shave no multicollinearity 

Multicollinearity in explanatory variables occurs when two or more than two of them do not provide unique information to the model. In this case, the explanatory are correlated to each other and provide similar information. In case of high correlativity between variables, they will create discrepancies while fitting in the interpreting regression model. 

Let’s say you want to observe the weight of babies, the observations for the following would be:

  • Weight of the baby
  • Baby’s clothes’ weight
  • Baby’s diet

Here, the weight of the baby and its clothes are the variables that give out more or less the same data, further taking up the space in the model. 

The best way to look out for multicollinearity is to use VIF (variance inflation factor). It is a way to measure the correlation and its strength between the explanatory variables.

See Voxco survey software in action with a Free demo.

See Voxco survey software in action with a Free demo.

  • No extreme outliers

Logistic regression assumes that there are no extreme outliers or any external observations that influence the data that goes into the model. 

Cook’s distance is an effective way to rule out the outliers and external observations from a dataset. You can choose to eradicate those from the data or decide to replace them with a mean or median. You can also let the outliers be, but remember to report those in the regression results. 

  • The explanatory variables and the Logit of response variable have a linear relationship between them.

The Logit is stated as:

Logit(p)  = log(p / (1-p)) 

Where p is the probability of an outcome to be positive.

The logistic regression assumes that this Logit of the response variable and the explanatory variables are linearly related.

Box-Tidwell test is used to see if this assumption stands true in your dataset for the regression model. 

  • Sufficient sample size 

The logistic regression assumes that the sample size from which the observations are drawn is large enough to give reliable conclusions for the regression model.

There is a rule of thumb to put this assumption in place. You need to have at least 10 cases where the outcome is not very frequent, for each explanatory variable. Let’s say you have 5 explanatory variables and you are expecting the probability of the least frequent outcome turns out to be 0.30, the model demands the sample size of at least (10*5)/0.30 = 166. 

  • Both logistic regression and linear regression have common assumptions:

  • A linear relationship between the explanatory variables and the response variable.
  • Normally distributed residuals. 
  • Homoscedasticity between the residuals.
Online survey tools 10 1

See why 450+ clients trust Voxco!

[fluentform id="10"]

By providing this information, you agree that we may process your personal data in accordance with our Privacy Policy.

Read more