SHARE THE ARTICLE ON
Logistic regression has proven to be an effective way of fitting a regression model when the response variable responds in a binary manner. Depending on the way independent variables changes, the dependent variable will respond in a “yes or no”, “0 or 1”, “true or false” manner.
In this section, we will look at some assumptions logistic regression makes before applying it on models.
Conducting exploratory research seems tricky but an effective guide can help.
It basically the whole point of logistic regression. It assumes that the response variable or dependent variable can give only two variables.
The simple way to measure this assumptions to find out how many unique outcomes the response variable can possibly give.
Logistic regression assumes the observations to be independent of each other and independent of repetitive measurement. Any individual should not be measured more than once and neither should it be taken in for the model.
A way to check this assumptions is by maintaining an order for the observations. You need to make sure the observations are done at random without any biases, or else the assumption get violated.
Multicollinearity in explanatory variables occurs when two or more than two of them do not provide unique information to the model. In this case, the explanatory are correlated to each other and provide similar information. In case of high correlativity between variables, they will create discrepancies while fitting in the interpreting regression model.
Let’s say you want to observe the weight of babies, the observations for the following would be:
Here, the weight of the baby and its clothes are the variables that give out more or less the same data, further taking up the space in the model.
The best way to look out for multicollinearity is to use VIF (variance inflation factor). It is a way to measure the correlation and its strength between the explanatory variables.
See Voxco survey software in action with a Free demo.
Logistic regression assumes that there are no extreme outliers or any external observations that influence the data that goes into the model.
Cook’s distance is an effective way to rule out the outliers and external observations from a dataset. You can choose to eradicate those from the data or decide to replace them with a mean or median. You can also let the outliers be, but remember to report those in the regression results.
The Logit is stated as:
Logit(p) = log(p / (1-p))
Where p is the probability of an outcome to be positive.
The logistic regression assumes that this Logit of the response variable and the explanatory variables are linearly related.
Box-Tidwell test is used to see if this assumption stands true in your dataset for the regression model.
The logistic regression assumes that the sample size from which the observations are drawn is large enough to give reliable conclusions for the regression model.
There is a rule of thumb to put this assumption in place. You need to have at least 10 cases where the outcome is not very frequent, for each explanatory variable. Let’s say you have 5 explanatory variables and you are expecting the probability of the least frequent outcome turns out to be 0.30, the model demands the sample size of at least (10*5)/0.30 = 166.