The linearity assumption for continuous variables in logistic regression is often overlooked by researchers in my experience reading articles. It is a key assumption of these regression models, as most continuous variables have a higher frequency of the outcome at higher and lower values, with a range of normal values usually in the middle values that do not correlate with any adverse outcomes.
In the first of this three part tutorial, I will explain how to recognize this problem in the data, then in the second tutorial, I will suggest ways to model the non-linearity of continuous variables without using other more complex types of regressions. Finally and most importantly, I will show in Part III a detail example of a study where this assumption was not checked, and how it could have improved model fit.
Checking the linearity assumption in logistic regression
Let’s look up an example to showcase the issue. I took the NHANES III dataset published by the CDC, and ran a logistic regression model with systolic blood pressure as predictor variable and death as the outcome. Then, I categorized the predictor variable into quintiles to show the risk of the outcome at each specific cut-off points, and ran a logistic regression model.
Where, in the dataset, the systolic blood pressure was categorized in quintiles (sbp.c). This model resulted in a linear increase in the risk of death; given by increasing values of the β coefficient at each successive values of the predictor variable.
Now let’s look at another example with a different predictor variable. This time we categorized sodium into quintiles and ran a logistic regression model:
You can now see that sodium (sodium.c) has an U-shaped relationship with death, meaning, low and high values of sodium may correlate with mortality (notice the may, this is because we do not really know this yet, and have to find out).
To show more clearly the relationship of sodium to mortality, check these graphs:
As you can see, the coefficients of the logistic regression with sodium in quintiles show the second to four quintile have less relationship to mortality, and the last quintile more association with death. The first quintile is the comparison coefficient to the odds ratio in Figure A, so the line goes from an odds ratio of 1 in the first quintile to 0.7 in the second.
The second graph (Figure B), shows this more clearly, with the raw case-fatality rate of the categorized predictor showing a U-shape relationship to mortality.
This type of U-shape relationships can usually be modeled using a second-degree polynomial logistic regression, more on that in Part II of this tutorial.
We have now finished the first part of this tutorial, if you have any comments, or see any errors, send me an email to angel.paternina @ gmail.com.
In Part II of this tutorial, I will show how to model a curvilinear polynomic model with non-linear relationship using logistic regression.