TUTORIAL: The linearity assumption in logistic regression, Part I

The linearity assumption for continuous variables in logistic regression is often overlooked by researchers in my experience reading articles. It is a key assumption of these regression models, as most continuous variables have a higher frequency of the outcome at higher and lower values, with a range of normal values usually in the middle values that do not correlate with any adverse outcomes.

In the first of this three part tutorial, I will explain how to recognize this problem in the data, then in the second tutorial, I will suggest ways to model the non-linearity of continuous variables without using other more complex types of regressions. Finally and most importantly, I will show in Part III a detail example of a study where this assumption was not checked, and how it could have improved model fit.

 

Checking the linearity assumption in logistic regression

Let’s look up an example to showcase the issue. I took the NHANES III dataset published by the CDC, and ran a logistic regression model with systolic blood pressure as predictor variable and death as the outcome. Then, I categorized the predictor variable into quintiles to show the risk of the outcome at each specific cut-off points, and ran a logistic regression model.

R code:

> fit <- glm(death ~ factor(sbp.c), data = dataset, family = binomial)
> summary(fit)

Call:
glm(formula = death ~ factor(sbp.c), family = binomial, data = dba)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.3569 -0.6137 -0.4616 1.0079 2.3372

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -2.66385 0.06690 -39.818  < 2e-16 ***
factor(sbp.c)2 0.47810 0.08604   5.557 2.75e-08 ***
factor(sbp.c)3 1.08992 0.07857  13.872  < 2e-16 ***
factor(sbp.c)4 2.03612 0.07441  27.363  < 2e-16 ***
factor(sbp.c)5 3.07652 0.07440  41.351  < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 22601 on 19660 degrees of freedom
Residual deviance: 18688 on 19656 degrees of freedom
(389 observations deleted due to missingness)
AIC: 18698

Number of Fisher Scoring iterations: 5

Where, in the dataset, the systolic blood pressure was categorized in quintiles (sbp.c). This model resulted in a linear increase in the risk of death; given by increasing values of the β coefficient at each successive values of the predictor variable.

Now let’s look at another example with a different predictor variable. This time we categorized sodium into quintiles and ran a logistic regression model:

R code:

> fit <- glm(death ~ factor(sodium.c), data = dataset, family = binomial)
> summary(fit)

Call:
glm(formula = death ~ factor(sodium.c), family = binomial, data = dba)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.8427 -0.8092 -0.6924 -0.6914 1.7598

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)       -0.94829 0.03931 -24.126 < 2e-16 ***
factor(sodium.c)2 -0.35777 0.05694 -6.284 3.31e-10 ***
factor(sodium.c)3 -0.36107 0.05776 -6.251 4.07e-10 ***
factor(sodium.c)4 -0.25931 0.05685 -4.561 5.08e-06 ***
factor(sodium.c)5  0.09559 0.05414  1.766   0.0775 .

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 18826 on 16853 degrees of freedom
Residual deviance: 18711 on 16849 degrees of freedom
(3196 observations deleted due to missingness)
AIC: 18721

Number of Fisher Scoring iterations: 4

 

You can now see that sodium (sodium.c) has an U-shaped relationship with death, meaning, low and high values of sodium may correlate with mortality (notice the may, this is because we do not really know this yet, and have to find out).

To show more clearly the relationship of sodium to mortality, check these graphs:

As you can see, the coefficients of the logistic regression with sodium in quintiles show the second to four quintile have less relationship to mortality, and the last quintile more association with death. The first quintile is the comparison coefficient to the odds ratio in Figure A, so the line goes from an odds ratio of 1 in the first quintile to 0.7 in the second.

The second graph (Figure B), shows this more clearly, with the raw case-fatality rate of the categorized predictor showing a U-shape relationship to mortality.

This type of U-shape relationships can usually be modeled using a second-degree polynomial logistic regression, more on that in Part II of this tutorial.

 


 

We have now finished the first part of this tutorial, if you have any comments, or see any errors, send me an email to angel.paternina @ gmail.com.

In Part II of this tutorial, I will show how to model a curvilinear polynomic model with non-linear relationship using logistic regression.