4.1 Logistic regression

So far, while we’ve explored binary options in our independent variables of regression, such as White vs. non-White for individuals, we have not considered how to construct a regression analysis if the outcome (dependent) variable is a binary outcome, like an individual getting a college degree or not. In this case, we can frame that binary outcome as a percent likelihood (between 0 and 1) of the outcome happening; that essentially turns our binary outcome back into a continuous variable. However, we cannot use simple linear regression as we used in the previous chapter, because the nature of the outcomes here is quite different. Instead of the relationship between x and y being best represented by a straight line (with the residuals being normally distributed), our possible outcomes here are often represented by an S-shaped curve that is constrained to 0 and 1 as minimum and maximum likelihoods, with some direction of x (or multiple x values) leading to an increasingly likely outcome, and vice versa (you could design your own Galton Board -style experiment to convince yourself that this curve is a good approximation of real-world phenomena). This curve seems to be a completely different model of x and y relationships from what we’ve dealt with before, but manipulating the y axis into a “logit” (log odds, as explained in the link above) turns the S-curve back into a linear relationship between x and y, which we can use our previous regression method to evaluate. Again, all of this still depends on the assumption that such a logit model is an appropriate representation for our observed data, which involves a similar type of pre-analysis inspection and experience as simple linear regression requires. But a logit (or logistic) regression is a powerful and generalizable technique for using linear combinations of one or more independent variables to (non-causally) predict a categorical outcome. The simplest version, which we’ll cover in this section, involves a binary outcome that can be coded as having 0 to 1 likelihood of occurring, but logistic regression can be expanded to predict from a set of multiple outcomes.

For our example, let’s use individual-level PUMS data, specifically information about age, race, and ethnicity, to predict primary language spoken (English or other). The processing steps involved should be familiar to you from previous chapters.

library(tidycensus)
library(tidyverse)
library(tigris)
library(sf)

census_api_key("c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

pums_vars_2018 <- 
  pums_variables %>%
  filter(year == 2018, survey == "acs5")

ca_pums <- get_pums(
  variables = c(
    "PUMA",
    "AGEP",
    "RAC1P",
    "HISP",
    "LANX"
  ),
  state = "CA",
  survey = "acs5",
  recode = T
)

ca_pumas <-
  pumas("CA", cb = T, progress_bar = F)

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_counties <-
  counties("CA", cb = T, progress_bar = F) %>%
  filter(NAME %in% bay_county_names)

bay_pumas <-
  ca_pumas %>% 
  st_centroid() %>% 
  .[bay_counties, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(ca_pumas %>% select(GEOID10)) %>% 
  st_as_sf()

bay_pums <-
  ca_pums %>% 
  filter(PUMA %in% bay_pumas$PUMACE10)

We’ll use AGEP, age, as a continuous variable, but for race and ethnicity, we’ll code white and hispanic as binary 0 or 1 variables. We’ll do the same for the dependent variable whose probability we’d like to predict, english, though you should consider this 0 to 1 coding to be used quite distinctly in the logistic regression as it will be treated as “odds” and converted to a log value.

bay_pums_language <-
  bay_pums %>% 
  mutate(
    white = ifelse(
      RAC1P_label == "White alone",
      1,
      0
    ),
    HISP_label = HISP_label %>% as.character(),
    hispanic = ifelse(
      is.na(HISP_label),
      0,
      1
    ),
    english = ifelse(
      LANX_label == "No, speaks only English",
      1,
      0
    )
  )

Now, instead of lm(), we will use a more generalized function called glm() which can be used to do the simple linear regression from the last chapter, but can also receive different arguments under family= which change the “type” of regression to be performed (i.e. what kind of error distribution to expect in the data). Here, we’ll indicate family = binomial() to let the function know that the function should be interpreted like a logit function.

logit_model <- glm(
  english ~ AGEP + white + hispanic,
  family = binomial(),
  data = bay_pums_language,
  weights = PWGTP
)

summary(logit_model)
## 
## Call:
## glm(formula = english ~ AGEP + white + hispanic, family = binomial(), 
##     data = bay_pums_language, weights = PWGTP)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -25.067   -4.091    2.410    3.703   23.245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.551e-01  1.726e-03  -553.2   <2e-16 ***
## AGEP         1.010e-02  3.579e-05   282.1   <2e-16 ***
## white        1.587e+00  1.592e-03   996.7   <2e-16 ***
## hispanic    -1.145e+00  4.372e-03  -261.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10585108  on 376056  degrees of freedom
## Residual deviance:  9286183  on 376053  degrees of freedom
## AIC: 9286191
## 
## Number of Fisher Scoring iterations: 6

Note that because of family = binomial(), glm() interprets the dependent and independent variables in the appropriate way for logistic regression. If you wanted to use glm() for simple linear regression (“ordinary least squares”), you could specify family = gaussian(). The interpretation of the glm() is not immediately intuitive because the y value is the “log odds” of speaking English. If we take the exp() of all regression coefficients, the y value becomes “odds”.

exp(coef(logit_model))
## (Intercept)        AGEP       white    hispanic 
##   0.3847781   1.0101468   4.8876598   0.3183645

Odds divided by 1 + odds yields probability of occurring, as most are more likely to intuitively understand. At face value, the negative and positive signs in the glm() indicate:

  • As age increases, holding other variables constant, probability of speaking English increases
  • If the individual is white, holding other variables constant, probability of speaking English increases
  • If the individual is Hispanic, holding other variables constant, probability of speaking English decreases

The exact probabilities can be computed, or predicted for specific combinations of independent variables using predict() with the extra argument type = “response” to convert the result into probability of occurrence:

predict(logit_model, data.frame(AGEP = 40, white = 1, hispanic = 0), type = "response")
##         1 
## 0.7379715