2.1 Housing costs

One common indicator of housing is “affordability” or “housing cost burden”, which is generally defined by a household spending more than 30% of their income on housing costs (50% is also used as a “severe” cost burden threshold). This measure can be used for both renters and owners. Using PUMS data, we can not only determine whether individual respondents are housing burdened or not (and extrapolate total population estimates using weights), but also quantify the total “amount of housing cost burden”, which can be thought of as a quantity of money that, if somehow cut from the appropriate households’ housing costs, would eliminate housing cost burden in the population. Below, we’ll demonstrate how to perform these calculations using 5-yr PUMS data for the whole Bay Area.

library(tidyverse)
library(tigris)
library(sf)
library(leaflet)

In the intro curriculum, we used censusapi to load PUMS data. In this curriculum, we’ll switch to using tidycensus, which does not provide access to as many census datasets as censusapi but provides a more streamlined interface for certain datasets, particularly PUMS. You can use the same Census API key as you’ve used before, but you provide it using a different function, as shown below.

library(tidycensus)

census_api_key("c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

This census_api_key() line puts the key in your “R Environment”, which is not the same as the environment on your RStudio screen. This environment is a .Renviron file which you can find somewhere on your computer, probably your Documents folder. Try finding it and opening it in RStudio or another text editor, and you should see a line that starts with CENSUS_API_KEY, followed by your key. The next time you restart R (or you can use readRenviron("~/.Renviron") right now, which should re-read that file), RStudio should automatically know your Census API key.

Here we prepare the PUMS shapes in the Bay Area, as we’ve done before.

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_counties <-
  counties("CA", cb = T, progress_bar = F) %>%
  filter(NAME %in% bay_county_names)

ca_pumas <-
  pumas("CA", cb = T, progress_bar = F)

bay_pumas <-
  ca_pumas %>% 
  st_centroid() %>% 
  .[bay_counties, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(ca_pumas %>% select(GEOID10)) %>% 
  st_as_sf()

With tidycensus, you can get a data dictionary as follows. Recall that you can also make use of the official codebook.

pums_vars_2019 <- 
  pums_variables %>%
  filter(year == 2019, survey == "acs5")

To get to as holistic of a measure of housing costs as possible, we need to consider many PUMS variables, as well as the different data fields related to renters vs. owners. The following variables may constitute housing costs:

RNTP: Monthly rent for renters
MHP: Annual mobile home costs for mobile home residents
MRGP: Monthly first mortgage payment for owners
MRGT: A flag for whether MRGP includes annual property taxes for a given record. If not, then the following field should be added.
TAXAMT: Annual property tax for owners
MRGI: A flag for whether MRGP includes insurance for a given record. If not, then the following field should be added.
INSP: Annual fire/hazard/flood insurance for owners
SMP: Any other monthly mortgage payments for owners
CONP: Monthly condo (or HOA) fees for condo owners
ELEP: Monthly electricity costs for all households
GASP: Monthly gas costs for all households
FULP: Other annual fuel costs for all households
WATP: Annual water costs for all households

Fortunately, there are pre-aggregated variables as well: GRNTP for all monthly renter housing costs, and SMOCP for all monthly owner housing costs.

There is one other important refinement to consider when working with dollar amounts in PUMS 5-yr samples. Since the individual responses can come from any of the 5 years (2015-2019), any dollar calculations should involve adjustments for inflation, so that, say, all values are represented in 2019 dollars. The PUMS data provides ADJHSG and ADJINC for each record, which vary in their values by year of the response, which you can use to adjust housing-related and income-related fields, respectively, to 2019 dollars.

The loading of PUMS data is done as follows:

ca_pums <- get_pums(
  variables = c(
    "PUMA",
    "GRNTP",
    "SMOCP",
    "ADJHSG",
    "HINCP",
    "ADJINC"
  ),
  state = "CA",
  year = 2019,
  survey = "acs5"
)

bay_pums <-
  ca_pums %>% 
  filter(PUMA %in% bay_pumas$PUMACE10)

Compared to the PUMS output from censusapi, this output is smarter about the class of variables, but not perfect. The cost and income variables above are automatically numeric, but the inflation-adjustments are characters and will need to be converted to numeric. Also note that the output automatically includes SERIALNO (unique for each household), SPORDER (where 1 is the head of household), WGTP (household weight), PWGTP (person weight), and ST (state).

Now, let’s compute housing cost burden as a percentage as well as an absolute dollar amount, given a specific burden threshold, for each record:

burden_threshold <- 0.3

bay_burden <-
  bay_pums %>% 
  filter(HINCP > 0) %>%
  filter(SPORDER == 1) %>% 
  transmute(
    PUMA,
    weight = WGTP,
    housingcost = ifelse(
      SMOCP > 0,
      SMOCP*12*as.numeric(ADJHSG),
      GRNTP*12*as.numeric(ADJHSG)
    ),
    income = HINCP*as.numeric(ADJINC),
    burden_perc = housingcost/income,
    burden = housingcost - burden_threshold*income
  )

Notes:

Our measure of housing burden is only meaningful if the household has non-zero income, so we filter out some households. But an alternate analysis can include these excluded households as part of a more holistic assessment of housing stress or stability.
Filtering SPORDER == 1 leaves you with one record per household. All housing-level variables (noted in the data dictionary) are the same for all members of the household anyway. But the person-level variables will be specific to the “head of household”.
Because we are analyzing at the household level, we use WGTP instead of PWGTP.
You can observe that any record with a positive SMOCP value doesn’t have a positive GRNTP value and vice versa, so this serves as an indicator of whether a household is a renter or owner. Our ifelse() statement takes advantage of this logic. It also applies ADJHSG to adjust the cost values per inflation. Notice how the ADJHSG value differs based on year, which is part of SERIALNO.
We adjust HINCP using ADJINC. Note that we multiplied the costs by 12 because they were defined as monthly, while the incomes are not multiplied because they are defined as yearly.
burden_perc is a percentage, while burden is the dollar amount over or under the exact amount that would be considered “affordable”, per the threshold we picked (in this case 0.3). If positive, then the household pays more than 30% of income on housing; if negative, then the household pays less.

There are many opportunities at this stage to further refine the analysis, given that housing cost burden may be more of a concern for households under a certain overall income, or with other characteristics. You might also be interested in multiple different burden thresholds as well. For now, let’s go ahead and summarize our results for each PUMA:

bay_burden_pumas <-
  bay_burden %>% 
  mutate(
    burdened = ifelse(
      burden_perc >= burden_threshold,
      weight,
      0
    ),
    excess = ifelse(
      burden < 0,
      burden,
      0
    ),
    burden = ifelse(
      burden > 0,
      burden,
      0
    )
  ) %>% 
  group_by(PUMA) %>% 
  summarize(
    burdened = sum(burdened),
    households = sum(weight),
    burden = sum(burden*weight),
    excess = sum(excess*weight)
  ) %>% 
  mutate(
    burdened_perc = burdened/households
  ) %>% 
  left_join(bay_pumas %>% select(PUMA = PUMACE10)) %>% 
  st_as_sf()

sum(bay_burden_pumas$burdened)/sum(bay_burden_pumas$households)

## [1] 0.3539971

sum(bay_burden_pumas$burden) %>% prettyNum(",") %>% paste0("$",.)

## [1] "$11,235,617,585"

Notes:

In the first ifelse() we use weight because we want to count the number of households that particular burdened households represents in the full population.
excess and burden are just holding the negative and positive values from the original burden field, respectively, so we can separately count “money available” and “money needed” within two different sets of households. Note that we couldn’t flip the order in which we did these two mutations, because the mutation of burden overwrites the original burden field.
When summarizing excess and burden, we need to multiply the dollar amounts by weight to get the total dollar amount for all households that each particular record represents.
prettyNum() adds commas where you’d expect them in normal numeric outputs, and we use paste0() to quickly add a dollar sign on the front.

Based on the quick summary statistics above, it appears that, across 2015-2019, about a third of households in the Bay Area were paying more than 30% of their income on housing, and the total amount of that housing cost above the 30% threshold, for all of those households combined, was over $11 billion per year, in 2019 dollars. In other words, external assistance in the form of universal basic income or housing vouchers at this scale would technically eliminate the “housing affordability problem” in the Bay Area. Here are two maps to visualize these results geospatially:

burden_pal1 <- colorNumeric(
  palette = "Purples",
  domain = bay_burden_pumas$burdened_perc
)

bay_burden_pumas %>% 
  leaflet() %>% 
  addProviderTiles(provider = providers$CartoDB.Positron) %>% 
  addPolygons(
    fillColor = ~burden_pal1(burdened_perc),
    fillOpacity = 0.5,
    color = "white",
    weight = 0.5,
    label = ~paste0(round(burdened_perc*100), "% of households paying 30%+ of income on housing"),
    highlightOptions = highlightOptions(
      weight = 2
    )
  ) %>% 
  addLegend(
    pal = burden_pal1,
    values = ~burdened_perc,
    title = "% Cost-burdened<br>households"
  )

Note that using ACS 5-yr data (e.g., B25074, B25095), you could create a very similar map at the block group level. But the PUMS method, trading off geographic granularity, provides a more accurate estimate of the actual dollar amount of burden, and allows for more customizations on this analysis.

burden_pal2 <- colorNumeric(
  palette = "Reds",
  domain = bay_burden_pumas$burden/1e6
)

bay_burden_pumas %>% 
  leaflet() %>% 
  addProviderTiles(provider = providers$CartoDB.Positron) %>% 
  addPolygons(
    fillColor = ~burden_pal2(burden/1e6),
    fillOpacity = 0.5,
    color = "white",
    weight = 0.5,
    label = ~paste0("$", round(burden/1e6), "M total annual cost burden"),
    highlightOptions = highlightOptions(
      weight = 2
    )
  ) %>% 
  addLegend(
    pal = burden_pal2,
    values = ~burden/1e6,
    title = "Total housing cost<br>burden, in $ millions"
  )

Note how burden is divided 1e6 and then reported in the hover-over label with M (million) for a more concise notation.

As previously noted, further refinement is strongly recommended to isolate just those who are truly “vulnerable” in the sense of having low income, or having a certain kind of less stable employment/income, or having certain household characteristics, or having even higher than 30% cost-burden. You could also quantify the “excess” value and consider this as money elsewhere in the population that could potentially be redistributed through progressive policies to enable more equitable housing. Otherwise, a big factor that may contribute to high housing costs is the scarcity of housing, which may be affected by local zoning, which we’ll explore next.