2.2 ACS data

censusapi is a package that allows for direct pulling of data from the American Communities Survey (ACS) and other Census datasets. This is a really good overview (as you can see, also written as a .Rmd file), as well as this online textbook by the creator of tigris. He also created tidycensus which is another popular choice for pulling Census data, with the advantage of allowing you to load tigris style shapefiles at the same time, but I tend to prefer censusapi because it gives you access to more Census datasets, everything you can see here. If you’re interested in the most recent and granular ACS data, you’d search for the latest 5-year survey on that page, which at the time of writing is the 2015-2019 5-yr survey, with a full data dictionary viewable here. The 5-yr datasets sample from the full range of the 5 years, which allows for more geographic granularity (and lower margins of error), while the 1-yr datasets sample from only 1 year. Usually the tradeoff is recency vs. geographic granularity. You might actually not get the information you are looking for from the 1-yr datasets, even at the County level if there isn’t enough population, so unless it’s absolutely critical to have recent data, I generally default to the 5-yr datasets.

Recall in the setup chunk we had a Sys.setenv() call. As you can see from the censusapi tutorial, this is needed to set an API key to access the Census API. You should request your own here. I’m including my own in the chunk below.

It’s useful to use listCensusMetadata() to read the contents of the 5-yr data dictionary, which will provide the ability to replace an illegible code like B01001_003E with the understanding that that variable provides counts of males under 5 years old.

library(tidyverse)
library(sf)
library(tigris)
library(leaflet)
library(censusapi)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

acs_vars_2019_5yr <-
  listCensusMetadata(
    name = "2019/acs/acs5",
    type = "variables"
  )

This can take a while to load but is regularly useful, so this is a perfect kind of variable to save using saveRDS() for easy retrieval in the future. You can inspect this dataframe in the Source window in much the same way as you’d review the API page, or http://data.census.gov, to explore what variables exist in the ACS.

Next is a workflow using getCensus(), the fundamental function in censusapi, along with a pipeline of dplyr manipulations that should mostly look familiar to you already, in order to generate a clean dataset of, say, sex by age for CBGs in San Mateo County.

smc_sexbyage <-
  getCensus(
    name = "acs/acs5",
    vintage = 2019,
    region = "block group:*", 
    regionin = "state:06+county:081",
    vars = "group(B01001)"
  ) %>%
  mutate(
    cbg =
      paste0(state,county,tract,block_group)
  ) %>% 
  select(!c(GEO_ID,state,county,tract,block_group,NAME) & !ends_with(c("EA","MA","M"))) %>%
  pivot_longer(
    ends_with("E"),
    names_to = "variable",
    values_to = "estimate"
  ) %>%
  left_join(
    acs_vars_2019_5yr %>% 
      select(name, label), 
    by = c("variable" = "name")
  ) %>% 
  select(-variable) %>% 
  separate(
    label,
    into = c(NA,NA,"sex","age"),
    sep = "!!"
  ) %>% 
  filter(!is.na(age))

Notes:

  • getCensus() has lots of required parameters, best understood through reviewing the official documentation linked at the top of this section.
  • ‘06’ refers to California and 081 refers to San Mateo County; FIPS codes for states and counties can be found on their Wikipedia pages.
  • Knowledge of which variable codes to use comes from exploring the Census dataset and experience over time. Depending on what Census data you’re interested in, often there are multiple variable codes that you can use to patch a desired result together. Here we are working towards a percentage of elderly people in each CBG, and the “sex by age” dataset happens to be a pretty straightforward way to create any age group of interest, but you have to work with disaggregated sex data for some portion of the process, even if you’re not specifically interested in it.
  • I use ends_with() within select() for a more specialized search. The census data comes with “estimate” fields ending in “E”, “margins of error” fields ending in “M”, and “annotation” fields for either ending in “EA” or “MA” (which are usually empty”, so this operation gets rid of unused fields quickly. More specialized selection methods can be reviewed here.
  • We haven’t used separate() before, but it should be relatively straightforward. The into = argument is your opportunity to create new fields that will hold the separated contents of label, which holds a string like “Estimate!!Total!!Female!!65 to 74 years”, but since we have no use for the first two separated elements which are always “Estimate” and “Total”, I set NA for the first two elements, then specify field names sex and age for the next two based on my conceptual understanding of the structure of this particular dataset. http://data.census.gov gives you the most intuitive view of how the various datasets are structured.
  • Census datasets tend to have double-counting in the sense that there will be a “Total” field followed by rows disaggregating the “Total” into “Subtotals”. If you were to directly add up all the counts, you may end up double counting multiple times. In this specific case, the simplest way to remove double-counting is to filter(!is.na(age)), where age holds the last information available in the label which represents the “lowest level of disaggregation”. Any row that has an NA in this field would be a higher-level “Total” that would be a double-count of a field that doesn’t have an NA. Note that !is.na() is read as “is not empty”.

If we want to now get “percent elderly” for each CBG, we could do the following:

smc_elderly <- 
  smc_sexbyage %>% 
  mutate(
    elderly = 
      ifelse(
        age %in% c(
          "65 and 66 years",
          "67 to 69 years",
          "70 to 74 years",
          "75 to 79 years",
          "80 to 84 years",
          "85 years and over"
        ),
        estimate,
        NA
      )
  ) %>% 
  group_by(cbg) %>% 
  summarize(
    elderly = sum(elderly, na.rm = T),
    total_pop = sum(estimate, na.rm = T)
  ) %>% 
  mutate(
    percent_elderly = elderly/total_pop*100
  ) %>% 
  filter(!is.na(percent_elderly))

The only new technique here is ifelse() which takes three arguments: a logical condition, which within a mutate() will be assessed for each row; the result that should be given if the condition is true; and the result that should be given if the condition is false.

And to conclude, here’s a map of the result, using tigris and sf techniques from the previous chapter:

smc_blockgroups <- block_groups("CA", "San Mateo", cb = T, progress_bar = F)
elderly_pal <- colorNumeric(
  palette = "Blues",
  domain = 
    smc_elderly$percent_elderly
)

leaflet() %>% 
  addProviderTiles(provider = providers$CartoDB.Positron) %>% 
  addPolygons(
    data = 
      smc_elderly %>% 
        left_join(
          smc_blockgroups %>% select(GEOID), 
          by = c("cbg" = "GEOID")
        ) %>% 
        st_as_sf(),
    fillColor = ~elderly_pal (percent_elderly),
    color = "white",
    opacity = 0.5,
    fillOpacity = 0.75,
    weight = 1,
    label = ~paste0(
      round(percent_elderly), 
      "% over age 65"
    ),
    highlightOptions = highlightOptions(
      weight = 2,
      opacity = 1
    )
  ) %>% 
  addLegend(
    data = smc_elderly,
    pal = elderly_pal,
    values = ~percent_elderly,
    title = "% over 65"
  )