2.3 Decennial Data

The 2020 Decennial Census was an attempt to count 100% of residents across the U.S. While it did not necessarily reach this goal, the resultant block-level summary files that are now available are one of the best sources of ground-truth about populations that we have to work with. Note that the Decennial Census questionnaire was not the same as the annual American Community Survey, so the data we have at the block level is mostly demographics. In this section, we’ll demonstrate how to read and visualize this data.

library(tigris)
library(sf)
library(censusapi)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

smc_pop_2020 <-
  getCensus(
    name = "dec/pl",
    vintage = 2020,
    region = "block:*", 
    regionin = "state:06+county:081",
    vars = "P1_001N"
  ) %>% 
  transmute(
    block =
      paste0(state,county,tract,block),
    pop = P1_001N
  )

The above pipeline is the shortcut method when you only have 1 or a small number of variables to read, and you can directly rename the fields into understandable names, like pop = P1_001N. Below, I review how to use the “group” parameter within the vars argument to grab all the categories and subcategories of race contained within the “P1” grouping, and use the corresponding metadata file to grab legible field names.

dec_vars_2020 <-
  listCensusMetadata(
    name = "2020/dec/pl",
    type = "variables"
  )

smc_pop_race_2020 <-
  getCensus(
    name = "dec/pl",
    vintage = 2020,
    region = "block:*", 
    regionin = "state:06+county:081",
    vars = "group(P1)"
  ) %>% 
  mutate(
    block =
      paste0(state,county,tract,block)
  ) %>% 
  select(!c(GEO_ID,state,county,tract,NAME) & !ends_with(c("NA"))) %>% 
  pivot_longer(
    ends_with("N"),
    names_to = "name",
    values_to = "estimate"
  ) %>%
  left_join(
    dec_vars_2020 %>% 
      select(name, label)
  ) %>% 
  select(-name) %>% 
  separate(
    label,
    into = c(NA,NA,"category1","category2"),
    sep = "!!"
  )

Note the categorical structure of P1, extracted below from dec_vars_2020 (click on the “right” button to see the second column):

dec_vars_2020 %>% 
  filter(grepl("P1",name)) %>% 
  select(name, label) %>% 
  arrange(name)

First, note that the labels all start with a space followed by “!!”, which means that the separate() function creates a first column filled with just a space. We get rid of that and the next column filled with “Total:” using the first two NA parameters.

The next step is to decide which rows to keep, since there are usually double-counted rows to deal with in the getCensus group results. The data is structured such that there is first a subdivision “Population of one race” to individual races, and then there’s a subdivision “Population of two or more races”, which itself is further subdivided and provides a lot of possible unique combinations of races. Depending on your particular question, you may want to retain as much detail as possible, but if we are comfortable just using the category “Two or more races” alongside all the single-race options, then we can selectively filter in the following way:

smc_pop_race_2020 <- smc_pop_race_2020 %>% 
  mutate(
    race = case_when(
      category1 == "Population of two or more races:" & is.na(category2) ~ "Two or more races",
      category1 == "Population of two or more races:" ~ "",
      !is.na(category2) ~ category2,
      TRUE ~ ""
    )
  )

Here we are using case_when() for the first time. It’s like an ifelse() but used when you need to string multiple ones in a row, which is to say when you want to manually manage multiple relationships between some initial condition and a resulting assignment. Notice the unique structure of the case_when() arguments, using a ~ to relate the logical condition and the result, and the use of TRUE as the final catch-all for whatever rows are left undealt with.

If you inspect the dataframe at this stage, you can see how we have created values under race for rows we would like to retain, and all other rows have an empty string. So the final step involves removing those other rows. Note that I am “overwriting” the object smc_pop_race_2020 in these chunks, which is always fine to do, but possibly inconvenient while you are troubleshooting, since you lose intermediate steps and might have to run code upstream. These are also always opportunities to integrate different lines of code all into one larger pipeline, when you are ready to finalize your script.

smc_pop_race_2020 <- smc_pop_race_2020 %>% 
  filter(race != "") %>% 
  select(block, race, pop = estimate)

Note that in the select() function above, you can rename fields, but you can’t perform mutate() style changes within select(). This final dataframe is ideal for being able to filter down to population counts for specific racial groups, and you can still easily get to a total population per block with the help of group_by() and summarize(), which you should be able to confirm gets you the same result as if you had only pulled “P1_001N” earlier.

Now let’s bring in the updated 2020 block shapefiles and map population count for a particular community, North Fair Oaks in San Mateo County. tigris functions may not default to the particular year you want, so in this case we are specifying that we want “2020” blocks. Keep in mind that these are administrative boundaries that change regularly as infrastructure development happens, and sometimes for political reasons, as is particularly clear in the redistricting process which motivates the Decennial Census.

Note: as of writing, there was a bug in the blocks() function that caused 2020 shapefiles to not be correctly loaded. The sign of this issue is if smc_blocks_2020 below has the field GEOID10 instead of GEOID20. You can see this particular issue being raised and addressed at https://github.com/walkerke/tigris/issues/131, which itself is a good demonstration of the open-source community on GitHub. That being said, even though the package creator fixed the issue, the fix may not have propagated to the “CRAN” version, which is the version stored in the official R “clearinghouse” for vetted packages, which is what you can access via a standard install.packages() call. Keep in mind that anybody can create a package and put it on their personal GitHub, but only “popular” ones make it to CRAN. In this case, what you need is a version of tigris slightly more updated than what’s available on CRAN, so in cases like this or cases when you want to use a non-CRAN package and trust the GitHub author, you can typically use the devtools package (remember to install it with install.packages() if you don’t already have it) and run something like below:

library(devtools)
install_github('walkerke/tigris')
library(tigris)

This should overwrite your existing tigris package saved locally, like an update. Now, we should be able to accurately grab 2020 shapefiles. Of course, you can avoid this all by directly downloading the data from online and reading it in directly using st_read for shapefiles.

blocks_2020 <- blocks("CA", "San Mateo", year = 2020, progress_bar = F)

nfo_boundary <- places("CA", progress_bar = F) %>% 
  filter(NAME == "North Fair Oaks")

nfo_pop_2020 <- smc_pop_2020 %>% 
  left_join(blocks_2020 %>% select(block = GEOID20)) %>% 
  st_as_sf() %>% 
  st_centroid() %>% 
  .[nfo_boundary, ] %>% 
  st_drop_geometry() %>% 
  left_join(blocks_2020 %>% select(block = GEOID20)) %>% 
  st_as_sf()
mapview(nfo_pop_2020, zcol = "pop")