2.2 ACS data
censusapi
is a package that allows for direct pulling of data from the American Communities Survey (ACS) and other Census datasets. This is a really good overview (as you can see, also written as a .Rmd file), as well as this online textbook by the creator of tigris
. He also created tidycensus
which is another popular choice for pulling Census data, with the advantage of allowing you to load tigris
style shapefiles at the same time, but I tend to prefer censusapi
because it gives you access to more Census datasets, everything you can see here. If you’re interested in the most recent and granular ACS data, you’d search for the latest 5-year survey on that page, which at the time of writing is the 2015-2019 5-yr survey, with a full data dictionary viewable here. The 5-yr datasets sample from the full range of the 5 years, which allows for more geographic granularity (and lower margins of error), while the 1-yr datasets sample from only 1 year. Usually the tradeoff is recency vs. geographic granularity. You might actually not get the information you are looking for from the 1-yr datasets, even at the County level if there isn’t enough population, so unless it’s absolutely critical to have recent data, I generally default to the 5-yr datasets.
Recall in the setup chunk we had a Sys.setenv()
call. As you can see from the censusapi
tutorial, this is needed to set an API key to access the Census API. You should request your own here. I’m including my own in the chunk below.
It’s useful to use listCensusMetadata()
to read the contents of the 5-yr data dictionary, which will provide the ability to replace an illegible code like B01001_003E
with the understanding that that variable provides counts of males under 5 years old.
library(tidyverse)
library(sf)
library(tigris)
library(leaflet)
library(censusapi)
Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")
<-
acs_vars_2019_5yr listCensusMetadata(
name = "2019/acs/acs5",
type = "variables"
)
This can take a while to load but is regularly useful, so this is a perfect kind of variable to save using saveRDS()
for easy retrieval in the future. You can inspect this dataframe in the Source window in much the same way as you’d review the API page, or http://data.census.gov, to explore what variables exist in the ACS.
Next is a workflow using getCensus()
, the fundamental function in censusapi
, along with a pipeline of dplyr
manipulations that should mostly look familiar to you already, in order to generate a clean dataset of, say, sex by age for CBGs in San Mateo County.
<-
smc_sexbyage getCensus(
name = "acs/acs5",
vintage = 2019,
region = "block group:*",
regionin = "state:06+county:081",
vars = "group(B01001)"
%>%
) mutate(
cbg =
paste0(state,county,tract,block_group)
%>%
) select(!c(GEO_ID,state,county,tract,block_group,NAME) & !ends_with(c("EA","MA","M"))) %>%
pivot_longer(
ends_with("E"),
names_to = "variable",
values_to = "estimate"
%>%
) left_join(
%>%
acs_vars_2019_5yr select(name, label),
by = c("variable" = "name")
%>%
) select(-variable) %>%
separate(
label,into = c(NA,NA,"sex","age"),
sep = "!!"
%>%
) filter(!is.na(age))
Notes:
getCensus()
has lots of required parameters, best understood through reviewing the official documentation linked at the top of this section.- ‘06’ refers to California and
081
refers to San Mateo County; FIPS codes for states and counties can be found on their Wikipedia pages. - Knowledge of which variable codes to use comes from exploring the Census dataset and experience over time. Depending on what Census data you’re interested in, often there are multiple variable codes that you can use to patch a desired result together. Here we are working towards a percentage of elderly people in each CBG, and the “sex by age” dataset happens to be a pretty straightforward way to create any age group of interest, but you have to work with disaggregated sex data for some portion of the process, even if you’re not specifically interested in it.
- I use
ends_with()
withinselect()
for a more specialized search. The census data comes with “estimate” fields ending in “E”, “margins of error” fields ending in “M”, and “annotation” fields for either ending in “EA” or “MA” (which are usually empty”, so this operation gets rid of unused fields quickly. More specialized selection methods can be reviewed here. - We haven’t used
separate()
before, but it should be relatively straightforward. Theinto =
argument is your opportunity to create new fields that will hold the separated contents oflabel
, which holds a string like “Estimate!!Total!!Female!!65 to 74 years”, but since we have no use for the first two separated elements which are always “Estimate” and “Total”, I setNA
for the first two elements, then specify field namessex
andage
for the next two based on my conceptual understanding of the structure of this particular dataset. http://data.census.gov gives you the most intuitive view of how the various datasets are structured. - Census datasets tend to have double-counting in the sense that there will be a “Total” field followed by rows disaggregating the “Total” into “Subtotals”. If you were to directly add up all the counts, you may end up double counting multiple times. In this specific case, the simplest way to remove double-counting is to
filter(!is.na(age))
, whereage
holds the last information available in the label which represents the “lowest level of disaggregation”. Any row that has anNA
in this field would be a higher-level “Total” that would be a double-count of a field that doesn’t have anNA
. Note that!is.na()
is read as “is not empty”.
If we want to now get “percent elderly” for each CBG, we could do the following:
<-
smc_elderly %>%
smc_sexbyage mutate(
elderly =
ifelse(
%in% c(
age "65 and 66 years",
"67 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"
),
estimate,NA
)%>%
) group_by(cbg) %>%
summarize(
elderly = sum(elderly, na.rm = T),
total_pop = sum(estimate, na.rm = T)
%>%
) mutate(
percent_elderly = elderly/total_pop*100
%>%
) filter(!is.na(percent_elderly))
The only new technique here is ifelse()
which takes three arguments: a logical condition, which within a mutate()
will be assessed for each row; the result that should be given if the condition is true; and the result that should be given if the condition is false.
And to conclude, here’s a map of the result, using tigris
and sf
techniques from the previous chapter:
<- block_groups("CA", "San Mateo", cb = T, progress_bar = F) smc_blockgroups
<- colorNumeric(
elderly_pal palette = "Blues",
domain =
$percent_elderly
smc_elderly
)
leaflet() %>%
addProviderTiles(provider = providers$CartoDB.Positron) %>%
addPolygons(
data =
%>%
smc_elderly left_join(
%>% select(GEOID),
smc_blockgroups by = c("cbg" = "GEOID")
%>%
) st_as_sf(),
fillColor = ~elderly_pal (percent_elderly),
color = "white",
opacity = 0.5,
fillOpacity = 0.75,
weight = 1,
label = ~paste0(
round(percent_elderly),
"% over age 65"
),highlightOptions = highlightOptions(
weight = 2,
opacity = 1
)%>%
) addLegend(
data = smc_elderly,
pal = elderly_pal,
values = ~percent_elderly,
title = "% over 65"
)