2.4 Microdata

The survey data we’ve been using through censusapi has been in the form of “bucket aggregations”, where the buckets are ranges like “67 to 69 years”. However, a subset of ACS results (1% of the population) are provided as anonymized, un-summarized responses in the Public Use Microdata Sample (PUMS). Generally, you’d need to go here to download the large state CSVs directly, but as of this year, a separate package, tidycensus, has built a more user-friendly set of functions to not only load PUMS data but also analyze it, including making use of the weighting data that’s included to make extrapolations about the full population.

First, for the sake of completeness, I’ll demonstrate how you’d download the raw PUMS data. You could, of course, directly download a CSV from the FTP site yourself and read_csv() into your Environment, as we did with PG&E data. However, in this case, the zip file online can be directly loaded using the technique we used on the CalEnviroScreen data to save into a temporary folder, unzip from there, then delete the folder. Again, this is not necessarily saving you much effort beyond just manually downloading and unzipping the zip file and working with the contents that way, and again, the tidycensus functions will provide you with an even more controlled procedure. But this technique is useful to keep in mind generally for zip files if you don’t want to deal with the manual hassle of unzipping (with the PG&E website providing a unique challenge because of its contact form pop-up).

temp <- tempfile()
download.file("https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/csv_hca.zip",destfile = temp, mode = "wb")

pums_hca_2018_5yr <- read_csv(unzip(temp,"psam_h06.csv"))

unlink(temp)

This will take some time to load. As you can see, before even considering the special concepts like replicate weights, PUMS is quite a large and unwieldy dataset, so it will be great to take advantage of package functions designed to streamline the processing of this data.

Now, following the documentation here, we’ll practice loading in PUMS data using get_pums(). Note that tidycensus generally is a nice alternative to censusapi for pulling in ACS data, since it can provide data pre-manipulated into tidy form, and can pre-joined to Census geographies. However, I personally still prefer censusapi because it provides access to the full Census API directory. That being said, PUMS loading is not available (yet) through censusapi so it’s good to approach packages with the mindset of using the right tool for the right job.

Note that as of September 2020, you’ll need a version of tidycensus that isn’t yet incorporated into the official CRAN package directory, which is what you have access to using RStudio’s package installation options. Instead, you will need to download directly from the creator’s GitHub repo (which generally you should consider risky compared to packages vetted for CRAN, but Kyle Walker would be the kind of creator to trust, given his prominence in the R community). To do this for a package that is set up on GitHub in this way, you use a package called devtools and the following function:

install.packages("devtools")

devtools::install_github("walkerke/tidycensus")

Our first step is to set up a data dictionary, which is similar to acs_vars_2018_5yr which we set up for working with ACS summary data. The underlying questionnaire that either of these datasets is based on is the same, but the structure is different (buckets vs. individual responses), and PUMS data will in certain cases give you significantly more detailed answer options. In fact, the dictionary, pums_variables, is built into the tidycensus package, so you simply have to filter it to the specific dictionary you need. We need to specify year, ACS 1-yr or 5-yr version, and level (person-level responses or household-level responses, which per the questionnaire end up being different questions). For this demonstration, we’ll use the latest available dataset, the 2018 5-yr version, and choose the household-level set (which is smaller). Note that census_api_key() is the way to set up your API key for the tidycensus package, which is different from census_api, but go ahead and input your same API key.

library(tidycensus)

census_api_key("c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

pums_vars_2018 <- 
  pums_variables %>%
  filter(year == 2018, survey == "acs5")

pums_vars_2018_distinct_hh <- 
  pums_vars_2018 %>%
  distinct(var_code, var_label, data_type, level) %>% 
  filter(level == "housing")

You can View(pums_vars_2018_distinct_hh) and use this as a way to check for available data, in lieu of the online query tools. Now we will pull the actual data, which you can do at the state level. You need to specify the variables of interest at this point; for our demonstration, I’ll pick NP for household size, HHL for household language, HINCP for household income, and TEN for tenure. We’ll also need PUMA, which is the Public Use Microdata Area where the respondent is located (bigger than a census tract). This will take some time to load.

ca_pums <- get_pums(
  variables = c(
    "PUMA",
    "NP",
    "HHL",
    "HINCP",
    "TEN"
  ),
  state = "CA",
  year = 2018,
  survey = "acs5",
  recode = T
)

Notes:

  • recode = T provides label fields directly in the dataset; otherwise you’d need to join with your data dictionary.
  • Note that a few additional fields were automatically provided. SERIALNO gives a unique ID for each household; notice how you have as many rows as is noted under NP, which is essentially household size. If you want to create statistics based on households, as our chosen fields are intended to be, we’ll need to remove duplicate SERIALNO rows.
  • WGTP is the number of households that the one sample record “represents”, in the sense that if you sum up WGTP, you should get the total number of households in California. The weights are designed to make the unique household characteristics representative of the “correct distribution” at the state level, but of course this can only be so accurate. You can also get eighty replicate weights created by the Census Bureau to simulate multiple samples and create more precise standard error estimates, but we’ll cover that in a future chapter.
  • Given the time it takes to load this, you may want to grab a large number of fields you might be interested in at one time, and then save that as an .rds file for easy retrieval later. This also helps prevent get_pums() from having to run every time you knit a document; it also produces a lot of unnecessary progress bar output in the knitted result which can’t be removed using the standard opts_chunk techniques.

Next, let’s filter to just PUMAs in the Bay Area, using techniques from last chapter. pumas() is the appropriate function from tigris.

ca_pumas <-
  pumas("CA", cb = T, progress_bar = F)

bay_pumas <-
  ca_pumas %>% 
  st_centroid() %>% 
  .[bay_counties, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(ca_pumas %>% select(GEOID10)) %>% 
  st_as_sf()

bay_pums <-
  ca_pums %>% 
  filter(PUMA %in% bay_pumas$PUMACE10)

To create summary statistics, we can use dplyr functions, but be sure to filter to just unique households, and weight each household response with WGTP. Given the specific fields we have, let’s construct a unique measure of the percent of households in a PUMA that are non-English-speaking renter households with a household income of less than $100,000 per year.

bay_pums_example <-
  bay_pums %>% 
  filter(!duplicated(SERIALNO)) %>% 
  mutate(
    nonenglish_renter_lowinc = ifelse(
      (HHL_label != "English only") &
        (TEN_label == "Rented") &
        (HINCP < 100000),
      WGTP,
      0
    )
  ) %>% 
  group_by(PUMA) %>% 
  summarize(
    perc_nonenglish_renter_lowinc =
      sum(nonenglish_renter_lowinc, na.rm =T)/sum(WGTP, na.rm = T)*100
  ) %>% 
  left_join(
    bay_pumas %>% 
      select(PUMACE10),
    by = c("PUMA" = "PUMACE10")
  ) %>% 
  st_as_sf()

pums_pal <- colorNumeric(
  palette = "Oranges",
  domain = bay_pums_example$perc_nonenglish_renter_lowinc
)

leaflet() %>%
  addTiles() %>% 
  addPolygons(
    data = bay_pums_example,
    fillColor = ~pums_pal(perc_nonenglish_renter_lowinc),
    color = "white",
    opacity = 0.5,
    fillOpacity = 0.5,
    weight = 1,
    label = ~paste0(
      round(perc_nonenglish_renter_lowinc), 
      "% non-English-speaking renter households making less than $100K"
    ),
    highlightOptions = highlightOptions(
      weight = 2,
      opacity = 1
    )
  ) %>% 
  addLegend(
    data = bay_pums_example,
    pal = pums_pal,
    values = ~perc_nonenglish_renter_lowinc,
    title = "% non-English-speaking<br>renter households<br>making less than $100K"
  )