3.1 Microdata

The survey data we’ve been using through censusapi has been in the form of “bucket aggregations”, where the buckets are ranges like “67 to 69 years”. However, a subset of ACS results (1% of the population) are provided as anonymized, un-summarized responses in the Public Use Microdata Sample (PUMS). I recommend a more formal reading about PUMS before getting started. You can go here to download the large state CSVs directly, but censusapi now also hosts the PUMS files, so we can build off of the techniques we already know.

But for the sake of completeness, I’ll demonstrate how you’d download the raw PUMS data. You could, of course, directly download a CSV from the FTP site yourself and read_csv() into your Environment, as we did with PG&E data. However, in this case, the zip file online can be directly loaded using the third option we listed in Chapter 1.5, where we save the zip file into a temporary folder, unzip from there, then delete the folder. This is not necessarily saving you much effort beyond just manually downloading and unzipping the zip file and then using read_csv(), but this technique is useful to keep in mind generally for zip files if you don’t want to deal with the manual hassle of unzipping (note that this method doesn’t work for the PG&E data because of a special pop-up web interface that requires you to fill out a form).

library(censusapi)
library(tidyverse)
library(tigris)
library(sf)
library(leaflet)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")
temp <- tempfile()
download.file("https://www2.census.gov/programs-surveys/acs/data/pums/2019/1-Year/csv_hca.zip",destfile = temp)

pums_hca_2019_1yr <- read_csv(unzip(temp,"psam_h06.csv"))

unlink(temp)

Pair temp <- tempfile() with unlink(temp) so your temporary folder and its contents are removed. This will take some time to load. As you can see, before even considering the special concepts like replicate weights, PUMS is quite a large and unwieldy dataset, so it will be great to take advantage of package functions designed to streamline the processing of just subsets of this data.

Now, we’ll practice loading in PUMS data using getCensus(). At the time of writing, the availability of PUMS data on the Census API page is new enough that censusapi doesn’t fully support PUMS data yet; specifically, listCensusMetadata() doesn’t work for grabbing the PUMS data dictionary, which notably is quite different from the ACS datasets. You can view the data dictionary using the standard Census API variables view, but since this doesn’t distinguish between the questions asked at the individual vs. household level (which matters now that we’re looking at individual responses), I recommend also referring to the official codebook as well.

Keep in mind that the underlying questionnaire for PUMS data is the same as the ACS, but the structure of PUMS data is different (buckets vs. individual responses), and PUMS data will in certain cases give you significantly more detailed answer options. PUMS data is also provided as 1-yr or 5-yr samples, representing 1% or 5% of the population, respectively. For this demonstration, we’ll use the latest available dataset, the 2019 1-yr version.

You need to specify the variables of interest at this point; for our demonstration, I’ll pick NP for household size, HHL for household language (with the meanings of the coded numeric values listed in the formal data dictionary), HINCP for household income, and TEN for tenure. There are also four “meta-variables” that are always useful to load:

  • SERIALNO gives a unique ID for each household; you’ll see that you have as many rows with the same ID as is noted under NP, which is essentially household size. If you want to create statistics based on households, as our chosen fields are intended to be, we’ll need to remove duplicate SERIALNO rows.
  • SPORDER is the order in which individual household members are listed in the data. The only relevant number here is 1, since when the Census Bureau identifies certain household characteristics like race, it usually defaults to the characteristic of the “head of household” (which isn’t necessarily representative of the whole household). So if you want to replicate this concept with the PUMS data, you’d use certain characteristics of SPORDER == 1, but with PUMS data you have the option of designing other kinds of variables.
  • PWGTP is the number of people that the one sample record “represents”, in the sense that if you sum up PWGTP, you should get the total number of people in the state you’ve selected. The weights are designed to make the unique individual characteristics representative of the “correct distribution” at the state level, but of course this can only be so accurate. You can also get eighty replicate weights created by the Census Bureau to simulate multiple samples and create more precise standard error estimates, which is outside the scope of this curriculum.
  • WGTP is the same concept as PWGTP, but weights for households (meaning you would want to be working with just one record per household before using these). Adding up WGTP should get you the total number of households in the state.
pums_2019_1yr <- getCensus(
  name = "acs/acs1/pums",
  vintage = 2019,
  region = "public use microdata area:*", 
  regionin = "state:06",
  vars = c(
    "SERIALNO",
    "SPORDER",
    "PWGTP",
    "WGTP",
    "NP",
    "HHL",
    "HINCP",
    "TEN"
  )
)

Given the time it takes to load this, you may want to grab a large number of fields you might be interested in at one time, and then save that as an .rds file for easy retrieval later.

Next, let’s filter to just PUMAs in the Bay Area, using techniques from last chapter. Recall that many manipulation functions like filter() come from dplyr which is one of the packages in the tidyverse.

pumas() is the appropriate function from tigris for the unique “public use microdata area” geographies that PUMS data is provided at (bigger than census tracts).

ca_pumas <-
  pumas("CA", cb = T, progress_bar = F)

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_counties <-
  counties("CA", cb = T, progress_bar = F) %>%
  filter(NAME %in% bay_county_names)

bay_pumas <-
  ca_pumas %>% 
  st_centroid() %>% 
  .[bay_counties, ] %>% 
  st_drop_geometry() %>% 
  left_join(ca_pumas %>% select(GEOID10)) %>% 
  st_as_sf()

bay_pums <-
  pums_2019_1yr %>% 
  mutate(
    PUMA = str_pad(public_use_microdata_area,5,"left","0")
  ) %>% 
  filter(PUMA %in% bay_pumas$PUMACE10)

Note: str_pad() is used to pad the field public_use_microdata_area with extra zeros on the left, because that is how the field PUMACE10 in bay_pumas documents its PUMAs. This is a common difference between string and numeric fields. We could have also converted PUMACE10 into a numeric field using as.numeric(), but this is an opportunity to demonstrate the arguments you’d use in str_pad().

To create summary statistics, we can use dplyr functions, but be sure to filter to just unique households, and weight each household response with WGTP. Given the specific fields we have, let’s construct a unique measure of the percent of households in a PUMA that are non-English-speaking renter households with a household income of less than $100,000 per year.

bay_pums_example <-
  bay_pums %>% 
  filter(!duplicated(SERIALNO)) %>% 
  mutate(
    WGTP = as.numeric(WGTP),
    nonenglish_renter_lowinc = ifelse(
      (HHL != 1) &
        (TEN == 3) &
        (HINCP < 100000),
      WGTP,
      0
    )
  ) %>% 
  group_by(PUMA) %>% 
  summarize(
    perc_nonenglish_renter_lowinc =
      sum(nonenglish_renter_lowinc, na.rm =T)/sum(WGTP, na.rm = T)*100
  ) %>% 
  left_join(
    bay_pumas %>% 
      select(PUMACE10),
    by = c("PUMA" = "PUMACE10")
  ) %>% 
  st_as_sf()

Notes:

  • The filter !duplicated(SERIALNO) is the most straightforward way to de-duplicate by a specific field’s values. Recall that SERIALNO is unique to households, and the default order of the data downloaded has SPORDER starting from 1, so filtering out duplicates of SERIALNO has the effect of removing anybody who is not the “head of household”.
  • WGTP happens to be a character type field when pulled from the Census API. We are mutating it to a numeric field so we can perform numerical math on it.
  • According to the official codebook, a value of 1 in HHL means “English only”, so we set the logical condition HHL != 1 to get households where the head of household speaks any non-English language. Similarly, a value of 3 in TEN means “Rented”.
  • The ifelse() statement causes the household to be “counted” if the 3 conditions are true, and no counted otherwise. So that means that summing up nonenglish_renter_lowinc gets you an estimate of all households of that characteristic, while summing up WGTP gets you all households period.
pums_pal <- colorNumeric(
  palette = "Oranges",
  domain = bay_pums_example$perc_nonenglish_renter_lowinc
)

leaflet() %>%
  addTiles() %>% 
  addPolygons(
    data = bay_pums_example,
    fillColor = ~pums_pal(perc_nonenglish_renter_lowinc),
    color = "white",
    opacity = 0.5,
    fillOpacity = 0.5,
    weight = 1,
    label = ~paste0(
      round(perc_nonenglish_renter_lowinc), 
      "% non-English-speaking renter households making less than $100K"
    ),
    highlightOptions = highlightOptions(
      weight = 2,
      opacity = 1
    )
  ) %>% 
  addLegend(
    data = bay_pums_example,
    pal = pums_pal,
    values = ~perc_nonenglish_renter_lowinc,
    title = "% non-English-speaking<br>renter households<br>making less than $100K"
  )

Notice the difference in the size of PUMAs relative to other TIGER boundaries we’ve used. They are not as granular as we’d like, but still useful for neighborhood-level inquiry and problem-solving.