2.5 Spatial subsets

It is common for other non-Census datasets to not share the same kinds of geometries as Census data, but you might want to somehow join the two kinds of information together. For example, the Bay Area regional governments have used Priority Development Areas as a policy tool to encourage urban development (housing and jobs) in areas that are well serviced by transit. We can load the geometries for these PDAs the same way we’ve loaded data from the MTC Open Data website before and see what they look like, in relation to CBGs (showing just San Mateo County for illustration).

library(tigris)
library(sf)
library(leaflet)
library(tidyverse)

bay_pdas <- st_read("https://opendata.arcgis.com/datasets/36dd7a36576f42d4a3d6b0708e3982f4_0.geojson")
## Reading layer `priority_development_areas_current' from data source `https://opendata.arcgis.com/datasets/36dd7a36576f42d4a3d6b0708e3982f4_0.geojson' using driver `GeoJSON'
## Simple feature collection with 218 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.0216 ymin: 36.98737 xmax: -121.5564 ymax: 38.81092
## geographic CRS: WGS 84
smc_blockgroups <- 
  block_groups("CA", "San Mateo", cb = T, progress_bar = F) %>% 
  st_transform(st_crs(bay_pdas))

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = bay_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_blockgroups,
    color = "red",
    weight = 0.75,
    fill = F
  )

Zooming in on specific PDAs, you can see that they don’t necessarily have any relationship to CBGs. Let’s say you want to estimate the population within the PDAs. First off, using Census population data at all will at least involve the uncertainty associated with outdated and sample-based information (at least until the 2020 Decennial data is available at the block level, but that has its own uncertainties as well). In this section, though, we will focus on a different kind of uncertainty associated with the geometry boundaries not matching up.

Focusing on the PDAs in San Mateo County, let’s see what population counts we get using two different subsetting methods we’ve already encountered in the curriculum. Keep in mind that if you just need population, data, pretty much every ACS dataset you can pick has as its first _001E field a total population or household count, so have many possible choices, but let’s go ahead and use B01001, which we previously used to get breakdowns of “SEX BY AGE”, but here we’ll just use it to get full population counts for each CBG, which means we can skip some of the dplyr manipulations and grab just the one field of data we need (we’ve done this once before in section 2.2).

smc_pdas <-
  bay_pdas %>% 
  filter(county == "San Mateo")

smc_cbg_pop <-
  getCensus(
    name = "acs/acs5",
    vintage = 2018,
    region = "block group:*", 
    regionin = "state:06+county:081",
    vars = "B01001_001E"
  ) %>%
  transmute(
    cbg =
      paste0(state,county,tract,block_group),
    pop = B01001_001E
  ) %>% 
  left_join(
    smc_blockgroups %>% 
      select(GEOID), 
    by = c("cbg" = "GEOID")
  ) %>% 
  st_as_sf()

smc_pda_pop1 <-
  smc_cbg_pop[smc_pdas, ]

sum(smc_pda_pop1$pop)
## [1] 320563
leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = smc_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_pda_pop1,
    color = "red",
    weight = 0.75,
    fill = F
  )

The above method overcounts population because it includes any CBG that even so much as touches the PDA boundaries. Now let’s try the method that subsets to only CBGs with their centroid inside of the PDA boundaries, which we’ve generally used throughout the curriculum so far:

smc_pda_pop2 <-
  smc_cbg_pop %>% 
  st_centroid() %>% 
  .[smc_pdas, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(smc_cbg_pop %>% select(cbg)) %>% 
  st_as_sf()

sum(smc_pda_pop2$pop)
## [1] 145635
leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = smc_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_pda_pop2,
    color = "red",
    weight = 0.75,
    fill = F
  )

The population estimate has dropped by more than half, and we’d expect that any further Census analysis of demographics would also be substantively different between the two subsetting approaches. The map seems to suggest that this approach is an underestimate in some places and an overestimate in others.

Let’s consider one other approach, where we cut the CBGs as if the CBGs were cookie dough and the PDAs were the cookie cutters. In this case, one could calculate the area of a given CBG and then the area of the portion that is left over after being cut using a sf package function, st_area(). We’ll need to ensure that the coordinate system used is a planar coordinate system, so that the result of st_area() is in cartesian units (EPSG:26910, a common planar coordinate system for this part of the U.S., will do the trick). The following adds an area field to smc_cbg_pop in square meters:

smc_cbg_pop_area <-
  smc_cbg_pop %>% 
  st_transform(26910) %>% 
  mutate(area = st_area(.))

Note that a period (or geometry) needs to be added in st_area() when used within a mutate(). Now, we’ll demonstrate st_intersection() which is the “cookie cutter” procedure.

smc_cbg_pop_intersection <-
  smc_cbg_pop_area %>% 
  st_intersection(
    smc_pdas %>% 
      st_transform(26910)
  )

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = smc_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_cbg_pop_intersection %>% 
      st_transform(4326),
    color = "red",
    weight = 0.75,
    fill = F
  )

This example shows that st_intersection() needs to receive the “cookie dough” in the pipeline, and its own argument should be the “cookie cutter”, and the result will be just the portions of cookie dough left within the cookie cutter. Now, st_area() can be applied again to get the “leftover area”, which can then be divided by the original area to get a ratio of “% area”. What does it mean to then apply this ratio to the original population? Is it akin to “leftover population” inside the cookie cutter?

smc_pda_pop3 <-
  smc_cbg_pop %>% 
  st_transform(26910) %>% 
  mutate(original_area = st_area(.)) %>% 
  st_intersection(
    smc_pdas %>% 
      st_transform(26910)
  ) %>% 
  mutate(
    leftover_area = st_area(.),
    perc_area = leftover_area / original_area,
    pop = pop * perc_area
  )

sum(smc_pda_pop3$pop) %>% round()
## 142732 [1]

The critical assumption implied in this approach is that population is evenly distributed within the original geometry. This of course is not true, but if you have reason to believe that this assumption is reasonable enough for the purposes of creating a better population estimate than the previous two approaches we’ve tried, then you may state that your population subsetting is based on the “proportional area” of CBGs within the PDAs.

The result turns out to actually be pretty close to the second estimate we had, which speaks to the general usefulness of the centroid method. But keep in mind that the proportional area method materially affects the assortment (and proportion) of CBGs that are considered part of the PDA, which will materially affect the further demographic analyses that occur. It’s relatively easy to keep holding on to this “proportional area adjustment factor” alongside the CBG IDs for the entirety of the analysis, waiting until the very end to multiply population counts by it.