2.4 Spatial subsets

It is common for other non-Census datasets to not share the same kinds of geometries as Census data, but you might want to somehow join the two kinds of information together. For example, the Bay Area regional governments have used Priority Development Areas as a policy tool to encourage urban development (housing and jobs) in areas that are well serviced by transit. We can load the geometries for these PDAs the same way we’ve loaded data from the MTC Open Data website in Chapter 2.1 and see what they look like, in relation to blocks (showing just San Mateo County for illustration).

library(tigris)
library(sf)
library(leaflet)
library(tidyverse)

bay_pdas <- st_read("https://opendata.arcgis.com/datasets/4df9cb38d77346a289252ced4ffa0ca0_0.geojson")

smc_pdas <-
  bay_pdas %>% 
  filter(county == "San Mateo") %>% 
  st_transform(4269)

smc_pdas_blocks <- smc_blocks_2020[smc_pdas, ]
leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = smc_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_pdas_blocks,
    color = "red",
    weight = 0.75,
    fill = F
  )

Here we’re switching from the default map to a stylized map for clearer viewing. You can see what other basemaps are built into leaflet() here. There are ways to pull in other basemaps too, which we’ll explore later.

Zooming in on specific PDAs (blue), you can see that they don’t necessarily have any relationship to blocks (red). Let’s say you want to estimate the population within the PDAs. First off, using Census population data at all will at least involve the uncertainty associated with outdated and sample-based information (the 2020 Decennial data in this case is as good as it gets, but has its own uncertainties as well). In this section, though, we will focus on a different kind of uncertainty associated with the geometry boundaries not matching up.

Focusing on the PDAs in San Mateo County, let’s see what population counts we get using two different subsetting methods we’ve already encountered in the curriculum. Keep in mind that if you just need population data, pretty much every Census dataset you can pick has as its first _001 field a total population or household count, so you have many possible choices, but let’s go ahead and use P1_001N for total Decennial Census population for each block. We’ve previously already prepared this information as the field pop in our smc_blocks_2020 dataframe.

Our first method is simply to use the bracket-based spatial filtering we already did in the map above, take all those red-outlined blocks, and sum their population.

smc_pdas_blocks_1 <- smc_pdas_blocks %>% 
  select(block = GEOID20) %>% 
  left_join(smc_pop_2020)

sum(smc_pdas_blocks_1$pop)
## [1] 208919

The above method overcounts population because it includes any block that even so much as touches the PDA boundaries, as first introduced in Chapter 2.1. Now let’s try the method that subsets to only blocks with their centroid inside of the PDA boundaries, which we’ve generally used throughout the curriculum so far:

smc_pdas_blocks_2 <-
  smc_pdas_blocks_1 %>% 
  st_centroid() %>% 
  .[smc_pdas, ] %>% 
  st_drop_geometry() %>% 
  left_join(smc_pdas_blocks_1 %>% select(block)) %>% 
  st_as_sf()

sum(smc_pdas_blocks_2$pop)
## [1] 157300
leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = smc_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_pdas_blocks_2,
    color = "red",
    weight = 0.75,
    fill = F
  )

The population estimate has dropped by about 25%, and we’d expect that any further Census analysis of demographics would also be substantively different between the two subsetting approaches. The map seems to suggest that this approach is an underestimate in some places and an overestimate in others.

Let’s consider one other approach, where we cut the blocks as if the blocks were cookie dough and the PDAs were the cookie cutters. In this case, one could calculate the area of a given block and then the area of the portion that is left over after being cut using a sf package function, st_area(). We’ll need to ensure that the coordinate system used is a planar coordinate system, so that the result of st_area() is in cartesian units (EPSG:26910, a common planar coordinate system for this part of the U.S., will do the trick). The following adds an area field to smc_pdas_blocks in square meters:

smc_pdas_blocks_area <-
  smc_pdas_blocks %>% 
  st_transform(26910) %>% 
  mutate(area = st_area(.))

Note that a period (or geometry) needs to be added in st_area() when used within a mutate(). Now, we’ll demonstrate st_intersection() which is the “cookie cutter” procedure.

smc_pdas_blocks_intersection <-
  smc_pdas_blocks_area %>% 
  st_intersection(
    smc_pdas %>% 
      st_transform(26910)
  )
leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = smc_pdas,
    stroke = F,
    fillOpacity = 0.5
  ) %>% 
  addPolygons(
    data = smc_pdas_blocks_intersection %>% 
      st_transform(4269),
    color = "red",
    weight = 0.75,
    fill = F
  )

This example shows that st_intersection() needs to receive the “cookie dough” in the pipeline, and its own argument should be the “cookie cutter”, and the result will be just the portions of cookie dough left within the cookie cutter. Now, st_area() can be applied again to get the “leftover area”, which can then be divided by the original area to get a ratio of “% area”. What does it mean to then apply this ratio to the original population? Is it akin to “leftover population” inside the cookie cutter?

smc_pdas_blocks_3 <-
  smc_pdas_blocks %>% 
  select(block = GEOID20) %>% 
  left_join(smc_pop_2020) %>% 
  st_transform(26910) %>% 
  mutate(original_area = st_area(.)) %>% 
  st_intersection(
    smc_pdas %>% 
      st_transform(26910)
  ) %>% 
  mutate(
    leftover_area = st_area(.),
    perc_area = leftover_area / original_area,
    pop = pop * perc_area
  )
sum(smc_pdas_blocks_3$pop) %>% round()
## 157399 [1]

The critical assumption implied in this approach is that population is evenly distributed within the original geometry. This of course is not true, but if you have reason to believe that this assumption is reasonable enough for the purposes of creating a better population estimate than the previous two approaches we’ve tried, then you may state that your population subsetting is based on the “proportional area” of blocks within the PDAs. You can also apply “dough” that’s bigger than blocks, like CBGs, but the higher granularity your dough, the more fine-tuned your result.

The result turns out to actually be pretty close to the second estimate we had, which speaks to the general usefulness of the centroid method. But keep in mind that the proportional area method materially affects the assortment (and proportion) of blocks that are considered part of the PDA, which will materially affect the further demographic analyses that occur. It’s relatively easy to keep holding on to this “proportional area adjustment factor” alongside the block IDs for the entirety of the analysis, waiting until the very end to multiply population counts by it.