2.3 Census-to-parcel disaggregation

As soon as we start working with parcel-level data, as we have started to do in this chapter, it may become useful to estimate population characteristics for specific parcels, understanding that no matter what, our efforts will still be at best an educated guess. We could use spatial subsetting, where we assume that a population is homogeneously distributed across a CBG and then cut proportional areas out of that CBG. But if we have some intelligence contained in parcel information, then we could make better guesses than the homogeneity assumption. We’ll explore some possibilities in this section.

First, let’s focus on just one CBG in San Francisco, where we also happen to have easy access to Assessor-Recorder data. But again, be prepared for lots of data wrangling when working with messy parcel data.

library(tidyverse)
library(readxl)
library(tigris)
library(censusapi)
library(sf)
library(leaflet)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

acs_vars_2019_5yr <-
  listCensusMetadata(
    name = "2019/acs/acs5",
    type = "variables"
  )

sf_cbgs <- block_groups("CA", "San Francisco", cb = T, progress_bar = F)

potrero_cbg <- 
  sf_cbgs %>% 
  filter(GEOID == "060750614001")

leaflet() %>% 
  addProviderTiles(provider = providers$CartoDB.Positron) %>% 
  addPolygons(
    data = potrero_cbg,
    fill = F
  )

This census block group is recognizeable by those who’ve driven on 101-N as the one that includes the floating van on a pole on the right side of the freeway, or by those who’ve taken the 48 Muni bus to/from the Caltrain 22nd St. Station and the Mission. It notably has a clear distinction between large parcels of public housing for renters and smaller lots of townhomes, condos, and other apartments. Because of this clear heterogeneity of building types, with clear distinctions available in the parcel data, we can probably do better than using the homogeneity assumption when allocating populations to sub-CBG levels.

Similar to the last section, let’s load in various datasets available online and link them together. We won’t need zoning information for this analysis, and we’ll select 2019 Assessor-Recorder data to better match with 2015-2019 ACS data.

sf_parcels_shape <- 
  st_read("https://data.sfgov.org/api/geospatial/acdm-wktn?method=export&format=GeoJSON") %>% 
  filter(active == "true") %>% 
  select(
    apn = blklot
  )

temp <- tempfile()
download.file("https://sfassessor.org/sites/default/files/uploaded/2020.7.10_SF_ASR_Secured_Roll_Data_2018-2019.xlsx",destfile = temp, mode = "wb")

sf_secured_19 <- read_excel(temp, sheet = "Roll Data 2018-2019")
datakey_19 <- read_excel(temp, sheet = "Data Key")
usecode_19 <- read_excel(temp, sheet = "Class Code Only")

unlink(temp)

potrero_parcels_shape <-
  sf_parcels_shape %>% 
  st_transform(4269) %>% 
  st_centroid() %>% 
  .[potrero_cbg, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(sf_parcels_shape %>% select(apn)) %>% 
  st_as_sf()

leaflet() %>% 
  addProviderTiles(provider = providers$CartoDB.Positron) %>% 
  addPolygons(
    data = potrero_cbg,
    fill = F
  ) %>% 
  addPolygons(
    data = potrero_parcels_shape,
    fillColor = "blue",
    color = "black",
    weight = 0.5,
    label = ~apn
  )

potrero_parcels <-
  potrero_parcels_shape %>% 
  left_join(
    sf_secured_19 %>% 
      mutate(
        apn = RP1PRCLID %>% 
          str_replace(" ","")
      )
  )

table(potrero_parcels$RP1CLACDE) %>% 
  as.data.frame() %>% 
  left_join(usecode_19, by = c("Var1" = "CODE"))

A few observations:

We will filter out non-residential parcels based on the use codes we can see in the table above.
Condos are once again separated out into copies of the same land parcel geometry, as is visually clear from the more opaque parts of the map, and from the large count of “Z” use codes.
We can use the homeowner’s property tax exemption, which is generally $7,000 off your annual taxable property value if you occupy your own property, to identify owner occupied units. There is a field RP1EXMVL1 that includes this 7000 number that can be used to make this determination, but some units paid 5600 instead, which is because they were late to file. So it’s cleaner to go off of the “11” code in EXEMPTYPE.
By the way, there is a residential unit, 1345 Kansas, that appears to have EXEMPTYPE “06” for “Religious” which results in $277,405 off their taxable value. On Google Maps, this looks like an apartment but is clearly also “House of God in San Francisco”. Without any other information, we’ll go ahead and remove this APN manually from our analysis. This is the kind of peculiar finding you can expect once you’re down at this granularity of analysis.
Within potrero_parcels, similar to what we’ve seen before, a few condos seem to have 0 units, which we will assume are incorrectly labeled and should be at least 1 unit, given that these records also contain bedrooms, and some have homeowner exemptions.

Before cleaning up the parcel data, let’s also bring in some ACS data that can be used to meaningfully disaggregate population data by parcel-level characteristics. Generally, the most categorically useful fields in our parcel data are tenure (owner-occupied vs. renter-occupied, using the homeowner exemption field) and the number of units in the structure (where we’ll base this off of group_by() on the land geometry). You could also explore using year built, date of last sale, and possibly sale price or land value as a proxy of socioeconomic characteristics. Given these available parcel-level characteristics, then you can search for corresponding ACS tables that give you cross-sections of individual and household counts by one or more of these parcel-level characteristics. For example, here we’ll use B25033 which is “TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY UNITS IN STRUCTURE”. This won’t give us any other information other than population count, but will help us allocate population based on the combination of tenure and units in structure, which is how we can then choose to structure our parcel data to enable that join.

sf_pop_tenure_units <-
  getCensus(
    name = "acs/acs5",
    vintage = 2019,
    region = "block group:*", 
    regionin = "state:06+county:075",
    vars = "group(B25033)"
  ) %>% 
  mutate(
    cbg =
      paste0(state,county,tract,block_group)
  ) %>% 
  select(!c(GEO_ID,state,county,tract,block_group,NAME) & !ends_with(c("EA","MA","M"))) %>%
  pivot_longer(
    ends_with("E"),
    names_to = "variable",
    values_to = "estimate"
  ) %>%
  left_join(
    acs_vars_2019_5yr %>% 
      select(name, label), 
    by = c("variable" = "name")
  ) %>% 
  select(-variable) %>% 
  separate(
    label,
    into = c(NA,NA,"tenure","units_in_structure"),
    sep = "!!"
  ) %>% 
  filter(
    cbg == potrero_cbg$GEOID,
    !is.na(units_in_structure),
    estimate > 0
  )

sf_pop_tenure_units

Note that 17 mobile home residents appear to be recorded in this CBG in the dataset, and we won’t be able to attach these 17 residents to APNs.

potrero_parcels_res <-
  potrero_parcels %>% 
  filter(!is.na(RP1CLACDE)) %>% 
  filter(!RP1CLACDE %in% c("E","I","IZ","P","VA1","VCI","VPU","VR","X")) %>% 
  filter(apn != "4260020") %>% 
  mutate(
    UNITS = pmax(UNITS, 1),
    tenure = ifelse(
      is.na(EXEMPTYPE),
      "Renter occupied:",
      "Owner occupied:"
    )
  ) %>% 
  as.data.frame() %>% 
  mutate(geometry = geometry %>% st_as_text()) %>% 
  group_by(geometry) %>% 
  mutate(
    units_in_structure = sum(UNITS, na.rm = T),
    units_in_structure = case_when(
      units_in_structure == 1 ~ "1, detached or attached",
      units_in_structure < 5 ~ "2 to 4",
      TRUE ~ "5 or more"
    )
  ) %>% 
  group_by(tenure, units_in_structure) %>% 
  mutate(
    units_in_bin = sum(UNITS, na.rm = T)
  ) %>% 
  ungroup() %>% 
  select(-geometry) %>% 
  left_join(potrero_parcels %>% select(apn)) %>% 
  st_as_sf()

potrero_parcels_res %>% 
  st_drop_geometry() %>%  
  select(apn, tenure, units_in_structure, UNITS, units_in_bin) %>% 
  head()

sum(potrero_parcels_res$UNITS)

## [1] 1030

Notes:

For tenure and units_in_structure, we’re explicitly choosing to match the labels with our ACS data to enable the left_join().
We are doing group_by() twice. The second group_by() overwrites the first one.
We aren’t using summarize() because in this case we don’t have a reason to combine rows in our dataframe. Instead, we are adding the “summarized” values to each and every APN record, which will be useful for the math in the next step. In this situation, you can group_by() and then mutate() the same kinds of arguments as you would put in summarize().
We group_by(geometry) first so we can calculate the total number of units in each “structure”. This will help us know if a owner-occupied or renter-occupied unit would have selected “1, detached or attached”, “2 to 4”, or “5 or more” for “total units in structure” in the ACS questionnaire.
Then we group_by(geometry, tenure) to calculate the total number of units in the “bins” of tenure and units in structure, which match up with the “bins” we get from the ACS. We need to attach units_in_bin to each row so that we know what fraction of the total units are represented in each record. Then, we can use this same percentage to allocate some proportion of the ACS population for that bin into the APN. In a sense, it’s at this “bin” scale that we are still using a kind of homogeneity assumption, absent other information.
A mutate() doesn’t remove the grouping; a summarize() only removes the last grouping variable, so if you had grouped with more than one, there will still be some grouping applied to the dataframe. So I’d recommend always using ungroup() after your intended use, to deselect any and all groupings such that they do not mess up subsequent steps.

The table output shows how the structure of this dataframe now matches the ACS table and is ready for the final join. We also can see that there are 1,030 units total; according to the ACS there are ~2,600 residents we’ll need to allocate into these 1,030 units. That allocation will happen within the next command:

potrero_parcels_census <-
  potrero_parcels_res %>% 
  left_join(
    sf_pop_tenure_units %>% select(-cbg),
    by = c("tenure", "units_in_structure")
  ) %>% 
  mutate(
    pop = UNITS/units_in_bin*estimate
  )

potrero_parcels_census %>% 
  st_set_geometry(NULL) %>% 
  select(apn, tenure, pop) %>% 
  head()

The final estimate of population takes into account tenure, units in structure, and number of units in the APN. Note that the populations are fractions, which are a reminder that these, of course, are estimates using blunt math, but you can hold on to these fractions through other analyses and then round off your final answers, such as counts of renters vs. owners within accessibility bands of amenities, as we’ll do in a later chapter.

leaflet() %>% 
  addProviderTiles(provider = providers$CartoDB.Positron) %>% 
  addPolygons(
    data = potrero_cbg,
    fill = F
  ) %>% 
  addPolygons(
    data = potrero_parcels_census,
    fillColor = ~ifelse(tenure == "Owner occupied:", "red", "blue"),
    color = ~ifelse(tenure == "Owner occupied:", "red", "blue"),
    weight = 0.5,
    fillOpacity = ~pop/100
  )

In this map, we’ve assigned red colors to owner-occupied APNs and blue colors to renter-occupied APNs, and also allowed opacities to stack on each other so that the more opaque parcels have more population. This just allows us to, at a quick glance, see a much more heterogeneous distribution of two sub-populations, compared to an assumption of homogeneity.

Note that with decennial census data, you could do a similar census-to-parcel disaggregation, but allocating race or ethnicity at the block level. Combined with parcel-level tenure via the homeowner exemption data, you could produce a more recent estimate of tenure by race/ethnicity. Or, you could combine ACS 5-yr data and the most recent decennial census data to construct many different parcel-level disaggregations.