3.2 Commute data

Given the particular impact of commute travel on our urban systems, the Census Bureau puts out special data products to understand “Longitudinal Employer-Household Dynamics”. The most granular dataset available is the LEHD Origin-Destination Employment Statistics (LODES). The data can be viewed through an online user interface called OnTheMap (which does not show the highest granularity of data available), and the Tech Doc provides an explanation of the downloaded data’s structure. While one can use the website interface to download the raw data, there is an R package called lehdr which can be used to download the data within RStudio.

devtools::install_github("jamgreen/lehdr")

library(lehdr)
library(tidyverse)
library(tigris)
library(sf)
library(leaflet)
library(censusapi)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

The function grab_lodes() requires many arguments which are also necessary to understand when manually accessing the data online:

  • state: LODES can only be grabbed for a whole state at once. Given the size of California, this will mean that the function runs pretty slowly for us. I’d recommend saving the result as an .rds file for faster retrieval.
  • year: The most recent available year of data is 2019.
  • lodes_type: There are three datasets available through LODES. “od” is the origin-destination paired data which is generally the most interesting because it allows for routing. “rac” and “wac” provide more detailed data for just one location, either focusing on the employed residents in the location or the workers who work in the location, respectively.
  • job_type: LODES counts a few different categories of data. The difference between “JT00” (All Jobs) and “JT01” (Primary Jobs) is that the first may include side jobs. If one is analyzing regular commute patterns, then JT01 is likely the preferred selection. There are also options here to focus on private or federal jobs.
  • state_part: “main” filters to just O-D pairs within California, while “aux” can include origins from outside of the state.
  • agg_geo: The data is available all the way down at the census block level, but this dataset would take a very long time to download. It is worth trying it once (give at least an hour if not more), then saving the result on your machine. This also would involve running blocks() from tigris to join to geometries, which also takes a long time to load. In both cases, you might find it easier to download directly from online and unzip the results (see method after the grab_lodes code). Otherwise, selecting “bg” or “tract” at this stage is more manageable.
ca_od <- grab_lodes(
  state = "ca", 
  year = 2019, 
  lodes_type = "od", 
  job_type = "JT01",
  state_part = "main", 
  agg_geo = "tract"
)

If you download directly from online, you will see that the initial dataframe is slightly different from the result of grab_lodes. In order to get the online download in the correct format to work with the code in this section, you can apply the following script:

ca_od_read <- read_csv(".../ca_od_main_JT01_2019.csv.gz") # insert file path of the downloaded file on your local machine

ca_od <-
  ca_od_read %>%
  mutate(
    h_tract = substr(h_geocode,1,11),
    w_tract = substr(w_geocode,1,11)
  ) %>%
  dplyr::select(-createdate,-w_geocode,-h_geocode) %>%
  group_by(h_tract,w_tract) %>%
  summarise_at(vars(S000:SI03),sum) %>%
  mutate(
    state = "CA",
    year = "2019"
  )

The download from online initially has variables h_geocode and w_geocode, which are at the census block group level rather than the census tract level that we see from grab_lodes. To convert them to tracts, we add two new columns (mutate), where we create the h_tract and w_tract variables by parsing the CBG value to only include the characters in the 1-10 position (substr ), and also add a 0 at the beginning (paste0 ). We then deselect the variables that are irrelevant, group_by the origin-destination pairs and summarize the variables that start with “S…”. Finally, we add the state and year columns. At this point, the dataframe will match the result of grab_lodes.

The resultant dataframe has the same structure as described in the data dictionary. Besides the pair of geographies that make up the origin-destination pair, the dataframe includes 10 fields of counts of workers, starting with a total count (S000) and followed by three sets of triplets, breaking that worker down by three age bins, three wage bins, and three sector bins:

Three age groups:

  • Number of jobs of workers age 29 or younger
  • Number of jobs for workers age 30 to 54
  • Number of jobs for workers age 55 or older

Three wage tiers:

  • Number of jobs with earnings $1250/month or less
  • Number of jobs with earnings $1251/month to $3333/month
  • Number of jobs with earnings greater than $3333/month

Three broad sectors:

  • Number of jobs in Goods Producing industry sectors
  • Number of jobs in Trade, Transportation, and Utilities industry sectors
  • Number of jobs in All Other Services industry sectors

Generally, even though this information is available down at the block level, I would recommend using such O-D information only aggregated up to tract or higher, so as to reduce sensitivity to imputation errors, given that the data collection process would not have been 100% perfect. In the demonstration below, I’ll filter the origin tracts to those that make up East Palo Alto, and map total worker count in all other tracts (setting a minimum threshold of 5):

ca_tracts <- tracts("CA", cb = T, progress_bar = F)

epa_boundary <- places("CA", cb = T, progress_bar = F) %>% filter(NAME == "East Palo Alto")

epa_tracts <- 
  ca_tracts %>% 
  st_centroid() %>% 
  .[epa_boundary, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(ca_tracts %>% select(GEOID)) %>% 
  st_as_sf()
epa_tracts %>% 
  leaflet() %>% 
  addTiles() %>% 
  addPolygons()
epa_od <-
  ca_od %>% 
  filter(h_tract %in% epa_tracts$GEOID) %>% 
  left_join(
    ca_tracts, 
    by = c("w_tract" = "GEOID")
  ) %>% 
  st_as_sf() %>% 
  filter(S000 >= 5)
lodes_pal <- colorNumeric(
  palette = "YlGnBu",
  domain = epa_od$S000
)

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = epa_od,
    fillColor = ~lodes_pal(S000),
    label = ~paste0(w_tract, ": ", S000),
    color = "black",
    weight = 0.5,
    fillOpacity = 0.25
  ) %>% 
  addPolygons(
    data = epa_boundary,
    fill = F,
    color = "red",
    weight = 2
  ) %>% 
  addLegend(
    data = epa_od,
    pal = lodes_pal,
    values = ~S000,
    title = "Where EPA residents<br>work, LODES 2019<br>(5 or more workers)"
  )

Note that some worker destinations are shown as being in Southern California. In theory this is possible, but it’s more likely that these are the headquarter locations of companies that have Bay Area offices as well, so one might choose to add a maximum distance threshold on the origin-destination pairs (in the next section we’ll demonstrate how to create travel routes). Otherwise, this LODES dataset provides a rich amount of information that can be manipulated to produce any number of insights, such as the % of residents who work in or outside of their home County, the change in this measure over time, and an estimate of commute vehicle miles traveled (incorporating ACS data to estimate the % of commute trips in each distance range that are made by vehicle).

As a last demonstration, I’ll use the “Workplace Area Characteristics” from LODES to more quickly arrive at job counts for a given location (available through the O-D dataset as well, but this will run more quickly because it doesn’t create O-D pairs). These job counts can then be compared with a count of “employed residents” from ACS to create a jobs-to-employed-residents (J/ER) ratio, which is a common indicator of a kind of fiscal sustainability for California cities.

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_counties <-
  counties("CA", cb = T, progress_bar = F) %>%
  filter(NAME %in% bay_county_names)

bay_wac <-
  2010:2019 %>%
  map_dfr(function(year){
    grab_lodes(
      state = "ca",
      year = year,
      lodes_type = "wac",
      job_type = "JT01",
      segment = "S000",
      state_part = "main",
      agg_geo = "county"
    ) %>%
    filter(w_county %in% bay_counties$GEOID)
  })

bay_employedresidents <-
  2010:2019 %>%
  map_dfr(function(year){
    getCensus(
      name = "acs/acs1/subject",
      vintage = year,
      vars = c("S2301_C01_001E","S2301_C03_001E","S2301_C04_001E"),
      region = "county:001,013,041,055,075,081,085,095,097",
      regionin = "state:06"
    ) %>%
    transmute(
      Population16andOlder = S2301_C01_001E,
      PercEmployedResidents = S2301_C03_001E,
      EmployedResidents = PercEmployedResidents/100*Population16andOlder,
      year = year,
      county = paste0(state,county)
    )
  })

bay_jer <-
  bay_wac %>% 
  select(
    year,
    county = w_county,
    Jobs = C000
  ) %>% 
  left_join(
    bay_employedresidents,
    by = c("year", "county")
  ) %>% 
  mutate(
    jer = Jobs/EmployedResidents
  ) %>% 
  left_join(bay_counties, by = c("county" = "GEOID")) %>% 
  st_as_sf()
jer_pal <- colorNumeric(
  palette = "RdYlGn",
  domain = bay_jer %>% filter(year == 2019) %>% pull(jer)
)

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = bay_jer %>% 
      filter(year == 2019),
    fillColor = ~jer_pal(jer),
    label = ~paste0(NAME, ": ", jer %>% round(2)),
    color = "black",
    weight = 0.5,
    fillOpacity = 0.25
  ) %>% 
  addLegend(
    data = bay_jer %>% 
      filter(year == 2019),
    pal = jer_pal,
    values = ~jer,
    title = "Jobs to Employed<br>Residents Ratio<br>LODES 2019"
  )

And here is a time series plot comparing the Bay Area counties over the years of available LODES data, 2010-2019:

plot <- bay_jer %>% 
  mutate(year = year %>% as.character()) %>% 
  ggplot() +
  geom_line(
    aes(
      x = year,
      y = jer,
      group = NAME,
      color = NAME
    )
  ) +
  labs(
    x = "Year",
    y = "Jobs to Employed Residents Ratio",
    color = "County"
  )

plot %>% 
  ggplotly() %>% 
  config(displayModeBar = F)