4.1 Vehicle emissions

We have already introduced commute data in Section 3.2 and routing techniques in Section 3.3. To estimate vehicle emissions (say, as part of a GHG inventory), we’ll just need to add two additional steps. First, we’ll use travel survey data to estimate trip modes and convert trip counts to vehicle counts. Second, we’ll use the emissions models to estimate vehicle and fuel types and associated emissions rates.

One issue with typical GHG inventories is that they only measure transportation emissions related to commute behavior, but as we saw in Section 3.1, commute trips alone do not comprise a majority of trips or trip miles in the Bay Area. The non-work trips people make every day are just as important to account for in a comprehensive emissions inventory. So, instead of using LODES data, which just focuses on commutes, we will use SafeGraph data, which includes trips broadly to places of interest (POIs). The free month of Neighborhood Patterns data online (September 2021) provides data about every CBG in the U.S. for that one month, with the data schema shown here. It is designed to capture visits in general to destination CBGs, so it should include both work commuters and visitors. This neighborhood_patterns.gz file will be a good test of the processing limits of your computer. Save it onto your computer and try loading it with the appropriate file path.

library(tigris)
library(tidyverse)
library(censusapi)
library(sf)
library(leaflet)
library(mapboxapi)
library(jsonlite)

acs_vars_2019_5yr <-
  listCensusMetadata(
    name = "2019/acs/acs5",
    type = "variables"
  )

sept21_patterns <- read_csv("neighborhood_patterns.csv.gz") # adjust your filepath as necessary

When this successfully loads, then we can reduce our focus area significantly to just the destination CBGs that comprise Stanford campus, and then immediately work with just that file, removing the large raw file. Recall that with large files, it is quicker to view in the Source window using View(head(sept21_patterns)).

ca_cbgs <- block_groups("CA", cb = T, progress_bar = F)

stanford_boundary <- places("CA", cb = T, progress_bar = F) %>% 
  filter(NAME == "Stanford")

stanford_cbgs <- 
  ca_cbgs %>% 
  st_centroid() %>% 
  .[stanford_boundary, ] %>% 
  st_drop_geometry() %>% 
  left_join(ca_cbgs %>% select(GEOID)) %>% 
  st_as_sf()
leaflet() %>% 
  addMapboxTiles(
    style_id = "streets-v11",
    username = "mapbox"
  ) %>% 
  addPolygons(
    data = stanford_cbgs,
    fill = F
  )

This selection could be further refined, if desired, to add the hospital and shopping center and remove some other peripheral residential areas.

stanford_patterns <-
  sept21_patterns %>% 
  filter(area %in% stanford_cbgs$GEOID)

rm(sept21_patterns)

While there are many interesting variables in this Neighborhood Patterns product, the one we’ll focus on is device_home_areas, which gives, in JSON string format, counts of trips made by devices whose home locations are in other CBGs. (A complete GHG inventory would go even further and account for trips made by those living in Stanford CBGs traveling elsewhere, but since this data product is organized based on destination, that origin-based view will be outside of the scope of this demonstration.) Let’s grab the 7 JSON strings we have and convert them to a dataframe, with the knowledge that the destination in each case is “Stanford”. To do this, we’ll need a new package, jsonlite, to convert the JSON strings to dataframes.

json <-
  stanford_patterns$device_home_areas[1] %>% 
  fromJSON() %>%
  unlist() %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(
    origin_cbg = "rowname",
    device_count = "."
  )

head(json)

You are encouraged to inspect the results at each step of the pipeline. The immediate output isn’t exactly what you need, since a JSON isn’t necessarily perfectly translatable to a dataframe, so we need to unlist() and then convert to a dataframe, and further clean up the result from there.

Also note that, according to the data documentation, device counts of 2-4 are masked and displayed only as “4” to provide further privacy (instances of “1” aren’t shown at all). For our purposes, we will manually convert values of “4” to “3” to more accurately reflect the average of options between 2 and 4.

Now we’ll loop through all seven rows and combine the JSON results together, using the map_dfr() technique.

origin_cbgs <-
  1:nrow(stanford_patterns) %>% 
  map_dfr(function(row){
    
    json <-
      stanford_patterns$device_home_areas[row] %>% 
      fromJSON() %>%
      unlist() %>% 
      as.data.frame() %>% 
      rownames_to_column() %>% 
      rename(
        origin_cbg = "rowname",
        device_count = "."
      ) %>% 
      mutate(
        device_count = ifelse(
          device_count == 4,
          3,
          device_count
        )
      )
    
  })

Since origin_cbgs is the combined result for 7 destination CBGs that comprise Stanford campus, and a stop is counted as long as a device was stopped for 1 minute, we might assume that any duplicate CBGs in our dataframe should be merged together, because they are likely to have been the same “trips” that happened to include coverage of more than one Stanford CBG.

origin_cbgs_merged <-
  origin_cbgs %>% 
  group_by(origin_cbg) %>% 
  summarize(
    device_count = max(device_count)
  )

Now we will make a few refinements to turn these into true visit estimates, given that SafeGraph is just a sample of devices. To normalize, we can compare these device counts to the total number of devices SafeGraph monitors in an origin CBG, which is the function of the separate dataset, home_panel_summary. The ratio of census population to number_devices_residing is the “scale factor” to extrapolate the device_count to estimates of total visits from a CBG to Stanford that month.

home_panel_summary <- read_csv("home_panel_summary.csv") # adjust your filepath as necessary

We’ll need to grab population counts for CBGs, as we’ve done before. In this case, the trips to Stanford may include origins from all around the country, some of which may have involved air travel. For our purposes, we’ll focus on reasonable car trips, and therefore will only grab initial population numbers for CBGs in California.

ca_cbgs_pop <-
  counties("CA", cb = T, progress_bar = F) %>%
  pull(COUNTYFP) %>% 
  map_dfr(function(x){
    getCensus(
      name = "acs/acs5",
      vintage = 2019,
      region = "block group:*",
      regionin = paste0("state:06+county:", x),
      vars = "B01001_001E"
    )
  }) %>% 
  transmute(
    census_block_group =
      paste0(state,county,tract,block_group),
    pop = B01001_001E
  )
origin_cbgs_pop <-
  origin_cbgs_merged %>% 
  left_join(
    ca_cbgs_pop,
    by = c("origin_cbg" = "census_block_group")
  )
sum(is.na(origin_cbgs_pop$pop))
## [1] 294

It looks like there are 294 origins that were outside of CA that we won’t consider in this demonstration.

origin_cbgs_normalized <-
  origin_cbgs_pop %>% 
  filter(!is.na(pop)) %>% 
  left_join(
    home_panel_summary %>% 
      select(origin_cbg = census_block_group, number_devices_residing)
  ) %>% 
  mutate(
    visits = (device_count * pop / number_devices_residing) %>% round()
  ) %>% 
  left_join(ca_cbgs %>% select(origin_cbg = GEOID)) %>% 
  st_as_sf()
visits_pal <- colorNumeric(
  palette = "Reds",
  domain = origin_cbgs_normalized %>% 
    arrange(desc(visits)) %>% 
    pull(visits) %>% 
    .[-c(1:6)]
)

leaflet() %>% 
  addMapboxTiles(
    style_id = "light-v9",
    username = "mapbox"
  ) %>% 
  addPolygons(
    data = stanford_cbgs,
    fill = F
  ) %>% 
  addPolygons(
    data = origin_cbgs_normalized,
    fillColor = ~visits_pal(visits),
    color = "red",
    weight = 1,
    fillOpacity = 0.75,
    label = ~visits
  ) %>% 
  addLegend(
    data = origin_cbgs_normalized,
    pal = visits_pal,
    values = origin_cbgs_normalized %>% 
      arrange(desc(visits)) %>% 
      pull(visits) %>% 
      .[-c(1:6)],
    title = "Visits to<br>Stanford in<br>Sept 2021"
  )

In the map, we’re removing the top 6 highest visit counts so they don’t distort the color range; these are destination CBGs themselves as origin CBGs. Clearly the distribution of origin CBGs is vast, and some of these very well may have been air travel (or other kinds of data errors). We’ll eventually use ACS data to estimate the subset that are personal vehicle trips.

Now, as we did in Section 3.3, we’ll need to get travel distances for each origin CBG to Stanford (which we’ll go ahead and consolidate as one destination, the centroid of stanford_boundary). We’ll use mapboxapi again.

stanford_origin <-
  origin_cbgs_normalized %>% 
  st_centroid() %>% 
  st_coordinates()

stanford_destination <-
  stanford_boundary %>% 
  st_centroid() %>% 
  st_coordinates()

stanford_route <- 
  1:nrow(stanford_origin) %>%
  map_dfr(function(x){
    mb_directions(
      origin = stanford_origin[x, ],
      destination = stanford_destination,
      profile = "driving-traffic"
    )
  }) %>% 
  st_as_sf()
leaflet() %>%
  addMapboxTiles(
    style_id = "light-v9",
    username = "mapbox"
  ) %>%
  addPolylines(
    data = stanford_route
  )

Next, we’ll factor in trip mode based on travel time to estimate what percentage of trips made on each of these routes is by single occupancy vehicle or carpool (which will be the extent of our GHG calculation for this demonstration). While this is possible using the NHTS data, we’ll demonstrate using ACS data, where the key caveat is that the ACS information is specifically in response to commute trips, not all trips generally (which is what our SafeGraph data represents). But for our purposes, we’ll assume that the distribution of modes we see for commute trips will be similar enough to the distribution of modes for trips overall.

travel_time_mode <-
  counties("CA", cb = T, progress_bar = F) %>%
  pull(COUNTYFP) %>% 
  map_dfr(function(x){
    getCensus(
      name = "acs/acs5",
      vintage = 2019,
      region = "block group:*",
      regionin = paste0("state:06+county:", x),
      vars = "group(B08134)"
    )
  }) %>% 
  mutate(
    cbg =
      paste0(state,county,tract,block_group)
  ) %>%
  filter(cbg %in% origin_cbgs_normalized$origin_cbg) %>% 
  select(!c(GEO_ID,state,county,tract,block_group,NAME) & !ends_with(c("EA","MA","M"))) %>%
  pivot_longer(
    ends_with("E"),
    names_to = "variable",
    values_to = "estimate"
  ) %>%
  left_join(
    acs_vars_2019_5yr %>% 
      select(name, label), 
    by = c("variable" = "name")
  ) %>% 
  select(-variable) %>% 
  separate(
    label,
    into = c(NA, NA, "total", "mode", "carpool", "time"),
    sep = "!!"
  ) %>% 
  mutate(
    mode = case_when(
      total %in% c(
        "Less than 10 minutes",
        "10 to 14 minutes",
        "15 to 19 minutes",
        "20 to 24 minutes",
        "25 to 29 minutes",
        "30 to 34 minutes",
        "35 to 44 minutes",
        "45 to 59 minutes",
        "60 or more minutes"
      ) ~ "Total",
      mode == "Drove alone:" ~ mode,
      carpool %in% c(
        "In 2-person carpool:",
        "In 3-or-more-person carpool:"
      ) ~ carpool
    ),
    time = case_when(
      mode == "Total" ~ total,
      mode == "Drove alone:" ~ carpool,
      mode == carpool ~ time
    )
  ) %>% 
  filter(!is.na(time)) %>% 
  select(-total, -carpool) %>% 
  pivot_wider(
    names_from = mode,
    values_from = estimate
  ) %>% 
  mutate(
    perc_veh1 = `Drove alone:`/Total,
    perc_veh2 = `In 2-person carpool:`/Total,
    perc_veh3 = `In 3-or-more-person carpool:`/Total
  )

Note that the B08134 ACS dataset for “MEANS OF TRANSPORTATION TO WORK BY TRAVEL TIME TO WORK” has a particularly convoluted structure, so a lot of mutate() work needs to be done to get just the subset of data we’re interested in.

stanford_trips <-
  origin_cbgs_normalized %>% 
  cbind(
    stanford_route %>% 
      st_drop_geometry()
  ) %>% 
  mutate(
    time = case_when(
      duration < 10 ~ "Less than 10 minutes",
      duration < 15 ~ "10 to 14 minutes",
      duration < 20 ~ "15 to 19 minutes",
      duration < 25 ~ "20 to 24 minutes",
      duration < 30 ~ "25 to 29 minutes",
      duration < 35 ~ "30 to 34 minutes",
      duration < 45 ~ "35 to 44 minutes",
      duration < 60 ~ "45 to 59 minutes",
      TRUE ~ "60 or more minutes"
    )
  ) %>% 
  left_join(
    travel_time_mode %>% 
      select(
        origin_cbg = cbg,
        time,
        perc_veh1,
        perc_veh2,
        perc_veh3
      ),
    by = c("origin_cbg", "time")
  ) %>% 
  mutate(
    vehicles = 
      visits * perc_veh1 + 
      visits * perc_veh2 / 2 +
      visits * perc_veh3 / 3,
    vmt = vehicles * distance * 2
  )

The result of the left_join() in the chunk above is to grab the relevant ACS data for each origin CBG from the Stanford route dataframe, given the specific trip duration that we calculated with mapboxapi. The relevant ACS data gives us an estimate of what percent of (commute) trips generally of that duration, starting from that origin, are made in personal vehicles with 1, 2, or more passengers. Note that we are assuming that trips made “In 2-person carpool” involve 1 vehicle for every 2 visitors, and trips made “In 3-or-more-person carpool” involve 1 vehicle for every 3 visitors, conservatively (for some of these trips that are way longer than 60 minutes, we may be quite off in our estimates of the number of people; for example, they may be road trips with the whole family to visit Stanford as a college option).

The final result, vmt, is “vehicle miles traveled”, the fundamental GHG-related measure we’re interested in. This is multiplied by 2, assuming these trips are roundtrips. The last potential refinement we can choose to make is to account for other visits detected by SafeGraph, but that didn’t get accounted for in our steps above. One subset includes those origins that were outside of CA; if we believe that some of them were potentially driving trips, then some attempt to account for them would potentially be significant because each marginal trip would have lots of VMTs associated with it. Another subset would be all the trips with origins masked from the original device_home_areas field because there was just one device recorded. We could attempt to account for both of these subsets if we consider the relationship between raw_device_counts from the original stanford_patterns dataframe and the device_count from stanford_trips, which is just the sum of device counts from recorded origin CBGs.

recorded_visits <- sum(stanford_trips$device_count)/
  sum(stanford_patterns$raw_device_counts)

recorded_visits
## [1] 0.3948736

The ratio suggests that when all is said and done, we’re not capturing 60% of the actual trips made, from other CBGs that we haven’t identified. These missing trips may also, on average, be from further away CBGs, since by definition they were more “one-off” visitors from a CBG. But that means these are also more likely to be airplane trips. Let’s go ahead and adjust our overall VMT estimate, assuming that the 60% of missed trips is on average similar in VMT to the 40% we did account for:

sum(stanford_trips$vmt, na.rm = T)/recorded_visits
## [1] 29167585

That’s almost 30 million estimated vehicle miles traveled in Sept 2021 by visitors to Stanford (for comparison, the same analysis run for June 2020 yielded 8 million). This is the kind of measure that Stanford is particularly concerned with when considering its transportation planning, but it might be more comprehensive than traditional measures that tend to focus on commute trips. However, this demonstration does not account for trips made by Stanford residents traveling to other places, which should also be considered (to some degree) as part of the GHG transportation footprint of the Stanford community. Ultimately, the GHG inventory would include a key determination of what percentage of incoming and outgoing trip emissions are considered the “responsibility” of Stanford to monitor and mitigate.

Now, for our final step, we’ll use the California Air Resources Board (CARB) Emission Factors (EMFAC) model which includes emissions rates data, year by year (including projections into the future) for different geographies. Since these models are focused fundamentally on air pollution within air basins, its notion of “geography” would focus on “vehicles on the road” and not necessarily the home locations of vehicles. So for our purposes, we could choose to use model numbers “state-wide”, or the Bay Area air basin, or the jurisdictional area for the Bay Area Air Quality Management District. Input the following selections (with modifications, if desired) and download the resultant CSV:

  • Output: Onroad Emission Rates
  • Model Version: EMFAC2021 v1.0.1
  • Region Type: Air District
  • Region: Bay Area AQMD
  • Calendar Year: 2021
  • Season: Summer
  • Vehicle Category: EMFAC202x Categories, LDA (“Passenger Cars”) and LDT1 (“Light-Duty Trucks (GVWR <6000 lbs and ETW ≤3750 lbs)”)
  • Model Year: Aggregate
  • Speed: Aggregate
  • Fuel: Gasoline, Diesel, Electricity
  • Output Unit: tons/year
emfac <- 
  read_csv("your-file-path.csv", skip = 8) %>% 
  transmute(
    Category = `Vehicle Category`,
    Fuel_Type = Fuel,
    Percent_Trips = Trips/sum(Trips),
    Percent_Miles = `Total VMT`/sum(`Total VMT`),
    `MTCO2_Running_Exhaust` = CO2_RUNEX/`Total VMT`,
    `MTCO2_Start_Exhaust` = CO2_STREX/Trips
  )
emfac

Note that the CSV downloaded from the EMFAC website has poor formatting, so we’re skipping the first 8 lines of meta information.

Emissions from N2O and CH4, along with emissions from other types of vehicles, were left out of this analysis, but can be accounted for in much the same way.

In the most straightforward calculation, each we can now do the following:

  • Allocate our trips and VMT to these six different vehicle and fuel categories
  • Calculate emissions associated with the trip itself by multiplying VMT by gCO2_Running_Exhaust
  • Add emissions associated with starting the vehicle by multiplying trip count by gCO2_Start_Exhaust (times 2)
stanford_trips_ghg <-
  emfac %>% 
  mutate(
    trips = Percent_Trips * sum(stanford_trips$visits, na.rm = T),
    vmt = Percent_Miles * sum(stanford_trips$vmt, na.rm = T),
    ghg = vmt*MTCO2_Running_Exhaust + trips*MTCO2_Start_Exhaust*2
  )
sum(stanford_trips_ghg$ghg)/recorded_visits
## [1] 9791.206

The final amount here is in metric tonnes, and includes the same adjustment as we did before to account for unrecorded visits. So, holding the rough benchmark of the average American emitting 10 tonnes of CO2 equivalent annually, visitors to Stanford in June 2020 emitted the equivalent of almost a thousand Americans’ full years’ worth of emissions, just through vehicle emissions alone.

The electric vehicle categories in EMFAC, as you’d expect do not contribute to GHG emissions, which is a reason why cities are pushing for electrification of their vehicle fleets as well as incentives for residents to switch to electric vehicles. However, keep in mind that while electric vehicles may reduce GHG emissions associated with driving, they do not fundamentally change other challenges in our urban systems related to car dependence. In general, this final methodology can be used to simulate what GHGs may look like in the next few decades, with different rates of EV adoption, and other assumptions around changes in carpooling, transit use, telework, etc.