6.1 Matching

Now we move towards quasi-experimental designs, which are potential tools for causal inference where a gold-standard randomized control trial may not be possible for practical or ethical reasons. Across a few different techniques available, it is often useful to conduct “matching”, where a sample of units that receive a treatment is matched with a sample of units that did not receive the treatment, and otherwise are comparable on all observed covariates (independent variables). For our purposes, we’ll usually focus on CBGs or similar Census geographies as the unit of analysis. The goal is generally to be able to find matching CBGs for an initial CBG of interest, and to base that matching off of a handful of ACS variables we can observe for the CBG. Because those multiple variables are different dimensions of information, we need a mathematical approach to “similarity” across multiple dimensions.

One technique, propensity score matching, effectively involves creating a logit model as we did in Chapter 5.4, and matching based on the resultant one-dimensional logit score. However, for a propensity score to work, one would need an “outcome variable” to be predicting, which is not always available or relevant. Another technique is called Mahalanobis matching, which is a way to measure distance between a point and a distribution of observed points across multiple dimensions. This technique works best with fewer than 8 variables, and preferably continuous variables. This tends to reflect the information we work with when working with CBGs or other Census geographies.

Let’s say we were looking at tracts in the Bay Area and comparing them along the characteristics of income, race, and education. As we’ve done before, we first load the data using censusapi. We’ll need a new package, StatMatch, to use calculate Mahalanobis distance, so go ahead and install.packages("StatMatch") and load the library.

library(tidyverse)
library(censusapi)
library(StatMatch)
library(leaflet)
library(sf)
library(tigris)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

acs_vars_2019_5yr <-
  listCensusMetadata(
    name = "2019/acs/acs5",
    type = "variables"
  )

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_tracts <-
  tracts("CA", bay_county_names, cb = T, progress_bar = F)

bay_multiple_tract <- 
  getCensus(
    name = "acs/acs5",
    vintage = 2019,
    region = "tract:*",
    regionin = "state:06+county:001,013,041,055,075,081,085,095,097",
    vars = c(
      "B06009_001E",
      "B06009_002E",
      "B06009_003E",
      "B19001_001E",
      "B19001_014E",
      "B19001_015E",
      "B19001_016E",
      "B19001_017E",
      "B19001A_001E"
    )
  ) %>% 
  transmute(
    tract = paste0(state, county, tract),
    perc_college = 1 - (B06009_002E + B06009_003E) / B06009_001E,
    perc_over100k = (B19001_014E + B19001_015E + B19001_016E + B19001_017E) / B19001_001E,
    perc_white = B19001A_001E / B19001_001E
  ) %>% 
  filter(
    !is.na(perc_college), 
    !is.na(perc_over100k),
    !is.na(perc_white)
  )

Next, we’ll retain just the three fields of relevant variables in matrix form:

obs_matrix <-
  bay_multiple_tract %>% 
  select(
    perc_white, 
    perc_over100k,
    perc_college
  ) %>% 
  as.matrix()

Now, mahalanobis.dist() from the StatMatch package can be used. When given one argument (in this case our matrix), the function will create a resultant matrix that has as many rows and columns as number of observations in the original data, and each cell will be the Mahalanobis distance between the pair of records, a single number that accounts for our three ACS dimensions (the lower the better). For simplicity of reference in the following steps, we’ll also rename the rownames() and colnames() of the matrix by the relevant tract ID, which will have maintained the same order as the original bay_multiple_tract dataframe.

dist_matrix <- mahalanobis.dist(obs_matrix)

rownames(dist_matrix) <- bay_multiple_tract$tract
colnames(dist_matrix) <- bay_multiple_tract$tract

If you View(dist_matrix), you’ll notice the renamed rownames and colnames, and also that the diagonal values are 0 because there will be no distance when pairing a tract with itself. The matrix is also symmetrical, so only the top-right or bottom-left triangle of the matrix is technically necessary to convey all unique information, but the practical usage of this matrix will involve referencing entire rows.

If we wanted to pair every tract with its one closest match, we could hold identify the minimum Malahanobis distance value from each row (besides the diagonal 0s) and then retain the appropriate column name:

dist_matrix_pairmatch <- dist_matrix
diag(dist_matrix_pairmatch) <- NA

matched_pair_tract <-
  1:nrow(dist_matrix_pairmatch) %>% 
  map_dfr(function(x){
    
    min_index <- which(dist_matrix_pairmatch[x, ] == min(dist_matrix_pairmatch[x, ], na.rm = T))
   
    data.frame(
      tract = bay_multiple_tract$tract[x],
      matched_tract = bay_multiple_tract$tract[min_index]
    )
    
  })

Note that which() specifically returns the index for which the logical condition is true, which is particularly useful to then use within a bracket in the following step. Also, diag() <- NA helps to prevent the 0s in the diagonals from being considered when we look for the min() distance score. The following map shows three examples of pairs:

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = bay_tracts %>% 
      filter(GEOID == matched_pair_tract[2,1])
  ) %>% 
  addPolygons(
    data = bay_tracts %>% 
      filter(GEOID == matched_pair_tract[2,2])
  ) %>% 
  addPolygons(
    data = bay_tracts %>% 
      filter(GEOID == matched_pair_tract[3,1]),
    color = "green"
  ) %>% 
  addPolygons(
    data = bay_tracts %>% 
      filter(GEOID == matched_pair_tract[3,2]),
    color = "green"
  ) %>% 
  addPolygons(
    data = bay_tracts %>% 
      filter(GEOID == matched_pair_tract[4,1]),
    color = "red"
  ) %>% 
  addPolygons(
    data = bay_tracts %>% 
      filter(GEOID == matched_pair_tract[4,2]),
    color = "red"
  )

If we had a specific tract of interest (“treated”), then to identify the twenty closest matching tracts (which might then be used as “controls”), we could do the following:

match_set_tract <- dist_matrix["06081611900", ] %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(
    tract = rowname,
    match = "."
  ) %>% 
  right_join(
    bay_multiple_tract
  ) %>% 
  arrange(match) %>% 
  .[1:21, ] %>% 
  left_join(bay_tracts %>% select(tract = GEOID)) %>% 
  st_as_sf()

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = match_set_tract[1, ],
    color = "red"
  ) %>% 
  addPolygons(
    data = match_set_tract[-1, ]
  )