2.2 Equity analysis

What we mean by “equity analysis” is the visualization and quantification of disproportionate outcomes across population groups of interest. Without diving deeply into the ethics that might motivate such an analysis, I’ll just assume that, with the examples shown in this section, you will agree that these ways of looking at population data are meaningful for understanding some kind of problem we might want to solve in society (however difficult the solutions may actually be).

The simplest version of this type of analysis involves only ACS summary data as we’ve used in the last chapter, but requires that the specific dataset loaded has an “outcome of interest”, broken into some number of buckets or tiers, and “population groups of interest”, with which we can count membership in those outcome buckets. For demonstration, we’ll use a well-established example, income vs. race. Income is acting here as the “outcome of interest”, which is to say that if we break up income into different tiers, generally, everyone would agree that it’s a better outcome for a household to be in higher income tiers. Racial groups is acting here as the “population groups of interest”, because if we discover that different racial groups have different outcomes in household income, generally, many people would consider that social arrangement to be worse than a world in which there were no differences in income distribution. Put another way, if you buy into the idea that we can’t choose our individual races when we are born into this world and can’t choose what household we grow up in, and you agree that therefore we should set up society so that everybody at least has an equal opportunity to thrive in life, then finding finding disproportionate household income by racial groups would be a clear signal of systems work to be done. Of course this topic is fraught with political controversy, but for our purposes, we are simply focusing on how to create objective measures that can be used to support the process deliberation in urban governance. Let me further stress that we’re basically going to be doing a specific kind of math on a two-dimensional disaggregation of a population sample, but just applying it to any “X by Y” breakdown of a population doesn’t necessarily yield a “meaningful” result from an ethical perspective; so you’ll need to think carefully about the point you are trying to communicate, and whether it makes sense to conduct this kind of analysis.

Let’s go ahead and try to find the relevant dataset in the ACS database. A quick search online would point you to a long list of datasets that all appear to provide some combination of income and race information. Here, some amount of experience with the ACS universe is key; I’ll go ahead and point you to the B19001 series as the most useful datasets, specifically B19001A through B19001G for the different key racial groups recorded by the Census Bureau (you’ll find it often the case that different racial groups are given their entire own datasets, given the importance of data about race). However, this presents an extra processing step of having to load in data from multiple “datasets” (i.e. different getCensus() calls) and merging them together. No matter what, you are likely to need to do some data wrangling to get your data ready for a structured equity analysis, so this will be a good example to demonstrate, and you might find that your own chosen pairing is easier (or harder) to orchestrate.

If you don’t already have this ready to go, make sure to have censusapi loaded, the Census API key set, and the acs_vars_2018_5yr data dictionary loaded (from online or from your stored location):

library(tidyverse)
library(censusapi)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

acs_vars_2018_5yr <-
  listCensusMetadata(
    name = "2018/acs/acs5",
    type = "variables"
  )

As a first test, let’s just load B19001A, which is just household income for households led by somebody who is “White alone”.

bay_income_white <-
  getCensus(
    name = "acs/acs5",
    vintage = 2018,
    region = "county:001,013,041,055,075,081,085,095,097",
    regionin = "state:06",
    vars = "group(B19001A)"
  ) %>%
  select(!c(GEO_ID,state,NAME) & !ends_with(c("EA","MA","M"))) %>%
  pivot_longer(
    ends_with("E"),
    names_to = "variable",
    values_to = "estimate"
  ) %>%
  left_join(
    acs_vars_2018_5yr %>% 
      select(name, label), 
    by = c("variable" = "name")
  ) %>% 
  select(-variable) %>% 
  separate(
    label,
    into = c(NA,NA,"income"),
    sep = "!!"
  ) %>% 
  filter(!is.na(income))

Note that this is very similar to what we’ve done before, but choosing counties as the unit of data, our region = argument can include a list of all 9 Bay Area counties by their FIPS codes, and some of the data manipulation changes because of the different fields of data involved.

Conceptually, all we need to do to combine all 7 datasets together is to load them individually like this, use mutate() to add a race field and manually enter the appropriate racial group label, and then bind them together. Since we have done a test run, we should be able to quickly set up a map_dfr() loop without experiencing many errors; otherwise, you should always feel comfortable using for loops for simple processes like this.

Let’s first create a simple lookup table that we can use to help with looping and supplying the race label to the mutate() step. You can always create a quick new dataframe as such using data.frame(). Keep this in mind as a generalizable technique. The racial labels come directly from the ACS questionnaire options.

census_race_labels <- data.frame(
  code = c("A","B","C","D","E","F","G"),
  label =
    c(
      "White Alone",
      "Black or African American",
      "American Indian and Alaska Native Alone",
      "Asian Alone",
      "Native Hawaiian and Other Pacific Islander Alone)",
      "Some Other Race Alone",
      "Two or More Races"
    )
)

We’re basically just setting up a relationship between the letter codes and the race labels that we can use in a number of ways. Also note that you could replicate the list of letters using the built-in R object LETTERS, like LETTERS[1:7]. In fact, since the codes are so straightforward, we can simplify the process further, as shown below:

census_race_labels <- 
  c(
    "White Alone",
    "Black or African American",
    "American Indian and Alaska Native Alone",
    "Asian Alone",
    "Native Hawaiian and Other Pacific Islander Alone)",
    "Some Other Race Alone",
    "Two or More Races"
  )

bay_income_race <-
  1:7 %>% 
  map_dfr(function(x){
    getCensus(
      name = "acs/acs5",
      vintage = 2018,
      region = "county:001,013,041,055,075,081,085,095,097",
      regionin = "state:06",
      vars = paste0("group(B19001",LETTERS[x],")")
    ) %>%
      select(!c(GEO_ID,state,NAME) & !ends_with(c("EA","MA","M"))) %>%
      pivot_longer(
        ends_with("E"),
        names_to = "variable",
        values_to = "estimate"
      ) %>%
      left_join(
        acs_vars_2018_5yr %>% 
          select(name, label), 
        by = c("variable" = "name")
      ) %>% 
      select(-variable) %>% 
      separate(
        label,
        into = c(NA,NA,"income"),
        sep = "!!"
      ) %>% 
      filter(!is.na(income)) %>% 
      mutate(race = census_race_labels[x])
  })

Notes:

  • census_race_labels has changed from a dataframe to just a vector of 7 strings. However, keep in mind the dataframe approach for more generalizable look-up tables.
  • We’re just feeding the vector of integers 1:7 into function(x) because we can easily refer to LETTERS[x] when necessary to get A through G. See how paste0() is used here to construct the appropriate ACS variable code.
  • The only other change is the addition of the mutate() at the end, where now we can refer to census_race_labels[x] to get the appropriate racial label, as long as we’ve ensured that the order is correct.

Now we are ready to conduct the equity analysis. Note that the last step was unique to our situation involving racial data because it happens to be organized in the ACS database in this specific way, with the A through G codes. For a different initial pairing of “outcome” and “population group”, you might be able to load just one ACS variable code, similar to our test of household income for White alone households, and be immediately at this step. What matters is that at this point you should have a tidy dataset with a “geography” field (here filled with the 9 counties), a “outcome” field (here filled with the 16 income tiers the ACS provides us), a “population group” field (here filled with our 7 racial labels), and a “count” field with the counts of individuals or households in each of these “buckets”.

The simplest form of equity analysis is simply plotting this data in a way that clearly visualizes disproportionate outcomes across the different population groups. My preference is to show stacked bar charts, both absolute and percentage-based, that show the distribution of the population groups (in this case, race) for the total population of the geographic region (e.g. the whole Bay Area, but you could filter your current dataset to one individual county at a time), followed by bars for each individual outcome tier (in this case income). Let’s get to our desired visualization in a series of attempts, using ggplot.

bay_income_race %>% 
  group_by(income, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = income %>% factor(),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "stack"
  ) +
  labs(
    x = "Household income",
    y = "Number of households",
    title = "Bay Area household income by race",
    fill = "Race of householder"
  )

This is sort of a hot mess because the income labels are all jumbled together. The easy solution will be to add + coord_flip() to the operation (if you didn’t know, you’d find out with a quick Google search).

bay_income_race %>% 
  group_by(income, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = income %>% factor(),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "stack"
  ) +
  labs(
    x = "Household income",
    y = "Number of households",
    title = "Bay Area household income by race",
    fill = "Race of householder"
  ) +
  coord_flip()

Now we can also notice a problem with the order of the income tiers. ggplot appears to have ordered the string labels alphabetically, but what we want is the existing order from the ACS data which would be captured in bay_income_race$income[1:16], which you can try in the Console to confirm. This is similar to the problem we experienced when we plotted PG&E data and needed to convert the “month” data to factors so that they weren’t incorrectly interpreted as integers, but here we need to deal with an automatic misordering problem too. So the appropriate technique is to pipe to factor(), but also specify levels = as an argument, in which you provide a string with all the factor candidates in the order you’d like. I’ll add this fix below as well.

bay_income_race %>% 
  group_by(income, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = income %>% factor(levels = bay_income_race$income[1:16]),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "stack"
  ) +
  labs(
    x = "Household income",
    y = "Number of households",
    title = "Bay Area household income by race",
    fill = "Race of householder"
  ) +
  coord_flip()

This is now looking pretty good, but there may be additional aesthetic preferences. Particularly, while these “absolute” bar heights may helpfully show the relative difference in numbers of households in each of the income tiers, to get at a distributional issue within the tiers, we need to be able to better see the smaller bars, so position = “fill” instead of position = “stack” will address that (make sure to change the axis label accordingly). At the same time, let’s also reverse the order of the income tiers, just because we might prefer to have “Less than $10,000” at the top, and it’s good to know how to do this with rev(). Lastly, let’s move the legend to the bottom so there’s more room to see the bars, using some arguments in theme().

bay_income_race %>% 
  group_by(income, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = income %>% factor(levels = rev(bay_income_race$income[1:16])),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "fill"
  ) +
  labs(
    x = "Household income",
    y = "Proportion of households",
    title = "Bay Area household income by race",
    fill = "Race of householder"
  ) +
  coord_flip() +
  theme(
    legend.position = "bottom",
    legend.direction = "vertical"
  )

You might consider this “done”, but as I noted earlier, I like to add a bar at the very top representing the “total” population, not broken up into income tiers, so that needs to be manually added to the initial dataset as shown below:

bay_race_total <-
  bay_income_race %>% 
  group_by(race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  mutate(income = "Total")

bay_income_race %>% 
  group_by(income, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  rbind(bay_race_total) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = income %>% factor(levels = rev(c("Total",bay_income_race$income[1:16]))),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "fill"
  ) +
  labs(
    x = "Household income",
    y = "Proportion of households",
    title = "Bay Area household income by race",
    fill = "Race of householder"
  ) +
  coord_flip() +
  theme(
    legend.position = "bottom",
    legend.direction = "vertical"
  )

Ultimately, all I’ve shown is a specific visual design technique to communicate an underlying equity issue that one might want to address. The specific design choices are, of course, entirely up to you. Besides the charts, generally you might find it useful to communicate specific numbers from the chart in the following way: “The overall population in the Bay Area is 42% non-White, but the subpopulation of households making less than $30,000 (roughly what would be considered extremely low income in the Bay Area) is 49% non-White. If income were “race-blind”, these percentages would be the same, but instead, non-White households appear to be 17% more likely to be extremely low income than we’d expect.”

The numbers above would be the kinds of specific numbers you pull directly from the data, as shown below, examples of an infinite number of ways you could manipulate the tabular data to get these numbers:

((sum(bay_race_total$estimate[1:6])/sum(bay_race_total$estimate))*100) %>% round()
## [1] 42
((bay_income_race %>% 
  filter(income %in% bay_income_race$income[1:5]) %>% 
  filter(race != "White Alone") %>% 
  pull(estimate) %>% 
  sum()) /
  (bay_income_race %>% 
    filter(income %in% bay_income_race$income[1:5]) %>% 
    pull(estimate) %>% 
    sum()) * 100) %>% 
  round()
## [1] 49
((((bay_income_race %>% 
  filter(income %in% bay_income_race$income[1:5]) %>% 
  filter(race != "White Alone") %>% 
  pull(estimate) %>% 
  sum()) /
  (bay_income_race %>% 
    filter(income %in% bay_income_race$income[1:5]) %>% 
    pull(estimate) %>% 
    sum())) / (sum(bay_race_total$estimate[1:6])/sum(bay_race_total$estimate)) - 1) * 100) %>% 
  round()
## [1] 17

If the above is difficult to parse, consider this a good opportunity to practice using the R techniques to produce the same numbers another way. pull(), by the way, is just a dplyr way of grabbing one field from a dataframe similar to how you’d use the $ sign. Also note that the numbers I showed in the quote above were actually directly generated using the same code as inline code, similar to how I introduced the Sys.Date() trick in the YAML introduction.

Equity analyses can also be constructed out of CBG-level information, in which the relevant bucketed information (e.g. income by race) at the CBG level is aggregated up by group_by() and summarize() to the desired overall geographic region of interest, at which point the same visualizations and math can be performed. If the ACS provides you the information you’re looking for both at the CBG level and at the larger geographic scale you’re interested in (like our previous County and Bay Area example), then there’s no reason to work with CBG level data. But if you need to construct your own geographic region of interest that isn’t an official Census geography, then you’d want to work with the most granular data available and build up to your desired region of interest. You may also have “outcome” information that isn’t available in Census geographies to begin with (and doesn’t otherwise have any population data associated with it), in which case you would first need to join your outcome data to some kind of population information, and CBGs would be the best option in order to match differentiated outcomes to differentiated populations in the most granular way possible; then, you can use that information for a broader equity analysis.

As a partial example of this more circuitous approach, let’s consider CalEnviroScreen, which is a product of the California Office of Environmental Health Hazard Assessment and California Environmental Protection Agency, meant essentially to assess disproportionate environmental outcomes for California communities. This report in fact includes the kinds of equity analysis visualizations we’ve covered in this section (see Figure 2, 9, and 11). We can download the CalEnviroScreen data and replicate this type of analysis for a more specific measure within its index, such as PM2.5. Note that this is a dataset that already provides us with “outcome” data (in this case, air pollution) already processed to the census tract level, which means we don’t have to do one processing step that I described in the last paragraph of converting outcome data in some other kind of geographic form to Census geographies. In this case, we benefit from OEHHA doing that work for us (e.g. converting point-based air quality station sensor data, though we’ll do this ourselves in a later chapter). So let’s go ahead and see what that air pollution data looks like for census tracts in the Bay Area, and then decide how to proceed from there to an equity analysis.

The main web page provides a few different ways to download the data. Let’s go ahead and use this as an opportunity to try reading Excel spreadsheets, such as the ces3results.xlxs link, using readxl, which is similar to readr and also in the tidyverse but not automatically loaded when you load library(tidyverse), meaning you’ll need to load it individually.

Note that read_excel(), the equivalent of read_csv(), does not yet allow for reading directly from a URL; you’d have to download ces3results.xlxs to your local machine and then read from there, similar to our PG&E example. Here I’ll demonstrate a technique for doing that “downloading” to a temporary location just as the staging area for reading the data into your R environment, and removing that temporary file right away, so as not to store unnecessary data.

library(readxl)

temp <- tempfile()
download.file("https://oehha.ca.gov/media/downloads/calenviroscreen/document/ces3results.xlsx",destfile = temp, mode = "wb")

ces3 <- read_excel(temp, sheet = "CES 3.0 (2018 Update)")

unlink(temp)

Notes:

  • mode = “wb” relates to technical data issues that I won’t get into here, but you are welcome to Google it and see plenty of arguments from R users who would want to see this argument become the default for downloading files from the web, which would mean we don’t have to specify this argument in the future.
  • If you don’t specify the sheet = argument in read_excel(), it will default load the first sheet in the Excel file. Of course, you don’t necessarily know in advance what the contents of the Excel sheet are; you could use excel_sheets() on the file to view the sheet names.
  • Keep in mind that Excel sheets can be formatted in strange ways, so be prepared for strange results when trying to read from them. This is why we generally prefer simple CSVs that are more likely to be “just data” and not a “formatted document of data”.
  • Pair temp <- tempfile() with unlink(temp) so your temporary folder and its contents are removed.
  • Keep in mind that packages can be updated by their creators at any time, and read_excel() could very likely handle URLs directly in the near future, in which case you’d update your package and then be able to simplify this step.

ces3 now contains a lot of information at the census tract level. The online documentation indicates that the PM2.5 field is “Annual mean concentration of PM2.5 (average of quarterly means, µg/m3), over three years (2012 to 2014)”. Furthermore, the documentation notes: “The US EPA has set a new standard for ambient PM2.5 concentration of 12 µg/m3, down from 15 µg/m3. According to EPA’s projections, by the year 2020 only seven counties nationwide will have PM2.5 concentrations that exceed this standard. All are in California (US EPA, 2012b).” Let’s use that field as our “outcome” and create our outcome tiers based on the distribution of outcomes we see in the Bay Area.

bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

ca_tracts <- tracts("CA", cb = T, progress_bar = F)

ces3_bay_pm25 <-
  ces3 %>% 
  filter(`California County` %in% bay_county_names) %>% 
  select(`Census Tract`, PM2.5) %>% 
  left_join(
    ca_tracts %>% 
      transmute(GEOID = as.numeric(GEOID)), 
    by = c("Census Tract" = "GEOID")
  ) %>% 
  st_as_sf()

pm25_pal <- colorNumeric(
  palette = "Reds",
  domain = ces3_bay_pm25$PM2.5
)

leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(
    data = ces3_bay_pm25,
    fillColor = ~pm25_pal(PM2.5),
    color = "white",
    weight = 0.5,
    fillOpacity = 0.5,
    label = ~PM2.5
  )

Notes:

  • California County and Census Tract are written with backticks because of the space in the variable name, which generally is how you deal with special variable names in dplyr functions, though it’s not always obvious when you use backticks and when you use quotation marks; that just comes with experience.
  • When working with new data, make sure to check for NAs. The easiest way to do this is with something like sum(is.na(ces3_bay_pm25$PM2.5)) in the Console. And if you need to remove NAs, the easiest method is filter(!is.na(PM2.5)) in the pipeline.
  • The GEOID field in the tigris dataset ca_tracts comes as a character type field with a “0” in front of the codes, which won’t join with the Census Tract field in ces3 which is a numeric type field. In this case, the easier way to align types is to convert the character field to a numeric field, by which it will know to automatically drop the leading zero. transmute() is a mutate() that also only keeps what is mutated, so it’s like a mutate() + select() (with the caveat that a sf object will still keep its geometry field, which is good for us in this case).
  • Here we’re switching from the default map to a stylized map for clearer viewing. You can see what other basemaps are built into leaflet() here. There are ways to pull in other basemaps too, which we’ll explore in the future.

Let’s quickly see what the minimum and maximum PM2.5 values recorded by CalEnviroScreen are in the Bay Area:

summary(ces3_bay_pm25$PM2.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.506   7.860   8.698   8.722   9.536  10.370

Now, the key assumption we have to make, if we want to do the kind of equity analysis we did, is that this “outcome” affects everybody in a given census tract the same way. Essentially, if our population grouping of interest is again race, we are assuming that the outcome is “race-blind” at the census tract level, but then when aggregating up to the entire Bay Area, we may find that being of a given race appears to be associated with better or worse air quality on average. Keep in mind that this assumption may very well be incorrect depending on the outcome, and it could be incorrect in a direction that would mean even greater inequity. But let’s go ahead and see where that assumption takes us for this example. For a given census tract, we’ll now look up population counts in racial groups, and append this outcome data directly to all buckets within a census tract, then recreate the regional tidy dataframe we need to perform the equity analysis.

census_race_labels <- 
  c(
    "White Alone",
    "Black or African American",
    "American Indian and Alaska Native Alone",
    "Asian Alone",
    "Native Hawaiian and Other Pacific Islander Alone)",
    "Some Other Race Alone",
    "Two or More Races"
  )

bay_race_tracts <-
  1:7 %>% 
  map_dfr(function(x){
    getCensus(
      name = "acs/acs5",
      vintage = 2018,
      region = "tract:*",
      regionin = "state:06+county:001,013,041,055,075,081,085,095,097",
      vars = paste0("B19001",LETTERS[x],"_001E")
    ) %>%
      mutate(
        tract = paste0(state, county, tract) %>% as.numeric(),
        race = census_race_labels[x]
      ) %>% 
      select(
        tract,
        race,
        estimate = paste0("B19001",LETTERS[x],"_001E")
      )
  })

Notes:

  • This loop is very similar to what we did for the income by race equity analysis, but since we need tract level data instead of county level data, note the different structure of the arguments region= and regionin=, based on what getCensus() is designed to let you specify. The asterisk in tract:* specifically means “all tracts in”.
  • Keep in mind that when you’re trying something for the first time that you might eventually want to loop, it’s always helpful to do a test with just one dataset first. I’m not showing that testing here, but it’d be something you do directly in your Source document as you are developing your code.
  • In this case, since we just need total counts by race in each tract, we don’t need all the pivoting steps we’ve done before, and can simply select the first field of data (e.g. B19001A_001E) based on our experience with this kind of ACS dataset and knowing that this field contains the total count (of a racial group). Notice how the vars= argument, instead of specifying a group(), grabs just a single variable directly, in this case the _001E field from the given dataset.
  • Notice the mutate() added to recreate the tract field as a paste0() of state, county, and tract when working with tract data, since the ACS API doesn’t give you the most useful form of the tract ID directly. We are also converting the tract ID to a numeric. You might not know to do this in advance, but as soon as you get to a step in which you need to join tract IDs, say with our PM2.5 data, you’d realize that you need to fix something upstream in your process. This kind of foresight just takes looking closely any new data you encounter, and learning with trial and error.
  • Also notice how within a select(), you can rename fields that you choose to select.

Now, we join the PM2.5 data to this dataframe and then group_by() and summarize():

bay_pm25_race <-
  bay_race_tracts %>% 
  left_join(
    ces3_bay_pm25 %>% 
      st_set_geometry(NULL),
    by = c("tract" = "Census Tract")
  ) %>% 
  mutate(
    PM2.5_tier =
      case_when(
        PM2.5 < 5 ~ "4-5",
        PM2.5 < 6 ~ "5-6",
        PM2.5 < 7 ~ "6-7",
        PM2.5 < 8 ~ "7-8",
        PM2.5 < 9 ~ "8-9",
        PM2.5 < 10 ~ "9-10",
        TRUE ~ "10-11"
      ) 
  ) %>% 
  group_by(race, PM2.5_tier) %>% 
  summarize(estimate = sum(estimate, na.rm = T))

Here we are using case_when() for the first time. It’s like an ifelse() but used when you need to string multiple ones in a row, which is to say when you want to manually manage multiple relationships between some initial condition and a resulting assignment. Here, we’re choosing to break the intervals into simple integer intervals of PM2.5 because it will be easy to understand in a chart, and we happen to know the minimum and maximum intervals to specify because we previewed the PM2.5 distribution with summary(). Keep in mind that this interval choice is up to you; we could have, for example, used percentiles instead. Notice the unique structure of the case_when() arguments, using a ~ to relate the logical condition and the result, and the use of TRUE as the final catch-all for whatever rows are left undealt with.

This dataframe is now in the same form as the one we had for income by race, so we can copy and paste much of the same code to produce visualizations and investigate the degree of inequity in this particular pairing.

bay_pm25_race_stacked <-
  bay_pm25_race %>% 
  group_by(PM2.5_tier, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  rbind(
    bay_pm25_race %>% 
      group_by(race) %>% 
      summarize(estimate = sum(estimate)) %>% 
      mutate(PM2.5_tier = "Total")
  ) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = PM2.5_tier %>% factor(levels = rev(c("Total","4-5","5-6","6-7","7-8","8-9","9-10","10-11"))),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "stack"
  ) +
  labs(
    x = "PM2.5",
    y = "Number of households",
    title = "Bay Area PM2.5 exposure by race",
    subtitle = "Annual mean concentration of PM2.5 (average of quarterly means, µg/m3),\nover three years (2012 to 2014), aggregated by Census tract",
    fill = "Race of householder"
  ) +
  coord_flip() +
  theme(
    legend.position = "bottom",
    legend.direction = "vertical"
  )

bay_pm25_race_fill <-
  bay_pm25_race %>% 
  group_by(PM2.5_tier, race) %>% 
  summarize(estimate = sum(estimate)) %>% 
  rbind(
    bay_pm25_race %>% 
      group_by(race) %>% 
      summarize(estimate = sum(estimate)) %>% 
      mutate(PM2.5_tier = "Total")
  ) %>% 
  ggplot() +
  geom_bar(
    aes(
      x = PM2.5_tier %>% factor(levels = rev(c("Total","4-5","5-6","6-7","7-8","8-9","9-10","10-11"))),
      y = estimate,
      fill = race
    ),
    stat = "identity",
    position = "fill"
  ) +
  labs(
    x = "PM2.5",
    y = "Proportion of households",
    title = "Bay Area PM2.5 exposure by race",
    subtitle = "Annual mean concentration of PM2.5 (average of quarterly means, µg/m3),\nover three years (2012 to 2014), aggregated by Census tract",
    fill = "Race of householder"
  ) +
  coord_flip() +
  theme(
    legend.position = "bottom",
    legend.direction = "vertical"
  )

bay_pm25_race_stacked

bay_pm25_race_fill

Notes:

  • I gave the two plots variable names so I could plot them both more adjacent to each other in the final web page.
  • I streamlined some steps from earlier, such as incorporating the rbind() of the “Total” counts of the racial groups, into one pipeline. Keep in mind that that extra step is just there to add a “Total” bar to the top of the bar charts.
  • Because this measure is more technical, I’ve included a subtitle through labs() to provide a more technical definition of the PM2.5 measure. Note how I use \n to create a line break in this text; sometimes you need to use \n to achieve this effect, sometimes you need to use <br>, which just depends on HTML details that you probably don’t need to worry about; just be prepared to try either.

The lowest two tiers turned out to have such low population counts that they should probably be merged into a “4-7” tier. Otherwise, based on these particular inputs and the assumptions of our methodology, there seems to be a disproportionate burden of poor air quality on Asian households, which seems to be largely driven by the air quality and population in San Jose.