library(tidyverse)
library(plotly)

On August 17, 2020, a rare lightning storm hit the Bay Area, causing multiple fires to start. A few days later, the air quality continually got worse. We all know that the August lightning complex fire caused the air quality to deteriorate, but for the purposes of demonstrating a difference-in-difference analysis, lets suppose we are uncertain if the lightning/fire event was the culprit.

We will use data scraped from the Purple Air website. This data is derived from air quality sensors in different locations all around the world. For the purposes of this analysis, we will consider the air quality data from three sensors near Stanford, and three sensors near Harvard (in the Boston, MA area) 5 days before and after the August Lightning complex fire started. Stanford will be the “treatment” group (in proximity of the fire) and Harvard will be the “control” group (not near the fire).

I have already labeled the data, so I first create a list with the file names (the names of the sensors) and read in/combine all of the relevant files

sensor_name <- c("Meow's Eye View","Central Square","Lubaro","Escobita","Menlo Atherton","Los Altos Ave.")

all_locations <- read_csv("C:/Users/sadegh/Documents/GitHub/hazard/DID demo/Escobita.csv")[numeric(0),]


for(i in 1:length(sensor_name)){
  
  temp_df <- 
    read_csv(paste0(sensor_name[i],".csv")) %>%
    mutate(
      state = ifelse(sensor_name[i] %in% c("Escobita","Menlo Atherton","Los Altos Ave."),"CA","MA") #here I create a flag for the state in which the sensor is located
    )
  
  all_locations <-
    all_locations %>% 
    rbind(temp_df)
  
}

Next, I chose to clean the data to make processing more straightforward. This includes renaming some variables, making the date column of class Date, filtering to the time frame of August 12 - August 22, and taking the PM 2.5 average by date and state.

all_locations_clean <-
  all_locations %>% 
  select(created_at,`PM2.5_CF_1_ug/m3_A`,Name,state) %>% 
  rename(
    PM2.5 = `PM2.5_CF_1_ug/m3_A`,
    date = "created_at"
  ) %>% 
  mutate(
    date = substr(date,1,10) %>% as.Date()
  ) %>% 
  filter(date >= "2020-08-12" %>% as.Date() & date <="2020-08-22") %>% 
  group_by(date,Name,state) %>% 
  summarise(PM2.5 = mean(PM2.5)) %>% 
  na.omit()

Before doing any stastical analysis, we can first plot the data and visually see the impact of the August Lightning fire on Bay Area air quality. We can see that before the event, the Bay Area and Boston air quality were more or less similar (with a PM 2.5 measurement ranging from 5 - 15 micrograms). These areas are established as “equal”, and we can now see if the event had any impact on air quality. If the air quality between the Bay Area and Boston follows a similar trend after the fire, then we could conclude that the fire had no impact on Bay Area air quality. The plot tells a different story, however. There is a sharp spike in Bay Area air quality after the event, indicating that the fire could indeed be a reason for the exacerbation in air quality.

graph_example <- function(state_input){
  df_new <-
    all_locations_clean %>% 
    group_by(date,state) %>% 
    summarise(PM2.5 = mean(PM2.5)) %>% 
    filter(state == state_input)
  
  return(df_new)
}

ca_example <- graph_example("CA")
ma_example <- graph_example("MA")

graph <-
  plot_ly() %>% 
  add_trace(
    x = ca_example$date,
    y = ca_example$PM2.5,
    type = "scatter",
    mode = "lines",
    line = list(color = "rgb(0,0,255)"),
    name = "Bay Area, Mean",
    showlegend=T
  ) %>% 
  add_trace(
    x = ma_example$date,
    y = ma_example$PM2.5,
    type = "scatter",
    mode = "lines",
    line = list(color = "rgb(255,0,0)"),
    name = "Boston, Mean",
    showlegend=T
  ) %>% 
  add_trace(
      x = c("2020-08-17" %>% as.Date(),"2020-08-17" %>% as.Date()),
      y = c(0,max(ca_example$PM2.5)+10),
      type = "scatter",
      mode = "lines",
      line = list(color = "rgb(63,63,63)",dash = "dot"),
      name = "August Lightning Complex Fire Starts",
      showlegend=F
    ) %>% 
  layout(
      margin = list(t = 25),
      paper_bgcolor='rgb(255,255,255)', 
      plot_bgcolor='rgb(229,229,229)',
      xaxis = list(
        title = "",
        gridcolor = 'rgb(255,255,255)',
        showgrid = TRUE,
        showline = FALSE,
        showticklabels = TRUE,
        tickcolor = 'rgb(127,127,127)',
        ticks = 'outside',
        zeroline = FALSE
      ),
      yaxis = list(
        title = "PM 2.5 (micrograms)",
        gridcolor = 'rgb(255,255,255)',
        showgrid = TRUE,
        showline = FALSE,
        showticklabels = TRUE,
        tickcolor = 'rgb(127,127,127)',
        ticks = 'outside',
        zeroline = FALSE
      ),
      annotations = list(
        xref = "x",
        yref = "y",
        x = "2020-08-17" %>% as.Date(),
        y = max(ca_example$PM2.5)+10,
        text = "Event \n August Lighning Complex Fire Starts",
        xanchor = "center",
        yanchor = "bottom",
        showarrow = F
      ),
      legend = list(
        x = 0.05, 
        y = 0.15,
        bgcolor = 'rgba(229,229,229,0.5)'
      )
    ) %>% 
    config(displayModeBar = F)
  
  
graph

We will now use a statistical difference-in-difference technique to quantify how much of an impact the fire had on air quality in the Bay Area compared to the baseline air quality in Boston.

did_prep <-
  all_locations_clean %>% 
  mutate(
    time = ifelse(date >= "2020-08-17" %>% as.Date(),1,0),
    treatment = ifelse(state == "CA",1,0),
    did_indicator = time * treatment
  )



did_reg <- lm(PM2.5 ~ time + treatment + did_indicator, data = did_prep)

summary(did_reg)
## 
## Call:
## lm(formula = PM2.5 ~ time + treatment + did_indicator, data = did_prep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.997  -3.059  -0.835   2.245  85.060 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.856      5.999   1.810   0.0761 .  
## time            -3.210      8.231  -0.390   0.6981    
## treatment       -2.343      8.999  -0.260   0.7956    
## did_indicator   53.402     12.559   4.252 8.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.23 on 52 degrees of freedom
## Multiple R-squared:  0.4546, Adjusted R-squared:  0.4232 
## F-statistic: 14.45 on 3 and 52 DF,  p-value: 5.734e-07