Obtaining Census data
Normalization Functions
- Bay Area-wide Normalization
- Blockgroup-based Normalization
  - Patterns data expansion
  - Adjustment for 2-4 visitor origins
Demonstration

This document demonstrates various normalization procedures for Safegraph patterns data. Safegraph data does not count the actual number of visits or visitors to places, but rather the movement of a subset of location-activated smartphones. To estimate the actual number of individual visits to places, we ‘normalize’ the number of raw visits, i.e. smartphone location pings in places.

library(tidyverse)
library(tigris)
library(censusapi)
library(sf)
library(mapview)
library(rjson)

options(
  tigris_class = "sf",
  tigris_use_cache = T
)

# https://api.census.gov/data/key_signup.html
Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

# This is the particular file path for our Stanford course. Safegraph COVID-19 data consortium members have access to the same source data.
safegraph_address <- 'P:/SFBI/Restricted Data Library/Safegraph/covid19analysis/'

Obtaining Census data

For normalization of Safegraph data, we need to know the population count of all census blocks relevant to our region of analysis. In this example, we consider the San Francisco Bay Area and its subregions. We use the American Community Survey via censusapi for population counts, and the census TIGER/Line shapefiles via tigris for identifying the geographic region.

#All data currently using acs_year 2018.
acs_year = 2018

#Names of Bay Area counties
norm_bay_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

#Fips code of Bay Area counties
norm_bay_counties <-
  counties("CA",progress_bar=F) %>% 
  filter(NAME %in% norm_bay_county_names) %>% 
  pull(COUNTYFP)

#Population of Bay Area
norm_bay_pop <-
  getCensus(
    name = "acs/acs1",
    vintage = acs_year,
    region = "county:*", 
    regionin = "state:06",
    vars = "B01003_001E"
  ) %>% 
  filter(county %in% norm_bay_counties) %>% 
  pull(B01003_001E) %>% #This is an ACS code for population counts.
  sum()

#Bay Area Blockgroups FIPS codes.
norm_bay_blockgroups <-
  norm_bay_county_names %>%
  map(function(x){
    block_groups("CA",x,progress_bar=F) %>%
      pull(GEOID)
  }) %>% 
  unlist()

#California counties FIPS codes.
norm_ca_counties_fips <-
  counties("CA", progress_bar=F) %>%
  pull(COUNTYFP)

#California population by blockgroup.
norm_ca_pop_blockgroup <-
  norm_ca_counties_fips %>% 
  map_dfr(function(x){
    getCensus(
      name = "acs/acs5",
      vintage = acs_year,
      region = "block group:*", 
      regionin = paste0("state:06+county:",x),
      vars = "B01003_001E"
    )
  }) %>% 
  transmute(
    origin_census_block_group = paste0(
      state,county,tract,block_group
    ),
    pop = B01003_001E
  )

Normalization Functions

There are many ways to do normalization. But first, let us discuss the mathematical procedure for normalization.

Compare census data and number of devices tracked by safegraph within a given region to estimate ratio of the number of actual visitors to those of raw visitors.
Use this ratio to map raw visitor counts into actual visitor counts.
Assume that every visitor makes the same number of visits. Thus, the number of visits per visitor is given by the ratio of raw visit count to raw visitor count for each geographic region of interest. These raw counts are provided in the safegraph patterns dataset.
To get actual visit counts, take the visitor count derived from step 2, and multiply by the ratio of raw visit counts to raw visitor counts.

The assumption of having same number of visits per visitor in step 3 is a major simplification. Since Safegraph does not distinguish the number visits by individual visitors, this is the easiest approach for estimating actual visit counts.

For performing the normalization in step 1, we show two ways. The first is a macro-approach.

Bay Area-wide Normalization

Since the San Francisco Bay Area is our area of interest, we first estimate the population of this region (stored in norm_bay_pop). Then we compute the total number of devices tracked by Safegraph within this region. This number can be obtained from the home panel summary dataset of Safegraph. The ratio of the population estimate to the number of devices in the region provides us with a multiplier which allows us to translate the raw number of visitors to an estimate of the actual number of visitors.

#Function to do Bay normalization.
#param patterns: Safegraph patterns dataset.
#pre patterns must have the columns: 'raw_visits_counts'.
#param home_summary: Safegraph home_panel_summary dataset.
#pre home_summary must have columns 'census_block_group' and 'number_devices_residing'.
#param bay_blockgroups: list of blockgroup GEOIDs in the bay area for filtering of the Safegraph datasets.
#param bay_pop: numeric. Population of Bay Area.
#returns patterns dataset with column visit_counts that multiples raw visits based on ratio of Bay area population to safegraph population.
normBay <- function(patterns, home_summary, bay_blockgroups = norm_bay_blockgroups, bay_pop = norm_bay_pop)
{
  #Compute safegraph population from home_summary by taking Bay blockgroups and adding number of devices.
  bay_sg_pop <-
    home_summary %>% 
    filter(census_block_group %in% bay_blockgroups) %>% 
    pull(number_devices_residing) %>% 
    sum()

#Create Bay Area-wide multiplier.
bay_multiplier <- bay_pop/bay_sg_pop

return(patterns %>% mutate(visit_counts = raw_visit_counts*bay_multiplier))
}

Blockgroup-based Normalization

The second approach is a micro-level normalization. We obtain the ratios of population to number of devices for each blockgroup. Since Safegraph patterns dataset provides information about the origin census blockgroups of some visitors, we can use different ratios for estimating visits depending on which blockgroup a device resides in.

This approach has one problem. Not all visitors have a recorded origin census blockgroup. This is part of Safegraph’s goal to maintain privacy. If the number of visitors from a certain origin blockgroup fall below a prescribed threshold (currently 2), then the home blockgroup is not recorded for those visitors.

So how do we estimate the number of visitors for those devices which have no recorded origin blockgroup? Here we make another simplifying assumption. We assume that the ratio of visitors to devices in the case of unrecorded origin blockgroups matches the overall ratio for the visitors with recorded blockgroups. In other words, we take all the devices whose origin blockgroup is known, convert them to number of visitors, then sum up the number of visitors over all such origin blockgroups and divide by the total number of devices over all such origin blockgroups. We use this ratio for the devices with no recorded origin blockgroup.

Patterns data expansion

In order to perform the above blockgroup-based normalization, we need to unpack the data about origin census blockgroups from the patterns data. This column is named visitor_home_cbgs and is in a JSON dictionary format of the form {blockgroup id: number of raw visitors}. We expand the patterns data so that for each place, there is one row for each of the origin blockgroups of its raw visitors. This will streamline the normalization process. This particular expansion is given in the function expandOrigins(). WARNING: this function takes a lot of time. So it is best to use it on filtered regions such as individual cities, rather than whole patterns datasets.

Adjustment for 2-4 visitor origins

An adjustment needs to be made for a specific detail in the most recent versions of patterns data:

Beginning with the 04-05 release, we apply differential privacy to the counts in home_cbgs, country_of_origin and device_types. We do not include a census block group or country of origin unless there are at least 2 visitors from that census block group or country of origin or with that device type as applicable. If there are between 2 and 4 visitors we show this count as 4.

Because these specific recorded origins may have visitor counts of 2, 3, or 4 wherever we see a 4, the normalized visits we estimate for that block group could be off by a factor of 2. So, in our function, we consider the lower (2) and upper (4) bounds for every recorded origin that is listed as having a raw_visitor_count of 4, which also leads to the propagation of lower and upper bounds through the normalization process.

#Function to expand patterns dataset by blockgroup.
#param patterns_data: Safegraph patterns dataset.
#pre It should have a column visitor_home_cbgs with a JSON dictionary of blockgroup GEOIDs and number of visitors.
#returns patterns_data with breakdown of visitors by blockgroup or unrecorded.
expandOrigins <- function(patterns_data)
{
  
  expanded_patterns <- NULL
  
  #Loop over each row so that we can convert each row to multiple rows, one for each origin census block group of visitors.
  for(row in 1:nrow(patterns_data)){
    
    #If no recorded blockgroup, then just save as one row with no origin_census_block_group.   
    if(patterns_data$visitor_home_cbgs[row] == "{}") {
      
      data_row <- 
        data.frame(
          safegraph_place_id = patterns_data$safegraph_place_id[row],
          origin_census_block_group = NA, #origin block group is not available.
          origin_raw_visitor_counts_high = patterns_data$raw_visitor_counts[row],
          origin_raw_visitor_counts_low = patterns_data$raw_visitor_counts[row]
        )
      
    } 
    # Else, need to unpack the JSON dictionary.
    else {
      #Unpack, unlist, rename.
      json <- 
        patterns_data$visitor_home_cbgs[row] %>% 
        fromJSON() %>% 
        unlist() %>% 
        as.data.frame() %>% 
        rownames_to_column() %>% 
        rename(
          origin_census_block_group = "rowname",
          origin_raw_visitor_counts_high = "."
        ) %>% 
        mutate( # Dealing with the 2-4 visitor issue
          origin_raw_visitor_counts_low =
            ifelse(
              origin_raw_visitor_counts_high == 4,
              2,
              origin_raw_visitor_counts_high
            )
        )
      
      #Compute the unrecorded raw visitors by subtracting blockgroup-mapped origin_raw_visitor_counts from raw_visitor_counts. Special cases need to be considered because of the 2-4 visitor issue
      if(patterns_data$raw_visitor_counts[row] > sum(json$origin_raw_visitor_counts_high)){
        
        unrecorded_raw_visitor_counts_high <- 
          patterns_data$raw_visitor_counts[row] - sum(json$origin_raw_visitor_counts_high)
      
        unrecorded_raw_visitor_counts_low <-
          patterns_data$raw_visitor_counts[row] - sum(json$origin_raw_visitor_counts_low)
        
      } else {
        
        #In this rare case, there were enough origins overcounted as having 4 raw visitors that the total count for recorded origins exceeded that actual raw_visitor_count. We downscale all recorded origin visitor counts so that the sum of recorded origin visitor counts equals actual raw_visitor_count exactly.
        json$origin_raw_visitor_counts_high <-
          json$origin_raw_visitor_counts_high/sum(json$origin_raw_visitor_counts_high)*patterns_data$raw_visitor_counts[row]
        
        unrecorded_raw_visitor_counts_high <- 0
        
        unrecorded_raw_visitor_counts_low <-
          max((patterns_data$raw_visitor_counts[row] - sum(json$origin_raw_visitor_counts_low)),0)
        
      }
        
      #Add one more row which contains unrecorded raw_visitor_counts.
      data_row <- 
        json %>% 
        rbind(
          data.frame(
            origin_census_block_group = NA,
            origin_raw_visitor_counts_high = unrecorded_raw_visitor_counts_high,
            origin_raw_visitor_counts_low = unrecorded_raw_visitor_counts_low
          )
        ) %>% 
        mutate(
          safegraph_place_id = patterns_data$safegraph_place_id[row]
        )
      
    }
    #Put all the rows together.
    expanded_patterns <-
      expanded_patterns %>% 
      rbind(data_row)
  }
  
  #Join the rest of the patterns data using safegraph place id.
  expanded_patterns <-
    expanded_patterns %>% 
    left_join(patterns_data, by = 'safegraph_place_id')
  
  return(expanded_patterns)
}

#Function to do blockgroup-wise normalization.
#param patterns: Safegraph patterns dataset.
#pre patterns must have the columns: 'raw_visits_counts'.
#param home_summary: Safegraph home_panel_summary dataset.
#pre home_summary must have columns 'census_block_group' and 'number_devices_residing'.
#param bay_blockgroups: list of blockgroup GEOIDs in the bay area for filtering of the Safegraph datasets.
#param ca_pop_blockgroup: dataframe. Censis population count for each blockgroup in CA.
#returns patterns dataset with column visit_counts that multiplies raw visits based on ratio of Bay area population to safegraph population.
normBG <- function(patterns, home_summary, bay_blockgroups = norm_bay_blockgroups, ca_pop_blockgroup = norm_ca_pop_blockgroup){
  
  #Expand and categorize visitors by origin_census_block_group.
  #Also join population and home summary data.
  patterns <- 
    expandOrigins(patterns) %>%
    left_join(home_summary, by = c('origin_census_block_group' = 'census_block_group')) %>%
    left_join(ca_pop_blockgroup, by = 'origin_census_block_group')
  
  #Potential cleaning of lack of population data if not present within California.
  # lack_of_pop <- which(is.na(patterns$pop) & !is.na(patterns$origin_census_block_group))
  # for (i in lack_of_pop) {
  #   geoid = patterns[i,'origin_census_block_group']
  #   state <- substring(geoid,1,2)
  #   county <- substring(geoid,3,5)
  #   tract <- substring(geoid,6,11)
  #   bg <- substring(geoid,12,12)
  #   #Get population data for non-California blockgroups.
  #   patterns[i,'pop'] = getCensus(
  #     name = "acs/acs5",
  #     vintage = acs_year,
  #     region = paste0("block group:", bg), 
  #     regionin = paste0("state:",state,"+county:",county,"+tract:",tract),
  #     vars = "B01003_001E"
  #   ) %>% pull('B01003_001E')
  # }
  
  #Estimate true visitor counts based on census population for recorded block groups.
  
  #To estimate number of visitors from unrecorded block groups, use the visitors-to-devices ratio generated by the aggregate recorded block groups.
  recorded_pop <-
    ca_pop_blockgroup %>% 
    filter(
      origin_census_block_group %in% patterns$origin_census_block_group
    ) %>% 
    pull(pop) %>% 
    sum()

  #Compute total number of devices with recorded origin blockgroups.
  recorded_sg_pop <-
    home_summary %>% 
    filter(census_block_group %in% patterns$origin_census_block_group) %>% 
    pull(number_devices_residing) %>% 
    sum()

  #Compute ratio
  recorded_ratio <- recorded_pop/recorded_sg_pop
  #Use the above ratio for the raw visitors with the unrecorded origin blockgroups.
  
  patterns <-
    patterns %>% 
    mutate(
      origin_visitor_counts_high =
        ifelse(
          !is.na(origin_census_block_group),
          origin_raw_visitor_counts_high*pop/number_devices_residing,
          origin_raw_visitor_counts_high*recorded_ratio
        ),
      origin_visitor_counts_low =
        ifelse(
          !is.na(origin_census_block_group),
          origin_raw_visitor_counts_low*pop/number_devices_residing,
          origin_raw_visitor_counts_low*recorded_ratio
        )
    )
  
  #Finally get visit_counts by multiplying by ratio of raw visit:visitor ratio.
  patterns <- 
    patterns %>% 
    mutate(
      visit_counts_high = origin_visitor_counts_high*raw_visit_counts/raw_visitor_counts,
      visit_counts_low = origin_visitor_counts_low*raw_visit_counts/raw_visitor_counts
    )
  
  return(patterns)
}

Demonstration

We now demonstrate the output of these various functions. We first load some weekly patterns (which we have pre-processed to only CA patterns) and home summary data. Then, using the City of San Jose as an example, we use both normalization techniques.

ca_weekly_200503 <- readRDS(paste0(safegraph_address,"weekly-patterns/2020-05-03-ca-weekly-patterns.rds"))
home <- readRDS(paste0(safegraph_address,"weekly-patterns/2020-05-03-home-panel-summary.rds"))
                          
sj <- ca_weekly_200503 %>% filter(city =='San Jose')

#BAYWIDE Method
sj_visits_bay <- normBay(sj,home)
kable(
  head(sj_visits_bay,10) %>% 
    dplyr::select(location_name, raw_visitor_counts, raw_visit_counts, visit_counts) %>% 
    mutate(visit_counts = round(visit_counts))
)

location_name	raw_visitor_counts	raw_visit_counts	visit_counts
Gardner Elementary	5	10	355
C & D Semiconductor Services	5	9	320
Seventh Day Adventist Church	1	4	142
Noah’s New York Bagels	20	21	746
Musicians Mobile	10	22	781
Friendship Hall	7	8	284
Moon Family Care	1	1	36
Friendly	11	12	426
Temple Emanu El Preschool	1	1	36
Frank M Santana Park	16	19	675

#Compute total week visits with baywide method
sj_bay_visit_count <- 
  sj_visits_bay %>% 
  pull(visit_counts) %>% 
  sum(na.rm = T)

Bay-wide normalization for San Jose gives total visits in the week of 5/3/20 as 7,594,160.

#BLOCKGROUP Method
sj_visits_bg <- normBG(sj,home)
kable(
  head(sj_visits_bg, 27) %>% 
    transmute(
      `location name` = location_name, 
      `origin census block group` = origin_census_block_group, 
      `origin raw visitor counts high` = origin_raw_visitor_counts_high, 
      `origin raw visitor counts low` = origin_raw_visitor_counts_low, 
      `visit counts high` = round(visit_counts_high), 
      `visit counts low` = round(visit_counts_low)
    )
)

location name	origin census block group	origin raw visitor counts high	origin raw visitor counts low	visit counts high	visit counts low
Gardner Elementary	060855018002	4	2	252	126
Gardner Elementary	NA	1	3	53	158
C & D Semiconductor Services	060855045053	4	2	175	88
C & D Semiconductor Services	NA	1	3	47	142
Seventh Day Adventist Church	NA	1	1	105	105
Noah’s New York Bagels	060855068031	4	2	241	121
Noah’s New York Bagels	060855030033	4	2	156	78
Noah’s New York Bagels	060855120312	4	2	104	52
Noah’s New York Bagels	NA	8	14	221	387
Musicians Mobile	060855014022	4	2	345	172
Musicians Mobile	060855023011	4	2	410	205
Musicians Mobile	NA	2	6	116	348
Friendship Hall	NA	7	7	211	211
Moon Family Care	NA	1	1	26	26
Friendly	060855034012	4	2	211	105
Friendly	060855123121	4	2	96	48
Friendly	NA	3	7	86	201
Temple Emanu El Preschool	060855004002	1	2	50	101
Temple Emanu El Preschool	NA	0	0	0	0
Frank M Santana Park	060014446022	5	5	157	157
Frank M Santana Park	060855064012	4	2	173	86
Frank M Santana Park	NA	7	9	219	281
Xplore Yoga	060855044102	4	2	153	76
Xplore Yoga	060855043181	4	2	185	93
Xplore Yoga	060855022012	4	2	195	98
Xplore Yoga	060133270001	4	2	156	78
Xplore Yoga	NA	28	36	1039	1335

Notice how the output table includes duplicate rows for locations where there were recorded origin census block groups. Rows with NA block group are the remaining “unrecorded origin” visits. The upper and lower bounds for raw visitor counts are shown along with their effect on the overall visit counts. Note that the notation of “high” and “low” refers specifically to the choice of 4 vs. 2 for raw visitor counts of recorded origins, but the effect on raw visitor counts for unrecorded origins is flipped, and it is possible for the effect on the final visit counts to be flipped as well.

#Compute total month visits with blockgroup method
sj_bg_visit_count_high <- 
  sj_visits_bg %>% 
  pull(visit_counts_high) %>% 
  sum(na.rm = T)

sj_bg_visit_count_low <- 
  sj_visits_bg %>% 
  pull(visit_counts_low) %>% 
  sum(na.rm = T)

if(sj_bg_visit_count_low > sj_bg_visit_count_high){
  hold <- sj_bg_visit_count_low
  sj_bg_visit_count_low <- sj_bg_visit_count_high
  sj_bg_visit_count_high <- hold
}

Blockgroup-based normalization for San Jose city gives total visits in the month of April 2020 as 6,488,523-6,921,297.

The ratio of the output of the blockgroup method to the baywide method is 0.85-0.91.

We believe that the blockgroup approach, given its higher spatial resolution, is a more accurate way of normalizing Safegraph data.

Contact Derek Ouyang at douyang1@stanford.edu if you have any questions.

Normalization of Safegraph Patterns Data

Neel Kasmalkar, Derek Ouyang

May 16, 2020