ACS summary data

ACS, I prefer censusapi. See https://stanfordfuturebay.github.io/course/census-data.html (and chapter 1 for any intro skills you need to review)

How you load in the metadata, which is basically https://api.census.gov/data/2018/acs/acs5/variables.html in dataframe form:

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")
acs_vars_2018_5yr <-
  listCensusMetadata(
    name = "2018/acs/acs5",
    type = "variables"
  )

But since it takes a bit of time, I like to save a RDS in the repo.

acs_vars_2018_5yr <- readRDS("acs_vars_2018_5yr.rds")

ACS example TBD

PUMS

PUMS microdata generally needs to be grabbed directly from FTP site, but apparently very recently an API was set up, and tidycensus has created get_pums() function. https://walker-data.com/tidycensus/articles/pums-data.html looks pretty straightforward, but I’ll go ahead and demo how to get to an age vs. race correlation for the PUMAs in SCC

This is new so need to download the dev version of the package.

devtools::install_github("walkerke/tidycensus")
library(tidycensus)
pums_vars_2018 <- 
  pums_variables %>%
  filter(year == 2018, survey == "acs5")

pums_vars_2018_distinct_person <- 
  pums_vars_2018 %>%
  distinct(var_code, var_label, data_type, level) %>% 
  filter(level == "person")

I have picked a specific set that I think are sufficient for our purposes, but there could be other useful variables. This exploration involves going back and forth between data dictionary and seeing samples of data.

ca_pums <- get_pums(
  variables = c(
    "PUMA",
    "SEX",
    "AGEP",
    "RAC1P",
    "RAC2P",
    "RAC3P",
    "HISP",
    "LANX",
    "LANP"
  ),
  state = "CA",
  survey = "acs5",
  recode = T
)

I’ll go ahead and save this to the repo (19.5mb)

saveRDS(ca_pums, "ca_pums.rds")

Filter to just SCC PUMAs

ca_pumas <-
  pumas("CA", cb = T, progress_bar = F)

scc_county <- 
  counties("CA", cb = T, progress_bar = F) %>% 
  filter(NAME == "Santa Clara")

scc_pumas <-
  ca_pumas %>% 
  st_centroid() %>% 
  .[scc_county, ] %>% 
  st_set_geometry(NULL) %>% 
  left_join(ca_pumas %>% select(GEOID10)) %>% 
  st_as_sf()

scc_pums <-
  ca_pums %>% 
  filter(PUMA %in% scc_pumas$PUMACE10)

Here’s quick summary stats on LANP_label, which according to https://www2.census.gov/programs-surveys/acs/methodology/questionnaires/2018/quest18.pdf is best interpreted as the “primary language spoken, if not English”

scc_pums %>% 
  count(LANP_label, wt = PWGTP)
## # A tibble: 115 x 2
##    LANP_label               n
##    <chr>                <dbl>
##  1 Afrikaans              327
##  2 Akan (incl. Twi)        90
##  3 Albanian                 8
##  4 Amharic               2841
##  5 Arabic                7411
##  6 Armenian              1317
##  7 Assyrian Neo-Aramaic  3075
##  8 Bengali               4246
##  9 Bosnian               1012
## 10 Bulgarian              869
## # ... with 105 more rows

NA is “English”. This will be something we can disaggregate by PUMA.

Loading with the replicate weights to be able to calculate standard errors

ca_pums_weights <- get_pums(
  variables = c(
    "PUMA",
    "SEX",
    "AGEP",
    "RAC1P",
    "RAC2P",
    "RAC3P",
    "HISP",
    "LANX",
    "LANP"
  ),
  state = "CA",
  survey = "acs5",
  recode = T,
  rep_weights = "person"
)

This one is a bit too large for the GitHub repo, so I’m saving in G drive.

saveRDS(
  ca_pums_weights, 
  paste0(
    path,
    "PUMS/ca_pums_weights.rds"
  )
)
scc_pums_weights <-
  ca_pums_weights %>% 
  filter(PUMA %in% scc_pumas$PUMACE10)

saveRDS(scc_pums_weights,"scc_pums_weights.rds")
library(survey)
library(srvyr)

scc_survey_design <- to_survey(scc_pums_weights)

Now, same table but with standard errors

scc_survey_design %>%
  survey_count(LANP_label)
## # A tibble: 115 x 3
##    LANP_label               n   n_se
##    <chr>                <dbl>  <dbl>
##  1 Afrikaans              327 151.  
##  2 Akan (incl. Twi)        90  58.3 
##  3 Albanian                 8   8.21
##  4 Amharic               2841 465.  
##  5 Arabic                7411 760.  
##  6 Armenian              1317 307.  
##  7 Assyrian Neo-Aramaic  3075 448.  
##  8 Bengali               4246 539.  
##  9 Bosnian               1012 357.  
## 10 Bulgarian              869 225.  
## # ... with 105 more rows

Quick regression model example using age and race

scc_model <-
  scc_pums_weights %>% 
  mutate(
    nonwhite = ifelse(
      RAC1P_label == "White alone",
      0,
      1
    ),
    nonenglish = ifelse(
      LANX_label == "No, speaks only English",
      0,
      1
    )
  ) %>% 
  to_survey(design = "rep_weights")
model <- survey::svyglm(
  nonenglish ~ AGEP + nonwhite,
  design = scc_model,
  family = "binomial"
)

summary(model)
## 
## Call:
## survey::svyglm(formula = nonenglish ~ AGEP + nonwhite, design = scc_model, 
##     family = "binomial")
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.3817914  0.0237582  -16.07   <2e-16 ***
## AGEP        -0.0071476  0.0003803  -18.80   <2e-16 ***
## nonwhite     1.6675958  0.0211839   78.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 97473.19)
## 
## Number of Fisher Scoring iterations: 4
predict(model, data.frame(AGEP = 20, nonwhite = 1), type = "response")
##   response     SE
## 1   0.7582 0.0032