ACS, I prefer censusapi
. See https://stanfordfuturebay.github.io/course/census-data.html (and chapter 1 for any intro skills you need to review)
How you load in the metadata, which is basically https://api.census.gov/data/2018/acs/acs5/variables.html in dataframe form:
Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")
acs_vars_2018_5yr <-
listCensusMetadata(
name = "2018/acs/acs5",
type = "variables"
)
But since it takes a bit of time, I like to save a RDS in the repo.
acs_vars_2018_5yr <- readRDS("acs_vars_2018_5yr.rds")
ACS example TBD
PUMS microdata generally needs to be grabbed directly from FTP site, but apparently very recently an API was set up, and tidycensus
has created get_pums()
function. https://walker-data.com/tidycensus/articles/pums-data.html looks pretty straightforward, but I’ll go ahead and demo how to get to an age vs. race correlation for the PUMAs in SCC
This is new so need to download the dev version of the package.
devtools::install_github("walkerke/tidycensus")
library(tidycensus)
pums_vars_2018 <-
pums_variables %>%
filter(year == 2018, survey == "acs5")
pums_vars_2018_distinct_person <-
pums_vars_2018 %>%
distinct(var_code, var_label, data_type, level) %>%
filter(level == "person")
I have picked a specific set that I think are sufficient for our purposes, but there could be other useful variables. This exploration involves going back and forth between data dictionary and seeing samples of data.
ca_pums <- get_pums(
variables = c(
"PUMA",
"SEX",
"AGEP",
"RAC1P",
"RAC2P",
"RAC3P",
"HISP",
"LANX",
"LANP"
),
state = "CA",
survey = "acs5",
recode = T
)
I’ll go ahead and save this to the repo (19.5mb)
saveRDS(ca_pums, "ca_pums.rds")
Filter to just SCC PUMAs
ca_pumas <-
pumas("CA", cb = T, progress_bar = F)
scc_county <-
counties("CA", cb = T, progress_bar = F) %>%
filter(NAME == "Santa Clara")
scc_pumas <-
ca_pumas %>%
st_centroid() %>%
.[scc_county, ] %>%
st_set_geometry(NULL) %>%
left_join(ca_pumas %>% select(GEOID10)) %>%
st_as_sf()
scc_pums <-
ca_pums %>%
filter(PUMA %in% scc_pumas$PUMACE10)
Here’s quick summary stats on LANP_label
, which according to https://www2.census.gov/programs-surveys/acs/methodology/questionnaires/2018/quest18.pdf is best interpreted as the “primary language spoken, if not English”
scc_pums %>%
count(LANP_label, wt = PWGTP)
## # A tibble: 115 x 2
## LANP_label n
## <chr> <dbl>
## 1 Afrikaans 327
## 2 Akan (incl. Twi) 90
## 3 Albanian 8
## 4 Amharic 2841
## 5 Arabic 7411
## 6 Armenian 1317
## 7 Assyrian Neo-Aramaic 3075
## 8 Bengali 4246
## 9 Bosnian 1012
## 10 Bulgarian 869
## # ... with 105 more rows
NA is “English”. This will be something we can disaggregate by PUMA.
Loading with the replicate weights to be able to calculate standard errors
ca_pums_weights <- get_pums(
variables = c(
"PUMA",
"SEX",
"AGEP",
"RAC1P",
"RAC2P",
"RAC3P",
"HISP",
"LANX",
"LANP"
),
state = "CA",
survey = "acs5",
recode = T,
rep_weights = "person"
)
This one is a bit too large for the GitHub repo, so I’m saving in G drive.
saveRDS(
ca_pums_weights,
paste0(
path,
"PUMS/ca_pums_weights.rds"
)
)
scc_pums_weights <-
ca_pums_weights %>%
filter(PUMA %in% scc_pumas$PUMACE10)
saveRDS(scc_pums_weights,"scc_pums_weights.rds")
library(survey)
library(srvyr)
scc_survey_design <- to_survey(scc_pums_weights)
Now, same table but with standard errors
scc_survey_design %>%
survey_count(LANP_label)
## # A tibble: 115 x 3
## LANP_label n n_se
## <chr> <dbl> <dbl>
## 1 Afrikaans 327 151.
## 2 Akan (incl. Twi) 90 58.3
## 3 Albanian 8 8.21
## 4 Amharic 2841 465.
## 5 Arabic 7411 760.
## 6 Armenian 1317 307.
## 7 Assyrian Neo-Aramaic 3075 448.
## 8 Bengali 4246 539.
## 9 Bosnian 1012 357.
## 10 Bulgarian 869 225.
## # ... with 105 more rows
Quick regression model example using age and race
scc_model <-
scc_pums_weights %>%
mutate(
nonwhite = ifelse(
RAC1P_label == "White alone",
0,
1
),
nonenglish = ifelse(
LANX_label == "No, speaks only English",
0,
1
)
) %>%
to_survey(design = "rep_weights")
model <- survey::svyglm(
nonenglish ~ AGEP + nonwhite,
design = scc_model,
family = "binomial"
)
summary(model)
##
## Call:
## survey::svyglm(formula = nonenglish ~ AGEP + nonwhite, design = scc_model,
## family = "binomial")
##
## Survey design:
## Called via srvyr
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3817914 0.0237582 -16.07 <2e-16 ***
## AGEP -0.0071476 0.0003803 -18.80 <2e-16 ***
## nonwhite 1.6675958 0.0211839 78.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 97473.19)
##
## Number of Fisher Scoring iterations: 4
predict(model, data.frame(AGEP = 20, nonwhite = 1), type = "response")
## response SE
## 1 0.7582 0.0032