4.2 Training and testing data

As an extension of the logistic regression technique, let’s use a logistic model “trained” on a portion of our PUMS data to try to predict the language of some other “testing” portion of our data, and explicitly evaluate how well the model performs in this test, relative to the actual values from the PUMS data. First, let’s use sample() in a similar way to how we used it in Section 3.1 to split our PUMS data. We’ll pick an arbitrary 80% of the PUMS data to use as “training” data and the other 20% to “test” our model.

sample <- sample(
  c(TRUE, FALSE), 
  nrow(bay_pums_language), 
  replace = T, 
  prob = c(0.8,0.2)
)

train <- bay_pums_language[sample, ]
test <- bay_pums_language[!sample, ]

Then, we run glm() with train:

logit_model_training <- glm(
  english ~ AGEP + white + hispanic,
  family = binomial(),
  data = train,
  weights = PWGTP
)

summary(logit_model_training)
## 
## Call:
## glm(formula = english ~ AGEP + white + hispanic, family = binomial(), 
##     data = train, weights = PWGTP)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -25.086   -4.087    2.404    3.699   23.262  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.577e-01  1.934e-03  -495.1   <2e-16 ***
## AGEP         1.010e-02  4.009e-05   252.0   <2e-16 ***
## white        1.592e+00  1.782e-03   893.5   <2e-16 ***
## hispanic    -1.143e+00  4.886e-03  -234.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8459524  on 300603  degrees of freedom
## Residual deviance: 7416756  on 300600  degrees of freedom
## AIC: 7416764
## 
## Number of Fisher Scoring iterations: 6

Then, we create a set of predictions for the test dataset with this training model. Note that we can just feed newdata = test which will identify the correct fields from test to use for the independent variables, and the resultant predictions will be distinct from the “real” results that are untouched in the field english.

test_predicted <-
  predict(logit_model_training, newdata = test, type = "response")

Finally, as one way of evaluating the performance of this model, we can create a 2x2 matrix using table(). table() is a general function for doing summary statistics, but if we give it the “real” values of english (yes or no) as one vector and give it our predicted probabilities (true or false above 0.5), then table() will give us the counts of each of four possible pair combinations these two vectors.

summary_2x2 <-
  test %>% 
  mutate(
    english = ifelse(
      english == 1, 
      "Yes (English)", 
      "No (ESL)"
    )
  ) %>% 
  pull(english) %>% 
  table(test_predicted > 0.5)

summary_2x2
##                
## .               FALSE  TRUE
##   No (ESL)      22919  9065
##   Yes (English) 12923 30546

This 2x2 matrix can be read as follows:

  • The bottom-right cell is the number of test records which were truly English speakers, and our model correctly predicted this using just the variables of age, White/non-White, and Hispanic/non-Hispanic.
  • The top-left cell is the number of test records which were ESL speakers, and our model correctly predicted this as well. So far, 71% of records were correctly predicted one way or the other.
  • The top-right cell is the number of test records which were ESL speakers, but our model incorrectly predicted them to be English speakers. These are called “Type 1 errors” or “false positives”.
  • The bottom-left cell is the number of test records which were English speakers, but our model incorrectly predicted them to be ESL speakers. These are called “Type 2 errors” or “false negatives”.

Depending on the purpose of the model, one may have slightly different objectives here, but generally one would want to limit both false positives and false negatives. This “training” and “testing” technique can be applied to simple linear regression models too, where the objective may be to reduce the average error between the predicted result and the “real” result in the test data.