1.6 Loops

Let’s consider the use of for loops, a foundational technique in programming languages, in this case for reading multiple data files. A for loop is useful when, say, loading in the multiple PG&E datasets, which all have a similar naming structure except for systematic changes in a part of the file name, like Q1 to Q2 to Q3 to Q4.

For loops are written with the structure for(dummy_variable_name in range_of_real_objects) {code_to_execute}. In the example below, I’ll use this structure to loop through quarter in quarters, where quarters <- 1:4, which is just a vector of the integers 1 through 4, and quarter becomes a variable that holds each integer consecutively as the script within the for loop is executed 4 times. I then paste this changing variable into an otherwise fixed set of text fragments to create a string in the variable filename that represents the full file path to retrieve one of the PG&E CSVs we’ve downloaded into the working directory (note that I put them in a sub-folder called “pge”). Then I read_csv(filename) in a similar way as we’ve practiced before into a variable called temp, for “temporary”. Lastly, I use rbind() to “stack like pancakes” two dataframes that share the same column names. So by the end of the for loop, four separate CSVs are read into R and then stacked together.

library(tidyverse)

year <- 2020
quarters <- 1:4
type <- "Electric"

pge_20_elec <- NULL

for(quarter in quarters) {
  
  filename <- 
    paste0(
      "pge/PGE_",
      year,
      "_Q",
      quarter,
      "_",
      type,
      "UsageByZip.csv"
    )

  print(filename)
  
  temp <- read_csv(filename)
  
  pge_20_elec <- rbind(pge_20_elec,temp)
  # Note rbind requires field names to be consistent for every new thing that you add.

  saveRDS(pge_20_elec, "pge_20_elec.rds")
}

Additional notes:

  • When running a for loop in RStudio, just place your cursor on the line with the for and Ctrl+Enter, which will then run through the entire loop. Starting inside the for loop will only run individual lines. To troubleshoot inside of a for loop, you can always manually set a value for quarter, like quarter <- 1, then start running individual lines within the for loop. Note that before doing this, quarter existed only within the for loop operation and not as a variable in your Environment.
  • I’ve also created variables for year and type, even though I give them only one fixed value which doesn’t change in the for loops. But part of the assignment at the end of this chapter will be to create nested for loops in which these other variables also become loopable.
  • For the rbind() technique to work, the variable that holds the growing “pancake stack” needs to exist before the for loop begins, hence pge_20_elec <- NULL which creates an empty container that is ready for the first iteration of the for loop.
  • I’ve included print(filename) which will display the value of filename 4 times in the Console as you run through the for loop. This is useful when you have long loops and you want to monitor progress. If you are running through tens of thousands of loops and you don’t want a printout for every step, assuming your dummy variable row contains numbers, you could write if(row %% 100 == 0) print(row) which only prints the row number if it’s divisible by 100, which is to say, 100, 200, 300, etc.
  • paste0() generally allows you to create string fragments separated by commas, and will leave no spaces between them. paste() is similar, but your last argument can be sep = "," where you can choose any delimiter besides a comma.
  • I’ve included saveRDS() at the end of the for loop, which will go ahead and save the “pancake” at each stage. In this case this is pretty much useless and slows the for loop, but as you start building much more complicated and intensive loop operations, you might appreciate the assurance that your progress is being saved (especially if it’s possible for your for loop to crash your computer midway through, thereby risking the loss of all progress). If you don’t need to save every single step, then similar to above, you could use an if statement to trigger a save only every one hundred steps, or so.

Note that while for loops provide all the customizable conveniences I’ve shown, when it comes to guiding large and complicated operations, they end up being slow compared to other loop techniques in R, which we’ll encounter in a later chapter.