The following is a brief guide to a few R concepts I think are going to be essential to any R-based work students do in 218Z. Note that this probably won’t work for those starting from scratch; you probably need to have used R in some capacity in the past, and I’m refreshing your memory and pointing you to packages and techniques you might not have directly used.

First note that if you are reading this you are likely starting on http://stanfordfuturebay.github.io/rbasics. You’re seeing text interspersed with code and code outputs. This is an .Rmd file that has been “knitted” into an HTML file, which is then published using Github Pages. I’ll be encouraging all student work to be shared internally and externally using this technique. I’ll be including explanation of how to format such a document, since this itself is a document formatted in that way.

The basics of setting up your R project work environment

  • Download the latest version of R here.
  • Download RStudio here.
  • Make sure you’re set up fully on Slack. You should be on a main #218z channel and probably on many more soon. Note you can insert code using the “Code block” option. Otherwise, on Slack you can just say “hey my code is updated on the covid19 repo” and others will know how to access your code.
  • (UPDATE: In April we switched to a new server, “P Drive”. See new steps here.) Get connected to our server which stores big files including “restricted files” you won’t be allowed to download onto your machine. Click here for instructions. This tends to need troubleshooting, especially on a Mac, and is very important to get working, so work on this one ASAP and reach out to Neel on Slack for help.
  • Make sure your GitHub account is working. When you click here you should be able to see documents.
  • Download GitHub Desktop here. Follow the online tutorials to clone stanfordfuturebay/covid19 onto your own machine. That means you should be able to go to a folder on your computer and see files like rbasics.Rmd.

Once all the above is ready to go, then I’d recommend you open up RStudio, open rbasics.Rmd, and then you should be able to follow along in the .Rmd version of what you also see on the webpage.

Tips for the RStudio interface

  • Double check your RStudio version under Help > About RStudio, and your R version under Tools > Global Options. You will also always be able to check package versions under Tools > Check for package updates. When you experience errors, we’ll want to be able to investigate version compatibility as a source of the error. As of now, I (Derek) am using RStudio Version 1.2.1335 and R Version 3.6.1.
  • Typically the first thing I do is set the working directory for whatever I’m working. That’s typically the bottom right window, “Files” tab, navigate to your desired folder (usually the repo you want to ultimately push to), then click the gear button > Set as working directory.
  • In the script window itself, the white gear wheel has an option to show chunk output “inline” or “in console”. By default it’s “inline”, but I almost always switch to “in console”.
  • Reminder that you can type commands directly into the console. Usually I use this to view data frames, typing View(whatever), view field names quickly using colnames(whatever), map things quickly using mapview(whatever), etc.
  • For running script “inline”, you can put your blinking typing cursor on any part of a line and do Ctrl+Enter (or the Mac equivalent) and the whole line will run. If you only want to run part of a line (especially %>% pipelines), then you highlight a section before typing Ctrl+Enter. Ctrl+Shift+Enter will run the entire chunk. I also use Ctrl+Up/Dn and Ctrl+PgUp/PgDn for fast navigation through the document.
  • Generally keep in mind that most stuff is “designed” to be easy and you often have to do the Googling yourself to seek out those efficiency gains. Don’t assume that something can’t be done, and so you have to be stuck doing something that seems difficult to you. This is of course very true when it comes to functions and packages too.
  • You can type ?somefunction in the console to get a quick tutorial in the bottom right window, though this isn’t always as useful as Googling.
  • Comment out and uncomment lines by putting your cursor anywhere on the line and typing Ctrl+Shift+C.

How .Rmd is structured

We’ll be making the full use of .Rmd format capabilities, which involves knitting to HTML.

When you open a new .Rmd file, you’ll notice a template is provided. For the most part though, I tend to just erase that and copy/paste script from existing .Rmd files.

At the top, between two --- lines, is something called YAML (I leave you to Google to your heart’s desire for more insight, and just give you the basic orientation here), which feeds high-level information to the “knitting” operation, which takes your document and does the “HTML web development” for you. For example, it’s where you can select preset table of contents parameters so you never have to bother with designing the HTML. Just know there are many cool parameters you can learn to use up there to make your web page even cooler; otherwise, ignore it.

The template gives you a setup chunk, which I’m including now. A chunk is created by a pair of three back-ticks, and a bracketed set of parameters. Note that you can quickly create a chunk using Ctrl+Alt+I, one of the most common shortcuts I use.

library(knitr)
opts_chunk$set(echo = T, warning = F, message = F)

The chunk above has 2 lines. It also has setup and warning = F written after r. I’ll explain warning = F soon. The word you put right after r basically just names the chunk. This can help with quick navigation using the drop-down menu on the bottom of this script window; otherwise I usually don’t name my chunks out of laziness. It might be easier to organize using headers in the non-chunk areas using hashtags, as you’ve seen me use already. Ctrl+Shift+O opens a Google Doc style outline on the right tab and can help with navigating your document.

I assume you already know how to install packages and the importance of library() to load packages. knitr is basically a must for turning .Rmd to HTML. opts_chunk$set() comes from knitr, and it sets global options for how every chunk is displayed in your HTML.

  • echo = T (you can type T/F or TRUE/FALSE) shows the code in a gray box. Keep in mind you can see how this looks in the web page version.
  • warning = F and message = F prevents a lot of annoying red messages from showing up in the web page, that otherwise you sometimes see in the RStudio console. Trust me this is a good idea to not show in your web pages.
  • Another useful one we’re setting as a global option is include = F. This would execute code in the background but not show the code itself on the web page. Usually the chunk above would not be included (but you would need to run the code itself in the background).
  • There are other potentially useful parameters you can Google.
  • Note these any of these can otherwise be individually listed after commas in the {r} bracket, affecting only the one chunk. That’s why the chunk above had warning = F, to prevent a warning that might show up when running library(knitr) (just a versioning notice).

Next you would usually have a “loading libraries” chunk. I usually do libraries and options in one chunk.

library(tidyverse)
library(plotly)
library(sf)
library(mapview)
library(leaflet)
library(tigris)
library(censusapi)

options(
  tigris_class = "sf"
)

Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")

This happens to be the “essential” packages in my tutorial, and we’ll go over each one by one.

For options(), there are plenty you might end up learning to use, specific to different packages. tigris_class = "sf" just happens to be an important one for tigris functions, which we’ll explain later. I’ll also explain the Sys.setenv() function later in the censusapi section.

Otherwise, there’s not much else unique to .Rmd files. Most of the action of course happens in the code chunks themselves. The writing that’s happening in non-chunk areas is in “markdown” language, like an easy HTML, and you can find really good cheat sheets online for bold, italics, tables, etc. which honestly I just look up when I want to do something specific.

Lastly, just note there’s a “Knit” button next to the white gear, and that’s what you ultimately click to export HTML. If successful, it’ll put a rbasics.html into your working directory. You are welcome to try it, though you don’t have to push this back to GitHub.

Speaking of: some pointers on using GitHub

I myself know very little about GitHub, and just get by with the “user friendly” version that’s GitHub Desktop. Basically, GitHub is like Google Drive but you have to actively “fetch/pull” to get updates from the cloud downloaded to your machine, and “commit/push” to send your changes to the overwrite what’s in the cloud. We have repos which are like Drive folders with members who can edit. GitHub keeps track of changes and often irritates you by being very very meticulous about it. Best practice is to always pull before you get started on anything, to make sure you’re working off the latest version someone else might have edited, and then push regularly and especially when you’re done working on something, so others can access it. You’re trying to avoid branches that can’t be merged. If you want to play it safe, then you just create your own “something_yourname.Rmd” copies of everything that nobody else would edit, which is probably fine to start with.

GitHub Desktop makes things easy, similar to having a Drive or Dropbox folder synced to your desktop. But you still need to use the GitHub Desktop window to do pull/push commands carefully. Honestly the best thing to do is just go on the GitHub Desktop site and do their tutorial if you’ve never used it. Practice cloning the covid19 repo, maybe make a copy of this script with your own name, and then practice pushing it back to the covid19 repo where everyone will see it. You’re going to want to get comfortable with this quickly.

The final step of publishing your HTML involves cloning the stanfordfuturebay.github.io repo, manually copying/moving your HTML file in that repo, and then pushing. It’s as simple as that. This is a special feature repo that specifically plugs into the GitHub Pages tool, and basically that special repo treats its contents like web page contents. whatever your name.html file is named, a minute or so after pushing, you can go to stanfordfuturebay.github.io/name URL like any webpage. Note that when you start working with more complex interactive maps and charts, you might start to have other support files/folders that get generated by the “knitting” function that you have to copy/paste over along with the .html file. And that’s basically all there is to it. Practice this as much as you want to start to create your own URLs (though at some point we’ll start to structure/manage this shared webpage more carefully).

OK, now we’ll finally get into R coding.

tidyverse

You probably need to have used tidyverse packages in the past to be able to follow a lot of my code. By far the best place to actually learn it is here (full disclosure: I myself haven’t gone through the full textbook and learn as I go). Here I’ll just highlight some of the key functionality.

  • read_csv() and write_csv() from readr within tidyverse, especially important to load in some datasets we’re using, and sharing outputs to those who work in Excel.
  • paste0() is pretty common when working with filepaths.

The following chunk shows a quick example. You need (UPDATE: Now P Drive) F Drive access to do this. It’s worth using this to verify that your F Drive access works. If you’re off campus, you’ll need to get the VPN running, as noted in the F Drive guide. If this isn’t working yet, you can practice with any CSV.

year <- 2019
quarter <- 1
type <- "Electric"

filename <- 
  paste0(
    "P:/SFBI/Data Library/PG&E/PGE_", 
    #"/Users/douyang1/F/Data Library/PG&E/PGE_", # Mac version probably looks something like this
    year,
    "_Q",
    quarter,
    "_",
    type,
    "UsageByZip.csv"
  )

test <- read_csv(filename)

write_csv(test, "test.csv") # This will go to your working directory.
  • Pipes %>% are critical to understand how to use. This is good to review. Basically, pipes are a shortcut for something you can always do in a more convoluted way, so it’s to your benefit that you learn to use them as much as possible. But when solving a coding problem, it’s often helpful to break things back down to incremental steps.
  • I’d say the most common functions from dplyr, typically used in conjunction with pipes, are mutate(), select(), filter(), group_by(), summarize(), and left_join(). Please use the textbook to go through examples of each of these, and if you can manipulate random dataframes of your own successfully and understand what’s happening, you’re good to go.
  • Note that I select() shows up in a few different packages, so I always write dplyr::select() to specify the package to avoid errors. For most other functions you don’t need to do the package:: specification, but just know that you can always do it.

Briefly to base R

  • For saving dataframes in a compressed format that’s optimized for R, I use saveRDS() and then load with readRDS(). For multiple dataframes and other objects in a collection, you can use save() and load(). Note that a key advantage to readRDS() over load() is that you can specify the variable name as you load something in. In that sense, it’s more similar to read_csv().
saveRDS(test, file = "test.rds")
save(test, file = "test.Rdata")

load("test.Rdata")
test_with_name_changed <- readRDS("test.rds")
# notice how you don't have a direct name change option with load()

save.image(file = "test.Rdata")
# this saves everything in your environment. Note it overwrites test.Rdata.
  • For loops are pretty common. Here’s how you’d do something quick and easy with base R for loops and rbind()
year <- 2019
quarters <- 1:4
type <- "Electric"

test_2019data <- NULL # This makes the variable as an empty container so it's ready to be used in the first iteration of the for loop.

for(quarter in quarters){
  
  print(quarter) # Note how I print something so I can see progress in the console.
  
  filename <- 
    paste0(
      "P:/SFBI/Data Library/PG&E/PGE_",
      year,
      "_Q",
      quarter,
      "_",
      type,
      "UsageByZip.csv"
    )
  
  temp <- read_csv(filename)
  
  test_2019data <-
    test_2019data %>% 
    rbind(
      temp
    )
  # Note rbind requires field names to be consistent for every new thing that you add.
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
  • There are more efficient ways to loop than for loops, particularly map() and map_dfr() from purrr in tidyverse. One thing I like about for loops is that it allows you to more easily debug by seeing exactly where something failed, and also create save points along the way. We’ll give guidance if anything we do gets computationally expensive enough that every second becomes precious.
  • Generally, notice the indenting conventions I use. It’s entirely up to you (the code generally knows how to run itself regardless of your indenting), but I’d recommend adopting something that makes your code as legible as possible.

ggplot2 from tidyverse

See basics here. Once you use it successfully a few times then you get the hang of the syntax, and just copy/paste old scripts and use them over and over again.

The example demonstrates one quick plot, but particularly how a dataframe can be piped directly to ggplot(). Note + signs for anything after the first ggplot(). Also note the use of as.factor() to turn months from normal numbers to discrete choices, which helps when plotting the x axis (remove it to see what I mean).

Also note that since the chunk below actually provides an output (instead of just storing the result within an object), then when you knit to HTML, the output will show. You will tend to display ggplot2 results (sometimes converting to an interactive object with something like plotly), sometimes tables using kable(), and interactive maps using mapview or leaflet. For maps and ggplot2 objects, you can add fig.cap in the top bracket to render a caption.

chart <-
  test_2019data %>% 
  filter(CUSTOMERCLASS %in% c("Elec- Residential","Elec- Commercial")) %>% 
  group_by(MONTH, CUSTOMERCLASS) %>% 
  summarize(TOTALKWH = sum(TOTALKWH, na.rm=T)) %>% 
  ggplot(
    aes(
      x = MONTH %>% as.factor(),
      y = TOTALKWH,
      fill = CUSTOMERCLASS
    )
  ) + 
  geom_bar(
    stat = "identity",
    position = "stack"
  ) + 
  labs(
    x = "Month", 
    y = "KWH", 
    title = "PG&E Territory Monthly Electricity Usage, 2019"
  ) + 
  scale_fill_discrete( # this add-on lets you rename the legend options.
    name="Electricity Type",
    labels = c("Residential","Commercial")
  )

chart # This is what creates an output in HTML.
Figure name

Figure name

Here’s an example that uses ggplotly() from plotly to make the chart interactive.

chart %>% ggplotly()

sf, tigris, and mapview/leaflet

We’re now moving to sf which is the standard package and GIS format nowadays for any geospatial work in R. This is the best overall reading. For our purposes, I’ll first show how to bring in census geographies using tigris. Here’s a webinar about tigris.

ca_counties <- counties("CA", cb=F, progress_bar = FALSE)
# Note you need to have included the option in the top chunk for this to automatically be in sf format. cb=F gives the most detailed geometry shape.

It’s important in GIS work to know your coordinate reference system. Here’s how to check for an object.

st_crs(ca_counties)
## Coordinate Reference System:
##   User input: 4269 
##   wkt:
## GEOGCS["GCS_North_American_1983",
##     DATUM["North_American_Datum_1983",
##         SPHEROID["GRS_1980",6378137,298.257222101]],
##     PRIMEM["Greenwich",0],
##     UNIT["Degree",0.017453292519943295],
##     AUTHORITY["EPSG","4269"]]

EPSG: 4269 refers to this. I won’t explain geography concepts here in detail, but just note that when joining two sf objects you need matching coordinate reference systems, and these EPSG codes are recognized by sf as a standard format. tigris data is always in 4269. You can transform using st_transform(). Google Maps coordinates are in EPSG:4326. GIS operations in planar geometry (like buffering) need to happen in a projected coordinate system, and I typically use EPSG:26910 for that, or my own custom version that is converted from meters to feet, and written out in a more detailed way. The chunk below demonstrates how to go back and forth between all of these.

ca_counties_transformed <- 
  ca_counties %>% 
  st_transform(4326) %>% 
  st_transform(26910) %>% 
  st_transform(st_crs(ca_counties)) # back to original

projection <- "+proj=utm +zone=10 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=ft +no_defs"

ca_counties_transformed <-
  ca_counties %>% 
  st_transform(projection)

ca_counties_centroids <-
  ca_counties_transformed %>% 
  st_centroid() # note the warning that shows up which is a key kind of assumption in geography. The UTM Zone 10 is planar focused on NorCal so slight disruptions due to the curvature of the earth are negligible.

The fastest way to map is using mapview(). You can add map layers together with +. (UPDATE: Adding multiple maps together no longer seems to work with mapview. Instead I am showing how to do this in leaflet, which is my recommended advanced mapper anyway.)

# mapview(ca_counties_centroids) +
#   mapview(ca_counties_transformed)

leaflet() %>% 
  addTiles() %>% 
  addMarkers(
    data = ca_counties_centroids %>% 
      st_transform(4326)
  ) %>% 
  addPolygons(
    data = ca_counties_transformed %>% 
      st_transform(4326)
  )

This will usually show up right in the bottom right window. If it’s too large, it’ll open up a browser (and that’s a hint that the map is probably too complex). Usually the map is too large because it’s rendering all the dataframe data to be viewable when clicking on features. If you don’t need them all, I generally recommend getting rid of information.

ca_counties %>% 
  dplyr::select(NAME) %>% # Note that an sf object will retain the geometry field through a select() command.
  mapview()

Notice here that it does a lot of automatic formatting in the background, in this case putting color on the names, which is not a meaningful kind of color representation. So this is a reminder that with mapview sometimes it does things you don’t want it to do, and then it’s a bit difficult to customize, and you might as well switch to more sophisticated set of tools like the leaflet package, which we’ll introduce as appropriate.

Here’s a quick way to get to just Bay Area counties. You might keep this handy as its own .rds file for your own purposes.

bay_area_county_names <-
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

bay_area_counties <-
  ca_counties %>%
  filter(NAME %in% bay_area_county_names)

mapview(bay_area_counties)

I personally never like the way these files include lots of random water area, and the random islands in the middle of nowhere. tigris has water shapes we can use to cut with. The chunk below includes many more sophisticated operations that I don’t have the time to explain right now. But I’d recommend you break this apart and run piece by piece (highlight from top to right before a pipe and Ctrl+Shift+Enter) to learn what’s happening. I won’t explain everything that’s happening in detail (though you’re welcome to ask), but just demonstrate what GIS work often ends up looking like, and involves using a lot of different tricks and Googling how they work. We’ll walk through specific GIS techniques as appropriate, but you’re always welcome to try things on your own and learn new tricks.

water <- 
  bay_area_county_names %>% 
  map_dfr(function(county){ # this is the tidyverse version of a for loop. You can practice writing this the normal for loop way.
    area_water("CA", county, progress_bar = FALSE) %>% as.data.frame()
  }) %>% 
  st_as_sf() %>% 
  st_set_crs(st_crs(ca_counties)) %>% 
  st_union() %>% 
  as_Spatial() %>% 
  sp::disaggregate() %>% 
  st_as_sf() %>% 
  st_set_crs(st_crs(ca_counties)) %>% 
  mutate(area = st_area(.) %>% as.numeric()) %>% 
  filter(area == max(area)) %>% 
  dplyr::select(-area)
mapview(water)
# Get rid of the weird island zone attached to the San Francisco shape
bay_area_counties[which(bay_area_counties$NAME == "San Francisco"),] <- 
  bay_area_counties %>%
  filter(NAME == "San Francisco") %>% 
  as_Spatial() %>% 
  sp::disaggregate() %>% 
  st_as_sf() %>% 
  st_set_crs(st_crs(ca_counties)) %>% 
  mutate(area = st_area(.) %>% as.numeric()) %>% 
  filter(area == max(area)) %>% 
  dplyr::select(-area)

bay_area_counties_fixed <-
  bay_area_counties %>% 
  st_difference(water)
  
mapview(bay_area_counties_fixed)

The most common tigris datasets to bring in are counties, cities, tracts, and block groups. The next chunk demonstrates how to do another common operation, a spatial join.

ca_cities <- places("CA", cb = F, progress_bar = FALSE)

bay_area_cities <- 
  ca_cities %>%
  dplyr::select(PLACEFP) %>% 
  st_centroid() %>% # convert to centroid for a clean spatial join
  st_join(bay_area_counties, left = F) %>% # left=F keeps only matches on both sides of the join, in this case only cities that are within the bay area counties
  st_set_geometry(NULL) %>% # Remove the point geometry since we're going to get back the shape geometries
  left_join(ca_cities %>% dplyr::select(PLACEFP)) %>% 
  st_as_sf() # Since the left side has "forgotten" that it is an sf object, but you just got a geometry field from the right side, this helps it "acknowledge" that.

mapview(bay_area_cities)

You can practice a similar approach to get “only the census tracts in a given city”.

censusapi

censusapi is a package that allows for direct pulling of data from the American Communities Survey and other Census datasets. This is a really good overview (as you can see, also written as a .Rmd file).

Recall in one of the top chunks we had a Sys.setenv() call. As you can see from the censusapi tutorial, this is needed to set an API key to access the Census API. You should request your own.

Below is a full workflow to get some demographic data mapped. It has many steps that deserve a deeper dive explanation which we’ll do depending on your project needs (and this ultimately is a signal of some of the background R knowledge we need you to bring to the table, or at least the familiarity to be able to study this piece by piece on your own and ask pointed questions). Typically, I would use this in conjunction with https://data.census.gov/cedsci/ and https://api.census.gov/data.html which I’m surfing to look for specific census variables I’m interested in.

acs_vars <-
  listCensusMetadata(
    name = "2018/acs/acs5",
    type = "variables"
  )
# This takes a bit of time to load. I usually save an .rds version on my machine to more quickly reload in the future.

smc_sexbyage <-
  getCensus(
    name = "acs/acs5",
    vintage = 2018,
    region = "block group:*", 
    regionin = "state:06+county:081",
    vars = "group(B01001)"
  ) %>%
  mutate(
    blockgroup =
      paste0(state,county,tract,block_group)
  ) %>% 
  select_if(!names(.) %in% c("GEO_ID","state","county","tract","block_group","NAME")) %>%
  dplyr::select(-c(contains("EA"),contains("MA"),contains("M"))) %>%
  gather(
    key = "variable",
    value = "estimate",
    -blockgroup
  ) %>%
  mutate(
    label = acs_vars$label[match(variable,acs_vars$name)]
  ) %>% 
  select(-variable) %>% 
  separate(
    label,
    into = c(NA,NA,"sex","age"),
    sep = "!!"
  )

# This wasn't necessarily the quickest way to get to % elderly figures, which you can find more readily processed in a census dataset, but the way I loaded in data above holds a rich amount of population breakdown that can be used for different purposes.

smc_elderly <- 
  smc_sexbyage %>% 
  filter(!is.na(age)) %>% 
  mutate(
    elderly = 
      ifelse(
        age %in% c(
          "65 and 66 years",
          "67 to 69 years",
          "70 to 74 years",
          "75 to 79 years",
          "80 to 84 years",
          "85 years and over"
        ),
        estimate,
        NA
      )
  ) %>% 
  group_by(blockgroup) %>% 
  summarize(
    elderly = sum(elderly, na.rm = T),
    total_population = sum(estimate, na.rm = T)
  ) %>% 
  mutate(
    percent_elderly = elderly/total_population
  )
smc_blockgroups <- block_groups("CA","San Mateo", cb = F, progress_bar = F)

smc_elderly %>% 
  left_join(smc_blockgroups %>% dplyr::select(GEOID), by = c("blockgroup" = "GEOID")) %>% 
  st_as_sf() %>% 
  filter(!is.na(percent_elderly)) %>% 
  mapview(zcol = "percent_elderly")

The last sections here on sf and censusapi definitely bring in many more techniques that go beyond a “Basics” tutorial, but just note that if you’re to be successfully working in R in this quarter, by the end you should basically be able to design scripts that look like what you’ve seen here. If you were able to follow most of this, then I’m confident you’ll be able to get there. As always, use #218z-r to ask specific questions about anything you don’t understand or don’t know how to do in R.