1.5 Reading and saving files

Generally you will be loading data into RStudio from some source, making it more useful or generating insights from it in some way, and then exporting that data out in raw or visualized form. Here are the common ways you’d read data into RStudio:

  1. Find a link online that lets you directly download a data file (e.g. CSV), and use that to directly load data into your Environment. If you find such a link or API, and if it’s stably maintained on the webpage, this is the easiest method.
  2. If for whatever reason it’s not easy to use the URL as the file path, but you can download the data onto your personal machine or a server, then read the data using the local file path instead. This is also very easy and often a great solution for working in teams that share the same server, but assumes you have set up such a file storage solution, and you might need to worry about changing file paths for different users, as we’ve discussed.
  3. Sometimes, the thing you’re downloading from online is a zip file or Excel file, in which case there’s a slightly more involved method to read the data by creating a temporary file location on your computer. This will be demonstrated in Chapter 3.1.
  4. A variety of packages have functions designed to essentially do Option 1 but with many convenient wrappers around the process, so that instead of having to find URLs or APIs and work with them yourself, you can interact with function parameters and get various pre-processed results. We’ll experience this later in this chapter.
  5. A variation on downloading datasets from online is scraping content directly from web pages, which will be the subject of a later chapter.

In this section, we’ll demonstrate Options 1 and 2.

In either case, functions from the tidyverse package will come in handy. The tidyverse is a popular collection of packages that all “share an underlying design philosophy, grammar, and data structures”. Its creator, Hadley Wickham, was a contributor to the Data Challenge Lab course that used to be taught at Stanford (and which, by the way, has a more comprehensive online curriculum that you’re encouraged to also use as a resource to learn R). Basically, we, along with many others, believe that features of tidyverse are a deeply intuitive way to learn and use R, and through this curriculum we will essentially be attempting to convince you of that too.

We’ll dive much more deeply into tidyverse later, but for now, we’ll focus on the functions within its packages that are designed to read data. readr is the main one.

In an earlier chunk, I’ve already loaded library(tidyverse). tidyverse uniquely brings in multiple core packages when you call it with library(), but in principle you could load library(readr) on its own.

The tidyverse packages have great documentation, so it’s worth generally reviewing what information is available for readr. Since CSVs are one of the most common types of data format you’d find online, read_csv() is the most common function you’d use, along with its exporting equivalent, write_csv().

For our example, let’s look at a large health dataset provided by the Centers for Disease Control and Prevention (CDC) called PLACES. Without going into the full background of this data project which you can read about, I’ll simply note that this dataset provides modeled estimates of many health outcomes that typically are difficult to find public data on at the census tract level (more on geographies soon). The dataset is fairly large so is also a good test of your personal machine and its ability to handle the kinds of datasets you would expect to be handling in R. On the CDC website, if you were looking for how to grab the data directly from R, you’d be looking for a direct link to .csv data, which requires manual searching. In this case, this can be found in the “Export” button, and copying the link address yields https://chronicdata.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD (the ?accessType=DOWNLOAD portion is not necessary). read_csv() has a straightforward schema, as shown below:

library(tidyverse)

health_full <- read_csv("https://chronicdata.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD")

<- is used to assign the result of functions to a variable in your Environment. You can supply a variable name of your choice on the left side. There are some general rules around this, like having no spaces, not leading with a number, and not using certain special characters. You should also develop the habit of using understandable variable names while keeping them relatively short.

Running the line above may take a minute or more to load. In your version of RStudio, you should see a progress counter in the Console, and a stop sign at the top right which lets you know that something is processing. When it’s complete, you’ll see a > show up again in the Console, and you’ll see the newly created variable in your Environment. You can click on it (or type View(health_full)) which will create a new tab in the Source window to view tabular data. As noted previously, names(health_full) in the Console quickly shows the names of the column headers (which matches the data dictionary you can find on the CDC site).

Reading files can introduce lots of unique challenges, so part of getting good at R is learning how to troubleshoot the unexpected problems you’ll face when interfacing with web pages and different file types. We’ll encounter some more complicated situations throughout this curriculum and revisit advanced reading methods as appropriate.

We’ll be able to manipulate this data in a variety of ways, and generally with such large datasets you’d be interested in filtering to just the data you need, saving that locally, and then removing the larger original file. I’ll go ahead and demonstrate filter() which is from the dplyr package in tidyverse (more dplyr functions will be showcased soon):

health_ca <- filter(health_full, StateAbbr == "CA")

filter() takes as its first parameter some existing data object, and as its second parameter a logical condition. Here I’m referencing an existing field name, StateAbbr, which I know is spelled that way because I’ve reviewed the data structure, and I’m filtering to only rows of health_full that have “CA” in the StateAbbr field. The result goes from 2M to 224K rows.

Note that in this particular instance, the PLACES data is available through a Socrata platform, which provides you with some programmatic abilities to filter in the query process itself (i.e., using SQL, or Structured Query Language, techniques). On the web page, instructions for this approach can be found at API > API Docs. Essentially, a filter like we just did above can also be achieved as part of the read_csv() step by using the URL https://chronicdata.cdc.gov/resource/cwsq-ngmh.csv?&stateabbr=CA&$limit=1000000. If the full file took too long to load, then this approach is more efficient because it prevents you from loading more than you need. But the webpage you’re reading data from doesn’t always provide this type of convenience.

For Bay Area work we could further filter to find just the cities in the Bay Area, but for now we’ll go ahead and move on to other examples.

Now let’s try an example that uses a local file path. PG&E is the local energy utility for the Bay Area, and it provides some ZIP code level data to the public online. Because this provides a ZIP file, this site could be a good example of Option 3 presented before. But for our purposes, this is a good opportunity to first try Option 2, which involves manually downloading data directly to your local machine. Go ahead and download 2020 electricity usage for four quarters, and make sure to unzip the CSV contents of the zip folders directly into your “working directory”, which you should have created at the end of the last section. You should see these CSVs immediately in your Files window.

Since we are using the working directory, then you can use relative file paths as follows:

pge_20_q1_elec <- read_csv("PGE_2020_Q1_ElectricUsageByZip.csv")

No additional file path information is required; RStudio will by default look inside the working directory for a “PGE_2020_Q1_ElectricUsageByZip.csv” unless you specify otherwise. This means it’s generally the easiest to put input data as well as output contents directly in the working directory, but since this is likely also a cloned repo, you will have constraints on file size if you are syncing to GitHub (100MB per file). So as you start working with bigger files, then generally you’d be interfacing instead with a server location (e.g. Google Drive File Stream) and providing a longer file path.

Now let’s practice saving some outputs. If you want to create a CSV, then write_csv() is your best bet. Here’s what it would look like to save your California-filtered version of the CDC dataset:

write_csv(health_ca, "health_ca.csv")

Note that in this case, no <- is needed. write_csv() requires the data object you’re interested in, and a file path destination, which in this case we provided as a relative file path into the working directory (we could have supplied any name for this CSV). This should be ~56MB.

If you don’t need the output to be a CSV for reading into Excel or Google Sheets, and just want to save an output that can be loaded back into R in the future, then the best choice is R’s own data format, .rds. RDS files are more compact than CSVs, and you can hold any kind of data in .rds, as long as it’s “one” data object. saveRDS() can be used very similarly to write_csv():

saveRDS(health_ca, "health_ca.rds")

Notice how the .rds file is a tenth the size of the CSV, even though it holds the same information. So if you’re trying to conserve storage space and allow for quick saving/loading, RDS files are the way to go.

Reading RDS files, by the way, looks like this:

health_ca <- readRDS("health_ca.rds")

In this case, you’ll just be overwriting the existing health_ca data object in your Environment. Note that, like read_csv(), you get to pick whatever name you want on the left side for your variable.

There are of course many other possible formats you can output data in, which we’ll encounter as we go. The last point I’ll make in this section is that a set of variables from your Environment can be saved together in an .rda file, which you can think of as similar to .rds but able to hold more than one object at a time. For example, you could run:

save(health_ca, pge_20_q1_elec, file = "working_datasets.rda")
load(“working_datasets.rda”)

Note:

  • The relevant function is now save().
  • You could list as many variables as you’d like, separated by commas.
  • The argument file = needs to be explicitly spelled out here. In fact this could have spelled out for saveRDS() but was unnecessary then, whereas here it’s necessary to avoid misinterpretation by the function, given that there are many other kinds of arguments that can be fed to save().
  • To read in an .rda file, you need to use load(). One key difference between readRDS() and load() is that you can’t specify a variable name, since there might be multiple variables incoming.

If you just wanted to save the entire Environment, then you could do:

# SAVE POINT
save.image("progress1.rda")
load(“progress1.rda”)

I like using this like a “save” button in a video game, designed as specific checkpoints throughout a long script. If you don’t want to lose the contents of “progress1.rda”, you’d generally leave the save.image() line commented out using Ctrl+Shift+C, and only un-comment it if you want to change the contents of the .rda file.