1.5 Reading and saving files

Generally you will be loading data into RStudio from some source, making it more useful or generating insights from it in some way, and then exporting that data out in raw or visualized form. Here are the common ways you’d read data into RStudio:

  1. Find a link online that lets you directly download a data file (e.g. CSV), and use that to directly load data into your Environment. If you find such a link or API, and if it’s stably maintained on the webpage, this is the easiest method.
  2. If for whatever reason it’s not easy to use the URL as the file path, but you can download the data onto your personal machine or a server, then read the data using the local file path instead. This is also very easy and often a great solution for working in teams that share the same server, but assumes you have set up such a file storage solution, and you might need to worry about changing file paths for different users, as we’ve discussed.
  3. A variety of packages have functions designed to essentially do Option 1 but with many convenient wrappers around the process, so that instead of having to find URLs or APIs and work with them yourself, you can interact with function parameters and get various pre-processed results. We’ll experience this later in this chapter.
  4. A variation on downloading datasets from online is scraping content directly from web pages, which will be the subject of a later chapter.

In this section, we’ll demonstrate Options 1 and 2.

In either case, functions from the tidyverse package will come in handy. The tidyverse is a popular collection of packages that all “share an underlying design philosophy, grammar, and data structures”. It’s creator, Hadley Wickham, was a contributor to the Data Challenge Lab course that used to be taught at Stanford (and which, by the way, has a more comprehensive online curriculum that you’re encouraged to also use as a resource to learn R). Basically, we, along with many others believe that features of tidyverse are a deeply intuitive way to learn and use R, and through this curriculum we will essentially be attempting to convince you of that too.

We’ll dive much more deeply into tidyverse later, but for now, we’ll focus on the functions within its packages that are designed to read data. readr is the main one.

In an earlier chunk, I’ve already loaded library(tidyverse). tidyverse uniquely brings in multiple core packages when you call it with library(), but in principle you could load library(readr) on its own.

The tidyverse packages have great documentation, so it’s worth generally reviewing what information is available for readr. Since CSVs are one of the most common types of data format you’d find online, read_csv() is the most common function you’d use, along with its exporting equivalent, write_csv().

For our example, let’s look at a large health dataset provided by the Centers for Disease Control and Prevention (CDC) called 500 Cities. Without going into the full background of this data project which you can read about, I’ll simply note that this dataset provides estimates of many health outcomes that typically are difficult to find public data on at the census tract level (more on geographies soon) in the 500 largest U.S. cities. The dataset is fairly large so is also a good test of your personal machine and its ability to handle the kinds of datasets you would expect to be handling in R. On the CDC website, if you were looking for how to grab the data directly from R, you’d be looking for a direct link to .csv data, which requires manual searching. In this case, this can be found in the “Export” button, and copying the link address yields https://chronicdata.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD. read_csv() has a straightforward schema, as shown below:

library(tidyverse)

health_full <- read_csv("https://chronicdata.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD")

<- is used to assign the result of functions to an Environment variable. You can supply a variable name of your choice on the left side. There are some general rules around this, like having no spaces, not leading with a number, and not using certain special characters. You should also develop the habit of using understandable variable names while keeping them relatively short.

Running the line above may take a minute or more to load. You’ll notice a stop sign shows up in the Console, which lets you know that something is processing. When it’s complete, you’ll see a > show up again in the Console, and you’ll see the newly created variable in your Environment. You can click on it (or type View(health_full)) which will create a new tab in the Source window to view tabular data. As noted previously, names(health_full) in the Console quickly shows the names of the column headers (which matches the data dictionary you can find on the CDC site).

Reading files can introduce lots of unique challenges, so part of getting good at R is learning how to troubleshoot the unexpected problems you’ll face when interfacing with web pages and different file types. We’ll encounter some more complicated situations throughout this curriculum and revisit advanced reading methods as appropriate.

We’ll be able to manipulate this data in a variety of ways, and generally with such large datasets you’d be interested in filtering to just the data you need, saving that locally, and then removing the larger original file. I’ll go ahead and demonstrate filter() which is from the dplyr package in tidyverse (more dplyr functions will be showcased soon):

health_ca <- filter(health_full, StateAbbr == "CA")

filter() takes as its first parameter some existing data object, and as its second parameter a logical condition. Here I’m referencing an existing field name, StateAbbr, which I know is spelled that way because I’ve reviewed the data structure, and I’m filtering to only rows of health_full that have “CA” in the StateAbbr field. The result goes from 800K to 150K rows. For Bay Area work we could further filter to find just the cities in the Bay Area, but for now we’ll go ahead and move on to other examples.

Now let’s try an example that uses a local file path. PG&E is the local energy utility for the Bay Area, and it provides some ZIP code level data to the public online. This site would be a good example of Option 1, except it has an extra pop-up window that shows up asking for the user to provide some info and fill out some boxes which means we can’t do exactly what we did in the last example (there would be a more involved approach akin to Option 4’s web scraping). But for our purposes, this is a good opportunity to try downloading data directly to your local machine. Go ahead and download 2019 electricity usage for four quarters, and make sure to unzip the CSV contents of the zip folders directly into your “working directory”, which you should have created at the end of the last section. You should see these CSVs immediately in your Files window.

Since we are using the working directory, then you can use relative file paths as follows:

pge_19_q1_elec <- read_csv("PGE_2019_Q1_ElectricUsageByZip.csv")

No additional file path information is required; RStudio will by default look inside the working directory for a “PGE_2019_Q1_ElectricUsageByZip.csv” unless you specify otherwise. This means it’s generally the easiest to put input data as well as output contents directly in the working directory, but since this is likely also a cloned repo, you will have constraints on file size if you are syncing to GitHub (generally 100MB per file). So as you start working with bigger files, then generally you’d be interfacing instead with a server location (e.g. Google Drive File Stream) and providing a longer file path.

Now let’s practice saving some outputs. If you wanted to create a CSV, then write_csv() is your best bet. Here’s what it would look like to save your California-filtered version of the CDC dataset:

write_csv(health_ca, "health_ca.csv")

Note that in this case, no <- is needed. write_csv() requires the data object you’re interested in, and a file path destination, which in this case we provided as a relative file path into the working directory (we could have supplied any name for this CSV). This should be ~40MB.

If you don’t need the output to be a CSV for reading into Excel or Google Sheets, and just want to save an output that can be loaded back into R in the future, then the best choice is R’s own data format, .rds. RDS files are more compact than CSVs, and you can hold any kind of data in .rds, as long as it’s “one” data object. saveRDS() can be used very similarly to write_csv():

saveRDS(health_ca, "health_ca.rds")

Notice how the .rds file is a tenth the size of the CSV, even though it holds the same information. So if you’re trying to conserve storage space and allow for quick saving/loading, RDS files are the way to go.

Reading RDS files, by the way, looks like this:

health_ca <- readRDS("health_ca.rds")

In this case, you’ll just be overwriting the existing health_ca data object in your Environment. Note that, like read_csv(), you get to pick whatever name you want on the left side for your variable.

There are of course many other possible formats you can output data which we’ll encounter as we go. The last point I’ll make in this section is that a set of Environment variables can be saved together in an .rda or .rds file, which you can think of as similar to .rds but able to hold more than one object at a time. For example, you could run:

save(health_ca, pge_19_q1_elec, file = "working_datasets.rda")
load(“working_datasets.rda”)

Note:

  • The relevant function is now save().
  • You could list as many Environment variables as you’d like, separated by commas.
  • The argument file = needs to be explicitly spelled out here. In fact this could have spelled out for saveRDS() but was unnecessary then, whereas here it’s necessary to avoid misinterpretation by the function, given that there are many other kinds of arguments that can be fed to save().
  • To read in an .rda file, you need to use load(). One key difference between readRDS() and load() is that you can’t specify a variable name, since there might be multiple variables incoming.

If you just wanted to save the entire Environment, then you could do:

# SAVE POINT
save.image("progress1.rda")
load(“progress1.rda”)

I like using this like a “save” button in a video game, designed as specific checkpoints throughout a long script. If you don’t want to lose the contents of “progress1.rda”, you’d generally leave the save.image() line commented out using Ctrl+Shift+C, and only un-comment it if you want to change the contents of the .rda file.