1.3 R Markdown files
The two most common types of R code documents are .R or .Rmd files. First, keep in mind that in either case they’re really just text files, but fed into R development environments like RStudio, they end up “doing R things”. A normal .R file is literally just code top to bottom to execute. An .Rmd file intersperses code with space for non-code commentary (in a language called Markdown), which is especially useful for creating web documents just like the one you’re viewing right now. Since we want to teach you how to create such public-facing documents for a wide variety of uses, and because it’s often convenient anyway to intersperse code with explanatory text, we’re going to use .Rmd files as the preferred format. But that means we have a few different formatting details to explain that, from the onset, may look strange.
You should have a new .Rmd file open that has a standard template. Usually, when I am starting a new .Rmd, I immediately erase this template and start to copy/paste script from other existing .Rmd files, but for now it’s useful to review this “scaffolding” to learn.
At the top, between two ---
lines, is something called a YAML header (I leave you to Google to your heart’s desire for more insight, and just give you the basic orientation here), which feeds high-level information to what’s called a “knitting” operation (explained soon), which takes your document and does the “web development” (or conversion to other end formats like PDF) for you. For example, the YAML header is where you can type a preset “table of contents” parameter so you never have to bother with designing a table of contents in HTML (this extensive multi-page website, with the left sidebar navigation, is also structured using some simple YAML commands). Just know there are many cool parameters you can learn to use up there to make your web page even cooler, which you can look up when appropriate. Generally, besides many nifty HTML tricks, there are two fundamental things you should always do in the YAML:
- Give your .Rmd file a title, which will render at the top of the web document. This doesn’t have to be the same as the file name you give to your .Rmd file.
- You should specify your date based on when you last edited this code. As an advanced technique, you can type the following inside of the quotation marks:
`r format(Sys.Date(), '%B %d, %Y')`
. This will automatically update the date based on when you “knit”. There are three important concepts here. First, if you are not inside of a “chunk”, which we’ll get to soon, then generally you can’t execute code. But if you do a pair of backticks, and include “r” right after the first backtick, then it’s like you’ve created a “mini chunk” that lets you run one line of code (as if you were typing it directly into the Console). So what you’ve put right into the YAML is the text output of the following code:format(Sys.Date(), '%B %d, %Y')
, which you are free to try typing directly into the Console. Now this code itself is a function within a function.Sys.Date()
is a standard function in base R (which means you have access to it anytime) which does exactly what you think it does; try it directly in the Console on its own. Nothing is required between the parentheses for certain self-evident functions like this. Finally,format()
can do a lot of generally useful formatting things depending on what you feed in as parameters (it’ll do nothing on its own). Whenformat()
is given a Date object, which is whatSys.Date()
produces, as its first parameter, followed by a string in which you specify how you’d like to format the date, then it can convert something like “2020-09-14” into “September 14, 2020”. The exact schema itself is something you have to find guidance on; try Googling “r date format” and typing?format
in the Console.
Now we’ll explain chunks which are a special formatting in .Rmd files to contain code you want to execute. Below I’m copying the first chunk you get in the template. A chunk is created by a pair of three backticks, and a bracketed set of parameters right after the first pair. Note that you can quickly create a chunk using Ctrl+Alt+I
, one of the most common shortcuts I use.
::opts_chunk$set(echo = TRUE) knitr
Note that on this web page, you don’t see the pair of three backticks, but if you were to view the .Rmd file that created this webpage, you’d see the backticks and bracketed parameters. So one of the nice features of an .Rmd file is that you can publish documents with chunks nicely rendered to display code, and if there are outputs to the code (which these don’t have), you can easily choose to display them right below the chunk.
I’ll now show a second chunk that has the same effect but a different formatting choice:
library(knitr)
$set(echo = T) opts_chunk
The first chunk has one line of code while the second has 2 lines. The first example is basically a shortcut of the second which is the more general approach. knitr
is what’s called a “package”, and packages contain functions; in this case, opts_chunk()
is a function of knitr
(note that this function has sub-functions via a $ sign which is rare). If you have used library()
to load a package, then you don’t need to put knitr::
before the function call (the only situation in which this might be needed is if two loaded packages both have a function of the same name, in which case you need to specify which one you’re using; sometimes I need to do this for dplyr::select()
). Generally, you’ll load a bunch of functions with library()
calls in your first chunk, which I’ll demonstrate soon.
In this case, opts_chunk$set()
does something generally important for .Rmd files and relates to “knitting”. Its functionality is also something built into the structure of .Rmd files. I’ll now show the same chunk above, but with the full formatting that you would see in the .Rmd file:
```{r}
library(knitr)
opts_chunk$set(echo = T)
```
I have to do some extra-special formatting of my own .Rmd file to get the pairs of triple backticks to show on this page, because the whole point of .Rmd files is that it converts that kind of information into backend information to then render web content. But it’s important to show you what the .Rmd formatting looks like.
For a chunk to be a chunk, you need the pairs of triple backticks and you need the bracket with an “r”. But you can add some parameters right after the “r” which affect that specific chunk. And these include the parameters you can feed into opts_chunk$set()
; for example, you could add , echo = T
after the “r” in the chunk above. The difference is that putting these parameters in the brackets affects only the one chunk, while opts_chunk$set()
sets defaults for all subsequent chunks. Here are the most common parameters you would likely want to set:
echo = T
(in R generally you can typeT/F
orTRUE/FALSE
for boolean logic) shows the code in a gray box when you render into a web page. This is the default case (I’m actually not sure why the template has it the way it is), so consider this parameter only useful if you wanted to setecho = F
(say you’re creating a report in which the reader wouldn’t care about the code and only wants to see output graphs and maps, which would still show – this is what you should do on class assignment submissions).warning = F
andmessage = F
prevent a lot of annoying warning messages from showing up in the web page that you sometimes see in the RStudio console depending on what the code is doing. Generally these parameters are a good idea to set inopts_chunk$set()
which applies them to all chunks, so as to clean up your web documents.include = F
executes code in the background but does not show the code itself on the web page.eval = F
prevents the code from being evaluated, but the code is still visible on the web page. Sometimes you’ll have some chunk that takes hours to run, and you don’t want to have to run it again when you knit, so you’ll end up saving the output of that long process as a file in your working directory and following up the first chunk with a second smaller chunk that just loads the completed file into your Environment. In this case, you might decide to show the first chunk on a web page but set it aseval=F
, then hide the second chunk withinclude = F
, but it is the one that actually retrieves the relevant output to continue using in the rest of your script.
As you can see, most of these details basically have to do with how the resultant web page looks, and don’t have much to do with the code itself, but it’s useful to understand these parameters now and get into the practice of having something like the following always as your first chunk (feel free to copy and paste this into your own .Rmd file):
```{r setup, include = F}
knitr::opts_chunk$set(warning = F, message = F)
```
Note that the chunk itself won’t display in a web page because of include = F
, but it will evaluate, and it specifically sets warning = F
and message = F
to apply to all future chunks. You can still apply other parameters individually to subsequent chunks as desired. Note also that I went back to using knitr::
since there’s basically no other need for knitr
package functions in most cases. Lastly, note the label setup
written after r
but before a comma. The word you put right after r
basically just names the chunk. This basically helps with quick navigation using a small drop-down menu you can find on the bottom of the Source window. There’s also value in doing this if you want to add automatically numbered captions to your plots, which won’t be taught in this curriculum. Otherwise I usually don’t name my chunks because I find it easier and more user-friendly to organize my document using hashtag headers in the non-chunk areas, as you can see done in your template. Ctrl+Shift+O
opens a Google Doc style outline on the right side of the Source window which is useful for navigating your document.
After that first setup chunk, next you would usually have a “loading libraries” chunk. Mine typically looks something like this (note I am no longer forcing the triple backticks portion to show, so you’ll just see the code within the chunk):
library(tidyverse)
library(plotly)
library(sf)
library(tigris)
library(leaflet)
library(censusapi)
Sys.setenv(CENSUS_KEY="c8aa67e4086b4b5ce3a8717f59faa9a28f611dab")
All the library()
calls are loading packages that you know you need to run the code in the rest of the script. Generally you’ll have some idea of the essential ones you always use, but as you are working on a novel problem, you may discover a new special package you want to use, and you’d scroll up and add it to this chunk. The list above happens to be the “essential” packages I’ll cover later in this chapter.
You have to install packages before you can library()
load them, and generally you just install once and you’re good to go. Packages can be installed (and updated) using the toolbar options (Tools > Install Packages
), or by typing install.packages("name_of_package")
in the console.
Usually after the library()
calls, you would set other kinds of environmental “settings’’ that are relevant to specific code you’ll use later on; setting them here merely provides the benefit of having all these settings in one place, but they could go anywhere as long as they are executed before the relevant code they affect. Here I am calling Sys.setenv()
which relates to the censusapi
package, which I’ll explain in the next chapter. Other common settings you might put in this chunk:
- If you are sharing code between users who might have different file paths, then generally it’s a good idea to create a variable like
path <- "G:/Shared drives/SFBI-Restricted/"
. This holds a string of text which will then be prefixed to other text later on, say to create a full file path to grab some CSV that is in that folder. This happens to be the right file path for me on a PC, but if somebody else has access to the same drive but is a Mac user, they might need to replace this withpath <- "/Volumes/GoogleDrive/Shared drives/SFBI-Restricted/"
. So then it’s easy for somebody to make this adjustment once in this line of this early chunk. (Note that often code will sit in cloned GitHub repos that you would have set your working directory to manually, in which case any smaller and simpler files you load in from the same repo don’t need an absolute file path like this; absolute file paths tend to be for grabbing large or sensitive data from a secure server; GitHub repos will be explained more in the next section.) (Also note that “environment variables” can also be used to enable Person 1 to have a differentpath
variable than Person 2, which get stored on their respective local machines and can be referenced in a generic way. This technique will not be covered in this curriculum. - A few packages will have special options you need to set, but they’ll generally be options you can set inside of
options()
, delineated by commas. We’ll encounter some of these later in the curriculum.
Of course, most of the “R” action happens in the code chunks themselves, which we haven’t yet dived deeply into. The writing that’s happening in non-chunk areas is in Markdown language, which you can think of as an easy language for displaying text, and you can always easily Google for the right Markdown syntax to do things like bold, italics, tables, etc (don’t feel like you need to memorize since you can Google).
Lastly, note there’s a “Knit” button next to the white gear, and that’s what you ultimately click to export a HTML file. If successful, after a few seconds you’ll see a filename.html
in your working directory (likely you’ll see this right away in your File window since you’re typically looking at your working directory), and by default RStudio will open up a window that gives you a preview of what the HTML file would look like as a web page. You are welcome to try it now, and notice that your simple .Rmd file doesn’t really do much interesting in terms of code, but you can already practice all the Markdown formatting you want.
If you need a deeper explanation of these and many other fundamental R Markdown concepts we’ll skip over in this course, start here, then do more Googling.