1.7 Manipulating data

The vast majority of “coding” that happens in R is manipulating data in ways that allow you to extract more insight from the raw information. Usually you are manipulating “dataframes”, which, if you are familiar with Excel or Google Sheets, can be thought of as akin to spreadsheets of data. There are hundreds of techniques you can eventually learn to manipulate data in exactly the way that is relevant for a problem, but for now I’ll just highlight the most basic techniques, which generally involve functions in dplyr (which you already have loaded if you did library(tidyverse) before, but for convenience I’ll re-execute all library() functions necessary in each chapter section in the first chunk).

Note: when I use the term dataframe in this curriculum, I’m technically referring to one of two types of “tables” in R, a “data.frame” or a “tibble”. The difference between these two is not substantive enough to matter in this curriculum, so you generally don’t have to think about whether your table is one or the other.

We’ve already seen filter(); here’s another example using the PG&E 2020 electricity data we just created:

library(tidyverse)

pge_filter <- filter(pge_20_elec, CUSTOMERCLASS %in% c("Elec- Residential","Elec- Commercial"))

Here, the == logic has been replaced with %in%, which can be thought of as “membership in”, followed by c() which is the easiest way to quickly create vectors of elements, in this case two “candidate” values for CUSTOMERCLASS. By the way, there are many types of “vectors”, including “lists” which has a technical meaning in R; we’ll encounter important cases where we need to distinguish between these, but for now, if you’re curious to learn more, read here. Anyway, the logical expression will be TRUE if CUSTOMERCLASS either contains “Elec- Residential” or “Elec- Commercial”, and therefore only these rows will be preserved in the filter. Performing this operation requires the user having some conceptual notion of why you would want to filter this way, and what the correct naming is; you’d be figuring this out by inspecting pge_20_elec.

Now, a brief digression to talk about style (with a more thorough treatment here). The previous operation was a single line, and R will understand it to be one operation even if it’s not literally written as one line, simply based on the natural structure of the operation. So, one aspect of “style” in programming is to break up long lines of code to make them more readable, while not changing the operation itself. See the modified example below.

pge_filter <- 
  filter(
    pge_20_elec,
    CUSTOMERCLASS %in% 
      c(
        "Elec- Residential",
        "Elec- Commercial"
      )
  )

The technique used simply involves clicking Enter at strategic points of the line of code, which automatically will create useful indenting in RStudio as well, so that the different conceptual parts of the operation are easier to visualize. Note that you wouldn’t be able to arbitrarily break anywhere you want; functions need to stay attached to their first open parentheses, and variable names can’t be broken. Otherwise, there is quite a lot of “personal preference” involved here; many would consider the example above to be a bit overdone in indents. But in my opinion, real estate is “free” in a code document in the sense that you can have as many lines as you want, and indenting tends to make code easier to view on a narrow computer window and easier to understand. I’ll start stylizing all my code from here using my own preferences, and will let you experiment and decide on your style.

Another fundamental function is select(). Typically you would perform select() after reviewing the names() of a dataframe and deciding which columns you want to keep, and which you don’t need. There are many ways to customize your select() arguments, many which have only recently been made possible through an update to dplyr, and that you’d learn as you discover relevant use cases. For now, I’ll just show the most common approaches, “selectively keeping” and “selectively removing”.

names(pge_filter)
## [1] "ZIPCODE"        "MONTH"          "YEAR"           "CUSTOMERCLASS" 
## [5] "COMBINED"       "TOTALCUSTOMERS" "TOTALKWH"       "AVERAGEKWH"
head(pge_filter)

head() shows only a few rows from the top of a dataframe which can allow for quick inspection. By the way, brackets can be used to view slices of a dataframe using the general convention of dataframe[row,column]. See the results of all the following permutations:

pge_filter[1,1]
pge_filter[1:5,1:5]
pge_filter[1:5,c("ZIPCODE","MONTH","YEAR")]
pge_filter[1:5, ]
pge_filter[1:5, ]$YEAR
## [1] 2020 2020 2020 2020 2020

Note that using a colon creates a vector of consecutive numbers, and c() again has a similar function but lets you create your own vectors. Also note that in the last two examples, no “column” information is given in the brackets, which will then be interpreted as “all columns”. You could do this on the “row” side as well. Using a $ “extracts” one column from a dataframe, thereby leaving just a vector.

Anyway, we can see that YEAR is not useful if the entire dataframe is from 2020. The following two commands have the same outcome, but in this case, “selectively removing” using a - sign is much simpler.

pge_select <-
  select(
    pge_filter,
    ZIPCODE,
    MONTH,
    CUSTOMERCLASS,
    COMBINED,
    TOTALCUSTOMERS,
    TOTALKWH,
    AVERAGEKWH
  )

pge_select <-
  select(
    pge_filter,
    -YEAR
  )

There are two other fields that you might consider removing at this stage. COMBINED, according to the original PG&E dataset, lets you know whether the given ZIP code’s results have been combined with a neighboring ZIP code to meet state privacy requirements. If you wanted to quickly see what the distribution of yeses and nos is for this field, you could use table() and the $ technique to extract one column vector:

table(pge_select$COMBINED)
## 
##     N     Y 
## 10549  9599

Depending on what analysis you’re conducting, you might consider “combined” rows problematic, but for our purposes, as long as there’s no double-counting, we will ultimately be moving towards aggregated results, so we may simply choose to remove this field. Another unnecessary field might be AVERAGEKWH, since as we start to aggregate results, we’d want to recalculate average kilowatt-hours using aggregated numerators and denominators. The following just shows a select() again, this time using the - in front of the c() to selectively remove that list of candidates (! also works).

pge_select <-
  select(
    pge_filter,
    -c(YEAR, COMBINED, AVERAGEKWH)
  )

group_by() and summarize() are typically used in conjunction with each other to create aggregated results that are similar to what you might be familiar with if you’ve used pivot tables in Excel or Google Sheets before. In group_by() you specify the columns you want to “keep” as discrete categories while collapsing information from other fields. For example, if we didn’t want to retain separate energy data for separate ZIP codes, but we wanted to retain separate energy data for separate months and customer classes, then we’d do:

pge_group <-
  group_by(
    pge_select,
    MONTH,
    CUSTOMERCLASS
  )

Nothing will look different about this output, but you can imagine it “knowing” something special about the MONTH and CUSTOMERCLASS fields, which can be leveraged for certain follow-up operations like summarize() and mutate(), which both allow you to create new fields of data (like adding a new column in Excel with formulas referencing existing columns). First we’ll demonstrate summarize() which specifically “collapses” data based on some kind of mathematical operation, like sum(), mean(), min(), etc.

pge_summarize <-
  summarize(
    pge_group,
    TOTALKWH = 
      sum(
        TOTALKWH, 
        na.rm = T
      )
  )

summarize() produces an entirely new dataframe that will only by default retain fields that you called out in group_by(), so the additional arguments you provide summarize() are essentially new fields you want to create based on summarizations of original fields. Here, the first TOTALKWH is the creation of a whole new variable, and you typically use = inside of such operations instead of <- since these aren’t “Environment variables”. Note we’re intentionally giving this new variable the same name as what it’s based on, since it isn’t conceptually different, but we could have given this a completely different name. The sum() operation will take all values of TOTALKWH from the original pge_group dataframe that share the same MONTH and CUSTOMERCLASS and add them together. na.rm = T is an additional argument that tells sum() to ignore empty fields (“NAs”) it may encounter, which otherwise would trigger an error. You should be aware of whether NAs exist which might be an important signal of something wrong with the data, but as long as they’re not fundamental problems, you’d generally include na.rm = T in any of these summarization operations.

You’ll notice that pge_summarize is significantly altered, holding only 24 rows (24 unique combinations of 12 months and 2 customer classes), and holding only one field of actual “observed information”, the total kilowatt-hours for that month and customer class. You could have added additional summarization fields, but first I’ll note that pge_summarize, as is, is what is called “tidy” data, a concept pioneered by the tidyverse. This chapter from Hadley’s textbook gives the best explanation. pivot_longer() and pivot_wider() can be used to convert back and forth between “tidy” data and what we might be more used to from an Excel background, as shown:

pge_wide <-
  pivot_wider(
    pge_summarize,
    names_from = CUSTOMERCLASS,
    values_from = TOTALKWH
  )

pge_wide

Generally, wider data is easier to look at, but longer, “tidier” data is easier to do a variety of operations on, so you should be familiar with how to go back and forth.

pge_tidy <-
  pivot_longer(
    pge_wide,
    c("Elec- Commercial", "Elec- Residential"),
    names_to = "CUSTOMERCLASS",
    values_to = "TOTALKWH"
  )

pge_tidy is the same as pge_summarize. Note that unlike in select() and other fundamental functions, in pivot_longer() you need to enclose field names in quotation marks. I’m personally not aware of consistent rules behind why a function does or doesn’t require quotations, so consider this the kind of function-specific know-how you’ll develop with experience.

Now let’s go back to the summarize() step to add TOTALCUSTOMERS as well, and then re-compute AVERAGEKWH after the summarization operation using mutate().

pge_summarize <-
  summarize(
    pge_group,
    TOTALKWH = 
      sum(
        TOTALKWH, 
        na.rm = T
      ),
    TOTALCUSTOMERS =
      sum(
        TOTALCUSTOMERS,
        na.rm = T
      )
  )

pge_mutate <-
  mutate(
    pge_summarize,
    AVERAGEKWH =
      TOTALKWH/TOTALCUSTOMERS
  )

mutate() is the closest thing to, in Excel, creating a new field, typing in a formula, and then “dragging” that formula down the length of the spreadsheet. It’ll likely become the dplyr function you use the most.

One final, very important technique before we move on. If you haven’t already noticed, the last few chunks have produced lots of individual outputs in the Environment that you might consider “useless” if all you really wanted was the final result, pge_mutate. There was a lot of redundancy in having to constantly refer to the previous interim data object as part of making a new interim data object.

The “pipe”, %>%, is designed to string functions together like an assembly line, where an “object” is passed from function to function in stages of manipulation. It’s worth getting a full explanation directly from Hadley here, but my own paraphrase is that wherever you can conceive of a pipeline, you should built the habit of coding with pipes which will simplify the code for both you and readers. I consider the pipes as fundamentally clarifying the nature of many operations to better resemble our mental model of what’s “happening” to a data object, treating the functions more as “verbs” which sequentially “act” on a single object. The pipe itself is very easy with the shortcut Ctrl+Shift+M.

Also note that the pipe comes as part of loading the package magrittr which is in the tidyverse, which means pipes only work if you have this package loaded. This is not a big deal if you tend to use tidyverse in your code. But as of 2021, R has its own “native pipe”, which looks like |>. This is sort of the same as %>% but there are enough small but significant advantages to %>% that I recommend using %>% instead of |> (at least until |> has some feature upgrades).

Reviewing many of the previous chunks, you’ll notice that the first argument in a function is often the dataframe being acted upon (this is not 100% the case, but we’ll be able to deal with the outlier situations no problem once we encounter such a case). One key thing a pipeline does is it removes the need to specify this first argument, since it’s self-evidently “the object in the pipeline”. So an important conceptual change to how you use functions is that you start to “ignore” the argument that asks for the “data”, which is often the first argument. Besides that, once you see the example below, the rest of the pipeline technique should appear straightforward.

pge_final <-
  pge_20_elec %>% 
  filter(
    CUSTOMERCLASS %in% 
      c(
        "Elec- Residential",
        "Elec- Commercial"
      )
  ) %>% 
  select(
    -c(YEAR, COMBINED, AVERAGEKWH)
  ) %>% 
  group_by(MONTH, CUSTOMERCLASS) %>% 
  summarize(
    TOTALKWH = 
      sum(
        TOTALKWH, 
        na.rm = T
      ),
    TOTALCUSTOMERS =
      sum(
        TOTALCUSTOMERS,
        na.rm = T
      )
  ) %>% 
  mutate(
    AVERAGEKWH =
      TOTALKWH/TOTALCUSTOMERS
  )
pge_final

Notice, at a practical level, how much less text is required when writing this. As before, the indenting here is my style preference; I like to treat pipes as the “and then” at the end of a verse, like poetry. You can put your cursor anywhere in this pipeline and click Ctrl+Enter, and the whole pipeline will run.

As great as pipes may appear to be, they do make troubleshooting complicated as you have now strung together many individual operations, and if you hit an error you might not immediately know what caused the error. So in some sense, it’s still a good idea to develop your process function by function, like we originally did, and convert to a pipeline only after you have checked that each individual step works. Of course, the more comfortable you become with coding, the fewer mistakes you make, and the more naturally you can default to writing pipelines. As a middle ground, note that if you select and drag text from the top of a pipeline to any intermediate step right before a new pipe, and Ctrl+Enter, you can thereby run a “partial pipeline”, which would essentially allow you to check intermediate stages for errors (i.e. if you can run from the top to halfway through with no errors, then you know the error isn’t in the first half of your pipeline steps). Overall, the question of “what should be combined into one pipeline” really becomes a conceptual design one, based on what steps you think should be understood to be linked as one.

From this point on, I will by default code using pipes, as I encourage you to do as well.