As I started visualizing the Current Reservoir Levels data, I realized that a technical issue I was grappling with was interesting in its own right. One of my variables didn’t apply to all of the data, and in order to display data in the way I wanted to, it was necessary to build a custom plot.
I explored the inbuild ggplot2
options and tried to visualize the inbuilt faceting feature, but due to the inconsistent nature of my data facets I ended up using patchwork
to build the plot exactly as I wanted it to be.
The Current Reservoir Levels structure
Exploring data is a large topic of study in itself, called exploratory data analysis (EDA). It isn’t a rigid structure, but more of a philosophy towards how to approach the data. I am still learning about EDA, how other people explore data, and developing my own techniques.
One common technique I know from my background in mathematics and social science is to look at points of change like borders, limits, extremes, asymptotes, and changes in direction. When a function modeling slope has a derivative that becomes negative, at that point there will be an interesting change.
In human geography, it is telling to analyze the specific details about how a group of people define who is at the edge of their group, who is tolerated, and who is held outside despite their attempts to be included in the group. Applying this to data, analyzing statistical outliers using box plots is one useful first step.
In visualizing data for EDA, there are other options and box plots have significant limitations. I am fascinated by more exotic visualizations like raincloud plots and want to work with them, but I have a lot of basic competence to build up with ggplot2
and related libraries first.
Before diving into the plots, let’s look at a slice of the data in tabular form with head()
:
Date | Reservoir | Site | Storage | Elevation | Release |
---|---|---|---|---|---|
2017-11-01 | Ashokan | East | 65.36 | NA | NA |
2017-11-01 | Ashokan | East | NA | 577.86 | NA |
2017-11-01 | Ashokan | West | NA | 36.34 | NA |
2017-11-01 | Ashokan | West | 577.89 | NA | NA |
2017-11-01 | Ashokan | NA | NA | NA | 18.00 |
2017-11-01 | Schoharie | NA | 12.92 | NA | NA |
2017-11-01 | Schoharie | NA | NA | 1109.86 | NA |
2017-11-01 | Schoharie | NA | NA | NA | 0.00 |
2017-11-01 | Rondout | NA | 47.26 | NA | NA |
2017-11-01 | Rondout | NA | NA | 835.85 | NA |
2017-11-01 | Rondout | NA | NA | NA | 9.92 |
2017-11-01 | Neversink | NA | 30.74 | NA | NA |
2017-11-01 | Neversink | NA | NA | 1430.03 | NA |
2017-11-01 | Neversink | North | NA | NA | 38.90 |
2017-11-01 | Neversink | South | NA | NA | 0.00 |
2017-11-01 | Neversink | Conservation Flow | NA | NA | 0.00 |
2017-11-01 | Pepacton | NA | 116.60 | NA | NA |
2017-11-01 | Pepacton | NA | NA | 1264.45 | NA |
2017-11-01 | Pepacton | North | NA | NA | 51.70 |
2017-11-01 | Pepacton | South | NA | NA | 0.00 |
2017-11-01 | Pepacton | Conservation Flow | NA | NA | 0.00 |
2017-11-01 | Cannonsville | NA | 47.69 | NA | NA |
2017-11-01 | Cannonsville | NA | NA | 1112.87 | NA |
2017-11-01 | Cannonsville | NA | NA | NA | 97.30 |
2017-11-02 | Ashokan | East | 64.95 | NA | NA |
If you’re looking closely, you can see that the Sites column is a complicating factor in this data. As not every reservoir has multiple sites for every type of measurement, a lot of common techniques for visualization will become more difficult.
The purpose of the data set is to record three measures across six reservoirs: total storage measured in billions of gallons per day, elevation of the water surface in feet, and water released from reservoirs measured in millions of gallons per day.
If we visualize the each variable across the entire system with a basic box plot that doesn’t break out the sites, here’s what we get for storage:
and elevation:
and release:
These plots could benefit from some improved scaling, but a bigger problem is that our Sites are actually different subsets of the respective reservoirs. For example, the Ashokan reservoir has two basins, East and West, which is why the Storage and Elevation numbers have such wide distributions in the box plot.
Let’s visualize the Ashokan sites separately, starting with storage:
and then elevation:
Despite the scales squishing the box plots to the point that they’re effectively useless, we can see that the East and West basins are completely different from each other. The East basin holds far less water, but has a higher surface elevation, which seems like it would indicate a much deeper reservoir.
The Neversink and Pepacton reservoirs are similar for Release values: both have North, South, and Conservation Flow sites. In short, when visualizing this data set, displaying aggregate data for all reservoirs is misleading as it collapses the differences between the different sites.
This would not be a problem for many visualizations, but the point of using a box plot is to visually identify outliers for subsequent analysis and possible removal. So how can we visualize boxplots for the Reservoirs by Site when there are multiple sites, but not when there aren’t sites?
Both Reservoir and Sites indicated on X axis?
The first possibility I investigated is whether both Reservoir and Sites can be mapped on the X axis. I don’t think this is possible. ggplot is very flexible, but the X and Y axis inputs seem to be pretty strict about what input they’ll accept. Here’s what I input to graph the storage of all reservoirs above:
data %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Storage,
fill = Reservoir)
) +
labs(
title = "Storage of Upstate Reservoirs (billion of gallons)",
caption = "Source: NYC Department of Environmental Protection")
If you’re not familiar with ggplot, it works by cumulatively adding layers to build the plot, thus the plus signs. In this case, we pipe (%>%) our data set into ggplot()
which creates a blank plot, then we add a boxplot with geom_boxplot()
, and then add labels with labs()
. The mapping
argument tells the boxplot which variables to use.
The ?aes()
doc page explains more about what can go into the x and y aesthetic mappings:
List of name-value pairs in the form aesthetic = variable describing which variables in the layer data should be mapped to which aesthetics used by the paired geom/stat. The expression variable is evaluated within the layer data, so there is no need to refer to the original dataset (i.e., use ggplot(df, aes(variable)) instead of ggplot(df, aes(df$variable))). The names for x and y aesthetics are typically omitted because they are so common; all other aesthetics must be named.
It’s pretty specific about taking variables as input rather than something like outputs from paste()
or subset()
. I didn’t experiment with it extensively, but I am confident it’s not intended to work this specific way. While looking for a solution, I was reminded of an inbuilt feature of ggplot that is designed to do something like what I want.
Faceting in ggplot2
Faceting is a standard feature of ggplot designed to subset and graph data for situations like this. Kevin Donovan wrote a great introduction to the topic. So what would it look like if we facet the Current Reservoir Levels data?
Here’s the Release levels using facet_grid()
:
data %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Release,
fill = Reservoir)
) +
facet_grid(
Site ~ .,
scales = "free_y",
drop = TRUE) +
theme_test() +
theme(axis.text.x = element_text(angle = 90),
strip.text.y = element_text(angle = 0),
strip.background.y = element_rect(fill = "#FFFFFF"),
strip.placement = "outside")
Here is the same data using facet_wrap()
:
data %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Release,
fill = Reservoir)
) +
facet_wrap(
Site ~ .,
scales = "free_y",
drop = TRUE) +
theme_test() +
theme(axis.text.x = element_text(angle = 90),
strip.placement = "outside")
There are a number of thematic tweaks I would make to these maps, and I will spend time focusing on ggplot theming in another post, but these are a significant improvement, particularly the plot made with facet_wrap()
. By setting the scales to move freely along the Y axis, most of the plots make sense and some even have analytical value.
The problems with faceting in our case is that our Sites variable is only applicable for a subset of each measurement variable, which creates a confusing and wasteful facet grid because ggplot faceting expects facets to apply to all values of a variable.
In the above Release graphs, East and West are wasting of previous space and it’s not clear why the reservoirs categorized “NA” (those with only one release site) are together on one graph. Furthermore, the reservoirs are all stuck on one scale, making Rondout and Schoharie hard to understand, and Neversink and Pepacton are duplicated and left blank.
For this data set, I cannot think of a clean, compact, and concise way to look at all the Reservoirs and Sites without just plotting them individually. There are two problems with this approach: it is super verbose and will create an excessive number of plots to display one after the other.
I don’t believe there is a way to make this anything but verbose, but fortunately R has a number of good solutions for condensing multiple plots into single images.
patchwork
and its siblings
patchwork
is a ggplot2
extension designed for exactly this use case: we have multiple plots we want to display as one graphic. There are other solutions for this, like gridExtra
and cowplot
, but I like patchwork
because it is very concise, readable, and easy to write.
By building a grid of plots, we can give them all adequate scales and plot only what we want. This requires a lot of verbosity, repetition, and manual tweaking for each plot, so if I were to make plots like this regularly I would find a way to simplify it. Here is the Storage as a patchwork
graphic:
# build 7 plots individually
# theme() and lab() parameters to reduce visual clutter
sp1 <- subset(data, Reservoir %in% 'Ashokan' & Site %in% 'West') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Site,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(x = "Ashokan")
sp2 <- subset(data, Reservoir %in% 'Ashokan' & Site %in% 'East') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Site,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(
x = "Ashokan",
y = element_blank())
sp3 <- subset(data, Reservoir %in% 'Schoharie') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(
x = element_blank(),
y = element_blank())
sp4 <- subset(data, Reservoir %in% 'Rondout') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(
x = element_blank(),
y = element_blank())
sp5 <- subset(data, Reservoir %in% 'Neversink') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(
x = element_blank(),
y = element_blank())
sp6 <- subset(data, Reservoir %in% 'Pepacton') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(
x = element_blank(),
y = element_blank())
sp7 <- subset(data, Reservoir %in% 'Cannonsville') %>%
ggplot() +
geom_boxplot(
mapping = aes(
x = Reservoir,
y = Storage),
width = .3
) +
theme_test() +
theme (
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()) +
labs(
x = element_blank(),
y = element_blank())
# build the patchwork.
# parallel plots instead of stacked charts so each plot is
# tall enough for the box plot to be legible.
(sp1 | sp2 | sp3 | sp4 | sp5 | sp6 | sp7) + plot_annotation(
title = "Storage Capacity of Upstate Reservoirs (billions of gallons)",
caption = "Source: NYC Department of Environmental Protection")
This is now the graphic I had in my head when I started analyzing the data. There are many visual elements we would definitely tweak if we wanted this to be a production grade image, since it currently has about as much pizzaz as a Nutrition Facts label, but we have all the essential elements for understanding, analyzing, and asking further questions.
For instance, we can see that the top whiskers are much smaller than the lower whiskers. This makes sense, as a reservoir can be drained but it won’t ever go beyond 100% of its capacity since it would just overflow or (more realistically) the DEP would just increase the outflow from the bottom of the reservoir.
Looking at each individual reservoir, the ranges make sense. The storage values for Ashokan West and Rondout are within 10% of their medians, while the rest are distributed a great deal more widely. Schoharie appears to have at least one zero measurement, which either indicates an error or the reservoir went entirely dry at some point.
This is what a good graphic should do: propel the reader forward so that they are asking (or are prepared for) the next question in the analysis. If people are confused or sidetracked by a graphic’s quirks, then it is undermining rather than supporting the venture.
Now for the Elevation numbers, without the code chunk this time:
I have more questions this time. Pepacton immediately stands out because it has outliers that are several hundred feet above the median measurement. I am not aware of any Biblical floods since November 2017 that placed the Catskills under 200 feet of water, so these look like errors.
The Schoharie zero measurement(s) show up in the Elevation data, too, which makes me wonder if there wasn’t an infrastructure project at some point that required the reservoir to be temporarily drained. A quick search turns up this article which talks about a project at Schoharie from 2019, and there are regularly major capital projects across the water system, so the zero value(s) might potentially be valid.
I’m curious about Ashokan West’s elevation figure, too. The other reservoirs range from 550 to 1450 feet in elevation, so Ashokan West having a median elevation of 40 seems remarkably low. As this reservoir also stores the most water by far, its shallow depth makes me wonder how the two figures square with each other. These values aren’t outliers, of course, just curious.
Now for the Release figures, I will need to structure the graphic a little differently. Since I have two reservoirs, Neversink and Pepacton, with three release sites (North, South, Conservation Flow) each, that means there are 10 total plots.
Here’s what it looks like on one line:
(rp1 | rp2 | rp3 | rp4 | rp5 | rp6 | rp7 | rp8 | rp9 | rp10) +
plot_annotation(
title = "Release from Upstate Reservoirs (millions of gallons)",
caption = "Source: NYC Department of Environmental Protection"
)
It’s a promising attempt, and maybe with more effort and experience I could make this work, but for our purposes here I don’t think everything will fit on one line. I tried to massage the labels as much as possible, but there just wasn’t enough room for everything.
It would be possible to break this into two plots, which would leave plenty of room for labels and spacing, but I believe we can fit everything into one image without making it too visually crowded or difficult to visually parse. Here’s the box plot on two lines:
(rp1 | rp2 | rp3 | rp7 | rp8) /
(rp4 | rp5 | rp6 | rp9 | rp10) +
plot_annotation(
title = "Release from Upstate Reservoirs (millions of gallons)",
caption = "Source: NYC Department of Environmental Protection"
)
I think this is okay. We don’t have quite as much height for the box plots, but they are still useable. patchwork::plot_annotation()
does not allow us to set a global Y variable, so I had to label both rows. I am tempted to rename()
Conservation Flow to just Conservation, since flow is implied as a facet of water release.
As with the Storage and Elevation graphics, this one is useable for actually parsing the outliers and distribution of the data. The first thing that jumps out is that Release numbers are much more widely distributed, with Cannonsville ranging from zero to one billion gpd. Conservation Flow numbers seem intermittent, so most non-zero values are outliers.
I’d want to take a closer look at outliers for the Rondout and Schoharie reservoirs. Rondout’s release numbers are clustered below 50 mgd, but then there are at least two outliers in the 600 mgd range, which would be high even for the largest reservoir in the system, Ashokan. Schoharie’s fourth quartile has quite a long whisker, so I expect most of the outliers are real values, but there are several that exceed 450 mgd that seem questionable.
Bringing it all together
Now we have successfully built box plots that break down the data set by reservoir and site, which turned out to be an unexpectedly complicated task due to the site variable only applying to a subset of each measurement variable. patchwork
allowed us to visualize all the information we wanted to see with appropriate scales and minimal visual clutter.
EDA activities are not generally what gets published, and so this exercise was not strictly necessary. The flexibility of ggplot meant that I easily modified the plots in a working document to see exactly what I wanted, and could have just described all of these activities as “cleaning the data.”
After data is cleaned, though, there is still value in creating visualizations that describe the shapes of the data, like the raincloud plots I mentioned earlier. Many data-oriented audiences prefer visualizations and discussions of the shapes of the data, since it gives them insights into what techniques will work well and their limitations.
Messy facets like our Sites variable are common enough, and patchwork
helped me to build custom graphics to manage this data set’s quirks, but the library is also useful more generally for showing multiple different types of visualizations in one graphic instead of displaying them serially.