Wrangling the Current Reservoir Levels data set

Now that I’ve familiarized myself with the R ecosystem, including RStudio, blogdown, querying databases, and basic data wrangling and visualization techniques, I’d like to start with some work of substance. I want to start with work that is small in scope to pace myself.

One of the factors that discourages people from tech work is the disconnect between the headlines and the reality of what needs to be done once work starts.

The cool and sexy work products and analysis that make the news are actually the apex of a very large pyramid, and only focusing on that means ignoring the reality of what most data work actually is. Everybody wants to hammer the golden spike, but the reality of most data work is laying track in Council Bluffs, Iowa.

Many people have covered the data pyramid topic over the years, so I won’t repeat what a simple Google search will return other than to say that the work I am interested in for the near future is about half way up the pyramid.

Since I am interested in working with open data sets, the decisions and work related to collection and storage are already taken care of: I am in effect a customer of unrefined data. The wrangling work of an analyst is predicated on data engineers who send raw data flows into data storage using ETL (extract, transform, load) techniques.

The first set I will be working on is Current Reservoir Levels data set from the NYC Department of Environmental Protection (NYCDEP) - not to be confused with the similarly named New York State Department of Environmental Conservation (NYSDEC).

My goals here are to confirm whether the data is tidy, transform it to a long form with pivot_longer(), and split columns with extract(). I will address outliers, missing values, and other data cleanliness issues and visualize the data in a separate post.

First look at the data

We’ll use the head() function to take a look.

Point_time	AUGEVolume	AUGEASTLEVANALOG	AUGWVOLUME	AUGWESTLEVANALOG	ASHREL	SICRESVOLUME	SICRESELEVANALOG	STPALBFLW	RECRESVOLUME	RECRESELEVANALOG	RECREL	NICRESVOLUME	NICRESELEVANALOG	NICNTHFLW	NICSTHFLW	EDIRESVOLUME	EDIRESELEVANALOG	EDRNTHFLW	EDRSTHFLW	WDIRESVOLUME	WDIRESELEVANALOG	WDRFLW
02/01/2019	75.04	585.17	39.67	584.65	619	16.81	1125.75	48.3	47.33	836.77	10.21	35.21	1440.08	58.0	64.8	142.01	1279.56	225.8	225.6	88.91	1146.33	959.2
02/02/2019	74.48	584.98	39.50	584.38	619	16.82	1125.75	64.6	47.19	836.57	10.21	35.21	1440.09	58.1	64.8	141.62	1279.35	225.9	226.1	88.49	1146.05	967.4
02/03/2019	73.88	584.62	39.34	584.21	598	16.84	1125.75	68.1	47.09	836.42	10.19	35.22	1440.10	58.1	64.8	141.25	1279.15	225.7	226.0	88.16	1145.83	965.7
02/04/2019	73.31	584.26	39.17	584.04	586	16.93	1125.81	62.5	46.99	836.27	10.20	35.23	1440.12	58.1	64.9	140.94	1278.98	225.7	225.6	87.91	1145.66	966.5
02/05/2019	72.76	583.90	39.19	583.85	584	17.16	1126.03	57.5	46.96	836.20	10.19	35.27	1440.19	58.0	64.9	140.92	1278.97	225.8	225.5	88.24	1145.88	960.3
02/06/2019	72.26	583.55	39.52	583.88	583	17.25	1126.70	49.4	47.09	836.42	10.19	35.28	1440.21	58.1	64.9	141.22	1279.13	225.7	225.8	88.64	1146.15	964.2

My first observations are that these column names are completely unintelligible and that everything appears to be numbers, which should simplify getting the correct types. Let’s look at how read_csv() interpreted the column types.

I’ll use skimr::skim() to return the types because it returns the column types without fuss and, unlike glimpse(), it works when piped into kable() and therefore is prettier to display. I’m starting to learn more about the complexity of package interaction: there are a lot of packages and use cases which do not play well together.

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	Point_time	1	10	10	0	1307	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	AUGEVolume	1	NA	NA	NA	NA	NA	70.7547670	8.0147475	43.32	65.44	73.38	77.33	81.45	▁▂▂▅▇
numeric	AUGEASTLEVANALOG	1	NA	NA	NA	NA	NA	581.9706493	5.2570404	563.21	578.84	583.71	586.35	588.83	▁▁▂▅▇
numeric	AUGWVOLUME	1	NA	NA	NA	NA	NA	38.6308480	6.0850618	21.81	35.21	39.75	43.30	47.69	▁▃▃▇▆
numeric	AUGWESTLEVANALOG	1	NA	NA	NA	NA	NA	582.6484570	6.7243636	562.37	579.03	584.34	587.85	591.46	▁▂▂▇▇
numeric	ASHREL	1	NA	NA	NA	NA	NA	160.8898396	226.6211923	0.00	13.00	18.00	324.00	623.00	▇▁▁▁▂
numeric	SICRESVOLUME	1	NA	NA	NA	NA	NA	16.4825898	2.2500972	0.00	15.75	16.92	18.26	19.72	▁▁▁▂▇
numeric	SICRESELEVANALOG	1	NA	NA	NA	NA	NA	1124.0199083	10.3972483	1035.00	1121.90	1125.88	1129.08	1362.67	▁▇▁▁▁
numeric	STPALBFLW	1	NA	NA	NA	NA	NA	73.5948052	85.3216774	0.00	1.00	37.90	125.90	569.60	▇▅▁▁▁
numeric	RECRESVOLUME	1	NA	NA	NA	NA	NA	47.7186173	0.7529521	45.34	47.22	47.73	48.29	49.31	▁▂▇▇▃
numeric	RECRESELEVANALOG	1	NA	NA	NA	NA	NA	837.2233309	1.1499935	833.74	836.46	837.31	838.11	839.65	▁▃▇▇▃
numeric	RECREL	1	NA	NA	NA	NA	NA	15.5071429	47.0380886	9.46	9.94	10.21	14.72	731.00	▇▁▁▁▁
numeric	NICRESVOLUME	1	NA	NA	NA	NA	NA	32.9248052	2.6853581	23.33	31.39	33.74	35.22	35.90	▁▁▂▃▇
numeric	NICRESELEVANALOG	1	NA	NA	NA	NA	NA	1435.1561877	5.7888105	1413.39	1431.88	1437.02	1440.08	1441.42	▁▁▂▃▇
numeric	NICNTHFLW	1	NA	NA	NA	NA	NA	30.3886784	28.9678642	0.00	0.00	38.50	61.00	78.25	▇▁▂▆▁
numeric	NICSTHFLW	1	NA	NA	NA	NA	NA	50.5201986	25.1924814	0.00	39.10	61.09	65.10	99.99	▂▁▁▇▁
numeric	NICCONFLW	1	NA	NA	NA	NA	NA	0.0807869	1.0604576	0.00	0.00	0.00	0.00	20.00	▇▁▁▁▁
numeric	EDIRESVOLUME	1	NA	NA	NA	NA	NA	129.4795111	13.1585229	96.83	118.20	132.75	141.55	152.67	▂▅▅▇▆
numeric	EDIRESELEVANALOG	1	NA	NA	NA	NA	NA	1276.3336520	27.9253472	1252.55	1265.90	1274.39	1279.33	1572.72	▇▁▁▁▁
numeric	EDRNTHFLW	1	NA	NA	NA	NA	NA	92.5472574	87.4711606	0.00	0.00	80.50	191.50	235.10	▇▅▃▁▆
numeric	EDRSTHFLW	1	NA	NA	NA	NA	NA	99.7966234	84.2206902	0.00	12.30	88.30	193.90	234.90	▇▇▅▁▇
numeric	EDRCONFLW	1	NA	NA	NA	NA	NA	0.2058824	1.9839503	0.00	0.00	0.00	0.00	23.00	▇▁▁▁▁
numeric	WDIRESVOLUME	1	NA	NA	NA	NA	NA	79.3403438	16.3171543	38.98	70.24	85.01	91.96	99.11	▂▂▂▅▇
numeric	WDIRESELEVANALOG	1	NA	NA	NA	NA	NA	1138.7774714	12.4987642	1106.23	1132.51	1143.66	1148.33	1151.95	▂▁▁▃▇
numeric	WDRFLW	1	NA	NA	NA	NA	NA	467.1185638	325.5462756	17.00	213.90	365.50	832.10	1009.60	▆▇▂▁▆

There’s a lot of information here, but we’re interested in the first two columns. Every number was correctly interpreted, but Point_time was interpreted as a character when it is actually a date. So we have two fairly simple corrections to make: rename the columns, and change the date column to a date data type.

Adding human-friendly column headers

After taking a look at the data dictionary from the NYC Open Data page to figure out how to interpret these columns, I see that they provide labels that are much more human-friendly. We’ll use those instead and rename using dplyr::rename():

Date	Ashokan East Storage	Ashokan East Elevation	Ashokan West Elevation	Ashokan West Storage	Ashokan Release	Schoharie Storage	Schoharie Elevation	Schoharie Release	Rondout Storage	Rondout Elevation	Rondout Release	Neversink Storage	Neversink Elevation	Neversink North Flow Release	Neversink South Flow Release	Pepacton Storage	Pepacton Elevation	Pepacton North Flow Release	Pepacton South Flow Release	Cannonsville Storage	Cannonsville Elevation	Cannonsville Release
01/01/2018	59.94	574.00	32.67	574.33	10	13.10	1110.44	176.0	48.26	837.34	9.74	29.64	1427.56	35.70	0.00	111.39	1261.24	47.6	0.0	48.80	1113.87	87.5
01/01/2019	74.47	584.62	42.99	587.85	594	17.08	1126.06	5.8	47.94	837.69	10.05	35.41	1440.47	61.80	61.00	143.69	1280.39	226.0	226.2	90.17	1147.18	969.0
01/01/2020	75.18	585.06	37.84	582.65	588	16.96	1126.55	1.4	47.62	837.41	10.19	32.61	1435.03	70.80	0.00	126.95	1270.95	97.3	96.8	84.57	1143.07	389.7
01/01/2021	72.75	583.55	41.40	586.21	592	16.81	1125.69	1.9	47.17	836.53	9.97	35.28	1440.22	61.27	62.35	129.68	1572.72	97.6	97.2	81.50	1141.20	152.7
01/02/2018	59.89	573.99	32.57	574.29	10	12.81	1109.46	189.1	48.17	837.20	9.76	28.92	1425.94	36.10	0.00	111.01	1261.01	47.6	0.0	48.95	1114.00	87.4
01/02/2019	74.50	584.62	42.77	587.85	601	17.01	1126.06	1.0	47.60	837.18	9.93	35.33	1440.32	61.70	60.90	143.43	1280.33	225.7	226.2	90.13	1147.15	969.4

The names are good, but notice how the dates are arranged now. I added arrange(Date) to sort the rows by date because they were out of order in the raw data set. Since they are a character class, arrange() interpreted them alphanumberically, in other words 01/01/2019 comes before 01/01/2020 rather than coming before 01/02/2019.

To correct this, we’ll convert the first column to dates using lubridate::mdy(), confirm with class(), and look again.

data$Date <- data$Date %>%
  mdy()
class(data$Date)

## [1] "Date"

Date	Ashokan East Storage	Ashokan East Elevation	Ashokan West Elevation	Ashokan West Storage	Ashokan Release	Schoharie Storage	Schoharie Elevation	Rondout Storage	Rondout Elevation	Rondout Release	Neversink Storage	Neversink Elevation	Neversink North Flow Release	Pepacton Storage	Pepacton Elevation	Pepacton North Flow Release	Cannonsville Storage	Cannonsville Elevation	Cannonsville Release
2017-11-01	65.36	577.86	36.34	577.89	18	12.92	1109.86	47.26	835.85	9.92	30.74	1430.03	38.9	116.60	1264.45	51.7	47.69	1112.87	97.3
2017-11-02	64.95	577.58	36.82	578.59	11	13.28	1111.05	47.32	835.94	9.92	30.68	1429.90	39.0	116.87	1264.61	51.7	48.20	1113.33	97.6
2017-11-03	64.36	577.31	37.18	579.14	12	13.56	1111.97	47.22	835.79	9.89	30.72	1429.99	39.0	117.08	1264.74	51.8	48.66	1113.74	97.8
2017-11-04	63.71	576.94	37.49	579.56	12	13.78	1112.73	47.28	835.88	9.89	30.83	1430.22	39.1	117.15	1264.78	51.8	49.05	1114.09	98.0
2017-11-05	63.15	576.52	37.78	579.92	12	13.98	1113.39	47.36	836.00	9.82	30.94	1430.47	39.1	117.11	1264.76	51.8	49.35	1114.36	97.7
2017-11-06	62.53	576.14	38.04	580.24	13	14.19	1114.04	47.46	836.14	9.76	31.07	1430.75	40.7	117.23	1264.83	54.0	49.75	1114.72	102.5

Great. The date is now arranged in ascending date order and displayed in R’s standard YYYY-MM-DD format, which is fine for our purposes here. Now we’re ready to talk about the tidiness of the data.

Semantics of the data

Tidy data is not a new concept, but its formal study and the popularity of the term have recently grown in stature with the importance of machine learning in modern business and the need for consistent approaches to manipulating data.

Hadley Wickham’s 2013 paper on the topic and continued work on the tidyverse have also facilitated the spread of the concept, and there are plenty of good introductions and breakdowns of the topic. I like the R tidy data vignette.

Tidy data by definition has three elements:

Each column represents a variable
Each row represents an observation
Each cell contains one value

Some treatments also add “each table represents one level of observation.” So, is the Current Reservoir Levels data set tidy?

To answer that, we need to explore the semantics a bit. The larger context of the data set is the water supply of New York City, which is a fascinating world of engineering in itself.

In our data set, we have the six upstate reservoirs represented which supply about 90% of the city’s water. They are from the Catskills (Ashokan and Schoharie) and Delaware (Rondout, Neversink, Pepacton, Cannonsville) watersheds. For each reservoir, there are three types of measurements, and some basins have multiple sites for each type:

Storage in billion gallons (BG)
Elevation in feet (FT)
Release in millions of gallons per day (MGD)

Storage represents how much water is in the reservoir, elevation represents the height of the surface of the water, and release represents how much water leaves the reservoir. Now we can consider what each column name means in terms of tidiness.

When our second column says “Ashokan East Storage,” that means it’s a measurement from the Ashokan reservoir, measuring storage, in the East Basin. When another column says “Schoharie Storage,” that means it’s a measurement from the Schoharie reservoir measuring storage. This means some columns have three variables per column, and all columns have at least two variables.

Since tidy data requires that each variable is its own column, our data is not tidy. To correct this, we will pivot our table using pivot_longer() and regular expressions (regex).

Pivoting the data

It’s simple to say what we want to do to make the data tidy: move the reservoir name to its own column, move the measurement type to its own column, and move the measurement site (where applicable) to its own column. Finally, we want to move the values to their own column.

This is where the actual difficulty and frustration in tech work happnes. How do I do this task given the tools I know? Are there other tools? There are many ways to do any given part, but how do I chain everything together in a way that works? Am I asking the wrong question? Am I approaching the solution from the wrong angle?

I suspect there’s no way to get over this initial hump other than having people to ask for help, expending elbow grease, taking deep dives into the documentation, gaining experience, and perfecting your Google Fu.

Once we break down the complex operation we want to perform into this simpler series of instructions, and then identify a set of tools that act accordingly (and in the way we expect them to), it’s not too difficult to transform the data. In this case, we are pivoting variables embedded in the column headers to transform them into new rows.

To tell R how we specifically want to break up the column headers, we will use regex. To avoid an overly complex single operation that is unnecessary and possibly impossible, we will break this pivot down into multiple steps. First, we’ll separate the reservoir name and measurement type.

The reservoir name is the first word in the columns, and is then separated with a space from the rest of the title (the measurement site and type). We want a regex that says: one word, followed by exactly one space (which we’ll ignore), and then everything else. Those expressions are \w*, \s{1}, and .* respectively.

There are tons of search results and doc pages, but I am starting to find a pattern in that R’s (and particularly the tidyverse’s) vignettes are very helpful when approaching a new topic. I like R’s vignettes, which are long-form narratives on a topic that include narrative cases with code. If docs are a cookbook, vignettes are a supermarket.

In this case, the vignette("pivot") page’s “Many variables in column names” section gave me exactly the use case I needed to put my code together. I thought about trying to achieve all three transformations in one step, but I already had smoke coming out of my ears and didn’t want to overdo it.

This is not my first experience with pivoting, but it has taken a lot of mental effort to get used to the logic behind the operation and the specific way that pivot_longer() achieves it. Once I was able to mentally visualize the spatial transformation the function was meant to achieve, I felt comfortable that I wasn’t just typing randomly.

Here’s the solution I came up with:

data <- data %>%
  pivot_longer(
    2:25,
    names_to = c("Reservoir", "Measurement Type"),
    names_pattern = "(\\w*)\\s{1}(.*)",
    values_to = "Value"
  )
glimpse(data)

## Rows: 31,416
## Columns: 4
## $ Date               <date> 2017-11-01, 2017-11-01, 2017-11-01, 2017-11-01, 20…
## $ Reservoir          <chr> "Ashokan", "Ashokan", "Ashokan", "Ashokan", "Ashoka…
## $ `Measurement Type` <chr> "East Storage", "East Elevation", "West Elevation",…
## $ Value              <dbl> 65.36, 577.86, 36.34, 577.89, 18.00, 12.92, 1109.86…

I used glimpse() here instead of head() because I wanted to illustrate a point: untidy data is much more compact. The original data had 1,309 rows, and pivoted it has 31,416 rows. The original dataset, with 25 columns, had an area of 32,725 cells, and the transformed data, with 4 columns, has 125,664 cells.

Splitting columns with `extract()`

To split the Site variable off of the Measurement Type column, I’ll call extract(). I’m sure it’s possible to write a regex to do this in one step, but since functions like extract() and pivot_longer() are calling underlying functions that have their own rules and quirks, I’m not going to fight the gods and will instead just run it through extract() twice.

I’ll also rename the current Measurement Type column because we’ll be replacing it with a newly extracted column.

data <- data %>%
  rename('temp'='Measurement Type') %>%
  extract(
    "temp",
    "Site",
    "(North|South|Conservation Flow|West|East)",
    remove=FALSE
    ) %>%
  extract(
    "temp",
    "Measurement Type",
    "(Storage|Elevation|Release)"
    )
data %>%
  head(100) %>%
  kbl() %>%
  kable_styling(font_size=12) %>%
  scroll_box(height="15em")

Date	Reservoir	Measurement Type	Site	Value
2017-11-01	Ashokan	Storage	East	65.36
2017-11-01	Ashokan	Elevation	East	577.86
2017-11-01	Ashokan	Elevation	West	36.34
2017-11-01	Ashokan	Storage	West	577.89
2017-11-01	Ashokan	Release	NA	18.00
2017-11-01	Schoharie	Storage	NA	12.92
2017-11-01	Schoharie	Elevation	NA	1109.86
2017-11-01	Schoharie	Release	NA	0.00
2017-11-01	Rondout	Storage	NA	47.26
2017-11-01	Rondout	Elevation	NA	835.85
2017-11-01	Rondout	Release	NA	9.92
2017-11-01	Neversink	Storage	NA	30.74
2017-11-01	Neversink	Elevation	NA	1430.03
2017-11-01	Neversink	Release	North	38.90
2017-11-01	Neversink	Release	South	0.00
2017-11-01	Neversink	Release	Conservation Flow	0.00
2017-11-01	Pepacton	Storage	NA	116.60
2017-11-01	Pepacton	Elevation	NA	1264.45
2017-11-01	Pepacton	Release	North	51.70
2017-11-01	Pepacton	Release	South	0.00
2017-11-01	Pepacton	Release	Conservation Flow	0.00
2017-11-01	Cannonsville	Storage	NA	47.69
2017-11-01	Cannonsville	Elevation	NA	1112.87
2017-11-01	Cannonsville	Release	NA	97.30
2017-11-02	Ashokan	Storage	East	64.95
2017-11-02	Ashokan	Elevation	East	577.58
2017-11-02	Ashokan	Elevation	West	36.82
2017-11-02	Ashokan	Storage	West	578.59
2017-11-02	Ashokan	Release	NA	11.00
2017-11-02	Schoharie	Storage	NA	13.28
2017-11-02	Schoharie	Elevation	NA	1111.05
2017-11-02	Schoharie	Release	NA	0.00
2017-11-02	Rondout	Storage	NA	47.32
2017-11-02	Rondout	Elevation	NA	835.94
2017-11-02	Rondout	Release	NA	9.92
2017-11-02	Neversink	Storage	NA	30.68
2017-11-02	Neversink	Elevation	NA	1429.90
2017-11-02	Neversink	Release	North	39.00
2017-11-02	Neversink	Release	South	0.00
2017-11-02	Neversink	Release	Conservation Flow	0.00
2017-11-02	Pepacton	Storage	NA	116.87
2017-11-02	Pepacton	Elevation	NA	1264.61
2017-11-02	Pepacton	Release	North	51.70
2017-11-02	Pepacton	Release	South	0.00
2017-11-02	Pepacton	Release	Conservation Flow	0.00
2017-11-02	Cannonsville	Storage	NA	48.20
2017-11-02	Cannonsville	Elevation	NA	1113.33
2017-11-02	Cannonsville	Release	NA	97.60
2017-11-03	Ashokan	Storage	East	64.36
2017-11-03	Ashokan	Elevation	East	577.31
2017-11-03	Ashokan	Elevation	West	37.18
2017-11-03	Ashokan	Storage	West	579.14
2017-11-03	Ashokan	Release	NA	12.00
2017-11-03	Schoharie	Storage	NA	13.56
2017-11-03	Schoharie	Elevation	NA	1111.97
2017-11-03	Schoharie	Release	NA	0.00
2017-11-03	Rondout	Storage	NA	47.22
2017-11-03	Rondout	Elevation	NA	835.79
2017-11-03	Rondout	Release	NA	9.89
2017-11-03	Neversink	Storage	NA	30.72
2017-11-03	Neversink	Elevation	NA	1429.99
2017-11-03	Neversink	Release	North	39.00
2017-11-03	Neversink	Release	South	0.00
2017-11-03	Neversink	Release	Conservation Flow	0.00
2017-11-03	Pepacton	Storage	NA	117.08
2017-11-03	Pepacton	Elevation	NA	1264.74
2017-11-03	Pepacton	Release	North	51.80
2017-11-03	Pepacton	Release	South	0.00
2017-11-03	Pepacton	Release	Conservation Flow	0.00
2017-11-03	Cannonsville	Storage	NA	48.66
2017-11-03	Cannonsville	Elevation	NA	1113.74
2017-11-03	Cannonsville	Release	NA	97.80
2017-11-04	Ashokan	Storage	East	63.71
2017-11-04	Ashokan	Elevation	East	576.94
2017-11-04	Ashokan	Elevation	West	37.49
2017-11-04	Ashokan	Storage	West	579.56
2017-11-04	Ashokan	Release	NA	12.00
2017-11-04	Schoharie	Storage	NA	13.78
2017-11-04	Schoharie	Elevation	NA	1112.73
2017-11-04	Schoharie	Release	NA	0.00
2017-11-04	Rondout	Storage	NA	47.28
2017-11-04	Rondout	Elevation	NA	835.88
2017-11-04	Rondout	Release	NA	9.89
2017-11-04	Neversink	Storage	NA	30.83
2017-11-04	Neversink	Elevation	NA	1430.22
2017-11-04	Neversink	Release	North	39.10
2017-11-04	Neversink	Release	South	0.00
2017-11-04	Neversink	Release	Conservation Flow	0.00
2017-11-04	Pepacton	Storage	NA	117.15
2017-11-04	Pepacton	Elevation	NA	1264.78
2017-11-04	Pepacton	Release	North	51.80
2017-11-04	Pepacton	Release	South	0.00
2017-11-04	Pepacton	Release	Conservation Flow	0.00
2017-11-04	Cannonsville	Storage	NA	49.05
2017-11-04	Cannonsville	Elevation	NA	1114.09
2017-11-04	Cannonsville	Release	NA	98.00
2017-11-05	Ashokan	Storage	East	63.15
2017-11-05	Ashokan	Elevation	East	576.52
2017-11-05	Ashokan	Elevation	West	37.78
2017-11-05	Ashokan	Storage	West	579.92

Are these particularly graceful regexes? Absolutely not. A computer science student would get points deducted for submitting something like this on a homework assignment. Yet they work with this data set in this context with these function calls, and that’s good enough for now. Not every meal can be a feast of legend.

The other issue I see after converting these columns is that their type is character. Since Reservoir, Site, and Measurement Type are categories, they should be factor data types. We’ll convert them using as_factor().

data$Reservoir <- data$Reservoir %>%
  as_factor()
data$`Measurement Type` <- data$`Measurement Type` %>%
  as_factor()
data$Site <- data$Site %>%
  as_factor()
summary(data)

##       Date                   Reservoir     Measurement Type
##  Min.   :2017-11-01   Ashokan     :6545   Storage  : 9163  
##  1st Qu.:2018-09-24   Schoharie   :3927   Elevation: 9163  
##  Median :2019-08-17   Rondout     :3927   Release  :13090  
##  Mean   :2019-08-17   Neversink   :6545                    
##  3rd Qu.:2020-07-09   Pepacton    :6545                    
##  Max.   :2021-05-31   Cannonsville:3927                    
##                 Site           Value        
##  East             : 2618   Min.   :   0.00  
##  West             : 2618   1st Qu.:  18.42  
##  North            : 2618   Median :  74.83  
##  South            : 2618   Mean   : 349.25  
##  Conservation Flow: 2618   3rd Qu.: 586.46  
##  NA's             :18326   Max.   :1572.72

Finally, we’ll save the data locally so that we don’t need to rewrite functions for all the transformations we’ve done the next time we’re working the data.

write.csv(data, file="/path/to/file.csv")

Conclusion

After retrieving the data, and before doing higher level work that explores the meaning of the data and looks for value, we have to make sure that the data is in a form we can use and ensure that it is not garbage. Changing the form is data wrangling (or munging), which we’ve done here.

I am now more comfortable with the form of this data. I don’t really like Values having multiple scales represented in one column (feet, billions of gallons, millions of gallons per day) because it makes the statistics calculated for the Values column meaningless. As I explore the data more, it may make sense to break this column apart.

Since the data is tidy, we are ready to look at the values present and make sure they are accurate, sane, and ready for higher level exploratory data analysis (EDA) techniques. Wrangling the data is not a one-time process, and as we explore and visualize the data, the shape of the table can change to meet the needs of the analysis.