Now that I’ve familiarized myself with the R ecosystem, including RStudio, blogdown, querying databases, and basic data wrangling and visualization techniques, I’d like to start with some work of substance. I want to start with work that is small in scope to pace myself.

One of the factors that discourages people from tech work is the disconnect between the headlines and the reality of what needs to be done once work starts.

The cool and sexy work products and analysis that make the news are actually the apex of a very large pyramid, and only focusing on that means ignoring the reality of what most data work actually is. Everybody wants to hammer the golden spike, but the reality of most data work is laying track in Council Bluffs, Iowa.

Many people have covered the data pyramid topic over the years, so I won’t repeat what a simple Google search will return other than to say that the work I am interested in for the near future is about half way up the pyramid.

Since I am interested in working with open data sets, the decisions and work related to collection and storage are already taken care of: I am in effect a customer of unrefined data. The wrangling work of an analyst is predicated on data engineers who send raw data flows into data storage using ETL (extract, transform, load) techniques.

The first set I will be working on is Current Reservoir Levels data set from the NYC Department of Environmental Protection (NYCDEP) - not to be confused with the similarly named New York State Department of Environmental Conservation (NYSDEC).

My goals here are to confirm whether the data is tidy, transform it to a long form with pivot_longer(), and split columns with extract(). I will address outliers, missing values, and other data cleanliness issues and visualize the data in a separate post.

First look at the data

We’ll use the head() function to take a look.

Point_timeAUGEVolumeAUGEASTLEVANALOGAUGWVOLUMEAUGWESTLEVANALOGASHRELSICRESVOLUMESICRESELEVANALOGSTPALBFLWRECRESVOLUMERECRESELEVANALOGRECRELNICRESVOLUMENICRESELEVANALOGNICNTHFLWNICSTHFLWNICCONFLWEDIRESVOLUMEEDIRESELEVANALOGEDRNTHFLWEDRSTHFLWEDRCONFLWWDIRESVOLUMEWDIRESELEVANALOGWDRFLW
02/01/201975.04585.1739.67584.6561916.811125.7548.347.33836.7710.2135.211440.0858.064.80142.011279.56225.8225.6088.911146.33959.2
02/02/201974.48584.9839.50584.3861916.821125.7564.647.19836.5710.2135.211440.0958.164.80141.621279.35225.9226.1088.491146.05967.4
02/03/201973.88584.6239.34584.2159816.841125.7568.147.09836.4210.1935.221440.1058.164.80141.251279.15225.7226.0088.161145.83965.7
02/04/201973.31584.2639.17584.0458616.931125.8162.546.99836.2710.2035.231440.1258.164.90140.941278.98225.7225.6087.911145.66966.5
02/05/201972.76583.9039.19583.8558417.161126.0357.546.96836.2010.1935.271440.1958.064.90140.921278.97225.8225.5088.241145.88960.3
02/06/201972.26583.5539.52583.8858317.251126.7049.447.09836.4210.1935.281440.2158.164.90141.221279.13225.7225.8088.641146.15964.2

My first observations are that these column names are completely unintelligible and that everything appears to be numbers, which should simplify getting the correct types. Let’s look at how read_csv() interpreted the column types.

I’ll use skimr::skim() to return the types because it returns the column types without fuss and, unlike glimpse(), it works when piped into kable() and therefore is prettier to display. I’m starting to learn more about the complexity of package interaction: there are a lot of packages and use cases which do not play well together.

skim_typeskim_variablen_missingcomplete_ratecharacter.mincharacter.maxcharacter.emptycharacter.n_uniquecharacter.whitespacenumeric.meannumeric.sdnumeric.p0numeric.p25numeric.p50numeric.p75numeric.p100numeric.hist
characterPoint_time011010013070NANANANANANANANA
numericAUGEVolume01NANANANANA70.75476708.014747543.3265.4473.3877.3381.45▁▂▂▅▇
numericAUGEASTLEVANALOG01NANANANANA581.97064935.2570404563.21578.84583.71586.35588.83▁▁▂▅▇
numericAUGWVOLUME01NANANANANA38.63084806.085061821.8135.2139.7543.3047.69▁▃▃▇▆
numericAUGWESTLEVANALOG01NANANANANA582.64845706.7243636562.37579.03584.34587.85591.46▁▂▂▇▇
numericASHREL01NANANANANA160.8898396226.62119230.0013.0018.00324.00623.00▇▁▁▁▂
numericSICRESVOLUME01NANANANANA16.48258982.25009720.0015.7516.9218.2619.72▁▁▁▂▇
numericSICRESELEVANALOG01NANANANANA1124.019908310.39724831035.001121.901125.881129.081362.67▁▇▁▁▁
numericSTPALBFLW01NANANANANA73.594805285.32167740.001.0037.90125.90569.60▇▅▁▁▁
numericRECRESVOLUME01NANANANANA47.71861730.752952145.3447.2247.7348.2949.31▁▂▇▇▃
numericRECRESELEVANALOG01NANANANANA837.22333091.1499935833.74836.46837.31838.11839.65▁▃▇▇▃
numericRECREL01NANANANANA15.507142947.03808869.469.9410.2114.72731.00▇▁▁▁▁
numericNICRESVOLUME01NANANANANA32.92480522.685358123.3331.3933.7435.2235.90▁▁▂▃▇
numericNICRESELEVANALOG01NANANANANA1435.15618775.78881051413.391431.881437.021440.081441.42▁▁▂▃▇
numericNICNTHFLW01NANANANANA30.388678428.96786420.000.0038.5061.0078.25▇▁▂▆▁
numericNICSTHFLW01NANANANANA50.520198625.19248140.0039.1061.0965.1099.99▂▁▁▇▁
numericNICCONFLW01NANANANANA0.08078691.06045760.000.000.000.0020.00▇▁▁▁▁
numericEDIRESVOLUME01NANANANANA129.479511113.158522996.83118.20132.75141.55152.67▂▅▅▇▆
numericEDIRESELEVANALOG01NANANANANA1276.333652027.92534721252.551265.901274.391279.331572.72▇▁▁▁▁
numericEDRNTHFLW01NANANANANA92.547257487.47116060.000.0080.50191.50235.10▇▅▃▁▆
numericEDRSTHFLW01NANANANANA99.796623484.22069020.0012.3088.30193.90234.90▇▇▅▁▇
numericEDRCONFLW01NANANANANA0.20588241.98395030.000.000.000.0023.00▇▁▁▁▁
numericWDIRESVOLUME01NANANANANA79.340343816.317154338.9870.2485.0191.9699.11▂▂▂▅▇
numericWDIRESELEVANALOG01NANANANANA1138.777471412.49876421106.231132.511143.661148.331151.95▂▁▁▃▇
numericWDRFLW01NANANANANA467.1185638325.546275617.00213.90365.50832.101009.60▆▇▂▁▆

There’s a lot of information here, but we’re interested in the first two columns. Every number was correctly interpreted, but Point_time was interpreted as a character when it is actually a date. So we have two fairly simple corrections to make: rename the columns, and change the date column to a date data type.

Adding human-friendly column headers

After taking a look at the data dictionary from the NYC Open Data page to figure out how to interpret these columns, I see that they provide labels that are much more human-friendly. We’ll use those instead and rename using dplyr::rename():

DateAshokan East StorageAshokan East ElevationAshokan West ElevationAshokan West StorageAshokan ReleaseSchoharie StorageSchoharie ElevationSchoharie ReleaseRondout StorageRondout ElevationRondout ReleaseNeversink StorageNeversink ElevationNeversink North Flow ReleaseNeversink South Flow ReleaseNeversink Conservation Flow ReleasePepacton StoragePepacton ElevationPepacton North Flow ReleasePepacton South Flow ReleasePepacton Conservation Flow ReleaseCannonsville StorageCannonsville ElevationCannonsville Release
01/01/201859.94574.0032.67574.331013.101110.44176.048.26837.349.7429.641427.5635.700.000111.391261.2447.60.0048.801113.8787.5
01/01/201974.47584.6242.99587.8559417.081126.065.847.94837.6910.0535.411440.4761.8061.000143.691280.39226.0226.2090.171147.18969.0
01/01/202075.18585.0637.84582.6558816.961126.551.447.62837.4110.1932.611435.0370.800.000126.951270.9597.396.8084.571143.07389.7
01/01/202172.75583.5541.40586.2159216.811125.691.947.17836.539.9735.281440.2261.2762.350129.681572.7297.697.2081.501141.20152.7
01/02/201859.89573.9932.57574.291012.811109.46189.148.17837.209.7628.921425.9436.100.000111.011261.0147.60.0048.951114.0087.4
01/02/201974.50584.6242.77587.8560117.011126.061.047.60837.189.9335.331440.3261.7060.900143.431280.33225.7226.2090.131147.15969.4

The names are good, but notice how the dates are arranged now. I added arrange(Date) to sort the rows by date because they were out of order in the raw data set. Since they are a character class, arrange() interpreted them alphanumberically, in other words 01/01/2019 comes before 01/01/2020 rather than coming before 01/02/2019.

To correct this, we’ll convert the first column to dates using lubridate::mdy(), confirm with class(), and look again.

data$Date <- data$Date %>%
  mdy()
class(data$Date)
## [1] "Date"
DateAshokan East StorageAshokan East ElevationAshokan West ElevationAshokan West StorageAshokan ReleaseSchoharie StorageSchoharie ElevationSchoharie ReleaseRondout StorageRondout ElevationRondout ReleaseNeversink StorageNeversink ElevationNeversink North Flow ReleaseNeversink South Flow ReleaseNeversink Conservation Flow ReleasePepacton StoragePepacton ElevationPepacton North Flow ReleasePepacton South Flow ReleasePepacton Conservation Flow ReleaseCannonsville StorageCannonsville ElevationCannonsville Release
2017-11-0165.36577.8636.34577.891812.921109.86047.26835.859.9230.741430.0338.900116.601264.4551.70047.691112.8797.3
2017-11-0264.95577.5836.82578.591113.281111.05047.32835.949.9230.681429.9039.000116.871264.6151.70048.201113.3397.6
2017-11-0364.36577.3137.18579.141213.561111.97047.22835.799.8930.721429.9939.000117.081264.7451.80048.661113.7497.8
2017-11-0463.71576.9437.49579.561213.781112.73047.28835.889.8930.831430.2239.100117.151264.7851.80049.051114.0998.0
2017-11-0563.15576.5237.78579.921213.981113.39047.36836.009.8230.941430.4739.100117.111264.7651.80049.351114.3697.7
2017-11-0662.53576.1438.04580.241314.191114.04047.46836.149.7631.071430.7540.700117.231264.8354.00049.751114.72102.5

Great. The date is now arranged in ascending date order and displayed in R’s standard YYYY-MM-DD format, which is fine for our purposes here. Now we’re ready to talk about the tidiness of the data.

Semantics of the data

Tidy data is not a new concept, but its formal study and the popularity of the term have recently grown in stature with the importance of machine learning in modern business and the need for consistent approaches to manipulating data.

Hadley Wickham’s 2013 paper on the topic and continued work on the tidyverse have also facilitated the spread of the concept, and there are plenty of good introductions and breakdowns of the topic. I like the R tidy data vignette.

Tidy data by definition has three elements:

  1. Each column represents a variable
  2. Each row represents an observation
  3. Each cell contains one value

Some treatments also add “each table represents one level of observation.” So, is the Current Reservoir Levels data set tidy?

To answer that, we need to explore the semantics a bit. The larger context of the data set is the water supply of New York City, which is a fascinating world of engineering in itself.

In our data set, we have the six upstate reservoirs represented which supply about 90% of the city’s water. They are from the Catskills (Ashokan and Schoharie) and Delaware (Rondout, Neversink, Pepacton, Cannonsville) watersheds. For each reservoir, there are three types of measurements, and some basins have multiple sites for each type:

  1. Storage in billion gallons (BG)
  2. Elevation in feet (FT)
  3. Release in millions of gallons per day (MGD)

Storage represents how much water is in the reservoir, elevation represents the height of the surface of the water, and release represents how much water leaves the reservoir. Now we can consider what each column name means in terms of tidiness.

When our second column says “Ashokan East Storage,” that means it’s a measurement from the Ashokan reservoir, measuring storage, in the East Basin. When another column says “Schoharie Storage,” that means it’s a measurement from the Schoharie reservoir measuring storage. This means some columns have three variables per column, and all columns have at least two variables.

Since tidy data requires that each variable is its own column, our data is not tidy. To correct this, we will pivot our table using pivot_longer() and regular expressions (regex).

Pivoting the data

It’s simple to say what we want to do to make the data tidy: move the reservoir name to its own column, move the measurement type to its own column, and move the measurement site (where applicable) to its own column. Finally, we want to move the values to their own column.

This is where the actual difficulty and frustration in tech work happnes. How do I do this task given the tools I know? Are there other tools? There are many ways to do any given part, but how do I chain everything together in a way that works? Am I asking the wrong question? Am I approaching the solution from the wrong angle?

I suspect there’s no way to get over this initial hump other than having people to ask for help, expending elbow grease, taking deep dives into the documentation, gaining experience, and perfecting your Google Fu.

Once we break down the complex operation we want to perform into this simpler series of instructions, and then identify a set of tools that act accordingly (and in the way we expect them to), it’s not too difficult to transform the data. In this case, we are pivoting variables embedded in the column headers to transform them into new rows.

To tell R how we specifically want to break up the column headers, we will use regex. To avoid an overly complex single operation that is unnecessary and possibly impossible, we will break this pivot down into multiple steps. First, we’ll separate the reservoir name and measurement type.

The reservoir name is the first word in the columns, and is then separated with a space from the rest of the title (the measurement site and type). We want a regex that says: one word, followed by exactly one space (which we’ll ignore), and then everything else. Those expressions are \w*, \s{1}, and .* respectively.

There are tons of search results and doc pages, but I am starting to find a pattern in that R’s (and particularly the tidyverse’s) vignettes are very helpful when approaching a new topic. I like R’s vignettes, which are long-form narratives on a topic that include narrative cases with code. If docs are a cookbook, vignettes are a supermarket.

In this case, the vignette("pivot") page’s “Many variables in column names” section gave me exactly the use case I needed to put my code together. I thought about trying to achieve all three transformations in one step, but I already had smoke coming out of my ears and didn’t want to overdo it.

This is not my first experience with pivoting, but it has taken a lot of mental effort to get used to the logic behind the operation and the specific way that pivot_longer() achieves it. Once I was able to mentally visualize the spatial transformation the function was meant to achieve, I felt comfortable that I wasn’t just typing randomly.

Here’s the solution I came up with:

data <- data %>%
  pivot_longer(
    2:25,
    names_to = c("Reservoir", "Measurement Type"),
    names_pattern = "(\\w*)\\s{1}(.*)",
    values_to = "Value"
  )
glimpse(data)
## Rows: 31,416
## Columns: 4
## $ Date               <date> 2017-11-01, 2017-11-01, 2017-11-01, 2017-11-01, 20…
## $ Reservoir          <chr> "Ashokan", "Ashokan", "Ashokan", "Ashokan", "Ashoka…
## $ `Measurement Type` <chr> "East Storage", "East Elevation", "West Elevation",…
## $ Value              <dbl> 65.36, 577.86, 36.34, 577.89, 18.00, 12.92, 1109.86…

I used glimpse() here instead of head() because I wanted to illustrate a point: untidy data is much more compact. The original data had 1,309 rows, and pivoted it has 31,416 rows. The original dataset, with 25 columns, had an area of 32,725 cells, and the transformed data, with 4 columns, has 125,664 cells.

Splitting columns with extract()

To split the Site variable off of the Measurement Type column, I’ll call extract(). I’m sure it’s possible to write a regex to do this in one step, but since functions like extract() and pivot_longer() are calling underlying functions that have their own rules and quirks, I’m not going to fight the gods and will instead just run it through extract() twice.

I’ll also rename the current Measurement Type column because we’ll be replacing it with a newly extracted column.

data <- data %>%
  rename('temp'='Measurement Type') %>%
  extract(
    "temp",
    "Site",
    "(North|South|Conservation Flow|West|East)",
    remove=FALSE
    ) %>%
  extract(
    "temp",
    "Measurement Type",
    "(Storage|Elevation|Release)"
    )
data %>%
  head(100) %>%
  kbl() %>%
  kable_styling(font_size=12) %>%
  scroll_box(height="15em")
DateReservoirMeasurement TypeSiteValue
2017-11-01AshokanStorageEast65.36
2017-11-01AshokanElevationEast577.86
2017-11-01AshokanElevationWest36.34
2017-11-01AshokanStorageWest577.89
2017-11-01AshokanReleaseNA18.00
2017-11-01SchoharieStorageNA12.92
2017-11-01SchoharieElevationNA1109.86
2017-11-01SchoharieReleaseNA0.00
2017-11-01RondoutStorageNA47.26
2017-11-01RondoutElevationNA835.85
2017-11-01RondoutReleaseNA9.92
2017-11-01NeversinkStorageNA30.74
2017-11-01NeversinkElevationNA1430.03
2017-11-01NeversinkReleaseNorth38.90
2017-11-01NeversinkReleaseSouth0.00
2017-11-01NeversinkReleaseConservation Flow0.00
2017-11-01PepactonStorageNA116.60
2017-11-01PepactonElevationNA1264.45
2017-11-01PepactonReleaseNorth51.70
2017-11-01PepactonReleaseSouth0.00
2017-11-01PepactonReleaseConservation Flow0.00
2017-11-01CannonsvilleStorageNA47.69
2017-11-01CannonsvilleElevationNA1112.87
2017-11-01CannonsvilleReleaseNA97.30
2017-11-02AshokanStorageEast64.95
2017-11-02AshokanElevationEast577.58
2017-11-02AshokanElevationWest36.82
2017-11-02AshokanStorageWest578.59
2017-11-02AshokanReleaseNA11.00
2017-11-02SchoharieStorageNA13.28
2017-11-02SchoharieElevationNA1111.05
2017-11-02SchoharieReleaseNA0.00
2017-11-02RondoutStorageNA47.32
2017-11-02RondoutElevationNA835.94
2017-11-02RondoutReleaseNA9.92
2017-11-02NeversinkStorageNA30.68
2017-11-02NeversinkElevationNA1429.90
2017-11-02NeversinkReleaseNorth39.00
2017-11-02NeversinkReleaseSouth0.00
2017-11-02NeversinkReleaseConservation Flow0.00
2017-11-02PepactonStorageNA116.87
2017-11-02PepactonElevationNA1264.61
2017-11-02PepactonReleaseNorth51.70
2017-11-02PepactonReleaseSouth0.00
2017-11-02PepactonReleaseConservation Flow0.00
2017-11-02CannonsvilleStorageNA48.20
2017-11-02CannonsvilleElevationNA1113.33
2017-11-02CannonsvilleReleaseNA97.60
2017-11-03AshokanStorageEast64.36
2017-11-03AshokanElevationEast577.31
2017-11-03AshokanElevationWest37.18
2017-11-03AshokanStorageWest579.14
2017-11-03AshokanReleaseNA12.00
2017-11-03SchoharieStorageNA13.56
2017-11-03SchoharieElevationNA1111.97
2017-11-03SchoharieReleaseNA0.00
2017-11-03RondoutStorageNA47.22
2017-11-03RondoutElevationNA835.79
2017-11-03RondoutReleaseNA9.89
2017-11-03NeversinkStorageNA30.72
2017-11-03NeversinkElevationNA1429.99
2017-11-03NeversinkReleaseNorth39.00
2017-11-03NeversinkReleaseSouth0.00
2017-11-03NeversinkReleaseConservation Flow0.00
2017-11-03PepactonStorageNA117.08
2017-11-03PepactonElevationNA1264.74
2017-11-03PepactonReleaseNorth51.80
2017-11-03PepactonReleaseSouth0.00
2017-11-03PepactonReleaseConservation Flow0.00
2017-11-03CannonsvilleStorageNA48.66
2017-11-03CannonsvilleElevationNA1113.74
2017-11-03CannonsvilleReleaseNA97.80
2017-11-04AshokanStorageEast63.71
2017-11-04AshokanElevationEast576.94
2017-11-04AshokanElevationWest37.49
2017-11-04AshokanStorageWest579.56
2017-11-04AshokanReleaseNA12.00
2017-11-04SchoharieStorageNA13.78
2017-11-04SchoharieElevationNA1112.73
2017-11-04SchoharieReleaseNA0.00
2017-11-04RondoutStorageNA47.28
2017-11-04RondoutElevationNA835.88
2017-11-04RondoutReleaseNA9.89
2017-11-04NeversinkStorageNA30.83
2017-11-04NeversinkElevationNA1430.22
2017-11-04NeversinkReleaseNorth39.10
2017-11-04NeversinkReleaseSouth0.00
2017-11-04NeversinkReleaseConservation Flow0.00
2017-11-04PepactonStorageNA117.15
2017-11-04PepactonElevationNA1264.78
2017-11-04PepactonReleaseNorth51.80
2017-11-04PepactonReleaseSouth0.00
2017-11-04PepactonReleaseConservation Flow0.00
2017-11-04CannonsvilleStorageNA49.05
2017-11-04CannonsvilleElevationNA1114.09
2017-11-04CannonsvilleReleaseNA98.00
2017-11-05AshokanStorageEast63.15
2017-11-05AshokanElevationEast576.52
2017-11-05AshokanElevationWest37.78
2017-11-05AshokanStorageWest579.92

Are these particularly graceful regexes? Absolutely not. A computer science student would get points deducted for submitting something like this on a homework assignment. Yet they work with this data set in this context with these function calls, and that’s good enough for now. Not every meal can be a feast of legend.

The other issue I see after converting these columns is that their type is character. Since Reservoir, Site, and Measurement Type are categories, they should be factor data types. We’ll convert them using as_factor().

data$Reservoir <- data$Reservoir %>%
  as_factor()
data$`Measurement Type` <- data$`Measurement Type` %>%
  as_factor()
data$Site <- data$Site %>%
  as_factor()
summary(data)
##       Date                   Reservoir     Measurement Type
##  Min.   :2017-11-01   Ashokan     :6545   Storage  : 9163  
##  1st Qu.:2018-09-24   Schoharie   :3927   Elevation: 9163  
##  Median :2019-08-17   Rondout     :3927   Release  :13090  
##  Mean   :2019-08-17   Neversink   :6545                    
##  3rd Qu.:2020-07-09   Pepacton    :6545                    
##  Max.   :2021-05-31   Cannonsville:3927                    
##                 Site           Value        
##  East             : 2618   Min.   :   0.00  
##  West             : 2618   1st Qu.:  18.42  
##  North            : 2618   Median :  74.83  
##  South            : 2618   Mean   : 349.25  
##  Conservation Flow: 2618   3rd Qu.: 586.46  
##  NA's             :18326   Max.   :1572.72

Finally, we’ll save the data locally so that we don’t need to rewrite functions for all the transformations we’ve done the next time we’re working the data.

write.csv(data, file="/path/to/file.csv")

Conclusion

After retrieving the data, and before doing higher level work that explores the meaning of the data and looks for value, we have to make sure that the data is in a form we can use and ensure that it is not garbage. Changing the form is data wrangling (or munging), which we’ve done here.

I am now more comfortable with the form of this data. I don’t really like Values having multiple scales represented in one column (feet, billions of gallons, millions of gallons per day) because it makes the statistics calculated for the Values column meaningless. As I explore the data more, it may make sense to break this column apart.

Since the data is tidy, we are ready to look at the values present and make sure they are accurate, sane, and ready for higher level exploratory data analysis (EDA) techniques. Wrangling the data is not a one-time process, and as we explore and visualize the data, the shape of the table can change to meet the needs of the analysis.