To learn more about creating interactive visualizations, I am going to use the Leaflet library to map the locations of New York City’s water quality sampling stations. These sites feed into at least two data sets of interest, the Drinking Water Quality Distribution Monitoring Data and the Distribution Sites LCR Monitoring Results.

Glancing at the data

The data sets come with a data dictionary, which I will import and examine with head():

SiteSample Station (SS) - LOCATION DESCRIPTIONX - CoordinateY - Coordinate
1S03SS - Shaft 3 of City Tunnel No.1 - E/S Goulden Ave, S/O Sedgwick Ave, OPP JPR GH71024950264277
1S04SS - Shaft 4 of City Tunnel No.1 - IFO 2780 Reservoir Ave, E/S Reservoir Ave, 1st SS N/O Strong St (@ intersection of Reservoir Ave & Goulden)1012568256577
1S03ASS - Shaft 3A of City Tunnel No.2 - S/S E 233rd St, W/O Bronxwood Ave IFO 862 E 233rd St. 1024721264392
1S07SS - Shaft 7 of City Tunnel No.1 - NE/S W 167th St, 1st SE/O Sedgwick Ave IFO 1260 Sedgwick Ave1004013245233
1S03BSS - Shaft 3B of City Tunnel No.3 - W/S Jerome Ave, S/O access road to Mosholu Golf Course (SW corner of entrance to park)1014949260688
1SCL1Jerome Park Reservoir (W205th St & Goulden Ave, Bx) - Croton Distribution Chamber - low pressure service tunnel, Tap in Gatehouse #5, treated water1013921259223

Okay, this is pretty straight forward. When I look at the summary() of the data, I notice something unusual:

SiteSample Station (SS) - LOCATION DESCRIPTIONX - CoordinateY - Coordinate
39550 : 2SS - W/S Park Ave, 1st SS S/O E 95th St, 12" : 2Min. : 916207Min. :122378
1S03 : 1SS - Shaft 3 of City Tunnel No.1 - E/S Goulden Ave, S/O Sedgwick Ave, OPP JPR GH7 : 11st Qu.: 9895961st Qu.:179318
1S04 : 1SS - Shaft 4 of City Tunnel No.1 - IFO 2780 Reservoir Ave, E/S Reservoir Ave, 1st SS N/O Strong St (@ intersection of Reservoir Ave & Goulden): 1Median :1003601Median :202666
1S03A : 1SS - Shaft 3A of City Tunnel No.2 - S/S E 233rd St, W/O Bronxwood Ave IFO 862 E 233rd St. : 1Mean :1003025Mean :203317
1S07 : 1SS - Shaft 7 of City Tunnel No.1 - NE/S W 167th St, 1st SE/O Sedgwick Ave IFO 1260 Sedgwick Ave : 13rd Qu.:10198783rd Qu.:228501
1S03B : 1SS - Shaft 3B of City Tunnel No.3 - W/S Jerome Ave, S/O access road to Mosholu Golf Course (SW corner of entrance to park) : 1Max. :1062706Max. :270095
(Other):392(Other) :392NANA

I imported the Site and Sample Station columns as factor data types, so summary() gives me a descending count of how many times each factor shows up. This result means that site 39550, which corresponds with a sampling station on Park Avenue and 95th Street, is represented twice in our data. Let’s look at that site specifically with filter(Site == '39550'):

SiteSample Station (SS) - LOCATION DESCRIPTIONX - CoordinateY - Coordinate
39550SS - W/S Park Ave, 1st SS S/O E 95th St, 12"997170225580
39550SS - W/S Park Ave, 1st SS S/O E 95th St, 12"997172225580

Normally, duplicate rows can be identified and removed with unique(), duplicated(), and tidyr::distinct(), but these are actually two distinct values for site 39550 because their X coordinates are different from each other. There’s no way to determine whether one of these values is a duplicate from the data alone.

I looked at Google Maps street view to see if there is any way to clear up the confusion, and there are two sampling sites at the intersection of Park and 95th, on on the southwest corner and one on the northwest corner. I wanted to see if I could determine from the coordinates whether one of the points is a duplicate, but then I realized I have a bigger problem.

Basics of geospatial mapping

The coordinates of our problem site (997170/997172, 225580) do not look like the standard GPS coordinates I am familiar with. To figure out what was going on, I started Googling and then unexpectedly found myself venturing down into a small geospatial mapping rabbit hole.

My immediate goal here is to transform the numbers I have into points that Leaflet can use to create a map. To do this, I needed to learn some basics of the modern geospatial data landscape:

  • The basic system used for mapping points is called a coordinate reference system (CRS), and there are a whole lot of these in existence
  • The deepest expertise in geospatial mapping developed in the energy sector, so the European Petroleum Survey Group (EPSG) became an early driver for organizing GIS standards, comparable to how the World Wide Web Consortium (W3C) and the International Standards Organization (ISO) develop standards for the web and cross-industry contexts, respectively
  • The emerging standard for referring to CRS’ is the Well-Known Text (WKT), and the Well-Known ID (WKID) is also popular
  • WKTs and WKIDs are associated with authorities like EPSG, and many CRS’ are cross-listed
  • New York State and NYC use the North American Datum of 1983 (NAD83). More specifically, NYC uses a subset of NAD83 called the Long Island Zone
  • The Long Island Zone is cross-listed as and equivalent to EPSG:2263, so 2263 is the CRS code I will need for transforming the sampling site coordinate data

This information is important, but it didn’t give me the specific technical solution that will allow me to map these points. To do that, I had to read up on how map data is organized, stored, accessed, and manipulated in general, but specifically how this is implemented in R.

  • While EPSG remains an important standard for referring to CRS’, there are many more elements involved in mapping than just coordinates, and the Open Geospatial Consortium (OGC) is a major standards body for mapping more generally
  • Maps are complex entities composed of many elements, and so the formats used for storing maps are also complex and highly variable across contexts and implementations. Shapefiles (associated with Esri and ArcGIS) are one well-known format
  • Naturally, as modern GIS systems and R’s predecessor S date to the 1970s, many ways to work with geospatial data have evolved
  • The relevant OGC specification for our purposes is called Simple Features, which deals with storage and access of maps
  • R’s implementation of Simple Features is the sf package, and sf is also the name of the R object (a spatial data frame) that stores map data

So that’s a lot of information, but it gives us what we need to transform our existing coordinates and then plot them on a Leaflet map.

Transforming the data

When I imported the data dictionary, I first saved it as a .csv file from a spreadsheet program and then used read_csv(). It’s possible to import directly from an Excel file, but it’s a trivial amount of work to convert one file and I’d rather not risk having proprietary Microsoft metadata surprise me.

The imported data we looked at above is a tibble, a type of data frame. Since the data as we have it is not saved in a fully-fledged map format, we will build a very basic spatial data frame using sf::st_as_sf(), which converts non-spatial objects including tibbles into an sf object.

sf objects can have a lot more data, but we’re only interested in the coordinates here. Then we’ll display the CRS info that we’ve attached with st_crs().

sf <- st_as_sf(
  data,
  coords = c(
    "X - Coordinate",
    "Y - Coordinate"),
  crs = 2263
  )
sf %>%
  st_crs()
## Coordinate Reference System:
##   User input: EPSG:2263 
##   wkt:
## PROJCRS["NAD83 / New York Long Island (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",40.1666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-74,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",984250,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",0,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["USA - New York - SPCS - Long Island"],
##         BBOX[40.47,-74.26,41.3,-71.8]],
##     ID["EPSG",2263]]

Here we can see a lot more detailed information about the CRS. The User Input line is just that: it’s what we input into st_as_sf(). The following information is the WKT information associated with EPSG:2263. One interesting part for our purposes here is the BBOX, which tells us the bounds of the CRS in the more-familiar latitude and longitude numbers.

The X and Y coordinates of our data set are written within these bounds, so the coordinates make sense as long as they are being given to something that understands EPSG 2263. Unfortunately, Leaflet does not support EPSG 2263 out of the box.

It’s possible to use other CRS’ with the Proj4Leaflet extension, but that requires a lot of extra work (like tile management) for no good reason. Leaflet supports EPSG 3857, which is the same as the World Geodetic System of 1984 (WGS84), by default.

WGS84 is the projection used by OpenStreetMap, Google Maps, Apple Maps, and most other online maps, so the simplest solution is to convert our coordinates from EPSG 2236. We will use sf::st_transform() to perform the transformation.

sf <- st_transform(
  sf,
  "+proj=longlat +datum=WGS84"
  )
sf %>%
  st_crs()
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 
##   wkt:
## GEOGCRS["unknown",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ID["EPSG",6326]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433],
##         ID["EPSG",8901]],
##     CS[ellipsoidal,2],
##         AXIS["longitude",east,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]],
##         AXIS["latitude",north,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]]]

We can see that the transformation was successful, and our data is now in a format ready to be used by Leaflet. Let’s take a look at the head() of the data again.

SiteSample Station (SS) - LOCATION DESCRIPTIONgeometry
1S03SS - Shaft 3 of City Tunnel No.1 - E/S Goulden Ave, S/O Sedgwick Ave, OPP JPR GH7POINT (-73.85279 40.89196)
1S04SS - Shaft 4 of City Tunnel No.1 - IFO 2780 Reservoir Ave, E/S Reservoir Ave, 1st SS N/O Strong St (@ intersection of Reservoir Ave & Goulden)POINT (-73.89761 40.87087)
1S03ASS - Shaft 3A of City Tunnel No.2 - S/S E 233rd St, W/O Bronxwood Ave IFO 862 E 233rd St. POINT (-73.85362 40.89228)
1S07SS - Shaft 7 of City Tunnel No.1 - NE/S W 167th St, 1st SE/O Sedgwick Ave IFO 1260 Sedgwick AvePOINT (-73.92858 40.83976)
1S03BSS - Shaft 3B of City Tunnel No.3 - W/S Jerome Ave, S/O access road to Mosholu Golf Course (SW corner of entrance to park)POINT (-73.88898 40.88215)
1SCL1Jerome Park Reservoir (W205th St & Goulden Ave, Bx) - Croton Distribution Chamber - low pressure service tunnel, Tap in Gatehouse #5, treated waterPOINT (-73.89271 40.87813)

We can see that the columns for X and Y coordinates have been replaced by one Geometry column, and that the values in this column are the familiar latitude and longitude numbers in common use. This sf object can now be piped directly into leaflet().

Creating the map

I’ll modify some of the default display options. Many of the arguments are listed in the docs accessed with ?leaflet().

sf %>%
  leaflet() %>%
  addProviderTiles(
    providers$Stamen.TonerLite) %>%
  setView(
    lng = -73.9062,
    lat = 40.7001,
    zoom = 11
    ) %>%
  addCircleMarkers(
    radius = 5,
    weight = 3,
    opacity = 1,
    color = "#5af",
    popup = ~paste(
      "<strong>Site</strong>: ",
      Site,
      "<BR>",
      "<strong>Description</strong>: ",
      `Sample Station (SS) - LOCATION DESCRIPTION`,
      "<BR>",
      "<strong>Coordinates</strong>: ",
      geometry
      )
    )

Now I want to confirm that the points corresponded with where the sampling stations are on the street. To do this, I picked four points from the corners of the distribution and one point from the center and visually confirmed their location on Google Maps Street View. I used the following points:

  • Van Zandt & 248th in Queens
  • Empire & Virginia in Far Rockaways
  • Main Street south of Hylan in Staten Island
  • Riverdale Avenue south of 261st in Riverdale
  • Humboldt south of Boerum in Brooklyn

All five points were mapped by Leaflet within spitball distance of where they appeared on Street View, so I expect that any errors in point placement are in the underlying data and were not caused by transforming the projection.

What about our duplicate point? Both points are displayed on the southwest corner of Park and 95th. Many sampling sites have two stations, and I believe that one of these is an auxiliary in case there is a problem with the first station. Looking at Street View, I saw many of these second sites but none of them were listed in the data as secondary points.

This makes me more confident that one of the duplicates in in fact an error. In an enterprise setting, I would reach out to the data maintainers and get the data fixed or at least confirm which point to exclude. Since I cannot be sure which point is the duplicate and since it has no impact on the work here, I’ll leave it alone.

Conclusion

It was easier to work with libraries ported to the R environment with the htmlwidgets framework than I expected. One of the challenges with working in the modern tech landscape is the huge number of technologies that are in use, and it makes life easier when work is done under the hood. This frees up effort for more substantive work.

At some point I will probably learn more JavaScript, but it’s great that learning it is optional rather than being critical before I can present any work. I like to put effort into the presentation of visualizations, but there is a great deal of benefit in having sane defaults and templates to build from, even at the cost of greater control and flexbility.

I wasn’t expecting to take a deep dive into GIS, but I’m not too surprised that it happened either. Tech topics have a way of taking learners on unplanned side quests. I had never felt a great deal of interest in geospatial work before, but now that I’ve seen a bit about how it is done I find it more appealing.

A huge amount of recent high-impact environmental research is geospatial in nature, so understanding and enjoying the workflow is fortunate for me. Next, I want to look at some of the libraries for non-spatial visualizations. I want to do more work with ggplot2, so plotly and ggiraph are logical next steps.