Starting with geospatial data in R

When I worked with Leaflet to map the NYC water quality testing stations, I had to learn a bit about how geospatial data is structured in order to transform my data points into a format that I could then plot on a generated Leaflet map.

For that task, I did not need to go especially deep into the workings of modern digital cartography because Leaflet has built in access to a range of tile servers. Essentially, I took a series of points and dropped them on top of a pre-existing map that was retrieved from a server.

I did not need to know anything about how that map was built and did not need to interact with it directly other than to call it. In cartographic terms, I worked with one layer consisting of a series of points. The rest of the layers, including roads, buildings, feature names, water boundaries, state boundary lines, and so on were all handled for me.

Leaflet is a wonderful package because it abstracts away most of this low level work, but doing that manual, fine-tuned work is the entire point of a lot of mapping work. Learning the finer points of cartographic workflows is a key skill in geospatial analysis, particularly when working on projects that are novel or highly specific and when readily available mapping solutions are not applicable.

Much of the work that interests me involves environmental analysis. For instance, a recent PBS NOVA documentary called “Can We Cool the Planet?” discusses high resolution maps of rainfall and tree cover like GEDI that are intended for use in identifying sites around the globe where engineered forests can be reasonably be expected to thrive.

I am also fascinated by organizations like Justdiggit, which uses the low tech approach of digging half-moon shaped cuts in arid areas. These cuts, made in large numbers, become waterlogged during the rainy season and facilitate the growth of grasses in a moist environment protected from the wind, and collectively form the basis of new grasslands.

More abstractly, the DeepSolar project used computer vision and geospatial imagery to identify all solar panels installed in the United States and crossed reference location data with socioeconomic factors from the Census to make inferences about the deployment of solar capacity.

In these cases and in many others, GIS work enables more impactful use of resources by answering questions about large areas of land that could not be efficiently answered through ground-level observations alone. The open-ended and variable nature of the work, however, means it’s more difficult to find broadly useful skills - “well, what do you want to do?”

What attracted me to R in the first place was its wide use in academia and focus on visual presentation of results. I have found this characterization to be accurate, but a lot of exploration is necessary to fully understand what all is possible. I’ve bounced around a lot to see what all exists and identify useful learning projects to move my knowledge and abilities forward.

The great complexity has recently given me pause and I feel a bit overextended mentally, so I wanted to take some time to think about how to limit what I am doing to avoid boiling the ocean. At a high level, I’m not sure exactly what type of impact and role I want to focus on yet.

I believe a wide range of skills in analysis and presentation is the type of evergreen skillset that can fit into many teams. Wrangling, cleaning, and presenting data is a “cockroach” function that will exist after the apocalypse, so it’s never a waste to invest in such skills.

Geospatial work, however, is a little more nuanced and specialized, and I’m less familiar with the space. Having worked in finance and budgeting, I have a body of professional experience I can lean on when approaching novel data analysis and visualization tasks, but I’m a lot less confident with GIS work.

I don’t know much about common workflows, how to move up the value chain, or what work is impactful in the GIS space, so that means I have some non-technical work I need to do in terms of researching the space and talking to people experienced with this line of work.

For now, I believe that the ability to work with and combine differently formatted data sources, build maps, understand their constituent parts, and make the results visually appealing and publication-worthy is a set of skills worth pursuing.

On the nitty gritty details, I’ve been reading a number of sources, and got a lot out of Robin Lovelace’s book, this excellent post on mapping with ggplot, and the meta-source R posts you might have missed!

I’ve also started building a loose framework for approaching projects as an amateur:

The project at hand. If there is no clear goal, the number of shiny things and capabilities that exist in the R ecosystem will lead me astray. Everything is complex, so “I’ll just build a bunch of stuff and see what looks cool” is not practiceable.
The data itself. There is a gigantic amount of mapping information out there. The spData package lists many of the reference data sets that are the geospatial equivalent of iris, mtcars, and palmerpenguins. New Zealand and North Carolina seem to be quite popular. Beyond this, I am only beginning to understand the details of the many available data sets and how to utilize them. Until I become more comfortable working with shapefiles, I will focus on using US Census and US Geological Survey data.
Formatting, wrangling, and cleaning the data. Over the last decade, there have been tremendous changes in how GIS data is stored, converted, and manipulated. Fortunately, R seems to be coalescing around the Simple Features standard implemented in the sf package, which makes dealing with industry-standard shapefiles predictable. I am interested in well-manicured data sets at the moment, because data cleaning is one too many things to juggle at the moment.
Choosing a mapping package. R has no shortage of mapping options. leaflet, tmap, ggplot2, plotly, mapview, sf, raster, mapdeck, and mapsf (formerly cartography) are some major ones. Each has strengths, weaknesses, formats and secondary packages (like ggrepel, ggforce, viridis, scico, and patchwork to name a few) it plays with or doesn’t play with, specific unique abilities, and varying levels of publication worthiness of the outputs.
Building the project. I’ve been hesitant to use reference data sets because I think they’re corny, but after playing around and trying to build “something simple” I’ve realized that no such thing exists in cartography. I may eventually get to a point where building maps can be routinized, but I am absolutely not there yet. Putting all the pieces together is a significant challenge because each piece is complex in its own right.

I feel a bit overwhelmed lately trying to understand everything out there, very much like I am swallowing a grapefruit whole, so I am taking some time and get everything down. One day at a time.