In exploring available geospatial data sets, it’s very quickly apparent that there is a huge amount of information available at no cost and that the limiting factor is not availability and licensing, but knowledge of what data is available and experience in manipulating it.

Since I am exploring very unfamiliar terrain by starting to work on cartography, I am not aware of best practices and have no industry knowledge to guide my activities. Until I make some time to intentionally research what learning activities are efficient, I am exploring very much out of curiosity. To that end, I’d like to explore building maps as an unstructured exercise.

Narrowing focus

At this point, I have some basic ideas of how geospatial data is structured and how to approach a project, but I’m unfamiliar with the mapping resources available and how to use them.

When exploring a new topic, it’s helpful to connect it with a familiar topic to give some sense of anchoring. While I would eventually like to explore the entire world, I will start with the town of Beaver, Pennsylvania, where I grew up. I’ll also focus on open source data from US government agencies, since they are major creators of mapping assets.

From what I can tell with some basic exploration, it seems that there are two major agencies producing non-classified general mapping products. The National Geospatial-Intelligence Agency presumably produces maps, but none of these appear to be publicly available.

These two are not mutually exclusive, and the two agencies regularly share data and collaborate with each other. Many other agencies also have mapping efforts for more specific purposes and often include viewers to visualize their data sets that use USGS Topo or Census TIGER data as the base map to display basic map features.

Here are a few examples:

  • The National Oceanic and Atmospheric Administration’s National Centers for Environmental Information (NCEI) publishes petabytes of environmental data
  • The Energy Information Administration (EIA) publishes maps related to energy use
  • The Federal Emergency Management Agency (FEMA) publishes the National Flood Hazard Layer (NFHL) that is used as the basis for flood insurance rates, making it a very politically consequential data set. Notably, it is described as a layer rather than as a map
  • The Bureau of Land Management (BLM) publishes some map layers related to the lands it manages, which are located mostly in the western half of the country

This list is not exhaustive and there are many other agencies that create and publish geospatial data in the form of layers. The agencies of the Federal Statistical System mostly all produce maps, and the Multi-Resolution Land Characteristics (MRLC) consortium works on land cover data.

The Department of Defense, if their geospatial work is like everything else the DoD does, probably has seven or eight components doing GIS work in addition to the NGA and each of these duplicate each other’s efforts while seeking to be the apex agency coordinating GIS policy for the entire DoD.

There are quite a few of agencies creating mapping data, and these often overlap with each other. That said, for general mapping purposes the USGS’ TNM/Topo and the Census’ TIGER are the two major national mapping efforts that include a full range of basic layers and details that other maps can be built on top of.

It makes sense to be familiar with these data sources for creating maps of areas located in the United States.

Tile servers, redrawing, and computer resources

As I have started exploring the USGS and Census map data portals and doing basic exploratory plots, my immediate observation is that these data sets are gigantic. I remember learning back when Google Maps became popular that mapping services do not use the map layers directly for the tile servers that underlie modern mapping applications.

Instead, on the back end the maintainers create huge numbers of tiles in raster format that combine all the layers to be used, and these raster tiles (i.e. static images) are what the servers actually send to users. When users zoom in, a new raster image is served that represents the correct resolution.

Older GPS mapping devices used vector maps and drew them out when you booted up the map to navigate. While many layers exist in raster format, particularly satellite imagery, vector form is very common. Even when purely raster data is used, multiple layers being rendered each time users interact with the map is computationally very taxing.

Vector data, which is at its basic level a list of points that can be drawn at any scale, can very quickly reach into the millions of points. When the NOAA says it makes petabytes of data available, this is not hard to believe given the size of the planet and the great size of even well-compressed geospatial data.

I am not especially knowledgeable about efficiency and optimization of code, and my computer has an above average amount of RAM (40 gigabytes), but I can tell it will not be difficult to crash my system if I am not mindful of the size of these sets. After loading my base system and three map layers, I have used 9.8 gigs and both plotly and ggplot2 are operating quite slowly.

It seems that being targeted and deliberate when mapping is necessary, since the unfocused exploration I am used to when using tile servers is not practical given the size of the files in question. I will have to get the specific file or files I need and methodically build the maps I want to. So how do I use TIGER and TNM to access the files I want?

To start, I will focus on TIGER and then address TNM in a separate post. They are both large enough to warrant their own exploratory posts. I will also not give exact instructions for how to get the shapefiles from the USGS and Census bueau websites for a specific reason.

US government websites are almost as bad about breaking their links with updates and changing their user interfaces as Microsoft, so any instructions I write will likely be outdated within weeks. US government websites are great examples of how to design a website if you don’t want your users to find anything.

Instead, I’ll write in more general terms about how the data sources are built and what they contain, factors which do not change as frequently. I will leave the specifics as a Google Fu exercise, as modern search engine algorithms are a much more efficient way to find the information you are seeking than manually maintained site maps.

Accessing TIGER

The structure of the TIGER data is a little simpler for my purposes, so I’ll start with it. The Geography Program portal has options to access shapefile downloads from a web interface or FTP (File Transfer Protocol) portal. The FTP option is somewhat hilarious, but don’t use it.

If you were born after 1980, you’ve probably never used FTP. In its heyday, it was used heavily for pirating mp3 music files, games, and proprietary software before Napster was invented in 1999. Support for FTP has been dropped by all modern browsers, and I wouldn’t be surprised if the protocol has unpatched zero day security exploits. Let’s use the web interface instead.

Being a product of the Census Bureau, the purpose of TIGER is to facilitate the census taken every 10 years and help the public make use of the agency’s data. This means that most of the information in TIGER data sets is about administrative divisions, unlike the US Geological Survey’s TNM which is mainly concerned with the physical world.

These include state, county, tribal, and other political boundaries, but also census blocks, school districts, legislative districts, voting districts, Congressional districts, city boundaries, and statistical areas (metropolitan areas, micropolitan areas, Combined Statistical Areas (CSA), and Core Based Statistical Areas (CBSAs)).

These administrative divisions, and physical features including coastline, landmarks, roads, rails, military installations, and water, are the types of layers available for download and each layer is updated annually. Files are downloaded as zip archives that include a range of types from the TIGER Shapefile download page.

These archives are what we will be working with, so let’s begin exploring the data.

Structure and contents of TIGER shapefile archives

The Library of Congress ESRI Shapefile description page has a great history and explanation of the format and the different types of files that are found in a shapefile directory. For our purposes in R, files with .shp extensions are the only type that matters.

As I mentioned above, we will be focusing on Beaver County, which is in the state of Pennsylvania. TIGER internally uses FIPS codes to represent items in a standardized way, so 42 represents Pennsylvania and 007 represents Beaver County.

Getting the data we want is not as simple as just downloading all the shapefiles for Beaver County, however. Shapefiles for different layer types are not consistent about whether they show data at the national, state, or county level.

The scale shown seems to be based on file size and number of points, with layers consisting of a smaller number of points (like rails and coastlines) shown at the national level, and more dense data (like secondary roads and waterways) shown at the county level.

It’s worthwhile to reiterate here that national data particularly can get very large and resource intensive. The .shp file for US county boundary data is just under 130 megabytes on disk, but when I draw on my system RAM usage spikes to over 12 gigabytes at various points and my CPU temperature goes very high.

Definitely be mindful of how resource intensive these large drawing operations can become and treat your system accordingly. If you use try to perform a task that requires more RAM than your system has available, it may fail and your computer can crash. Most CPUs have a temperature where they will shut down to prevent damage, usually in the 100 degrees Celsius range. Cooking your CPU by running it in this range is not advisable.

National level layers

We’ll start by looking at the national data layers, and load the shapefiles using sf::st_read(). I’ve suppressed the st_read() lines below because they generate a lot of console output, but anywhere a variable titled sf shows up below, it represents an sf object read in using st_read(). The more verbose source document for this page is also available here.

The variables in each layer are not standardized throughout TIGER, so each layer must be analyzed individually. For instance, the national county boundary shapefile helpfully contains columns for state and county IDs:

STATEFPCOUNTYFPCOUNTYNSGEOIDNAMENAMELSADLSADCLASSFPMTFCCCSAFPCBSAFPMETDIVFPFUNCSTATALANDAWATERINTPTLATINTPTLONgeometry
310390083584131039CumingCuming County06H1G4020NANANAA147764534510690204+41.9158651-096.7885168MULTIPOLYGON (((-97.01952 4…
530690151327553069WahkiakumWahkiakum County06H1G4020NANANAA68097623161568965+46.2946377-123.4244583MULTIPOLYGON (((-123.4364 4…
350110093305435011De BacaDe Baca County06H1G4020NANANAA601681894629090018+34.3592729-104.3686961MULTIPOLYGON (((-104.5674 3…
311090083587631109LancasterLancaster County06H1G402033930700NAA216927297022847034+40.7835474-096.6886584MULTIPOLYGON (((-96.91075 4…
311290083588631129NuckollsNuckolls County06H1G4020NANANAA14896451881718484+40.1764918-098.0468422MULTIPOLYGON (((-98.27367 4…
720850180452372085Las PiedrasLas Piedras Municipio13H1G402049041980NAA8774841832509+18.1871483-065.8711890MULTIPOLYGON (((-65.91048 1…
460990126577246099MinnehahaMinnehaha County06H1G4020NA43620NAA208970143918188795+43.6674723-096.7957261MULTIPOLYGON (((-96.89022 4…
483270138394948327MenardMenard County06H1G4020NANANAA2336237980613559+30.8852677-099.8588613MULTIPOLYGON (((-99.82187 3…
060910027731006091SierraSierra County06H1G4020NANANAA246869458223299110+39.5769252-120.5219926MULTIPOLYGON (((-120.6556 3…
210530051687321053ClintonClinton County06H1G4020NANANAA51087575721152699+36.7272577-085.1360977MULTIPOLYGON (((-85.2391 36…
390630107404439063HancockHancock County06H1G402053422300NAA13761249806021322+41.0002170-083.6659471MULTIPOLYGON (((-83.88076 4…
481890138388048189HaleHale County06H1G402035238380NAA2602109423246678+34.0684364-101.8228879MULTIPOLYGON (((-102.0874 3…
010270016153901027ClayClay County06H1G4020NANANAA15642554265281616+33.2703999-085.8635254MULTIPOLYGON (((-85.97879 3…
480110138379148011ArmstrongArmstrong County06H1G402010811100NAA235461758512183672+34.9641790-101.3566363MULTIPOLYGON (((-101.6251 3…
390030107401539003AllenAllen County06H1G402033830620NAA104258738911152061+40.7716274-084.1061032MULTIPOLYGON (((-84.39719 4…
131890034879413189McDuffieMcDuffie County06H1G4020NA12260NAA66659001423114032+33.4806126-082.4795333MULTIPOLYGON (((-82.44998 3…
551110158111555111SaukSauk County06H1G402035712660NAA215368601345692969+43.4280010-089.9433184MULTIPOLYGON (((-90.19196 4…
051370006990205137StoneStone County06H1G4020NANANAA15706732948145444+35.8570011-092.1404819MULTIPOLYGON (((-92.41528 3…
410630115513541063WallowaWallowa County06H1G4020NANANAA814783533314191752+45.5937530-117.1855796MULTIPOLYGON (((-117.7475 4…
420070121411242007BeaverBeaver County06H1G402043038300NAA112585481924162295+40.6841401-080.3507209MULTIPOLYGON (((-80.51896 4…


These columns make it easy to filter for Pennsylvania and Beaver County. Thus filtered, simple maps are easy to build:

pa <- sf %>%
  filter(STATEFP == '42')

bc <- sf %>%
  filter(STATEFP == '42' & COUNTYFP == '007')

pa <- pa %>%
  ggplot() +
  geom_sf(fill = "#fef5e7") +
  coord_sf(
    datum = sf::st_crs(4326)
  ) +
  labs(title = "Pennsylvania Boundaries by County")

bc <- bc %>%
  ggplot() +
  geom_sf(
    fill = "#fef5e7"
    ) +
  coord_sf(
    datum = sf::st_crs(4326)
  ) +
  labs(title = "Beaver County Boundaries")

# Use the Patchwork library to combine plots
pa | bc

Most of the other columns in the data don’t have obvious meanings and I’m not sure exactly how to determine what variables like MTFCC mean at the moment. I’m guessing there is a data dictionary available somewhere that breaks down these variables (or Google), but for our immediate, exploratory purposes here the polygons are what we are looking for.

Other national sets are less clear cut. The military installations layer, for instance, doesn’t include any information that could be used to easily tie these installations to the state or county level:

ANSICODEAREAIDFULLNAMEMTFCCALANDAWATERINTPTLATINTPTLONgeometry
NA110509768024Tripler Army Medical CtrK211014434580+21.3620086-157.8896492MULTIPOLYGON (((-157.8982 2…
NA110507841923Ng Havre de Grace Mil ResK21103041090+39.5327115-76.1041621MULTIPOLYGON (((-76.11008 3…
NA110509768183Aliamanu Mil ResK211021041330+21.3609483-157.9117716MULTIPOLYGON (((-157.9238 2…
NA110509768096Waianae-Kai Mil ResK2110344070+21.4469221-158.1909990MULTIPOLYGON (((-158.1931 2…
NA110509768142Schofield Barracks Mil ResK2110205091120+21.4993768-158.0824984MULTIPOLYGON (((-158.1151 2…
NA110509768059Helemano Mil ResK211012046080+21.5361123-158.0194801MULTIPOLYGON (((-158.0267 2…
NA1108310928834West Point Mil ResK2110527996402434478+41.3590138-74.0322985MULTIPOLYGON (((-74.10504 4…
NA1108310958136Ng Florence Mil ResK2110759936170+33.1375881-111.3291098MULTIPOLYGON (((-111.3878 3…
NA110509767829Makua Mil ResK2110174064010+21.5307680-158.2068764MULTIPOLYGON (((-158.2406 2…
NA110435324283Stone Ranch Mil ResK211074409460+41.3654109-72.2716641MULTIPOLYGON (((-72.26356 4…
NA1108310660208Ng Mta Gunpowder Mil ResK2110103257010372+39.4306656-76.5068215MULTIPOLYGON (((-76.51083 3…
NA1108301385692Ng Papago Mil ResK211018445090+33.4673291-111.9621997MULTIPOLYGON (((-111.9698 3…
NA110159819765Keaukaha Mil ResK211024508600+19.7068079-155.0391915MULTIPOLYGON (((-155.0528 1…
NA1104755641127Stone Ranch Mil ResK21101789290+41.3516867-72.2836503MULTIPOLYGON (((-72.2868 41…
NA1108311012199Thunderbird Youth AcdmyK2110233501038530+36.2981857-95.2927873MULTIPOLYGON (((-95.30986 3…
NA110443570421Offutt AFBK21107233193513297+41.1178814-95.9065457MULTIPOLYGON (((-95.92719 4…
NA110510128963Macdill AFBK2110214994461412049+27.8491777-82.5107811MULTIPOLYGON (((-82.53755 2…
NA110760434687Langley AFBK211011687672104265+37.0856955-76.3635705MULTIPOLYGON (((-76.38734 3…
NA1101110616060Edwards AFBK211012403834707042912+34.9080209-117.8278171MULTIPOLYGON (((-118.1409 3…
NA110457239915Cannon AFBK21101380479067511+34.3821366-103.3169436MULTIPOLYGON (((-103.3372 3…

Beaver County itself doesn’t have any military installations, the closest one is the 911th Airlift Wing Air Force reserve station in Allegheny County. The AREAID appears to be the unique feature identifier in this and other layers, so after locating the AREAID for the 911th, we can filter the layer and map the station relative to Beaver County:

afrc_911 <- sf %>%
  filter(
    AREAID == '1102202009432'
    )

afrc_911 %>%
  ggplot() +
  geom_sf(
    fill = "#1fffb4",
    size = .25
    ) +
  geom_sf(
    data = bc,
    fill = "#c7c1b3",
    alpha = .3
    ) +
  coord_sf(
    datum = sf::st_crs(4326)
    ) +
  labs(title = "911th Reserve Station Relative to Beaver County")

Plotting is more difficult when dealing with rails since the railway layer is a large collection of lines and it would be a great deal of work to identify the unique AREAID for each rail line running through Beaver County. For this layer, I think the best approach is to set geographic boundaries in ggplot’s coord_sf() function:

sf %>%
  ggplot() +
  geom_sf(
    data = bc,
    color = "#239b56",
    fill = "#239b56",
    alpha = .1
    ) +
  geom_sf(
    color = "#000000",
    size = .2,
    alpha = 1
    ) +
  coord_sf(
    xlim = c(-80.52, -80.15),
    ylim = c(40.47, 40.86),
    datum = sf::st_crs(4326)
  ) +
  labs(
    title = "Rail Network in Beaver County"
  )

We would tighten up the visual appeal of this map if it was intended for publication, but for rough demonstration purposes this is sufficient.

Looking at the plot itself, I see several breaks in the continuity of the lines, which is not something that railway operators usually do intentionally. To confirm whether it’s an error, I looked at the data a little closer using plotly, and I believe the breaks represent a gap in the data rather than in the way ggplot2 rendered the lines.

Perhaps the gaps are present in old lines that are not in operation, but they are included for administrative reasons. If this were a relevant aspect of the analysis, I would investigate closer.

There are other national layers aside from the counties, military installations, and railways, but I believe that these three are sufficient to demonstrate the basic structure and challenges of working with national layers.

State level layers

Many TIGER layers that are more dense are shown at the state level. As with the federal data, layers are made of points, lines, and polygons that only occasionally coincide directly with county data. Here are a few of the state-level layers:

As with the federal data above, isolating specific cases from the data relevant to the analysis at hand requires a combination of using prior knowledge (e.g. knowing that Blackhawk school district is part of Beaver County but that Seneca Valley is not), examining the data with exploratory plotting, Google Fu, and so on.

As with the variable names in some of the federal layers, there are some aspects of the state-level layers that aren’t immediately clear to me. Particularly, there is a layer called “Places” whose polygons seem to correspond with settled areas (i.e. everything that is not an unincorporated area.

Since I am more interested in environmental analysis, the census-related layers like Census tracts and Congressional districts are of less direct interest to me, but their structures are all fairly similar to the layers listed above. State level layers appear to be the most common type, so familiarity with them is important for TIGER workflows.

County level layers

At the county level, the level of detail available in each TIGER layer goes up tremendously, so visual clutter becomes an issue. There is a combined layer called All Lines (written as edges in the archive) that contains all the geographical features tracked by the census in one layer. With appropriate lines and color, this layer is good for monochrome mapping:

sf %>%
  ggplot() +
  geom_sf(
    fill = "#145a32",
    size = .05
    ) +
  coord_sf(
    datum = sf::st_crs(4326)
  ) +
  labs(
    title = "All Lines in Beaver County"
  )

While this has the benefit of including a large range of data immediately, fine control is lost. All geometries are in a single column, so you couldn’t easily change the size or color of a single feature or feature type to highlight it. The data included isn’t clearly and consistently labeled, either.

That means it would take extra work to figure out what exactly certain lines represent. Some cases are named, and “Third Street” is pretty obviously a road, but others have blank name columns and only obscure variable names like MTFCC.

The MTFCC codes tells us what features mean in the context of the TIGER data, for instance S1400 represents “Minor, unpaved roads usable by ordinary cars and trucks.” Other code explanations are less helpful, like “P0001: Nonvisible Linear Legal/Statistical Boundary.”

Depending on the nature of the analysis, it could make sense to use dplyr functions to break out this data into separate layers. Alternatively, it’s possible to download the separate layers from the Census site, import them as sf objects, and modify them individually before combining them into one map. I prefer this latter approach.

To demonstrate building up a map this way, let’s make one that combines layers for line water (i.e. small creeks and streams), area water (ponds, lakes, rivers), all roads (including “minor, unpaved roads usable by ordinary cars and trucks”), and school districts. We’ll plot these on top of the county outline from the national data set.

sf_bc %>%
  ggplot() +
  geom_sf(
    color = "#732700",
    fill = NA,
    size = 1
    ) +
  geom_sf(
    data = sf_roads,
    color = "#ffac33",
    size = .05,
    alpha = 1
  ) +
  geom_sf(
    data = sf_areawater,
    color = NA,
    fill = "#3e721d",
    size = .5,
    alpha = .3
  ) +
  geom_sf(
    data = sf_linearwater,
    color = "#3e721d",
    size = .2,
    alpha = .3
  ) +
  geom_sf(
    data = sf_unsd,
    color = "#55aaff",
    fill = NA,
    size = .6
  ) +
  coord_sf(
    datum = sf::st_crs(4326)
  ) +
  labs(
    title = "Beaver County School Districts"
  )

This was pretty fun to make. I’m not trying to apply “real” cartographic practice here, since there are a lot of semantic conventions that I don’t understand well enough to actually use. In ggplot, the details of each layer (called aesthetics) can be customized heavily, and this flexibility is a big reason the Grammar of Graphics (the gg in ggplot) approach has been successful.

I’m still very much a novice in visual design, so I played around with color here to try and break myself from the lazy conventions of “blue water, green trees, dark colored roads.” I also played with the alpha (transparency) layer to deemphasize the water and roads while emphasizing the school districts.

Mapping like this is a good way to look at the layering approach used by both ggplot2 and mapping software. The grammar of graphics approach pairs well with cartography and is very intuitive: take the Beaver County outline, then place yellow roads on top of it. Then put green lakes and rivers on top of that. Finally, add blue school district boundaries and a title.

Editing maps with ggplot is super simple, with the biggest limitation being the time it takes for drawing operations to complete. Other feature layers are simple to add, particularly text with geom_text(). I did not add text to this plot because the work of managing text layers is one too many items to juggle at the moment.

I will work with text more later, however. Thinking more broadly about data, it would be possible to add any almost plot that ggplot2 can render to a map, so it’s totally possible to analyze data and plot the results right on a given point.

Conclusion

When building maps, it’s necessary to have accessible base layers, preferrably ones that do not require expensive licensing and which work at the scale(s) that will be eventually used. The TIGER shapefiles are a good start for US-based mapping and any work that involves Census-related data.

TIGER does not include extensive information on physical features, but the layers it has are sufficient for a great amount of mapping work. When mapping information that requires a map of features from the local level up to the national level in the background for reference, TIGER data should be more than sufficient to get the job done.

For this post, I used ggplot2 exclusively to map for consistenty’s sake, but also because I am most familiar with it and because I am curious about how its various extensions work (or don’t work) with cartographic workflows. I want to explore the US Geological Survey’s The National Map (TNM) next, but then I’d like to explore alternative tools.

There is a great deal of variation between major mapping packages, and while ggplot is excellent, I do not want to limit myself. While ggplot has some interactivity when combined with plotly’s ggplotly() function and possibly ggiraph?, there are many other places where flexible interactive maps are necessary. tmap in particular seems quite powerful.

There is also a great deal of work I will do once I’ve created basic maps: using text labels and icon overlays like route numbers, and symbols for features like hospitals, airports, or forests. Eventually, I will start working on more meaningful analysis that requires using other sf columns, joining separate data frames together.

Even with some basic knowledge and effort, I am pleased with how much I can do and impressed with the power and flexibility of the tools and data available. I’m looking forward to doing more.