Running SQL queries in RStudio

Now I have successfully connected to my pi-gres database, but I still haven’t developed a smooth way of querying the database directly. The database connection is created as an object in memory, and something about this object does not play well when R Markdown documents (.Rmd extension) are processed by blogdown.

R Markdown and packages like blogdown are designed to make R code more portable and presentable. It makes sense that formats designed for sharing would not work seamlessly with a complex entity like a database connection, since that’s beyond their scope.

Imagine a .pdf document trying to establish a remote connection whenever you opened it. It’s not a use case that the modern internet security apparatus would be particularly thrilled about. Perhaps somebody with more knowledge of databases in RStudio would be able to make it work elegantly, securely, and stable-ly, but I’m not at that point yet and I don’t think it’s necessary for my workflow.

SQL/RStudio workflow, mk 2

Instead of trying to do all of the connecting, importing, manipulation, and analysis from within one monolithic document, I will split it up:

Import raw data into the pi-gres database, manage through pgAdmin4
Connection details saved in a dedicated R script for easy reference
One working R Markdown document for blogdown-unfriendly work that requires a database connection, i.e. SQL queries
Another Rmd document for presenting the analysis
Where necessary or desired, relevant code will be copied from the working document into the analysis using non-functional code blocks (```)

To demonstrate and test that this actually works, I have been reading about and exploring various open data sets to find sets that might be interesting to work on, with my four criteria being technical ability demonstration, answering questions, telling stories, and inspiring awe, compelling, and entertaining.

The New York City Open Data site has many great environmental data sets, particularly those relating to water quality. I picked a number of sets that seemed promising, opened a spreadsheet program, and created a table which I then exported to .csv format, the most common platform-agnostic format for sharing large data sets. I will use this file to test my workflow.

Examine and import the data

Before we import the data into pgAdmin4, we need to create the table in postgres with the correct headers or bad things will probably happen. I’m not in an explore-and-break mood because I had to reinstall RStudio and troubleshoot a bunch of recursive package management craziness with dpkg and aptitude after a recent update made RStudio crash every time users called ggplot() (oops), so I’ll be conservative for the moment.

Within our working script, import the data into RStudio and look at the column headers so that we know what column headers to create in our postgres table. Technically, we don’t have to do this step in RStudio, we could just as easily open the .csv file in a text editor or spreadsheet program and look at the column headers there.

> library(tidyverse)
> data <- read_csv("nyc_data_sets.csv")
> glimpse(data)
Rows: 8
Columns: 8
$ dataset_title    <chr> "Harbor Water Quality", "Current R…
$ url_slug         <chr> "5uug-f49n", "zkky-n5j3", "bkwf-xf…
$ update_frequency <chr> "weekly", "monthly", "monthly", "b…
$ num_columns      <dbl> 100, 25, 11, 10, 10, 10, 5, 7
$ num_rows         <dbl> 92000, 1278, 95400, 1411, 8181, 24…
$ each_row_is      <chr> "sample", "city reservoir", "sampl…
$ notes            <chr> "Many sites (40+), rows, and obser…
$ url              <chr> "https://data.cityofnewyork.us/Env…

It’s good practice to enforce character lengths in database fields, so we also need to get an idea of how long to make each field. Use the nchar() function to count the length of each entry in order to figure out how many characters we’ll need:

> nchar(data$dataset_title)
[1] 20 24 51 41 34 28 42 42
> nchar(data$url_slug)
[1] 9 9 9 9 9 9 9 9
> nchar(data$update_frequency)
[1] 6 7 7 8 8 7 6 8
> nchar(data$num_columns)
[1] 3 2 2 2 2 2 1 1
> nchar(data$num_rows)
[1] 5 4 5 4 4 4 4 4
> nchar(data$each_row_is)
[1]  6 14  6  6  6 11  6  6
> nchar(data$notes)
[1] 100  87  91 173  89 129 246  97
> nchar(data$url)
[1]  72  76 102  93  86  80  92  94

In order to query the database and retrieve data, it needs to exist and it needs to have data in it. Create the database in pgAdmin4 using either the query tool or the GUI:

CREATE DATABASE environment
    WITH 
    OWNER = postgres
    ENCODING = 'UTF8'
    CONNECTION LIMIT = -1;

Within the environment database, create a schema for open data sets:

CREATE SCHEMA open_data
    AUTHORIZATION postgres;

Using the lengths we got with nchar() above to inform our character field lengths, create the NYC open data sets table:

CREATE TABLE IF NOT EXISTS open_data.nyc_data_sets
(
    dataset_title character varying(100) NOT NULL,
    url_slug character varying(20) NOT NULL,
    update_frequency character varying(20),
    num_columns integer,
    num_rows integer,
    each_row_is character varying(20),
    notes character varying(400) NOT NULL,
    url character varying(200) NOT NULL,
    PRIMARY KEY (url_slug)
);

ALTER TABLE open_data.nyc_data_sets
    OWNER to postgres;

Import the data using the GUI in pgAdmin. I’m sure there’s a way to do this with SQL, but I’m not that knowledgeable yet. I could go and search for how to import data with raw SQL, but minimizing cognitive burden is the name of the game right now.

Confirm everything loaded properly with a SELECT query:

Now the data is in pi-gres, so we’re ready to connect from RStudio, query the data, and build a table.

Connect and query

In addition to odbc and RPostgres packages for handling connections, there’s also a package called pool that intelligently manages database connections and is designed for use cases like shiny web apps that need to flexibly create, (re)assign, idle, and break connections as demand requires. For our purposes, the RPostgres package is the most beginner-friendly and we’ll start with that.

Create the connection object. We will use database connections in later work, so it’s probably prudent to save the script as a separate file. odbc and pool syntax are slightly different, so creating separate scripts for them is a good idea too.

# db_connection_RPostgres.R
# Make sure you're disconnected from VPN!
# Connect using RPostgres

library(pacman)
library(DBI)
library(RPostgres)
# Connection objects are created one per database
# Change dbname or create multiple connection objects to access other databases
con <- DBI::dbConnect(RPostgres::Postgres(),
                      dbname = "environment",
                      host = "0.0.0.0", # Replace with actual server's IP
                      port = 5432,
                      user = rstudioapi::askForPassword("Input database user name"),
                      password = rstudioapi::askForPassword("Input database user password")
                      )

DBI::dbDisconnect() # Disconnect
rm(list = ls()) # Clear environment
pacman::p_unload(all) # Clear packages

In the working script, run a SQL query to import the dataset. We don’t need all the columns, so only import what we need:

```{sql, connection=con, output.var = "data"}
SELECT dataset_title,
num_columns,
num_rows,
notes,
url
FROM open_data.nyc_data_sets;
```

Save the retrieved data as a .csv file which can then be used in the main analysis:

write.csv(data, "table.csv", row.names=FALSE)

Conclusion

Data analysts, data visualization designers, data scientists, and most professionals who need to query databases are not database administrators, so knowledge of the plumbing necessary to administer a database is beyond the scope of their duties.

In most corporate environments, there is a DBA (or a team of them) who would simplify and facilitate this work. Sometimes analysts don’t even have direct access to the database, and instead request data pulls based on an available data dictionary explaining what’s what.

In those cases, that last SQL query I wrote would be sent to the DBA, and the resulting .csv file would be sent back for analysis. It is precisely because of my experience in these environments that I wanted to bootstrap the process myself. I would never be fully comfortable playing around in a production database, and I am sure any self-respecting management team wouldn’t be comfortable with it, either.

One downside of “training wheels” environments (and to be fair, production environments are not where you want to play around) is that they stifle learning. By nature I gravitate towards subject matter expertise (SME) rather than the generalist understanding expected of most leaders and “customers” of a given workflow. I want to get my hands dirty and roll up my sleeves.

There’s a big difference between the MBA-level understanding of what a schema is and the knowledge gained by actually linking tables with a foreign key and writing a data dictionary. More to the point, ownership builds competence, improves risk awareness, reduces vulnerability to bullshitting and foot-dragging, and facilitates understand of a topic in a way that just does not happen when working on somebody else’s farm.

If I want to DROP TABLE, nobody is going to stop me, and looking inside the guts of an actual database and breaking it makes me better able to carefully and skillfully use the databases I’m not supposed to break.

By owning the entire process from data creation and storage through to extraction using SQL, I have worked along all points of a workflow that is qualitatively the same as what is done in enterprise settings, even though it’s just a rinky dink Raspberry Pi setup. I also have a home option for when it’s time to learn more advanced work like web scraping, storage of GIS data, and creating machine learning models.

I don’t know if running more advanced data workflows off the Pi will make sense, of course. I had considered bootstrapping this blog, too, but Netlify proved to be a much more efficient route for hosting and deployment. By using their service, I also didn’t have to allocate time to teach myself nginx, domain security, and all the other topics that go into site reliability.

I expect that there are organizations that have resources dedicated to facilitating students like myself in learning these advanced topics. It’s in their interest to have a larger pool of talent, and the cost of making resources available is lower than the cost of not having enough trained workers. Knowing that I could do all of those things if I wanted to, however, is a great feeling.