After successfully querying my pi-gres database with SQL commands, retrieving data, and saving it locally, it’s time to make the data presentable. This is easier said than done: while it’s possible to scrape something together that works well enough, I have several other considerations:
- Portability and reusability: as much fun as I have taking deep dives into the documentation (and I only say that half sarcastically), I want to write code that I can use in other projects without substantially rebuilding it every time.
- Compatibility: my primary concern is blogging for now, so the ability to demonstrate code and style the appearance of outputs is important. R Markdown and associated packages are quite capable, but I also eventually want to create
htmlwidgets
, web apps withshiny
, and maybe dashboards. - Readability and best practices: while I don’t have any reservations about going against the grain, especially when cobbling a project together at the start, best practices exist for a reason and it’s always good to consider the poor people who will have to read, or God forbid maintain, this code.
I also want to be mindful of the balance between overthinking and flailing aimlessly in the name of “just doing it.” With that in mind, this is how I am proceeding.
Setup and R Markdown
I’m new to R Markdown and still learning its ins and outs, so I’ll be leaning a lot on two courses that were very helpful for me: Creating Reports and Presentations with R Markdown and RStudio taught by Charlie Joey Hadley, and Data Wrangling in R taught by Mike Chapple.
As an aside, LinkedIn Learning is available for free through the New York Public Library. I have used their catalog heavily in recent years and recommend them if you don’t already have a training site (Coursera, Khan Academy, Udemy, edX, etc.) that you like.
As I work here, I have several goals. Specifically, I want to:
- Accurately manipulate the table to make it appear how I want
- Use
knit
,kable
, andkableExtra
packages to make it look pretty and play well with my blog’s theme - Control the appearance of live code chunks and make sure I don’t break things
R Markdown documents are different from most Markdown in that code can be run live and the results displayed. By contrast, regular code blocks made with ``` don’t do anything:
penguin@linux:~$ sudo dd if=/dev/urandom of=/dev/nvme0n1 bs=1M
The above code is inside a code block and doesn’t do anything, which is fortunate because it is a command to write over my hard drive with random data. A live code block looks like the one below when written in code, but I’ve escaped it with this technique so it’s not live:
```{r setup}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
```
Now here the same chunk is again, but this time it’s live:
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
In this case, the code is setting the defaults for messages generated when code is run. The point of a blog is presentation, and hopefully for my audience’s sake I’m not presenting a bunch of junk code that throws errors, as console output is visually distracting, very probably unrelated to the message, and something most readers generally don’t want.
Then again, I do like to take pictures of the Blue Screen of Death whenever I see it in unexpected places, like a subway train arrival time board. The next time I have to make a really boring presentation, maybe I’ll lighten the mood by including a kernel panic.
In an actual presentation, I’ll hide a code block like this with the include=FALSE
flag so that the code is run, but it isn’t shown in the final document. This is a post to demonstrate, though, so code blocks make some sense here. Next, we’ll load the table from my local drive and display it with the default settings.
library(tidyverse)
data <- read_csv("table.csv")
print(data)
## # A tibble: 8 x 5
## dataset_title num_columns num_rows notes url
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Drinking Water Qual… 11 95400 Many sites (~400)… https://data.cit…
## 2 Distribution Sites … 10 1411 Many testing site… https://data.cit…
## 3 Entry Point LCR Mon… 10 8181 Testing sites are… https://data.cit…
## 4 Watershed Water Qua… 10 2431 Simple data set c… https://data.cit…
## 5 DEP - Cryptosporidi… 5 1065 Sample sites are … https://data.cit…
## 6 Compliance at-the-t… 7 2871 Compelling data s… https://data.cit…
## 7 Harbor Water Quality 100 92000 Many sites (40+),… https://data.cit…
## 8 Current Reservoir L… 25 1278 Rows are reservoi… https://data.cit…
Golly, that’s ugly. I really like the IBM Plex Mono typeface, but the table had nothing to do with that. It needs a little work before it’s fit for human consumption.
Very basic data wrangling and table formatting
When I first learned about data wrangling in R, I thought it would be fun to look at the NYC Parking Violations dataset. Then I ran the unique()
command on the vehicle color column and discovered that the fine people of the NYC Transportation Bureau had listed 1,830 different vehicle colors.
My favorite entry was “B.” Blue? Black? Beige? Brown? Burple? ¯\_(ツ)_/¯ Long story short, I’m not ready to tackle that dataset yet. I look forward to doing so, but it’s a stretch goal. The reason I created this example table was to list some data sets that seemed reasonably easy to manage and progressively build my skills against.
The challenge with data wrangling, as with much of programming, is that ideas that are simple for a human to express are difficult to translate into code. In this case, these are the very basic manipulations I want to do to this table:
- Turn the dataset title into a hyperlink
- Clean up the column names
- Get rid of the URL column once it’s no longer needed
Fortunately, I’ve had some experience with string manipulation dating back to my very first coding project (RIP Pebble) so the topic wasn’t completely new. Essentially, the instructions above turned into this:
- Add brackets [] before and after the dataset title and add parentheses () before and after the URL
- Crate a new column using these modified columns and call it Dataset (voila! Markdown hyperlink)
- Recreate the table with Dataset at the front, rename Columns, Rows, and Notes, drop dataset_title and url
Again, this is super simple and even this required a bit of diving around the documentation. In any language as old as R, there are multiple ways to do simple operations (e.g. paste()
), there are deprecated packages (I encountered write_csv()
recently), and then there are best practices to consider (like consistency in using the tidyverse
).
After spinning those wheels for a while I had to stop myself from overthinking and just write. I’m not going to get it perfect, but perfect is the enemy of finishing anything. This is what I did:
data <- data %>%
mutate(Dataset = str_c("[",data$dataset_title,"]",
"(",data$url,")")
) %>%
select(Dataset, Columns = num_columns,
Rows = num_rows, Notes = notes,
-dataset_title, -url) %>%
print()
## # A tibble: 8 x 4
## Dataset Columns Rows Notes
## <chr> <dbl> <dbl> <chr>
## 1 [Drinking Water Quality Distribu… 11 95400 Many sites (~400), fewer vari…
## 2 [Distribution Sites LCR Monitori… 10 1411 Many testing sites, compellin…
## 3 [Entry Point LCR Monitoring Resu… 10 8181 Testing sites are rows, would…
## 4 [Watershed Water Quality Data](h… 10 2431 Simple data set conceptually …
## 5 [DEP - Cryptosporidium And Giard… 5 1065 Sample sites are listed in ro…
## 6 [Compliance at-the-tap Lead and … 7 2871 Compelling data set, has pote…
## 7 [Harbor Water Quality](https://d… 100 92000 Many sites (40+), rows, and o…
## 8 [Current Reservoir Levels](https… 25 1278 Rows are reservoirs, columns …
Now it’s even less legible, but the Dataset column is now capable of hyperlinking with Markdown and the column headers are human-friendly. I’ll turn it into a cleaner table. There are a lot of R packages for working with tables. A whole lot.
condformat
, DT
, flextable
, formattable
, ftextra
, gt
, huxtable
, kableExtra
, knitr
, pander
, pixiedust
, reactable
, rhandsontable
, stargazer
, tables
, tangram
, xtable
, ztable
.
At some point I’d like to experiment with the different packages, but for now I’ll start with knitr::kable()
because it’s part of the tidyverse
, which in my limited experience is a cohesively designed set of packages with a clean API and good documentation. Being easy for beginners is more important for me right now than being the “best” technical offering.
Here’s the table with kable()
:
data %>% knitr::kable()
Dataset | Columns | Rows | Notes |
---|---|---|---|
Drinking Water Quality Distribution Monitoring Data | 11 | 95400 | Many sites (~400), fewer variables, many observations, would be good set for data wrangling |
Distribution Sites LCR Monitoring Results | 10 | 1411 | Many testing sites, compelling data (lead and copper at the tap), relatively simple set would be good for more impactful analysis (e.g. statistical approaches, storytelling) |
Entry Point LCR Monitoring Results | 10 | 8181 | Testing sites are rows, would be good tidy data presentation, is a fairly simple data set |
Watershed Water Quality Data | 10 | 2431 | Simple data set conceptually and structurally, could be used for more substantial analyses, e.g. stats, multivariate, regressions |
DEP - Cryptosporidium And Giardia Data Set | 5 | 1065 | Sample sites are listed in rows rather than as column variables, this would be good tidy data presentation. Data overlaps with Watershed Water Quality Data, tables could be pulled together for more complex SQL analysis or data table manipulations |
Compliance at-the-tap Lead and Copper Data | 7 | 2871 | Compelling data set, has potential interest or applicability for municipal nerds and civil groups |
Harbor Water Quality | 100 | 92000 | Many sites (40+), rows, and observation, would be good exercise for data wrangling and visualization |
Current Reservoir Levels | 25 | 1278 | Rows are reservoirs, columns are measurements, would make a good tidy data presentation |
It looks much better, and honestly I’d be comfortable publishing this. I know standard practice is to right-justify numbers, but that visually makes the most sense when there are many rows of numbers and they’re bound by a grid. When there are only a few numbers and lots of white space, I think centering numbers is more sensible. Let’s also add a caption.
data %>%
knitr::kable(align='lccl',
caption='NYC Environmental Open Data Sets of Interest')
Dataset | Columns | Rows | Notes |
---|---|---|---|
Drinking Water Quality Distribution Monitoring Data | 11 | 95400 | Many sites (~400), fewer variables, many observations, would be good set for data wrangling |
Distribution Sites LCR Monitoring Results | 10 | 1411 | Many testing sites, compelling data (lead and copper at the tap), relatively simple set would be good for more impactful analysis (e.g. statistical approaches, storytelling) |
Entry Point LCR Monitoring Results | 10 | 8181 | Testing sites are rows, would be good tidy data presentation, is a fairly simple data set |
Watershed Water Quality Data | 10 | 2431 | Simple data set conceptually and structurally, could be used for more substantial analyses, e.g. stats, multivariate, regressions |
DEP - Cryptosporidium And Giardia Data Set | 5 | 1065 | Sample sites are listed in rows rather than as column variables, this would be good tidy data presentation. Data overlaps with Watershed Water Quality Data, tables could be pulled together for more complex SQL analysis or data table manipulations |
Compliance at-the-tap Lead and Copper Data | 7 | 2871 | Compelling data set, has potential interest or applicability for municipal nerds and civil groups |
Harbor Water Quality | 100 | 92000 | Many sites (40+), rows, and observation, would be good exercise for data wrangling and visualization |
Current Reservoir Levels | 25 | 1278 | Rows are reservoirs, columns are measurements, would make a good tidy data presentation |
I’m not a fan of the forced “Table 1” prefix, and I’d like to give the Notes tab some more breathing space, but the kable documentation shows me that there really isn’t a lot of ability to modify kable
tables. To do a little more, I’ll use the kableExtra
package.
library(kableExtra)
data %>%
kbl(align='lccl') %>%
kable_styling(font_size=12)
Dataset | Columns | Rows | Notes |
---|---|---|---|
Drinking Water Quality Distribution Monitoring Data | 11 | 95400 | Many sites (~400), fewer variables, many observations, would be good set for data wrangling |
Distribution Sites LCR Monitoring Results | 10 | 1411 | Many testing sites, compelling data (lead and copper at the tap), relatively simple set would be good for more impactful analysis (e.g. statistical approaches, storytelling) |
Entry Point LCR Monitoring Results | 10 | 8181 | Testing sites are rows, would be good tidy data presentation, is a fairly simple data set |
Watershed Water Quality Data | 10 | 2431 | Simple data set conceptually and structurally, could be used for more substantial analyses, e.g. stats, multivariate, regressions |
DEP - Cryptosporidium And Giardia Data Set | 5 | 1065 | Sample sites are listed in rows rather than as column variables, this would be good tidy data presentation. Data overlaps with Watershed Water Quality Data, tables could be pulled together for more complex SQL analysis or data table manipulations |
Compliance at-the-tap Lead and Copper Data | 7 | 2871 | Compelling data set, has potential interest or applicability for municipal nerds and civil groups |
Harbor Water Quality | 100 | 92000 | Many sites (40+), rows, and observation, would be good exercise for data wrangling and visualization |
Current Reservoir Levels | 25 | 1278 | Rows are reservoirs, columns are measurements, would make a good tidy data presentation |
After experimenting with the package a bit, I can see a lot of its options butt heads with my blog’s theme, presumably the blog’s CSS styling overrides what kableExtra wants to do. I’m okay with that. For now, the table is reasonably sized and styled in a way I’m comfortable with.
I want to think about formatting and style, but I’d rather wait until the thing I am polishing is not a turd. Now that I have the table in a presentable format, I think it’s time to wrap this project up and start working on something else.