Setting up a home server

This is a post to describe the setup of a Raspberry Pi running Alpine Linux. I will install PostgreSQL later to use the system as a practical way to run “real” SQL queries on a live database and build the practical database manipulation and data wrangling skills that are critical for a good analyst.

Skill in data manipulation, analysis, and visualization are also the foundation of more complex work including statistical analysis, data engineering, geographic information systems (GIS), and data science.

The combination of simple projects and regular posts will also act as a professional communication exercise and work portfolio demonstrating domain proficiency, writing ability, and thought processes.

Why this project

For a few years now, I have been following academic research across a range of environmental fields, and I am familiar with many examples of how modern data tools and techniques are influencing other research areas.

This nexus of data, environmental work, and research commercialization fascinates me and so at the beginning of this year I decided that I want to specialize in data work related to the environment.

While the work of researchers is important and related to the larger work I want to do, my experience in higher education administration convinced me that I do not want to be a researcher myself.

I believe that supporting this work as a data professional, either as a freelancer, consultant, teacher, or in a relevant corporate setting, will be more enjoyable and sustainable for me and likely more impactful.

Since the areas that I want to focus on in the near future (data analysis, data visualization, statistical applications, and machine learning) are much more applied in nature, portfolios are a common way to demonstrate competence, communication ability, and showcase work projects.

I have started exploring open datasets available from various sources and believe that these are a good starting point for building projects. I will approach these datasets with several questions in mind:

How can I use this to demonstrate technical ability?
What questions can this dataset answer, and why are they important?
What stories are in this data and how can I tell them best?
How can I create something out of this that is beautiful, compelling, and entertaining?

In undertaking work like this, it is important to find a balance between big picture strategic concerns and the very nitty gritty details. It is also important to strike a balance between exploration and focus: by habit I explore more than I focus, and so creating this blog is also an effort to push myself to focus on concrete products.

Into the nitty gritty

I will write more about how I am “project managing” myself and my efforts, but I have been focusing first on understanding the larger landscape of contemporary data work. Tool-wise, I have been studying the R programming language for analysis and discovering the ecosystem that exists around it, including the blogdown package I am using to create this blog.

Competence in using R is important, but the training exercises containing well-structured and well-behaved data are not representative of the real world. Most data worth studying has to be dragged into the analysis kicking and screaming.

To that end, I have started brushing up on my limited SQL skills and will need a “real” database to run queries against. Eventually, I may be able to use it for web crawling, ingesting data, and other more intensive work, but for now I will consider the project successful if I can connect to the database, import existing CSV-formatted datasets into tables, and successfully run queries against them.

Hardware and software choices

In order to run learning projects involving SQL, I am going to set up a basic home server on a Raspberry Pi. The Pi has lots of quirks and drawbacks, but it is cheap, I have it on hand, and it is powerful enough for the tasks at hand.

In addition to studying SQL, I am also interested in (eventually) learning more about containerization and cloud deployment using technologies like Docker, so I will use Alpine Linux, a distribution designed to be small and secure which also runs from RAM, making it faster. Plus, I like trying new distros.

There are many well-regarded enterprise grade database programs and I do not have enough experience with database administration in a business context to have a strong preference for any of them. I am picking PostgreSQL for the database for the following reasons, in no particular order:

It is widespread and proven technology
It is widely taught, making it easier to find courses, guides, and forum posts for specific issues
It is free and open source software (FOSS)
Its SQL dialect is fairly standard
It has good documentation
It has grown in popularity in recent developer surveys and job posting estimates

Install Alpine to SD

(If you are not interested in reading highly specific technical instructions, feel free to stop reading here.)

The base system of the database will be Alpine Linux running on a Raspberry Pi. Since I will be connecting remotely, there will be no need for a desktop environment and I will run the server headless (no monitor or keyboard).

I used this guide for preparation of the SD card and this headless installation guide from the Alpine wiki.

Get the tarball from Alpine. Since we are using a more recent Raspberry Pi 4, download the aarch64 build (alpine-rpi-3.13.5-aarch64.tar.gz as of writing).
Open the GNOME Disks utility (package gnome-disks, but dd can also be used), click the triple dot icon, and select ‘Format Disk.’
In the ‘Erase’ dropdown, select either Quick or Slow, it makes no difference for our purposes here. For the ‘Partitioning’ dropdown, select the ‘GPT’ option.
Click the + icon to create a partition in unallocated space. I am using a 32gb SD card and allocated all available space to the partition. The volume label can be anything, ALPINE is fine. The type of partition is ‘FAT’. Once the new partition is created, it may be automatically mounted. You should see a ⏵ icon below the Volumes diagram. If you see a square icon, click it to unmount the SD card.
Keep Disks open and switch to an archiving utility to extract the tarball onto the SD card’s root directory. I used another GNOME utility, Archive Manager (file-roller), to do this. Be careful not to extract the . directory itself to the SD card, instead extract the contents of the directory.
While the drive is still unmounted, open a terminal and run the command lsblk to list your devices. SD cards will usually be sdX where is a number, in my case the it is mmcblk0 and the partition is mmcblk0p1.
Run the command fatlabel /dev/mmcblk0p1 ALPINE to change the volume name, changing mmcblk0p1 for whatever value your partition is assigned. This is necessary because of a current bug in the RPi firmware. The solution came from this issue. If you do not change the volume name or if you do not unmount the volume before running fatlabel, then your Pi will not boot and the red indicator light will flash solid.
Back in Disks, click the ⏵ icon to mount the drive. We are done with the Disks program now.

At this point, you could hook the Pi up to a keyboard, monitor, and ethernet jack and it should boot properly. To run the server headless without additional setup from the local command line, we need to do some extra work.

Create headless overlay

On your local computer (not on the SD card), type mkdir etc from the terminal to create a directory.
Inside this directory, type touch .default_boot_services and then mkdir local.d runlevels to create the next level of files. Type ls -a to confirm that all 3 files were created.
Type cd runlevels to navigate to that directory, and type mkdir default to create a new directory. Type cd default to navigate there. In this directory, create a symlink by typing ln -s /etc/init.d/local local.
Move to etc/local.d by typing cd ../../local.d and then touch headless.start to create a script file.
Type nano headless.start to open the file, copy the following script, and type Ctrl+o and Enter to save, then Ctrl+x to exit. Type cd .. to return to etc.

#!/bin/sh

__create_eni()
{
    cat <<-EOF > /etc/network/interfaces
    auto lo
    iface lo inet loopback

    auto ${iface}
    iface ${iface} inet dhcp
            hostname localhost
    EOF
}

__create_eww()
{
    cat <<-EOF > /etc/wpa_supplicant/wpa_supplicant.conf
    network={
            ssid="${ssid}"
            psk="${psk}"
    }
    EOF
}

__edit_ess()
{
    cat <<-EOF >> /etc/ssh/sshd_config
    PermitEmptyPasswords yes
    PermitRootLogin yes
    EOF
}

__find_wint()
{
    for dev in /sys/class/net/*
    do
        if [ -e "${dev}"/wireless -o -e "${dev}"/phy80211 ]
        then
            echo "${dev##*/}"
        fi
    done
}

ovlpath=$(find /media -name *.apkovl.tar.gz -exec dirname {} \;)
read ssid psk < "${ovlpath}/wifi.txt"

if [ ${ssid} ]
then
  iface=$(__find_wint)
  apk add wpa_supplicant
  __create_eww
  rc-service wpa_supplicant start
else
  iface="eth0"
fi

__create_eni
rc-service networking start

/sbin/setup-sshd -c openssh
cp /etc/ssh/sshd_config /etc/ssh/sshd_config.orig
__edit_ess
rc-service sshd restart
mv /etc/ssh/sshd_config.orig /etc/ssh/sshd_config

Now your directory tree is complete, and typing tree -a should give the following output:

penguin@localhost:~/etc$ tree -a
.
├── .default_boot_services
├── local.d
│   └── headless.start
└── runlevels
    └── default
        └── local -> /etc/init.d/local

From the working directory (where we created etc, but not inside etc), create a zip by running tar czvf headless.apkovl.tar.gz etc/ and then copy the resulting file to the root directory of the SD card. apkovl files are overlay files used by Alpine to persist data between reboots, you can read more about how they work here.
Even though I am not using wifi, the above script will fail if it does not find a file called wifi.txt in the root directory, so create an empty file with touch wifi.txt and copy it to the SD card’s root directory.

Now we are ready to log into the system.

Logging in and basic setup

Unmount the SD card and eject it, insert it into the Pi. Connect the etnernet and power on the device.
After a few minutes, check your router for new DHCP leases. The process for doing this varies by vendor but the number will generally be have the format 192.168.###.###. If you are connected to a VPN, disconnect and attempt to log in by typing ssh root@192.168.###.###.
You will be asked to accept a fingerprint, type yes and you should then be logged in.
Instead of running the standard setup-alpine collection of scripts to set up the system, we will avoid setup-sshd and setup-interfaces by manually running the other setup scripts:
- setup-ntp
- setup-keymap
- setup-hostname
- setup-timezone
- setup-apkrepos
- setup-lbu
- setup-apkcache
In my case, the chronyd timekeeping service does not start automatically after running setup-ntp, which can be confirmed by typing date. As a result of the time being wrong, I get an error when I run setup-apkrepos. I correct this by running rc-service chronyd restart. date now shows the correct time and setup-apkrepos runs correctly.
We will commit changes by typing lbu commit -d so that they persist between reboots, and then type reboot to restart the system and confirm that our changes persisted.

At this point, we have a basic installation running. Alpine does very little by default, so many things that we will want to do are not set up. The Post-Install section of the Installation page on the Alpine wiki is a good place to start. In the next post, we will set up PostgreSQL.