TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

User permissions checking

Background

I’ve currently been working on restricting user access to executables on a Linux box. I removed all executable rights for others and added them via access control lists for certain groups. So for example for cat it looked like this:
# getfacl /usr/bin/cat
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/cat
# owner: root
# group: root
user::rwx
group::r-x
other::r--

For an executable I want the user access to the permissions looked like this:


# getfacl /usr/bin/ping
getfacl: Removing leading '/' from absolute path names
# file: usr/bin/ping
# owner: root
# group: root
user::rwx
group::r-x
group:staff:--x
mask::r-x
other::r--

The user is in the staff group and can execute ‘ping’, while everyone else get’s a permission denied.

The Test

As an automated test, I thought I go over all commands and produce a whitelist of executables a given user has access to.

The script looks a bit like this for a single executable:

# cat /tmp/foo.py
import os

# the users group id
os.setgid(2000)
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

When I run it I expected the test for the first executable to be false and the second to be true:


# python /tmp/foo.py
True
True

Aeh what? Doesn’t the script run as my user who’s in the staff group?

Turns out there is more to the process than just the group and user id. There are also supplementary groups and capabilities. When changing the script to call print(os.getgroups()) it printed the supplementary groups of the user I was running the script as, which was root in this circumstance. Changing the script to also set the supplementary groups to the one of the user:


import os

# the users group id
os.setgid(2000)
os.setgroups([2000, 2003])
# the users ID
os.setuid(2003)

print(os.access(‘/usr/bin/cat’, os.X_OK))
print(os.access(‘/usr/bin/ping’, os.X_OK))

and running it returns the right results:

# python /tmp/foo.py
False
True

Caveats

Restricting permissions with ACLs and testing the way I demonstrated above can lead to false positives for scripts. You can not remove executable permissions from the script interpreter (e.g. /usr/bin/python) while keeping it with an ACL on the actual script. The test above will tell you it’s all fine and dandy, while in reality the user will run into a permission denied.

GHC: can’t find a package database

In case you’re using the nix package manager your nix build fails with:


these derivations will be built:
/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv
/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv
/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv
building '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv'...
setupCompilerEnvironmentPhase
Build with /nix/store/cclv7n6jr311i5ywwkms1m3iz4lsg37j-ghc-8.6.3.
unpacking sources
unpacking source archive /nix/store/j23vlzlg2rmqy0a706h235j4v9zh4m9s-purebred
source root is purebred
patching sources
compileBuildDriverPhase
setupCompileFlags: -package-db=/build/setup-package.conf.d -j4 -threaded
Loaded package environment from /build/purebred/.ghc.environment.x86_64-linux-8.6.3
ghc: can't find a package database at /home/rjoost/.cabal/store/ghc-8.6.3/package.db
builder for '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv' failed with exit code 1
cannot build derivation '/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv': 1 dependencies couldn't be built
error: build of '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv' failed

then the solution to it is actually easier then you think. It happens when you run


cabal new-repl

inside a nix shell, because cabal creates a hidden environment file. So look for a


.ghc.environment.--
# for example on Linux with GHC 8.6.3
.ghc.environment.x86_64-linux-8.6.3

Delete it and you should be good to go.

Docker volume mount fails

I recently stumbled over this odd error message in one of our gitlab runners:

ERROR: for nginx_proxy Cannot start service load_balancer: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/builds/group/project/nginx/nginx.conf\\\" to rootfs \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged\\\" at \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type

It is the result of using the docker-compose up myservice command which is defined to use just an image and mounts files like so:

- ./nginx/nginx.conf:/etc/nginx/nginx.conf

I’ve spent a bit of time figuring out what the underlying problem is. In hindsight, the error message already gives it away, but I was unable to reproduce the issue on my host machine. That is because the problem is actually more related to docker than your host.

When I found out, that the runner in gitlab is actually a docker container, it dawned upon me, that the operation we do here is a container-in-a-container operation.  The container typically shares the same docker instance with the host system. The bind mount actually happens on the host machine. It tries to mount the path from a directory/file which doesn’t exist on the host machine.

To verify if we can reproduce the same error on the host, I tried to bind mount a volume with a path which doesn’t exist and voila:

$ sudo docker run --rm -it --volume /build/nginx/nginx.conf:/etc/nginx/nginx.conf nginx --help
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/build/nginx/nginx.conf\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

Be weary of running docker in docker when you need to bind mount volumes. Prefer a bare metal or VM as a runner.

“Start request repeated too quickly”

If one of your units is not running any more and you find this in your journal: 


● getmail.service - getmail
Loaded: loaded (/home/rjoost/.config/systemd/user/getmail.service; enabled; vendor preset: enabled)
Active: failed (Result: start-limit-hit) since Thu 2018-11-29 18:42:17 AEST; 3s ago
Process: 20142 ExecStart=/usr/bin/getmail --idle=INBOX (code=exited, status=0/SUCCESS)
Main PID: 20142 (code=exited, status=0/SUCCESS)

Nov 29 18:42:17 bali systemd[3109]: getmail.service: Service hold-off time over, scheduling restart.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Scheduled restart job, restart counter is at 5.
Nov 29 18:42:17 bali systemd[3109]: Stopped getmail.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Start request repeated too quickly.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Failed with result 'start-limit-hit'.
Nov 29 18:42:17 bali systemd[3109]: Failed to start getmail.

it might be because your command really exits immediately and you may want to run the command manually to verify if that’s the case. Also check if you indeed have the unit configured with

Restart: always

.

I you’re sure it really does not restart too quickly, you can reset the counter with:

$ systemctl reset-failed unit

Further information can be found in the man pages of: systemd.unit(5) and systemd.service(5)

Best practices for diffing two online MySQL databases

We’ve had to move our internal Red Hat Beaker instance to a new MySQL database version. We made the jump with a 5min downtime of Beaker. One of the items we wanted to make sure is to not to loose any data.

Setup and Motivation

A database dump is about 135 GB compressed with gzip. The main database was being served by a MySQL 5.1 master/slave setup.

We discussed two possible strategies for switching to MariaDB. Either a dump and load which meant a downtime of 16h, or the use of an additional MariaDB slave which will be promoted to the new master. We chose the latter: a new MariaDB 10.2 slave promoted to be the new master.

We wanted to make sure that both slaves, the MySQL 5.1 and new MariaDB 10.2, were in sync and with promoting the MariaDB 10.2 slave to master we would not loose any data. To verify data consistency across the slaves, we diffed both databases.

Diffing

I went through a few iterations of dumping and diffing. Here are the items, which worked best.

Ignore mysql-utils if you only have read access

MySQL comes with a bunch of utilities and one of them is a tool to compare two databases, called mysqldbcompare and mysqldiff. I’ve tried mysqldiff first, but, after studying the source code, decided against using it. Reason being is that you will have to grant it additional write privileges to the databases which are arguably small, but still too much I was comfortable with.

Use the “at” utility to schedule mysqldump

The best way I found to kick off performing the database dumps at the same time is to use at. Scheduling a mysqldump manually for the two databases introduces way too much noisy differences. I guess, it goes without mention, that the database hosts clocks are synchronized (e.g. by the use of chronyd).

Dump the entire database at once

The mysqldump tool can dump each table separately, but that is not what you want. Also the default options which are geared towards a dump and load is not what you want.

Instead I dumped MySQL with:

mysqldump --single-transaction --order-by-primary --skip-extended-insert beaker | gzip > mysql.sql.gz;

while for MariaDB I used:

mysqldump --order-by-primary --skip-extended-insert beaker | gzip > mariadb.sql.gz;

The options used are aiding the later diff:

  • –order-by-primary orders every dumped table row consistently by their primary keys
  • –single-transaction keeps a transaction open until the dump has finished so you get a comparable database snapshot across the two databases for the same starting point
  • –skip-extended-inserts is used to have an INSERT statement for each row, otherwise they’re collapsed to multi-row insert statements which are harder to compare

Compression (GZip) and shell pipes are your friend

With big databases, like the Beaker production database, you want to avoid writing anything uncompressed. Linux ships additional gzip wrappers for cat (zcat), less (zless) and so on, which will help with creating shell pipes in order to process the data.

Cut up the dump

Once you have both database dumps, cut them up into their separate tables. Purpose of this is not to sift through the dumps with your own eye, but rather to cater for diff. The diff tool loads the entire file into memory and you will face, with large database dumps, it is running out of memory quickly:

diff mysql-beaker.sql.gz mariadb-replica-beaker.sql.gz
diff: memory exhausted

While I did found a tool to diff both large files, having a unified diff output is easier to compare data with.

Example: Using gzip and a pipe from my point above:

diff -u <(zcat mysql/table1.sql.gz) <(zcat mariadb/table1.sql.gz) > diffed/table1.diff

Now you can use your SHELL foo to loop over all cut up tables and write the diff into a separate folder which then lets you easily compare.

Debugging with RPM packages

With one of our internal web applications based on Ruby on Rails, we’ve discovered a file descriptor leak in one of the delayed job worker processes. The worker leaked descriptors whenever it invoked a message being send to the message bus using qpid-messaging.

Since we’re using gems compiled as C++ and C extensions, in order to find the root cause, I used the packages provided through the package manager and gdb.

Big thanks to Dan Callaghan who walked me through most of the process and then found the leak in the C++ sources.

TL;DR;

  • identify the leaking descriptors and reproduce it with lsof
  • attach strace to the process and identify file descriptors which are not being closed
  • install debuginfo packages for all dependencies
  • use gdb to figure out what is going on

Reproducer

I’ve used lsof and a friend wrote a small script to quickly monitor the worker process. Looking at the opened files of the process revealed a long list which looked like half closed sockets. It turned out later, that it wasn’t the same problem since the sockets were created, but never bound/connected.

I was unable to reproduce the problem on my local development environment, but found away to do it on our staging environment which resembles production much closer. So whenever I invoked an action in the UI which resulted in a message being sent, I was able to see another file descriptor leak with lsof.

Strace the process

With the reproducer at hand, I started to strace the process:

# Note we're not filtering system calls with -e here.
# Weirdly CLOSE was not reported when just filtering network calls
strace -s 1000 -p  -o strace_output_log.strace

Dan helped me looking through the produced log output, which revealed that the system under investigation created a socket and called getpeername right after it, without binding it resulting in a leaked file descriptor.

10971 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 35
10971 getpeername(35, 0x7fffae712a90, [112]) = -1 ENOTCONN (Transport endpoint is not connected)

Install debuginfo packages and use gdb

In order to debug the system, we need debuginfo packages installed, otherwise you wont be able to step through the sources using gdb. When you attach gdb to the process it will tell you what packages it is missing, for example:

Missing separate debuginfos, use: debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

You then go install those (be mindful that you the repositories configured e.g. section name fedora-debuginfo):

debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

and basically start debugging.

Our first suspicion was the qpid messaging library and we check if it’s invocation of getpeername was leaking the file descriptors. I’ve added a break point at the point of the source code we thought was suspicious and in a separate terminal used lsof to see which file descriptor number is leaked. For example:

# I've used a watch, which executes the lsof every 2 seconds by
# default. The grep filters some of the files I'm not interested in
$ watch "lsof -p  | grep -v REG"

The lsof output will show you the leaked file descriptor number in column 4 by default. With that you can check in gdb if the file descriptor being handled in the source code is the one which leaked.

Since that achieved no results, we used gdb to break on invocations of the getpeername identifier and used backtrace to pin point in the sources where the leak occurred.