TUI Acceptance Testing

Update: I had created this blog post as a draft and forgotten about it. Since this went into a talk at LCA2020 I thought I publish it for completeness.

Since Purebred is closing in on it’s seconds birthday on 18th of July, I wanted to highlight how useful our suite of acceptance tests has become and what work went into it.

The current state of affairs

We use tmux extensively to simulate user input and basically black box test the entire application. Currently all 25 tests run on travis in 1 minute and 25 seconds. Most of it comes down to IO performance. On a modern i5 laptop it’s down to 30seconds.

Screenshot_2020-01-17 Job #1366 1 - purebred-mua purebred - Travis CI

Each test runs performs a setup, starts the application and runs through a series of steps performing user input, waiting for the terminal to repaint and assert a given text to be present. Since everything is text in a terminal – even colours – this makes it really is to design a test suite if you can bridge the gap using tmux.

I think I should explain more precisely what I mean by “waiting for the terminal to repaint”. At this point in time, we poll tmux for our assertion string to be present. That happens in quick succession by checking the hardcopy of the terminal window (basically a screen shot) with an exponential back off.

Problems we encountered

tmux different behaviour between releases

The travis containers run a much older version of GNU/Linux and therefore tmux. We’ve subcommands accepting different arguments or changed escape sequences in the terminal screen shot.

Tests fail randomly because of too generic assertion strings

That took a bit of figuring out, but in hind sight it is really obvious. Since we determine the screen to be repainted once our assertion string shows up, we sometimes used a string which showed up in subsequent screens. Next steps were executed and the test failed at subsequent steps. This is confusing, since you wonder how the test steps have even gotten to this screen.

We solved this not really technically, but simply by fixing the failing problems and documenting this potential pitfall.

Races between new-session and tmux server

Initially each test set up a new session and cleaned the session during a tear down. While this intention was good, it lead to randomly failing tests. The single session being torn down meant that also the server was killed. However, if the next session is created immediately after the old session is being removed we find our self in a race between the tmux server being killed and newly started.

We solved this problem with a keep-alive session which runs as long the entire suite runs and numbering the test sessions.

Asserting against terminal colours

Some tests assert specific how widgets are specifically rendered including their ANSI colour codes. Since we do write the tests on our own computers which support more than 16 colours, the terminals the CI runs is typically less sophisticated. This can lead to randomly failing tests because the colours in CI are different depending on the type of the terminal.

We solved this problem by simply setting a “dumb” terminal only supporting 16 colours.

Line wrapping in a terminal

A terminal comes with a hard character line width. By default it’s 80 characters (and 24 in height). Part of those 80 characters will be eaten up by your PS1 (the command prompt) setting of your shell, the rest can be consumed by command input.

We did run into randomly failing tests if lines wrapped at random points in the input and therefore introduced newlines in the output.

We solved this by invoking tmux with an additional “-J” parameter to join wrapped lines.

Optimisations

Initially, each step was waiting up to 6 seconds for a redraw. With an increasing amount of tests, the wait time for a pass or fail increased as well. Since we faced some flaky tests, we felt fixing those first, before making the tests run faster. The downside of optimising first and fixing flaky tests afterwards is that random test failures become more pronounced eroding the confidence in any automated test suite.

After were were sure we had caught all problems, we introduced an exponential back off patch which would wait cumulatively up to a second for the UI to be repainted. That’s a long time for the UI to change.

Docker volume mount fails

I recently stumbled over this odd error message in one of our gitlab runners:

ERROR: for nginx_proxy Cannot start service load_balancer: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/builds/group/project/nginx/nginx.conf\\\" to rootfs \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged\\\" at \\\"/data/docker/overlay2/88fb8a0ee201dd14cfc9aa9befe4d7a5eb28e5ec816a2d76726040316853ed11/merged/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type

It is the result of using the docker-compose up myservice command which is defined to use just an image and mounts files like so:

- ./nginx/nginx.conf:/etc/nginx/nginx.conf

I’ve spent a bit of time figuring out what the underlying problem is. In hindsight, the error message already gives it away, but I was unable to reproduce the issue on my host machine. That is because the problem is actually more related to docker than your host.

When I found out, that the runner in gitlab is actually a docker container, it dawned upon me, that the operation we do here is a container-in-a-container operation.  The container typically shares the same docker instance with the host system. The bind mount actually happens on the host machine. It tries to mount the path from a directory/file which doesn’t exist on the host machine.

To verify if we can reproduce the same error on the host, I tried to bind mount a volume with a path which doesn’t exist and voila:

$ sudo docker run --rm -it --volume /build/nginx/nginx.conf:/etc/nginx/nginx.conf nginx --help
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/build/nginx/nginx.conf\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs\\\" at \\\"/var/lib/docker/devicemapper/mnt/2fab14f3dc592d19b1408618a5ba26e88e334d88fe6b7524dc6c30bb0d26bbfc/rootfs/etc/nginx/nginx.conf\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

Be weary of running docker in docker when you need to bind mount volumes. Prefer a bare metal or VM as a runner.

Profiling Haskell: Don’t chase the red herring

I’m currently working on a small Haskell tool which helps me minimize the waiting time for catching a train into the city (or out). One feature I’ve implemented recently is an automated import of aprox. 25MB compressed CSV data into an SQLite3 database, which was very slow in the beginning. Not focusing on the first results of the profiling information helped to optimize the implementation for a swift import.

Background

The data comes as a 25MB zip archive of text files in a CSV format. All imported, the SQLite database grows to about 800 MiB. My work-in-progress solution was a cruddy shell + SQL script which imports the CSV files into an SQLite database. With this solution, the import takes about 30 seconds, excluding the time you need to manually download the zip file. But this is not very portable, as I wanted to have a more user friendly solution.

The initial Haskell implementation using mostly the esqueleto and persistent DSL functions showed an abysmal performance. I had to stop the process after half an hour.

Finding the culprit

A first profiling pass showed this result summary:

COST CENTRE          MODULE                         %time %alloc                                                                                                                               
                                                                                                                                                                                               
stepError            Database.Sqlite                 77.2    0.0                                                                                                                               
concat.ts'           Data.Text                        1.8   14.5                                                                                                                               
compareText.go       Data.Text                        1.4    0.0                                                                                                                               
concat.go.step       Data.Text                        1.0    8.2                                                                                                                               
concat               Data.Text                        0.9    1.4                                                                                                                               
concat.len           Data.Text                        0.8   13.9                                                                                                                               
sumP.go              Data.Text                        0.8    2.1                                                                                                                               
concat.go            Data.Text                        0.7    2.6                                                                                                                               
singleton_           Data.Text.Show                   0.6    4.0                                                                                                                               
run                  Data.Text.Array                  0.5    3.1                                                                                                                               
escape               Database.Persist.Sqlite          0.5    7.8                                                                                                                               
>>=.\                Data.Attoparsec.Internal.Types   0.5    1.4                                                                                                                               
singleton_.x         Data.Text.Show                   0.4    2.9                                                                                                                               
parseField           CSV.StopTime                     0.4    1.6                                                                                                                               
toNamedRecord        Data.Csv.Types                   0.3    1.2                                                                                                                               
fmap.\.ks'           Data.Csv.Conversion              0.3    2.9                                                                                                                               
insertSql'.ins       Database.Persist.Sqlite          0.2    1.4                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3

Naturally I checked the implementation of the first function, since that seemed to have the largest impact. It is a simple foreign function call to C. Fraser Tweedale made me aware, that there is not more speed to gain here, since it’s already calling a C function. With that in mind I had to focus on the next entries. It turned out that’s where I gained most of the speed to something more competitive against the crude SQL script and having it more user friendly.

It turned out that Data.Persistent uses primarily Data.Text concatenation to create the SQL statements. That being done for every insert statement is very costly, since it prepares, binds values and executes the statement for each insert (for reference see this Stack Overflow answer).

The solution

My current solution is to prepare the statement once and only bind the values for each insert.

Having done another benchmark, the import time now comes down to approximately a minute on my Thinkpad X1 Carbon.

(Locally) Testing ansible deployments

I’ve always felt my playbooks undertested. I know about a possible solution of spinning up new OpenStack instances with the ansible nova module, but felt it to be too complex as a good idea to implement. Now I’ve found a quicker way to test your playbooks by using Docker.

In principal, all my test does is:

  1. create a docker container
  2. create a copy of the current ansible playbook in a temporary directory and mount it as a volume
  3. inside the docker container, run the playbook

This is obviously not perfect, since:

  • running a playbook locally vs connecting via ssh can be a different beast to test
  • can become resource intensive if you want to test different scenarios represented as docker images.

There is possibly more, but for myself in small it is a workable solution so far.

Find the code on github if you’d like to have a look. Improvements welcome!

 

(lxml) XPath matching against nodes with unprintable characters

Sometimes you want to clean up HTML by removing tags with unprintable characters in them (whitespace, non breaking space, etc). Sometimes encoding this back and forth results in weird characters when the HTML is rendered. Anyways, here is the snippet you might find useful:


def clean_empty_tags(node):
    """
    Finds all tags with a whitespace in it. They come out broke and
    we won't need them anyways.
    """
    for empty in node.xpath("//p[.='\xa0']"):
        empty.getparent().remove(empty)

Common docker pitfalls

I’ve ran into a few problems with docker I’d like to document myself and how to solve them.

Overwriting an entrypoint

If you’ve configured a script as an entrypoint which fails, you can run the docker image with a shell in order to fiddle with the script (instead of continously rebuilding the image):

#--entrypoint (provides a new entry point which is the nominated shell)
docker run -i --entrypoint='/bin/bash'  -t f5d4a4d6a8eb

Possible errors you face otherwise are these:

/bin/bash: /bin/bash: cannot execute binary file

Weird errors when building the image

I’ve ran into this a few times. Errors like:

Error in PREIN scriptlet in rpm package libvirt-daemon-0.9.11.4-3.fc17.x86_64
or
useradd: failure while writing changes to /etc/passwd

If you’ve set SELinux to enforcing, you may want to temporarily disable SELinux for just building the image. Don’t disable SELinux permanently.

Old (base) image

Check if your base image has changed (e.g. docker images) and pull it again (docker pull <image>)

hamburg001