“Start request repeated too quickly”

If one of your units is not running any more and you find this in your journal: 


● getmail.service - getmail
Loaded: loaded (/home/rjoost/.config/systemd/user/getmail.service; enabled; vendor preset: enabled)
Active: failed (Result: start-limit-hit) since Thu 2018-11-29 18:42:17 AEST; 3s ago
Process: 20142 ExecStart=/usr/bin/getmail --idle=INBOX (code=exited, status=0/SUCCESS)
Main PID: 20142 (code=exited, status=0/SUCCESS)

Nov 29 18:42:17 bali systemd[3109]: getmail.service: Service hold-off time over, scheduling restart.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Scheduled restart job, restart counter is at 5.
Nov 29 18:42:17 bali systemd[3109]: Stopped getmail.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Start request repeated too quickly.
Nov 29 18:42:17 bali systemd[3109]: getmail.service: Failed with result 'start-limit-hit'.
Nov 29 18:42:17 bali systemd[3109]: Failed to start getmail.

it might be because your command really exits immediately and you may want to run the command manually to verify if that’s the case. Also check if you indeed have the unit configured with

Restart: always

.

I you’re sure it really does not restart too quickly, you can reset the counter with:

$ systemctl reset-failed unit

Further information can be found in the man pages of: systemd.unit(5) and systemd.service(5)

Advertisements

Best practices for diffing two online MySQL databases

We’ve had to move our internal Red Hat Beaker instance to a new MySQL database version. We made the jump with a 5min downtime of Beaker. One of the items we wanted to make sure is to not to loose any data.

Setup and Motivation

A database dump is about 135 GB compressed with gzip. The main database was being served by a MySQL 5.1 master/slave setup.

We discussed two possible strategies for switching to MariaDB. Either a dump and load which meant a downtime of 16h, or the use of an additional MariaDB slave which will be promoted to the new master. We chose the latter: a new MariaDB 10.2 slave promoted to be the new master.

We wanted to make sure that both slaves, the MySQL 5.1 and new MariaDB 10.2, were in sync and with promoting the MariaDB 10.2 slave to master we would not loose any data. To verify data consistency across the slaves, we diffed both databases.

Diffing

I went through a few iterations of dumping and diffing. Here are the items, which worked best.

Ignore mysql-utils if you only have read access

MySQL comes with a bunch of utilities and one of them is a tool to compare two databases, called mysqldbcompare and mysqldiff. I’ve tried mysqldiff first, but, after studying the source code, decided against using it. Reason being is that you will have to grant it additional write privileges to the databases which are arguably small, but still too much I was comfortable with.

Use the “at” utility to schedule mysqldump

The best way I found to kick off performing the database dumps at the same time is to use at. Scheduling a mysqldump manually for the two databases introduces way too much noisy differences. I guess, it goes without mention, that the database hosts clocks are synchronized (e.g. by the use of chronyd).

Dump the entire database at once

The mysqldump tool can dump each table separately, but that is not what you want. Also the default options which are geared towards a dump and load is not what you want.

Instead I dumped MySQL with:

mysqldump --single-transaction --order-by-primary --skip-extended-insert beaker | gzip > mysql.sql.gz;

while for MariaDB I used:

mysqldump --order-by-primary --skip-extended-insert beaker | gzip > mariadb.sql.gz;

The options used are aiding the later diff:

  • –order-by-primary orders every dumped table row consistently by their primary keys
  • –single-transaction keeps a transaction open until the dump has finished so you get a comparable database snapshot across the two databases for the same starting point
  • –skip-extended-inserts is used to have an INSERT statement for each row, otherwise they’re collapsed to multi-row insert statements which are harder to compare

Compression (GZip) and shell pipes are your friend

With big databases, like the Beaker production database, you want to avoid writing anything uncompressed. Linux ships additional gzip wrappers for cat (zcat), less (zless) and so on, which will help with creating shell pipes in order to process the data.

Cut up the dump

Once you have both database dumps, cut them up into their separate tables. Purpose of this is not to sift through the dumps with your own eye, but rather to cater for diff. The diff tool loads the entire file into memory and you will face, with large database dumps, it is running out of memory quickly:

diff mysql-beaker.sql.gz mariadb-replica-beaker.sql.gz
diff: memory exhausted

While I did found a tool to diff both large files, having a unified diff output is easier to compare data with.

Example: Using gzip and a pipe from my point above:

diff -u <(zcat mysql/table1.sql.gz) <(zcat mariadb/table1.sql.gz) > diffed/table1.diff

Now you can use your SHELL foo to loop over all cut up tables and write the diff into a separate folder which then lets you easily compare.

Debugging with RPM packages

With one of our internal web applications based on Ruby on Rails, we’ve discovered a file descriptor leak in one of the delayed job worker processes. The worker leaked descriptors whenever it invoked a message being send to the message bus using qpid-messaging.

Since we’re using gems compiled as C++ and C extensions, in order to find the root cause, I used the packages provided through the package manager and gdb.

Big thanks to Dan Callaghan who walked me through most of the process and then found the leak in the C++ sources.

TL;DR;

  • identify the leaking descriptors and reproduce it with lsof
  • attach strace to the process and identify file descriptors which are not being closed
  • install debuginfo packages for all dependencies
  • use gdb to figure out what is going on

Reproducer

I’ve used lsof and a friend wrote a small script to quickly monitor the worker process. Looking at the opened files of the process revealed a long list which looked like half closed sockets. It turned out later, that it wasn’t the same problem since the sockets were created, but never bound/connected.

I was unable to reproduce the problem on my local development environment, but found away to do it on our staging environment which resembles production much closer. So whenever I invoked an action in the UI which resulted in a message being sent, I was able to see another file descriptor leak with lsof.

Strace the process

With the reproducer at hand, I started to strace the process:

# Note we're not filtering system calls with -e here.
# Weirdly CLOSE was not reported when just filtering network calls
strace -s 1000 -p  -o strace_output_log.strace

Dan helped me looking through the produced log output, which revealed that the system under investigation created a socket and called getpeername right after it, without binding it resulting in a leaked file descriptor.

10971 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 35
10971 getpeername(35, 0x7fffae712a90, [112]) = -1 ENOTCONN (Transport endpoint is not connected)

Install debuginfo packages and use gdb

In order to debug the system, we need debuginfo packages installed, otherwise you wont be able to step through the sources using gdb. When you attach gdb to the process it will tell you what packages it is missing, for example:

Missing separate debuginfos, use: debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

You then go install those (be mindful that you the repositories configured e.g. section name fedora-debuginfo):

debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

and basically start debugging.

Our first suspicion was the qpid messaging library and we check if it’s invocation of getpeername was leaking the file descriptors. I’ve added a break point at the point of the source code we thought was suspicious and in a separate terminal used lsof to see which file descriptor number is leaked. For example:

# I've used a watch, which executes the lsof every 2 seconds by
# default. The grep filters some of the files I'm not interested in
$ watch "lsof -p  | grep -v REG"

The lsof output will show you the leaked file descriptor number in column 4 by default. With that you can check in gdb if the file descriptor being handled in the source code is the one which leaked.

Since that achieved no results, we used gdb to break on invocations of the getpeername identifier and used backtrace to pin point in the sources where the leak occurred.

Profiling Haskell: Don’t chase the red herring

I’m currently working on a small Haskell tool which helps me minimize the waiting time for catching a train into the city (or out). One feature I’ve implemented recently is an automated import of aprox. 25MB compressed CSV data into an SQLite3 database, which was very slow in the beginning. Not focusing on the first results of the profiling information helped to optimize the implementation for a swift import.

Background

The data comes as a 25MB zip archive of text files in a CSV format. All imported, the SQLite database grows to about 800 MiB. My work-in-progress solution was a cruddy shell + SQL script which imports the CSV files into an SQLite database. With this solution, the import takes about 30 seconds, excluding the time you need to manually download the zip file. But this is not very portable, as I wanted to have a more user friendly solution.

The initial Haskell implementation using mostly the esqueleto and persistent DSL functions showed an abysmal performance. I had to stop the process after half an hour.

Finding the culprit

A first profiling pass showed this result summary:

COST CENTRE          MODULE                         %time %alloc                                                                                                                               
                                                                                                                                                                                               
stepError            Database.Sqlite                 77.2    0.0                                                                                                                               
concat.ts'           Data.Text                        1.8   14.5                                                                                                                               
compareText.go       Data.Text                        1.4    0.0                                                                                                                               
concat.go.step       Data.Text                        1.0    8.2                                                                                                                               
concat               Data.Text                        0.9    1.4                                                                                                                               
concat.len           Data.Text                        0.8   13.9                                                                                                                               
sumP.go              Data.Text                        0.8    2.1                                                                                                                               
concat.go            Data.Text                        0.7    2.6                                                                                                                               
singleton_           Data.Text.Show                   0.6    4.0                                                                                                                               
run                  Data.Text.Array                  0.5    3.1                                                                                                                               
escape               Database.Persist.Sqlite          0.5    7.8                                                                                                                               
>>=.\                Data.Attoparsec.Internal.Types   0.5    1.4                                                                                                                               
singleton_.x         Data.Text.Show                   0.4    2.9                                                                                                                               
parseField           CSV.StopTime                     0.4    1.6                                                                                                                               
toNamedRecord        Data.Csv.Types                   0.3    1.2                                                                                                                               
fmap.\.ks'           Data.Csv.Conversion              0.3    2.9                                                                                                                               
insertSql'.ins       Database.Persist.Sqlite          0.2    1.4                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3

Naturally I checked the implementation of the first function, since that seemed to have the largest impact. It is a simple foreign function call to C. Fraser Tweedale made me aware, that there is not more speed to gain here, since it’s already calling a C function. With that in mind I had to focus on the next entries. It turned out that’s where I gained most of the speed to something more competitive against the crude SQL script and having it more user friendly.

It turned out that Data.Persistent uses primarily Data.Text concatenation to create the SQL statements. That being done for every insert statement is very costly, since it prepares, binds values and executes the statement for each insert (for reference see this Stack Overflow answer).

The solution

My current solution is to prepare the statement once and only bind the values for each insert.

Having done another benchmark, the import time now comes down to approximately a minute on my Thinkpad X1 Carbon.

Changing a website using the developer console

If you need to quickly change a website, you can use a combination of CSS/XPath selectors and a function to hide/remove DOM nodes. I had to find my way through a long list of similar items which was really hard to go through by simply looking at it.

For example, you can simply delete all links you’re not interested in by a simple combination of selector and function:

$x('//li/a[contains(., "not-interesting")]').map(function(n) { n.parentNode.removeChild(n) })

If you’ve made a mistake, reload the website.

git: Moving partial changes between commits

Now and then I face the fact that I’ve added changes to a commit I’d like to have moved into a different commit. Here is what you do:

What’s there

We have two commits. For illustration purposes I’ve trimmed the log output down:

$ git log --stat
commit 19c698a9ee91a5f03f1c3240fc957e6b328931f5

    WIP: adding tests

 parts/tests/functional/conftest.py       |  4 ++--
 parts/tests/functional/test_frobfrob.py  | 43 ++++++++++
 frobfrob.py                              | 14 +++++++++++++-

commit c7ef6c3014ca9d049dea46fbed44010acf53ae79

    prepare frob frob schemas

 parts/tests/functional/conftest.py           | 31 +++++++++++++
 frobfrob/models.py                           | 32 +++++++++++++

commit 5b30d351f51fda40d37d2f7dc25d2367bd37845a
[...]

Now I want to move the changes made to conftest.py from commit c7ef6c3014ca9d049dea46fbed44010acf53ae79 into commit 19c698a9ee91a5f03f1c3240fc957e6b328931f5 (or HEAD).

Pluck out the commit

In order to pluck out the changes to conftest.py, we’ll reset the file against the previous commit 5b30d351f51fda40d37d2f7dc25d2367bd37845a (you could also use HEAD~3).

$ git reset 5b30d351f51fda40d37d2f7dc25d2367bd37845a parts/tests/functional/conftest.py
Unstaged changes after reset:
M       parts/tests/functional/conftest.py

$ git status -s
MM parts/tests/functional/conftest.py

As you can see, we will have staged changes and unstaged changes. The staged changes remove the additions to the conftest.py file and the unstaged changes add our code to conftest.py

Remove and Add

We now create two commits:

  1. Use the staged changes for a new commit which we’ll squash with c7ef6c3014ca9d049dea46fbed44010acf53ae79.
  2. Stage the unstaged changes and create another commit which we’ll squash with 19c698a9ee91a5f03f1c3240fc957e6b328931f5 or HEAD.
# 1. commit Message is something like: squash: removes changes to conftest.py
$ git commit

# 2. commit
# stage changes
$ git add -p

# commit, message will be something like: squash: adds changes to conftest.py
$ git commit

# we end up with two additional commits
$ git log --oneline
492ff22 Adds changes to conftest
8485946 removes conftest files
19c698a WIP: adding tests
c7ef6c3 prepare frob frob schemas
Interactive rebase put’s it all together

Now use an interactive rebase to squash the changes with the right commits:

$ git rebase -i HEAD~5