Debugging with RPM packages

With one of our internal web applications based on Ruby on Rails, we’ve discovered a file descriptor leak in one of the delayed job worker processes. The worker leaked descriptors whenever it invoked a message being send to the message bus using qpid-messaging.

Since we’re using gems compiled as C++ and C extensions, in order to find the root cause, I used the packages provided through the package manager and gdb.

Big thanks to Dan Callaghan who walked me through most of the process and then found the leak in the C++ sources.

TL;DR;

  • identify the leaking descriptors and reproduce it with lsof
  • attach strace to the process and identify file descriptors which are not being closed
  • install debuginfo packages for all dependencies
  • use gdb to figure out what is going on

Reproducer

I’ve used lsof and a friend wrote a small script to quickly monitor the worker process. Looking at the opened files of the process revealed a long list which looked like half closed sockets. It turned out later, that it wasn’t the same problem since the sockets were created, but never bound/connected.

I was unable to reproduce the problem on my local development environment, but found away to do it on our staging environment which resembles production much closer. So whenever I invoked an action in the UI which resulted in a message being sent, I was able to see another file descriptor leak with lsof.

Strace the process

With the reproducer at hand, I started to strace the process:

# Note we're not filtering system calls with -e here.
# Weirdly CLOSE was not reported when just filtering network calls
strace -s 1000 -p  -o strace_output_log.strace

Dan helped me looking through the produced log output, which revealed that the system under investigation created a socket and called getpeername right after it, without binding it resulting in a leaked file descriptor.

10971 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 35
10971 getpeername(35, 0x7fffae712a90, [112]) = -1 ENOTCONN (Transport endpoint is not connected)

Install debuginfo packages and use gdb

In order to debug the system, we need debuginfo packages installed, otherwise you wont be able to step through the sources using gdb. When you attach gdb to the process it will tell you what packages it is missing, for example:

Missing separate debuginfos, use: debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

You then go install those (be mindful that you the repositories configured e.g. section name fedora-debuginfo):

debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

and basically start debugging.

Our first suspicion was the qpid messaging library and we check if it’s invocation of getpeername was leaking the file descriptors. I’ve added a break point at the point of the source code we thought was suspicious and in a separate terminal used lsof to see which file descriptor number is leaked. For example:

# I've used a watch, which executes the lsof every 2 seconds by
# default. The grep filters some of the files I'm not interested in
$ watch "lsof -p  | grep -v REG"

The lsof output will show you the leaked file descriptor number in column 4 by default. With that you can check in gdb if the file descriptor being handled in the source code is the one which leaked.

Since that achieved no results, we used gdb to break on invocations of the getpeername identifier and used backtrace to pin point in the sources where the leak occurred.

Advertisements

Profiling Haskell: Don’t chase the red herring

I’m currently working on a small Haskell tool which helps me minimize the waiting time for catching a train into the city (or out). One feature I’ve implemented recently is an automated import of aprox. 25MB compressed CSV data into an SQLite3 database, which was very slow in the beginning. Not focusing on the first results of the profiling information helped to optimize the implementation for a swift import.

Background

The data comes as a 25MB zip archive of text files in a CSV format. All imported, the SQLite database grows to about 800 MiB. My work-in-progress solution was a cruddy shell + SQL script which imports the CSV files into an SQLite database. With this solution, the import takes about 30 seconds, excluding the time you need to manually download the zip file. But this is not very portable, as I wanted to have a more user friendly solution.

The initial Haskell implementation using mostly the esqueleto and persistent DSL functions showed an abysmal performance. I had to stop the process after half an hour.

Finding the culprit

A first profiling pass showed this result summary:

COST CENTRE          MODULE                         %time %alloc                                                                                                                               
                                                                                                                                                                                               
stepError            Database.Sqlite                 77.2    0.0                                                                                                                               
concat.ts'           Data.Text                        1.8   14.5                                                                                                                               
compareText.go       Data.Text                        1.4    0.0                                                                                                                               
concat.go.step       Data.Text                        1.0    8.2                                                                                                                               
concat               Data.Text                        0.9    1.4                                                                                                                               
concat.len           Data.Text                        0.8   13.9                                                                                                                               
sumP.go              Data.Text                        0.8    2.1                                                                                                                               
concat.go            Data.Text                        0.7    2.6                                                                                                                               
singleton_           Data.Text.Show                   0.6    4.0                                                                                                                               
run                  Data.Text.Array                  0.5    3.1                                                                                                                               
escape               Database.Persist.Sqlite          0.5    7.8                                                                                                                               
>>=.\                Data.Attoparsec.Internal.Types   0.5    1.4                                                                                                                               
singleton_.x         Data.Text.Show                   0.4    2.9                                                                                                                               
parseField           CSV.StopTime                     0.4    1.6                                                                                                                               
toNamedRecord        Data.Csv.Types                   0.3    1.2                                                                                                                               
fmap.\.ks'           Data.Csv.Conversion              0.3    2.9                                                                                                                               
insertSql'.ins       Database.Persist.Sqlite          0.2    1.4                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3

Naturally I checked the implementation of the first function, since that seemed to have the largest impact. It is a simple foreign function call to C. Fraser Tweedale made me aware, that there is not more speed to gain here, since it’s already calling a C function. With that in mind I had to focus on the next entries. It turned out that’s where I gained most of the speed to something more competitive against the crude SQL script and having it more user friendly.

It turned out that Data.Persistent uses primarily Data.Text concatenation to create the SQL statements. That being done for every insert statement is very costly, since it prepares, binds values and executes the statement for each insert (for reference see this Stack Overflow answer).

The solution

My current solution is to prepare the statement once and only bind the values for each insert.

Having done another benchmark, the import time now comes down to approximately a minute on my Thinkpad X1 Carbon.

Changing a website using the developer console

If you need to quickly change a website, you can use a combination of CSS/XPath selectors and a function to hide/remove DOM nodes. I had to find my way through a long list of similar items which was really hard to go through by simply looking at it.

For example, you can simply delete all links you’re not interested in by a simple combination of selector and function:

$x('//li/a[contains(., "not-interesting")]').map(function(n) { n.parentNode.removeChild(n) })

If you’ve made a mistake, reload the website.

git: Moving partial changes between commits

Now and then I face the fact that I’ve added changes to a commit I’d like to have moved into a different commit. Here is what you do:

What’s there

We have two commits. For illustration purposes I’ve trimmed the log output down:

$ git log --stat
commit 19c698a9ee91a5f03f1c3240fc957e6b328931f5

    WIP: adding tests

 parts/tests/functional/conftest.py       |  4 ++--
 parts/tests/functional/test_frobfrob.py  | 43 ++++++++++
 frobfrob.py                              | 14 +++++++++++++-

commit c7ef6c3014ca9d049dea46fbed44010acf53ae79

    prepare frob frob schemas

 parts/tests/functional/conftest.py           | 31 +++++++++++++
 frobfrob/models.py                           | 32 +++++++++++++

commit 5b30d351f51fda40d37d2f7dc25d2367bd37845a
[...]

Now I want to move the changes made to conftest.py from commit c7ef6c3014ca9d049dea46fbed44010acf53ae79 into commit 19c698a9ee91a5f03f1c3240fc957e6b328931f5 (or HEAD).

Pluck out the commit

In order to pluck out the changes to conftest.py, we’ll reset the file against the previous commit 5b30d351f51fda40d37d2f7dc25d2367bd37845a (you could also use HEAD~3).

$ git reset 5b30d351f51fda40d37d2f7dc25d2367bd37845a parts/tests/functional/conftest.py
Unstaged changes after reset:
M       parts/tests/functional/conftest.py

$ git status -s
MM parts/tests/functional/conftest.py

As you can see, we will have staged changes and unstaged changes. The staged changes remove the additions to the conftest.py file and the unstaged changes add our code to conftest.py

Remove and Add

We now create two commits:

  1. Use the staged changes for a new commit which we’ll squash with c7ef6c3014ca9d049dea46fbed44010acf53ae79.
  2. Stage the unstaged changes and create another commit which we’ll squash with 19c698a9ee91a5f03f1c3240fc957e6b328931f5 or HEAD.
# 1. commit Message is something like: squash: removes changes to conftest.py
$ git commit

# 2. commit
# stage changes
$ git add -p

# commit, message will be something like: squash: adds changes to conftest.py
$ git commit

# we end up with two additional commits
$ git log --oneline
492ff22 Adds changes to conftest
8485946 removes conftest files
19c698a WIP: adding tests
c7ef6c3 prepare frob frob schemas
Interactive rebase put’s it all together

Now use an interactive rebase to squash the changes with the right commits:

$ git rebase -i HEAD~5

Ansible Variables all of a Sudden Go Missing?

I’ve written a playbook which deploys a working development environment for some of our internal systems. I’ve tested it with various versions of RHEL. Yet when I ran it against a fresh install of Fedora it failed:

fatal: [192.168.1.233] => {'msg': "One or more undefined variables: 'ansible_lsb' is undefined", 'failed': True}

It turned out, that ansible gets it’s facts through different programs on the remote machine. If some of these programs are not available (in this instance it was lsb_release) the variables are not populated resulting in this error.

So check if all variables you access are indeed available with:

$ ansible -m setup <yourhost>

Abort a git commit –amend

The situation

You hack on a patch, add files to the index and with a knee jerk reaction do:

git commit --amend

(In fact, I do this in my editor with the vim-fugitive plug-in, but it also happened in the terminal). For the commit message git places you in your text editor. If you quit, your changes are merged with the last commit. Being aware of your trapped situation, what do you do?

The solution

Simply delete the commit message (up to where the comments start with #). The typical git commit-hook will see it as a commit with an empty message and abort the commit and therefore the merge.

Touché

I just ran into a problem in which I got myself cornered. I’ve edited a configuration file with elevated permission using sudo:

sudo vim /etc/httpd/conf.d/metrics.conf

From time to time when editing, I background the vim session in order to perform some commands in the shell, then continue working in vim. This time, I’ve sent the editor into the background. When trying to foreground it I got this:

~/Metrics(branch:bug_947304*) » fg
[1]  + 11403 continued  sudo vim /etc/httpd/conf.d/metrics.conf
[1]  + 11403 suspended (tty input)  sudo vim /etc/httpd/conf.d/metrics.conf

What? How? When? Any attempt to foreground this process failed. Reason was the elevated permission I’ve started the editor with. Long story short, you’ll need to kill the editor in order to get back into editing. This mailing list thread explains why.