Debugging with RPM packages

With one of our internal web applications based on Ruby on Rails, we’ve discovered a file descriptor leak in one of the delayed job worker processes. The worker leaked descriptors whenever it invoked a message being send to the message bus using qpid-messaging.

Since we’re using gems compiled as C++ and C extensions, in order to find the root cause, I used the packages provided through the package manager and gdb.

Big thanks to Dan Callaghan who walked me through most of the process and then found the leak in the C++ sources.

TL;DR;

  • identify the leaking descriptors and reproduce it with lsof
  • attach strace to the process and identify file descriptors which are not being closed
  • install debuginfo packages for all dependencies
  • use gdb to figure out what is going on

Reproducer

I’ve used lsof and a friend wrote a small script to quickly monitor the worker process. Looking at the opened files of the process revealed a long list which looked like half closed sockets. It turned out later, that it wasn’t the same problem since the sockets were created, but never bound/connected.

I was unable to reproduce the problem on my local development environment, but found away to do it on our staging environment which resembles production much closer. So whenever I invoked an action in the UI which resulted in a message being sent, I was able to see another file descriptor leak with lsof.

Strace the process

With the reproducer at hand, I started to strace the process:

# Note we're not filtering system calls with -e here.
# Weirdly CLOSE was not reported when just filtering network calls
strace -s 1000 -p  -o strace_output_log.strace

Dan helped me looking through the produced log output, which revealed that the system under investigation created a socket and called getpeername right after it, without binding it resulting in a leaked file descriptor.

10971 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 35
10971 getpeername(35, 0x7fffae712a90, [112]) = -1 ENOTCONN (Transport endpoint is not connected)

Install debuginfo packages and use gdb

In order to debug the system, we need debuginfo packages installed, otherwise you wont be able to step through the sources using gdb. When you attach gdb to the process it will tell you what packages it is missing, for example:

Missing separate debuginfos, use: debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

You then go install those (be mindful that you the repositories configured e.g. section name fedora-debuginfo):

debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

and basically start debugging.

Our first suspicion was the qpid messaging library and we check if it’s invocation of getpeername was leaking the file descriptors. I’ve added a break point at the point of the source code we thought was suspicious and in a separate terminal used lsof to see which file descriptor number is leaked. For example:

# I've used a watch, which executes the lsof every 2 seconds by
# default. The grep filters some of the files I'm not interested in
$ watch "lsof -p  | grep -v REG"

The lsof output will show you the leaked file descriptor number in column 4 by default. With that you can check in gdb if the file descriptor being handled in the source code is the one which leaked.

Since that achieved no results, we used gdb to break on invocations of the getpeername identifier and used backtrace to pin point in the sources where the leak occurred.

Advertisements

Ansible Variables all of a Sudden Go Missing?

I’ve written a playbook which deploys a working development environment for some of our internal systems. I’ve tested it with various versions of RHEL. Yet when I ran it against a fresh install of Fedora it failed:

fatal: [192.168.1.233] => {'msg': "One or more undefined variables: 'ansible_lsb' is undefined", 'failed': True}

It turned out, that ansible gets it’s facts through different programs on the remote machine. If some of these programs are not available (in this instance it was lsb_release) the variables are not populated resulting in this error.

So check if all variables you access are indeed available with:

$ ansible -m setup <yourhost>

Common docker pitfalls

I’ve ran into a few problems with docker I’d like to document myself and how to solve them.

Overwriting an entrypoint

If you’ve configured a script as an entrypoint which fails, you can run the docker image with a shell in order to fiddle with the script (instead of continously rebuilding the image):

#--entrypoint (provides a new entry point which is the nominated shell)
docker run -i --entrypoint='/bin/bash'  -t f5d4a4d6a8eb

Possible errors you face otherwise are these:

/bin/bash: /bin/bash: cannot execute binary file

Weird errors when building the image

I’ve ran into this a few times. Errors like:

Error in PREIN scriptlet in rpm package libvirt-daemon-0.9.11.4-3.fc17.x86_64
or
useradd: failure while writing changes to /etc/passwd

If you’ve set SELinux to enforcing, you may want to temporarily disable SELinux for just building the image. Don’t disable SELinux permanently.

Old (base) image

Check if your base image has changed (e.g. docker images) and pull it again (docker pull <image>)

hamburg001

Could not initialize Opera

Just ran into a problem this morning since opera has been working up to Opera v. 12.

On start-up you get something like this:

captainmoonlite :: ~ » opera
Could not initialize Opera.

Running the startup script with strace reveals the culprit:

captainmoonlite :: ~ » strace opera
execve("/usr/bin/opera", ["opera"], [/* 71 vars */]) = 0
[...]
lstat("/home/roman/.kde/share/config/kcmnspluginrc", 0x7fffbd443080) = -1 EACCES (Permission denied)
write(2, "Could not initialize Opera.\n", 28Could not initialize Opera.

After fixing the permissions of .kde/share/config, opera started just fine. Opera v.11 must have not accessed this directory or file. strace FTW!

Debugging Byobu

Ubuntu ships with a neat GNU screen enhancement called byobu. One of the nice features is to run custom scripts. The output of your custom byobu scripts are shown in the status line of your byobu session.

Byobu runs custom commands

I’ve converted my former screen script to run as a custom script in byobu, but it suddenly stopped working. I was wondering why and found a way to see what the problem was.

What you need

My script scans my mail directory and checks for new mail. I placed it in my home directory under:

$ ls /home/roman/.byobu/bin
3_maildircheck

Debugging

The following points should give you a clue why your custom script won’t work with byobu:

  1. Check if you have enabled custom scripts in byobu (press F9 in a byobu session).
  2. Run the custom command by itself from the plugins directory, not from your home directory. The plugins directory is located under Ubuntu in /usr/lib/byobu/custom.
  3. The output of custom scripts are written to a cache file under /var/run/screen. Check what the cache files tell you.

Upgrading Ghostscript and CentOS

The installed Ghostscript version on CentOS is usually too old if you compare it to installations like Ubuntu. Upgrading ghostscript without compiling the whole package is a challenge, but I found via Chris Schuld’s Blog a link to the http://blackopsoft.com/ repository which provides a recent ghoscript version.

The Problem

Nevertheless, I still had troubles with ImageMagick. If you want to convert PDFs to images for example, the installed ghostscript version was unable to lookup certain font files resulting in an error similar to this one:

Error: /invalidfont in findfont
Operand stack:
   Arial-ISO   Arial-ISO   Arial   Font   Arial
Execution stack:
   [...]
Dictionary stack:
  [...]
Current allocation mode is local
Last OS error: 2
Current file position is 279
GNU Ghostscript 7.05: Unrecoverable error, exit code 1

The Solution

I hunted around in the system. Apparently ghostscript still uses a fontmap to lookup fonts. The question is which fontmap was in use. I tried to change the fontmap in /etc/ghostscript which had no affect.

I finally found a bogus fontmap under

/usr/share/ghostscript/8.70/Resource/Init

which provided all font entries and aliases, but none of the aliases had a font file associated with it. For example, Helvetica-Bold pointed to NimbusSanL-BoldCond, but NimbusSanL-BoldCond had no pointer to an existing font file in the file system. It was simply missing.

The Fix

I got it now working with this in /usr/share/ghostscript/8.70/Resource/Init:

%!
% See Fontmap.GS for the syntax of real Fontmap files.
%% Replace 1 (Fontmap.GS)
/NimbusSanL-Regu        (n019003l.pfb)  ;
/NimbusSanL-ReguItal    (n019023l.pfb)  ;
/NimbusSanL-Bold        (n019004l.pfb)  ;
/NimbusSanL-BoldItal    (n019024l.pfb)  ;

/NimbusSanL-ReguCond    (n019043l.pfb)  ;
/NimbusSanL-ReguCondItal        (n019063l.pfb)  ;
/NimbusSanL-BoldCond    (n019044l.pfb)  ;
/NimbusSanL-BoldCondItal        (n019064l.pfb)  ;

(Fontmap.GS) .runlibfile

References: