Profiling Haskell: Don’t chase the red herring

I’m currently working on a small Haskell tool which helps me minimize the waiting time for catching a train into the city (or out). One feature I’ve implemented recently is an automated import of aprox. 25MB compressed CSV data into an SQLite3 database, which was very slow in the beginning. Not focusing on the first results of the profiling information helped to optimize the implementation for a swift import.

Background

The data comes as a 25MB zip archive of text files in a CSV format. All imported, the SQLite database grows to about 800 MiB. My work-in-progress solution was a cruddy shell + SQL script which imports the CSV files into an SQLite database. With this solution, the import takes about 30 seconds, excluding the time you need to manually download the zip file. But this is not very portable, as I wanted to have a more user friendly solution.

The initial Haskell implementation using mostly the esqueleto and persistent DSL functions showed an abysmal performance. I had to stop the process after half an hour.

Finding the culprit

A first profiling pass showed this result summary:

COST CENTRE          MODULE                         %time %alloc                                                                                                                               
                                                                                                                                                                                               
stepError            Database.Sqlite                 77.2    0.0                                                                                                                               
concat.ts'           Data.Text                        1.8   14.5                                                                                                                               
compareText.go       Data.Text                        1.4    0.0                                                                                                                               
concat.go.step       Data.Text                        1.0    8.2                                                                                                                               
concat               Data.Text                        0.9    1.4                                                                                                                               
concat.len           Data.Text                        0.8   13.9                                                                                                                               
sumP.go              Data.Text                        0.8    2.1                                                                                                                               
concat.go            Data.Text                        0.7    2.6                                                                                                                               
singleton_           Data.Text.Show                   0.6    4.0                                                                                                                               
run                  Data.Text.Array                  0.5    3.1                                                                                                                               
escape               Database.Persist.Sqlite          0.5    7.8                                                                                                                               
>>=.\                Data.Attoparsec.Internal.Types   0.5    1.4                                                                                                                               
singleton_.x         Data.Text.Show                   0.4    2.9                                                                                                                               
parseField           CSV.StopTime                     0.4    1.6                                                                                                                               
toNamedRecord        Data.Csv.Types                   0.3    1.2                                                                                                                               
fmap.\.ks'           Data.Csv.Conversion              0.3    2.9                                                                                                                               
insertSql'.ins       Database.Persist.Sqlite          0.2    1.4                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3                                                                                                                               
compareText.go.(...) Data.Text                        0.1    4.3

Naturally I checked the implementation of the first function, since that seemed to have the largest impact. It is a simple foreign function call to C. Fraser Tweedale made me aware, that there is not more speed to gain here, since it’s already calling a C function. With that in mind I had to focus on the next entries. It turned out that’s where I gained most of the speed to something more competitive against the crude SQL script and having it more user friendly.

It turned out that Data.Persistent uses primarily Data.Text concatenation to create the SQL statements. That being done for every insert statement is very costly, since it prepares, binds values and executes the statement for each insert (for reference see this Stack Overflow answer).

The solution

My current solution is to prepare the statement once and only bind the values for each insert.

Having done another benchmark, the import time now comes down to approximately a minute on my Thinkpad X1 Carbon.

Advertisements

(Locally) Testing ansible deployments

I’ve always felt my playbooks undertested. I know about a possible solution of spinning up new OpenStack instances with the ansible nova module, but felt it to be too complex as a good idea to implement. Now I’ve found a quicker way to test your playbooks by using Docker.

In principal, all my test does is:

  1. create a docker container
  2. create a copy of the current ansible playbook in a temporary directory and mount it as a volume
  3. inside the docker container, run the playbook

This is obviously not perfect, since:

  • running a playbook locally vs connecting via ssh can be a different beast to test
  • can become resource intensive if you want to test different scenarios represented as docker images.

There is possibly more, but for myself in small it is a workable solution so far.

Find the code on github if you’d like to have a look. Improvements welcome!

 

(lxml) XPath matching against nodes with unprintable characters

Sometimes you want to clean up HTML by removing tags with unprintable characters in them (whitespace, non breaking space, etc). Sometimes encoding this back and forth results in weird characters when the HTML is rendered. Anyways, here is the snippet you might find useful:


def clean_empty_tags(node):
    """
    Finds all tags with a whitespace in it. They come out broke and
    we won't need them anyways.
    """
    for empty in node.xpath("//p[.='\xa0']"):
        empty.getparent().remove(empty)

Common docker pitfalls

I’ve ran into a few problems with docker I’d like to document myself and how to solve them.

Overwriting an entrypoint

If you’ve configured a script as an entrypoint which fails, you can run the docker image with a shell in order to fiddle with the script (instead of continously rebuilding the image):

#--entrypoint (provides a new entry point which is the nominated shell)
docker run -i --entrypoint='/bin/bash'  -t f5d4a4d6a8eb

Possible errors you face otherwise are these:

/bin/bash: /bin/bash: cannot execute binary file

Weird errors when building the image

I’ve ran into this a few times. Errors like:

Error in PREIN scriptlet in rpm package libvirt-daemon-0.9.11.4-3.fc17.x86_64
or
useradd: failure while writing changes to /etc/passwd

If you’ve set SELinux to enforcing, you may want to temporarily disable SELinux for just building the image. Don’t disable SELinux permanently.

Old (base) image

Check if your base image has changed (e.g. docker images) and pull it again (docker pull <image>)

hamburg001

Abort a git commit –amend

The situation

You hack on a patch, add files to the index and with a knee jerk reaction do:

git commit --amend

(In fact, I do this in my editor with the vim-fugitive plug-in, but it also happened in the terminal). For the commit message git places you in your text editor. If you quit, your changes are merged with the last commit. Being aware of your trapped situation, what do you do?

The solution

Simply delete the commit message (up to where the comments start with #). The typical git commit-hook will see it as a commit with an empty message and abort the commit and therefore the merge.