I’m currently working on a small Haskell tool which helps me minimize the waiting time for catching a train into the city (or out). One feature I’ve implemented recently is an automated import of aprox. 25MB compressed CSV data into an SQLite3 database, which was very slow in the beginning. Not focusing on the first results of the profiling information helped to optimize the implementation for a swift import.
The data comes as a 25MB zip archive of text files in a CSV format. All imported, the SQLite database grows to about 800 MiB. My work-in-progress solution was a cruddy shell + SQL script which imports the CSV files into an SQLite database. With this solution, the import takes about 30 seconds, excluding the time you need to manually download the zip file. But this is not very portable, as I wanted to have a more user friendly solution.
The initial Haskell implementation using mostly the esqueleto and persistent DSL functions showed an abysmal performance. I had to stop the process after half an hour.
Finding the culprit
A first profiling pass showed this result summary:
COST CENTRE MODULE %time %alloc
stepError Database.Sqlite 77.2 0.0
concat.ts' Data.Text 1.8 14.5
compareText.go Data.Text 1.4 0.0
concat.go.step Data.Text 1.0 8.2
concat Data.Text 0.9 1.4
concat.len Data.Text 0.8 13.9
sumP.go Data.Text 0.8 2.1
concat.go Data.Text 0.7 2.6
singleton_ Data.Text.Show 0.6 4.0
run Data.Text.Array 0.5 3.1
escape Database.Persist.Sqlite 0.5 7.8
>>=.\ Data.Attoparsec.Internal.Types 0.5 1.4
singleton_.x Data.Text.Show 0.4 2.9
parseField CSV.StopTime 0.4 1.6
toNamedRecord Data.Csv.Types 0.3 1.2
fmap.\.ks' Data.Csv.Conversion 0.3 2.9
insertSql'.ins Database.Persist.Sqlite 0.2 1.4
compareText.go.(...) Data.Text 0.1 4.3
compareText.go.(...) Data.Text 0.1 4.3
Naturally I checked the implementation of the first function, since that seemed to have the largest impact. It is a simple foreign function call to C. Fraser Tweedale made me aware, that there is not more speed to gain here, since it’s already calling a C function. With that in mind I had to focus on the next entries. It turned out that’s where I gained most of the speed to something more competitive against the crude SQL script and having it more user friendly.
It turned out that Data.Persistent uses primarily Data.Text concatenation to create the SQL statements. That being done for every insert statement is very costly, since it prepares, binds values and executes the statement for each insert (for reference see this Stack Overflow answer).
My current solution is to prepare the statement once and only bind the values for each insert.
Having done another benchmark, the import time now comes down to approximately a minute on my Thinkpad X1 Carbon.
I gave a talk about my experience learning Haskell:
The slides can be found here: http://redhat.slides.com/rjoost/deck-2
I’ve always felt my playbooks undertested. I know about a possible solution of spinning up new OpenStack instances with the ansible nova module, but felt it to be too complex as a good idea to implement. Now I’ve found a quicker way to test your playbooks by using Docker.
In principal, all my test does is:
- create a docker container
- create a copy of the current ansible playbook in a temporary directory and mount it as a volume
- inside the docker container, run the playbook
This is obviously not perfect, since:
- running a playbook locally vs connecting via ssh can be a different beast to test
- can become resource intensive if you want to test different scenarios represented as docker images.
There is possibly more, but for myself in small it is a workable solution so far.
Find the code on github if you’d like to have a look. Improvements welcome!
Sometimes you want to clean up HTML by removing tags with unprintable characters in them (whitespace, non breaking space, etc). Sometimes encoding this back and forth results in weird characters when the HTML is rendered. Anyways, here is the snippet you might find useful:
Finds all tags with a whitespace in it. They come out broke and
we won't need them anyways.
for empty in node.xpath("//p[.='\xa0']"):
I’ve ran into a few problems with docker I’d like to document myself and how to solve them.
Overwriting an entrypoint
If you’ve configured a script as an entrypoint which fails, you can run the docker image with a shell in order to fiddle with the script (instead of continously rebuilding the image):
#--entrypoint (provides a new entry point which is the nominated shell)
docker run -i --entrypoint='/bin/bash' -t f5d4a4d6a8eb
Possible errors you face otherwise are these:
/bin/bash: /bin/bash: cannot execute binary file
Weird errors when building the image
I’ve ran into this a few times. Errors like:
If you’ve set SELinux to enforcing, you may want to temporarily disable SELinux for just building the image. Don’t disable SELinux permanently.
Old (base) image
Check if your base image has changed (e.g. docker images) and pull it again (docker pull <image>)
You hack on a patch, add files to the index and with a knee jerk reaction do:
git commit --amend
(In fact, I do this in my editor with the vim-fugitive plug-in, but it also happened in the terminal). For the commit message git places you in your text editor. If you quit, your changes are merged with the last commit. Being aware of your trapped situation, what do you do?
Simply delete the commit message (up to where the comments start with #). The typical git commit-hook will see it as a commit with an empty message and abort the commit and therefore the merge.
When it comes to a full hard disk in one of your virtual machines, I found this article utterly useful resizing it:
KVM Linux – Expanding a Guest LVM File System Using Virt-resize