In case you’re using the nix package manager your
nix build fails with:
these derivations will be built:
Build with /nix/store/cclv7n6jr311i5ywwkms1m3iz4lsg37j-ghc-8.6.3.
unpacking source archive /nix/store/j23vlzlg2rmqy0a706h235j4v9zh4m9s-purebred
source root is purebred
setupCompileFlags: -package-db=/build/setup-package.conf.d -j4 -threaded
Loaded package environment from /build/purebred/.ghc.environment.x86_64-linux-8.6.3
ghc: can't find a package database at /home/rjoost/.cabal/store/ghc-8.6.3/package.db
builder for '/nix/store/7xk0m6r07x85rwlh01b3wvq8bbzwbw1n-purebred-0.1.0.0.drv' failed with exit code 1
cannot build derivation '/nix/store/dmj2ax3qsa55jjl6by9fb9sk929k98nl-ghc-8.6.3-with-packages.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv': 1 dependencies couldn't be built
error: build of '/nix/store/j9fl8cmq9c6kjnz9dj79rmbs1kzafyys-purebred-with-packages-8.6.3.drv' failed
then the solution to it is actually easier then you think. It happens when you run
inside a nix shell, because cabal creates a hidden environment file. So look for a
# for example on Linux with GHC 8.6.3
Delete it and you should be good to go.
I’m currently working on a small Haskell tool which helps me minimize the waiting time for catching a train into the city (or out). One feature I’ve implemented recently is an automated import of aprox. 25MB compressed CSV data into an SQLite3 database, which was very slow in the beginning. Not focusing on the first results of the profiling information helped to optimize the implementation for a swift import.
The data comes as a 25MB zip archive of text files in a CSV format. All imported, the SQLite database grows to about 800 MiB. My work-in-progress solution was a cruddy shell + SQL script which imports the CSV files into an SQLite database. With this solution, the import takes about 30 seconds, excluding the time you need to manually download the zip file. But this is not very portable, as I wanted to have a more user friendly solution.
The initial Haskell implementation using mostly the esqueleto and persistent DSL functions showed an abysmal performance. I had to stop the process after half an hour.
Finding the culprit
A first profiling pass showed this result summary:
COST CENTRE MODULE %time %alloc
stepError Database.Sqlite 77.2 0.0
concat.ts' Data.Text 1.8 14.5
compareText.go Data.Text 1.4 0.0
concat.go.step Data.Text 1.0 8.2
concat Data.Text 0.9 1.4
concat.len Data.Text 0.8 13.9
sumP.go Data.Text 0.8 2.1
concat.go Data.Text 0.7 2.6
singleton_ Data.Text.Show 0.6 4.0
run Data.Text.Array 0.5 3.1
escape Database.Persist.Sqlite 0.5 7.8
>>=.\ Data.Attoparsec.Internal.Types 0.5 1.4
singleton_.x Data.Text.Show 0.4 2.9
parseField CSV.StopTime 0.4 1.6
toNamedRecord Data.Csv.Types 0.3 1.2
fmap.\.ks' Data.Csv.Conversion 0.3 2.9
insertSql'.ins Database.Persist.Sqlite 0.2 1.4
compareText.go.(...) Data.Text 0.1 4.3
compareText.go.(...) Data.Text 0.1 4.3
Naturally I checked the implementation of the first function, since that seemed to have the largest impact. It is a simple foreign function call to C. Fraser Tweedale made me aware, that there is not more speed to gain here, since it’s already calling a C function. With that in mind I had to focus on the next entries. It turned out that’s where I gained most of the speed to something more competitive against the crude SQL script and having it more user friendly.
It turned out that Data.Persistent uses primarily Data.Text concatenation to create the SQL statements. That being done for every insert statement is very costly, since it prepares, binds values and executes the statement for each insert (for reference see this Stack Overflow answer).
My current solution is to prepare the statement once and only bind the values for each insert.
Having done another benchmark, the import time now comes down to approximately a minute on my Thinkpad X1 Carbon.