Haskell - Optimization of the differential equation solver

I am learning Haskell and trying to write code as fast as I can in C. For this exercise, I am writing the Euler integrator for a simple one-dimensional physical system.

  • C code compiled with GCC 4.5.4 and -O3 . It runs in 1,166 seconds.
  • Haskell code compiled with GHC 7.4.1 and -O3 . It works in 21.3 seconds.
  • If I compile Haskell with -O3 -fllvm , it starts in 4.022 seconds.

So, did I miss something to optimize my Haskell code?

PS .: I used the following arguments: 1e-8 5 .

C code:

 #include <stdio.h> double p, v, a, t; double func(double t) { return t * t; } void euler(double dt) { double nt = t + dt; double na = func(nt); double nv = v + na * dt; double np = p + nv * dt; p = np; v = nv; a = na; t = nt; } int main(int argc, char ** argv) { double dt, limit; sscanf(argv[1], "%lf", &dt); sscanf(argv[2], "%lf", &limit); p = 0.0; v = 0.0; a = 0.0; t = 0.0; while(t < limit) euler(dt); printf("%f %f %f %f\n", p, v, a, t); return 0; } 

Haskell Code:

 import System.Environment (getArgs) data EulerState = EulerState !Double !Double !Double !Double deriving(Show) type EulerFunction = Double -> Double main = do [dt, l] <- fmap (map read) getArgs print $ runEuler (EulerState 0 0 0 0) (**2) dt l runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState runEuler s@ (EulerState _ _ _ t) f dt limit = let s' = euler sf dt in case t `compare` limit of LT -> s' `seq` runEuler s' f dt limit _ -> s' euler :: EulerState -> EulerFunction -> Double -> EulerState euler (EulerState pvat) f dt = (EulerState p' v' a' t') where t' = t + dt a' = ft' v' = v + a'*dt p' = p + v'*dt 
+7
source share
4 answers

I got a good boost by applying the conversion of the working wrapper to runEuler .

 runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState runEuler sf dt limit = go s where go s@ (EulerState _ _ _ t) = if t < limit then go (euler sf dt) else s 

This helps f get into the loop (which probably also happens in version C), getting rid of a lot of overhead.

+6
source

Key points have already been mentioned by hammar and from Philip JF . But let me collect them and, nevertheless, add a little explanation.

I will go from top to bottom.

 data EulerState = EulerState !Double !Double !Double !Double 

Your type has strict fields, so whenever an object of this type is evaluated in WHNF, its fields are also evaluated in WHNF. In this case, this means that the object is fully evaluated. This is good, but in our case, unfortunately, is not enough. Objects of this type can still be constructed using pointers to raw data instead of unpacking the raw data into the constructor, and this is what happens with the acceleration field (modulo the fact that the loop does not directly use the type, but passes components like arguments). Since this field is not used in euler , you get

 Rec { Main.$wrunEuler [Occ=LoopBreaker] :: GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Types.Double -> GHC.Prim.Double# -> Main.EulerFunction -> GHC.Prim.Double# -> GHC.Prim.Double# -> (# GHC.Types.Double, GHC.Types.Double, GHC.Types.Double, GHC.Types.Double #) 

loop with an argument in a box. This means that at each iteration you need to add some Double# , and some Double - unboxed. Boxing and unpacking are not very expensive operations, but in a cycle that would otherwise be tight, they can cost a lot of performance. Another instance of the same boxing / unboxing problem is associated with an argument of type EulerFunction , more on that later. -funbox-strict-fields as suggested by Philp JF , or the pragma {-# UNPACK #-} , at least here the acceleration field helps, but the difference becomes relevant only when boxing / unpacking for function evaluation is also eliminated.

 print $ runEuler (EulerState 0 0 0 0) (**2) dt l 

You pass (** 2) here as an argument. This is not the same function as you use in C; the corresponding C function will be return pow(t,2); , and with my gcc, using this almost doubles the runtime for the C program (although there is no clang for the difference). The big performance issue is that (**) is a slow function. Since (** 2) has different results from \x -> x*x for many arguments, there is no rewrite rule for this, so you really get this slow function with the GHC code generator (it seems that LLVM replaces it with \x -> x*x nonetheless, because of the huge performance difference between the two GHC backends and the result of clang). If you pass (\x -> x*x) or (^ 2) instead of (** 2) , you get multiplication (there is a rewrite rule for (^ 2) from 7.4). At the moment, on my system, there is not much difference between the performance of the code generated by the NCG and the LLVM-generated, but the NCG is about 10% faster.

Now a huge problem

 runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState runEuler s@ (EulerState _ _ _ t) f dt limit = let s' = euler sf dt in case t `compare` limit of LT -> s' `seq` runEuler s' f dt limit _ -> s' 

runEuler is recursive, so it cannot be embedded. This means that the function passed cannot be specified there either, and the dt and limit arguments are passed for each iteration. The fact that the function cannot be built-in means that in the loop its argument must be boxed before passing the function, and then its result must be unpacked. It is expensive. And this means that no optimizations that can be made after embedding a function argument can ever be made.

If you do the worker / wrapper conversion and static argument conversion suggested by hammar , runEuler can be runEuler , therefore, the function passed can be inlined and, in this case, the argument box and unpacking of its result can be eliminated. In addition, and even more influence, in this case, the function call can be eliminated and replaced by a single machine operation. This results in a good stiff loop, as shown in the figure.

  174,208 bytes allocated in the heap 3,800 bytes copied during GC 

compared with

 16,000,174,912 bytes allocated in the heap 1,475,432 bytes copied during GC 

the original.

Together, this achieves about half the speed of a C program with a native code generator and at the same speed with the LLVM backend on my box (the source code generator is not particularly good at loop optimization, whereas LLVM, since loops are very common in many languages โ€‹โ€‹compiled through LLVM) .

+11
source

At the moment, I do not have a valid LLVM, but I get twice the ratio

  • Using -O2 instead of -O3 in GHC (although I doubt this is important, -O3 not supported)
  • Using -funbox-strict-fields
  • Using x*x instead of x ** 2 (same as your C code)
  • Moving your "Euler function" as an independent function is the same as in C.

t

  func :: EulerFunction func x = x * x runEuler :: EulerState -> Double -> Double -> EulerState runEuler s@ (EulerState _ _ _ t) dt limit = let s' = euler s dt in case t `compare` limit of LT -> s' `seq` runEuler s' dt limit _ -> s' euler :: EulerState -> Double -> EulerState euler (EulerState pvat) dt = (EulerState p' v' a' t') where t' = t + dt a' = func t' v' = v + a'*dt p' = p + v'*dt 

maybe you can move it further (or maybe a Haskell performance expert like Dons will come up with a solution), I havenโ€™t even looked at the kernel that generates it, but in general the way to make Haskell code is โ€œas fast as C "means" write it to C and use FFI ".

+5
source

A few links:

The following is evangelism representing common folklore. So take it with salt.

You cannot get stable C-like performance in different micro-libraries from any less mature than prehistoric languages โ€‹โ€‹like C, Fortran, Ada and C ++. Even Java is not quite there yet. Sometimes you get, but sometimes the compiler fails, and the GHC fails quite often.

But microbenchmarks do not tell you everything.

The problem is that getting finely tuned, low-level C code around the world is not financially feasible for large projects. Thus, C programs ultimately have poor algorithms, poor architecture, low bottlenecks, and plans to eventually rewrite everything. In C, it's easy to set up low-level code, but hard to make massive architectural changes. In Haskell, it's the other way around, so writing in a mix of haskell and C should give you the best of both worlds.

+4
source

All Articles