Key points have already been mentioned by hammar and from Philip JF . But let me collect them and, nevertheless, add a little explanation.
I will go from top to bottom.
data EulerState = EulerState !Double !Double !Double !Double
Your type has strict fields, so whenever an object of this type is evaluated in WHNF, its fields are also evaluated in WHNF. In this case, this means that the object is fully evaluated. This is good, but in our case, unfortunately, is not enough. Objects of this type can still be constructed using pointers to raw data instead of unpacking the raw data into the constructor, and this is what happens with the acceleration field (modulo the fact that the loop does not directly use the type, but passes components like arguments). Since this field is not used in euler , you get
Rec { Main.$wrunEuler [Occ=LoopBreaker] :: GHC.Prim.Double
loop with an argument in a box. This means that at each iteration you need to add some Double# , and some Double - unboxed. Boxing and unpacking are not very expensive operations, but in a cycle that would otherwise be tight, they can cost a lot of performance. Another instance of the same boxing / unboxing problem is associated with an argument of type EulerFunction , more on that later. -funbox-strict-fields as suggested by Philp JF , or the pragma {-# UNPACK #-} , at least here the acceleration field helps, but the difference becomes relevant only when boxing / unpacking for function evaluation is also eliminated.
print $ runEuler (EulerState 0 0 0 0) (**2) dt l
You pass (** 2) here as an argument. This is not the same function as you use in C; the corresponding C function will be return pow(t,2); , and with my gcc, using this almost doubles the runtime for the C program (although there is no clang for the difference). The big performance issue is that (**) is a slow function. Since (** 2) has different results from \x -> x*x for many arguments, there is no rewrite rule for this, so you really get this slow function with the GHC code generator (it seems that LLVM replaces it with \x -> x*x nonetheless, because of the huge performance difference between the two GHC backends and the result of clang). If you pass (\x -> x*x) or (^ 2) instead of (** 2) , you get multiplication (there is a rewrite rule for (^ 2) from 7.4). At the moment, on my system, there is not much difference between the performance of the code generated by the NCG and the LLVM-generated, but the NCG is about 10% faster.
Now a huge problem
runEuler :: EulerState -> EulerFunction -> Double -> Double -> EulerState runEuler s@ (EulerState _ _ _ t) f dt limit = let s' = euler sf dt in case t `compare` limit of LT -> s' `seq` runEuler s' f dt limit _ -> s'
runEuler is recursive, so it cannot be embedded. This means that the function passed cannot be specified there either, and the dt and limit arguments are passed for each iteration. The fact that the function cannot be built-in means that in the loop its argument must be boxed before passing the function, and then its result must be unpacked. It is expensive. And this means that no optimizations that can be made after embedding a function argument can ever be made.
If you do the worker / wrapper conversion and static argument conversion suggested by hammar , runEuler can be runEuler , therefore, the function passed can be inlined and, in this case, the argument box and unpacking of its result can be eliminated. In addition, and even more influence, in this case, the function call can be eliminated and replaced by a single machine operation. This results in a good stiff loop, as shown in the figure.
174,208 bytes allocated in the heap 3,800 bytes copied during GC
compared with
16,000,174,912 bytes allocated in the heap 1,475,432 bytes copied during GC
the original.
Together, this achieves about half the speed of a C program with a native code generator and at the same speed with the LLVM backend on my box (the source code generator is not particularly good at loop optimization, whereas LLVM, since loops are very common in many languages โโcompiled through LLVM) .