Name the three programs A , B, and C, respectively.
C vs B
Let's start with the simplest: C has an additional restriction ( b >= a ) with respect to B. Intuitively, this means that the search space traversed in C is smaller than in B. We can rationalize this by noting that instead of varying for each possible permutation of the pair a, b (from which we know that there are 1000^2=1000000 possible pairs), we do not instead consider all those cases where b less than a . Presumably, checking for the presence or absence of b >= a gives a little extra code (comparison), which is outweighed by a calculation avoided by doing the comparison, so we note (slight) acceleration. Fair enough.
B vs a
The following is a bit more complicated: it looks like A has the same restriction as C ( b >= a ), but it is encoded differently (that is, here we encoded it since the range of values of b can reach in List Monad ). We might think that A should work faster than B (since it should work similarly to C ). Clearly, our intuition is not enough.
To the core
Now, since we cannot always trust our intuition, we must examine what really happens in the generated GHC Core . Allows you to reset the kernel for our 3 programs (without optimization):
for p in ABC do ghc -ddump-simpl $p.hs >> $p.core done
If we compare B.core and C.core , we note that both files have approximately the same structure:
Start by calling a few familiar functions (System.IO.print ...) , (Data.List.product ...) and (GHC.List.head ...)
Next, we define a pair of nested recursive functions with a signature:
ds_dxd [Occ=LoopBreaker] :: [GHC.Integer.Type.Integer] -> [[GHC.Integer.Type.Integer]]
We call each of these specific functions on an enumeration of the form:
(GHC.Enum.enumFromTo @ GHC.Integer.Type.Integer GHC.Num.$fEnumInteger (GHC.Integer.smallInteger 1) (GHC.Integer.smallInteger 1000)))
and execute our logic inside the innermost function. Note that in B.core we see that
case GHC.Classes.== @ GHC.Integer.Type.Integer ... (GHC.Num.+ ... (GHC.Real.^ ... ds3_dxc (GHC.Integer.smallInteger 2)) (GHC.Real.^ ... ds8_dxg (GHC.Integer.smallInteger 2))) (GHC.Real.^ ... c_abw (GHC.Integer.smallInteger 2))
corresponding to a naive filter of all possible values corresponding to our restriction, whereas in C.core instead we have:
case GHC.Classes.>= @ GHC.Integer.Type.Integer GHC.Classes.$fOrdInteger ds8_dxj ds3_dxf of _ { GHC.Bool.False -> ds5_dxh ds9_dxk; GHC.Bool.True -> let { ... case GHC.Classes.== ...
corresponding to adding the additional constraint >= before our triplet constraint and, therefore, finding fewer integers for a shorter runtime, as our intuition expects.
When comparing A.core and B.core we immediately see a familiar structure (a nested pair of recursive functions, each of which is called over an enumeration), and, as it turned out, the kernel output for a and b almost identical! The difference seems to be in the innermost enumeration:
ds5_dxd (GHC.Enum.enumFromTo @ GHC.Integer.Type.Integer GHC.Num.$fEnumInteger ds3_dxb (GHC.Integer.smallInteger 1000))
where we see a range of enumerations from a given inductive variable ds3_dxb to 1000 , and not as a static range ( [1..1000 ]).
What gives then? If this does not mean that A should work faster than B ? (we naively expect that A will be executed similarly to C , given that they implement the same restrictions). Well, it turns out that various compiler optimizers in the game produce extremely complex behavior, and different combinations often give non-intuitive (and frankly bizarre) results, in which case we have two compilers interacting with each other: ghc and gcc . To be able to understand these results, we must rely on the generated optimized kernel (in the end, this is the generated assembler, which is really taken into account, but this time we will ignore it).
To an optimized kernel and Beyond
Let an optimized kernel be generated:
for p in ABC do ghc -O3 -ddump-simpl $p.hs >> $p.core done
and compare our child task ( A ) with its faster counterparts. Comparatively, B and C both perform an optimization class that cannot be A : let-floating and lambda lifting. We can see this by noting that our recursive functions in B and C have a few 40 fewer lines of code, which leads to stricter inner loops. To understand why A does not use this optimization, we should take a look at the code that does not pop up:
let { c1_s10T :: GHC.Integer.Type.Integer -> [[GHC.Integer.Type.Integer]] -> [[GHC.Integer.Type.Integer]] [LclId, Arity=2, Str=DmdType LL] c1_s10T = \ (ds2_dxg :: GHC.Integer.Type.Integer) (ds3_dxf :: [[GHC.Integer.Type.Integer]]) -> let { c2_s10Q [Dmd=Just L] :: GHC.Integer.Type.Integer [LclId, Str=DmdType] c2_s10Q = GHC.Integer.minusInteger lvl2_s10O ds2_dxg } in -- subtract case GHC.Integer.eqInteger (GHC.Integer.plusInteger lvl3_s10M (GHC.Real.^_^ ds2_dxg lvl_r11p)) -- add two squares (lve3_s10M has been floated out) (GHC.Real.^_^ c2_s10Q lvl_r11p) -- ^ compared to this square of _ { GHC.Bool.False -> ds3_dxf; GHC.Bool.True -> GHC.Types.: @ [GHC.Integer.Type.Integer] (GHC.Types.: @ GHC.Integer.Type.Integer ds_dxe (GHC.Types.: @ GHC.Integer.Type.Integer ds2_dxg (GHC.Types.: @ GHC.Integer.Type.Integer c2_s10Q (GHC.Types.[] @ GHC.Integer.Type.Integer)))) ds3_dxf } } in
i.e., subtraction ( minusInteger ) and equality ( eqInteger ), as well as two squares ( ^_^ ) are performed in the critical section of our cycle (represented by the body of the auxiliary function), while the same auxiliary function in C.core contains less calculations ( and if we go even deeper, we will see that this is because the GHC cannot determine whether it is safe to lay out these calculations at the optimization stage). This is consistent with our early analysis, as we see that the constraint ( b >= a ) really present, unlike C , we were not able to swim most of the redundant calculations outside the loop.
To confirm, let me increase the boundaries of a loop participating arbitrarily (for demonstration), say, to [1..10000] . We should expect that the run-time behavior of A should be asymptotically close to the run-time C , just as we expect B to remain in the dust.
➜ time ./A ./A 0.37s user 0.01s system 74% cpu 0.502 total ➜ time ./B ./B 3.21s user 0.02s system 99% cpu 3.246 total ➜ time ./C ./C 0.33s user 0.01s system 99% cpu 0.343 total
and what you know is how we expect! Your initial estimates were too small for any interesting performance characteristics that could be enlightened (no matter what theory can tell you, constant overheads matter in practice). Another way to look at this result is that our initial intuition about A corresponding to the performance of C was actually more accurate than the first.
Of course, all this may be superfluous for the sample code at hand, but such an analysis can be very useful in environments with limited resources.