You definitely have too many thunks allocated. I will show how to analyze the code:
merge (A ta) (A tb) = A (merge ta tb)
Here you highlight A constructor with one argument, which is thunk. Can you tell when a piece of merge ta tb will be forced? Probably only at the very end when the resulting tree is used. Try adding a hit to each argument of each Tree constructor to make sure that it is strictly speaking:
data Tree a = AB !(Tree a) !(Tree a) | A !(Tree a) | B !(Tree a) | C !a
The following example:
go lw !r = go (l-1) (shiftR w 1) (cons (w.&.1) r)
Here you highlight thunk for l-1 , shifrR w 1 and cons (w.&.1) r . The first will be forced in the next iterations when comparing l with o , the second will be forced when you will force 3d thunk in the next iteration ( w is used here), and the third thunk will be forced in the next iteration due to the explosion on r . So probably this particular situation is in order.
Yuras source share