Why is this Haskell code slower with -O?

Question

Why is this Haskell code slower with -O?

This piece of Haskell code runs much slower with -O , but -O should not be dangerous . Can someone tell me what happened? If that matters, this is an attempt to solve this problem and uses a binary search and a constant segment tree:

 import Control.Monad import Data.Array data Node = Leaf Int -- value | Branch Int Node Node -- sum, left child, right child type NodeArray = Array Int Node -- create an empty node with range [l, r) create :: Int -> Int -> Node create lr | l + 1 == r = Leaf 0 | otherwise = Branch 0 (create lm) (create mr) where m = (l + r) `div` 2 -- Get the sum in range [0, r). The range of the node is [nl, nr) sumof :: Node -> Int -> Int -> Int -> Int sumof (Leaf val) r nl nr | nr <= r = val | otherwise = 0 sumof (Branch sum lc rc) r nl nr | nr <= r = sum | r > nl = (sumof lc r nl m) + (sumof rc rm nr) | otherwise = 0 where m = (nl + nr) `div` 2 -- Increase the value at x by 1. The range of the node is [nl, nr) increase :: Node -> Int -> Int -> Int -> Node increase (Leaf val) x nl nr = Leaf (val + 1) increase (Branch sum lc rc) x nl nr | x < m = Branch (sum + 1) (increase lc x nl m) rc | otherwise = Branch (sum + 1) lc (increase rc xm nr) where m = (nl + nr) `div` 2 -- signature said it all tonodes :: Int -> [Int] -> [Node] tonodes n = reverse . tonodes' . reverse where tonodes' :: [Int] -> [Node] tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t tonodes' _ = [create 0 n] -- find the minimum m in [l, r] such that (predicate m) is True binarysearch :: (Int -> Bool) -> Int -> Int -> Int binarysearch predicate lr | l == r = r | predicate m = binarysearch predicate lm | otherwise = binarysearch predicate (m+1) r where m = (l + r) `div` 2 -- main, literally main :: IO () main = do [n, m] <- fmap (map read . words) getLine nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine replicateM_ m $ query n nodes where query :: Int -> NodeArray -> IO () query n nodes = do [p, k] <- fmap (map read . words) getLine print $ binarysearch (ok nodes npk) 0 n where ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool ok nodes npks = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

(This is exactly the same code with code review , but this question is about a different issue.)

This is my input generator in C ++:

 #include <cstdio> #include <cstdlib> using namespace std; int main (int argc, char * argv[]) { srand(1827); int n = 100000; if(argc > 1) sscanf(argv[1], "%d", &n); printf("%d %d\n", n, n); for(int i = 0; i < n; i++) printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' '); for(int i = 0; i < n; i++) { int p = rand() % n; int k = rand() % n + 1; printf("%d %d\n", p, k); } }

If you do not have a C ++ compiler, this is the result of ./gen.exe 1000 .

This is the result of execution on my computer:

 $ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.3 $ ghc -fforce-recomp 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m0.088s user 0m0.015s sys 0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m2.969s user 0m0.000s sys 0m0.045s

And this is a heap profile summary:

 $ ghc -fforce-recomp -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 70,207,096 bytes allocated in the heap 2,112,416 bytes copied during GC 613,368 bytes maximum residency (3 sample(s)) 28,816 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 132 colls, 0 par 0.00s 0.00s 0.0000s 0.0004s Gen 1 3 colls, 0 par 0.00s 0.00s 0.0006s 0.0010s INIT time 0.00s ( 0.00s elapsed) MUT time 0.03s ( 0.03s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.03s ( 0.04s elapsed) %GC time 0.0% (14.7% elapsed) Alloc rate 2,250,213,011 bytes per MUT second Productivity 100.0% of total user, 83.1% of total elapsed $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 6,009,233,608 bytes allocated in the heap 622,682,200 bytes copied during GC 443,240 bytes maximum residency (505 sample(s)) 48,256 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10945 colls, 0 par 0.72s 0.63s 0.0001s 0.0004s Gen 1 505 colls, 0 par 0.16s 0.13s 0.0003s 0.0005s INIT time 0.00s ( 0.00s elapsed) MUT time 2.00s ( 2.13s elapsed) GC time 0.87s ( 0.76s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.89s ( 2.90s elapsed) %GC time 30.3% (26.4% elapsed) Alloc rate 3,009,412,603 bytes per MUT second Productivity 69.7% of total user, 69.4% of total elapsed

+84

optimization haskell ghc compiler-bug

johnchen902 Apr 02 '15 at 2:29

source share

1 answer

Joachim Breitner · Accepted Answer · 2015-06-02 17:58

I think the time has come to answer this question.

What happened to your code with `-O`

Let me enlarge your main function and rewrite it a bit:

 main :: IO () main = do [n, m] <- fmap (map read . words) getLine line <- getLine let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line replicateM_ m $ query n nodes

Obviously, the intention here is that a NodeArray is created once and then used in each of the m invocations query .

Unfortunately, GHC efficiently converts this code,

 main = do [n, m] <- fmap (map read . words) getLine line <- getLine replicateM_ m $ do let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line query n nodes

and you can immediately see the problem here.

What is state hacking and why does it destroy the performance of my programs

The reason is a hacking state that says (approximately): "When something is of type IO a , suppose it is called only once." The official documentation is not much more complicated:

-fno-state-hack
Disable "state hacking", in which any lambda with the state marker # as an argument is considered single, therefore it is believed that there are built-in elements inside it. This can improve the performance of I / O and ST codes, but it reduces the risk of sharing.

Roughly speaking, the idea is this: if you define a function with type IO and a where clause, for example

 foo x = do putStrLn y putStrLn y where y = ...x...

Something like IO a can be thought of as something like RealWord -> (a, RealWorld) . In this regard, the above becomes (approximately)

 foo x = let y = ...x... in \world1 -> let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ())

A call to foo will (usually) look like this: foo argument world . But the definition of foo takes only one argument, and the second only a local lambda expression! This will be a very slow foo call. It would be much faster if the code looked like this:

 foo x world1 = let y = ...x... in let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ())

This is called an eta extension and is performed for various reasons (for example, analyzing the definition of functions , checking what it is called , and - in this case, the type of directional heuristic).

Unfortunately, this is unreasonable if the call to foo actually has the form let fooArgument = foo argument , that is, with the argument, but no world has passed (for now). In the source code, if fooArgument used multiple times, y will be evaluated only once and shared. In the modified code, y will be recalculated every time - exactly what happened to your nodes .

Can things be fixed?

Maybe. See # 9388 for trying to do this. The problem with the fix is that in many cases it will cost performance when the conversion happens fine, although the compiler cannot know for sure. And, probably, there are cases when technically this is not normal, i.e. Separation is lost, but it is still useful because the acceleration from a faster call outweighs the additional cost of recounting. Therefore, it is not clear where to go from here.

Why is this Haskell code slower with -O?

What happened to your code with -O

What is state hacking and why does it destroy the performance of my programs

Can things be fixed?

More articles:

What happened to your code with `-O`