Haskell: GC suboptimal parallel working balance, without parallel execution acceleration

The description of my problem is almost the same as in this post , but although I think I can understand the appropriate solution , I don’t see how this relates to my problem, if at all possible.

Here is my sample program

{-# LANGUAGE BangPatterns #-} import System.Random (randoms, mkStdGen) import Control.Parallel.Strategies import Control.DeepSeq (NFData) import Data.List data Point = Point !Double !Double fmod :: Double -> Double -> Double fmod ab | a < 0 = b - fmod (abs a) b | otherwise = if a < b then a else let q = a / b in b * (q - fromIntegral (floor q :: Int)) standardMap :: Double -> Point -> Point standardMap k (Point qp) = Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi)) iterate' gen !p = p : (iterate' gen $ gen p) iterateN :: (Point -> Point) -> [Int] -> Point -> [Point] iterateN _ [] p = [p] iterateN gen (dn:dns) p = p : (iterateN gen dns $ (head . drop dn) $ iterate' gen p) ensemble :: [Point] ensemble = zipWith Point qs ps where qs = randoms (mkStdGen 42) ps = randoms (mkStdGen 21) main = let dns = take 100 $ repeat 10000 ens = take 1000 ensemble obs = \(Point pq) -> p^2 - q^2 work = map obs . (iterateN (standardMap 7.0) dns) ps = parMap rdeepseq work ens in putStrLn $ show (foldl' (+) 0 $ map (foldl' (+) 0) ps) 

the problem is that this program does not scale very well with the number of threads. For example, on Debian 3.2.46-1 x86_64} with GHC 7.4.1 I get

 $ ghc -O3 --make stmap.hs -threaded $ time ./stmap +RTS -N1 real 1m9.791s user 1m9.448s sys 0m0.208s $ time ./stmap +RTS -N2 real 0m36.981s user 1m13.113s sys 0m0.656s $ time ./stmap +RTS -N4 real 0m23.110s user 1m31.310s sys 0m0.792s $ time ./stmap +RTS -N8 real 0m20.537s user 2m21.921s sys 0m21.017s 

These numbers can fluctuate greatly. The only indicator I found where the problem might be is the GC suboptimal parallel work balance, for example:

 $ ./stmap +RTS -N8 -sstderr 1>/dev/null 112,032,905,392 bytes allocated in the heap 59,112,296 bytes copied during GC 971,520 bytes maximum residency (35 sample(s)) 96,416 bytes maximum slop 8 MB total memory in use (1 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 27032 colls, 27031 par 6.49s 0.81s 0.0000s 0.0015s Gen 1 35 colls, 35 par 0.39s 0.05s 0.0014s 0.0028s Parallel GC work balance: 4.05 (6799831 / 1680927, ideal 8) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 14.81s ( 14.84s) 0.96s ( 0.97s) Task 1 (worker) : 0.00s ( 15.81s) 0.00s ( 0.00s) Task 2 (bound) : 0.03s ( 15.80s) 0.01s ( 0.01s) Task 3 (worker) : 14.72s ( 14.82s) 0.98s ( 0.99s) Task 4 (worker) : 14.70s ( 14.84s) 0.96s ( 0.97s) Task 5 (worker) : 14.69s ( 14.82s) 0.98s ( 0.99s) Task 6 (worker) : 14.69s ( 14.82s) 0.98s ( 0.99s) Task 7 (worker) : 14.72s ( 14.81s) 0.99s ( 1.00s) Task 8 (worker) : 14.76s ( 14.83s) 0.97s ( 0.98s) Task 9 (worker) : 14.76s ( 14.81s) 1.00s ( 1.00s) SPARKS: 1000 (1000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 118.87s ( 14.95s elapsed) GC time 6.87s ( 0.86s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 125.74s ( 15.81s elapsed) Alloc rate 942,488,358 bytes per MUT second Productivity 94.5% of total user, 751.8% of total elapsed gc_alloc_block_sync: 1130880 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 175 

where it is ~ 4, but only in the next run it was much worse, ~ 2,

 $ ./stmap +RTS -N8 -sstderr 60364.38698300099 112,033,885,088 bytes allocated in the heap 4,626,963,592 bytes copied during GC 2,101,264 bytes maximum residency (1846 sample(s)) 652,528 bytes maximum slop 13 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 25497 colls, 25496 par 29.42s 3.70s 0.0001s 0.0022s Gen 1 1846 colls, 1846 par 17.97s 2.26s 0.0012s 0.0071s Parallel GC work balance: 2.00 (577773617 / 288947149, ideal 8) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 14.86s ( 15.03s) 6.07s ( 6.10s) Task 1 (worker) : 0.00s ( 21.13s) 0.00s ( 0.00s) Task 2 (bound) : 0.03s ( 21.11s) 0.02s ( 0.02s) Task 3 (worker) : 14.92s ( 14.99s) 6.06s ( 6.14s) Task 4 (worker) : 14.88s ( 15.02s) 6.07s ( 6.11s) Task 5 (worker) : 14.91s ( 15.02s) 6.09s ( 6.12s) Task 6 (worker) : 14.92s ( 15.04s) 6.07s ( 6.10s) Task 7 (worker) : 14.86s ( 15.03s) 6.03s ( 6.11s) Task 8 (worker) : 14.86s ( 15.03s) 6.07s ( 6.10s) Task 9 (worker) : 14.92s ( 15.00s) 6.11s ( 6.13s) SPARKS: 1000 (1000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 120.36s ( 15.18s elapsed) GC time 47.39s ( 5.96s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 167.75s ( 21.13s elapsed) Alloc rate 930,821,901 bytes per MUT second Productivity 71.7% of total user, 569.5% of total elapsed gc_alloc_block_sync: 1253157 whitehole_spin: 21 gen[0].sync: 4 gen[1].sync: 19789 

What is responsible for these fluctuations at runtime? And most importantly, how can I improve the parallel balance of the GC in my specific example and in general?

+7
performance multithreading haskell ghc
source share
1 answer

The variation is probably due to the fact that using +RTS -Nn leads to the creation of one connected thread and n workflows (see output), so one worker will share the physical core with the connected thread and intervene. Therefore, it is recommended to use the number below, than the total number of available physical cores as an argument for +RTS -N .

Another potential problem is load balancing: you may need to split the work in different ways if there is a load imbalance (thread profile would help). Look at for more details on customization.

+1
source share

All Articles