I have a program that I'm trying to parallelize (full insert with executable code here ).
I profiled and found that most of the time was spent in findNearest , which is essentially a simple foldr over a large Data.Map .
findNearest :: RGB -> M.Map k RGB -> (k, Word32) findNearest rgb m0 = M.foldrWithKey' minDistance (k0, distance rgb r0) m0 where (k0, r0) = M.findMin m0 minDistance kr x@ (_, d1) = -- Euclidean distance in RGB-space let d0 = distance rgb r in if d0 < d1 then (k, d0) else x
parFindNearest must run findNearest parallel to the larger subtrees of Map .
parFindNearest :: NFData k => RGB -> M.Map k RGB -> (k, Word32) parFindNearest rgb = minimumBy (comparing snd) . parMap rdeepseq (findNearest rgb) . M.splitRoot
Unfortunately, GHC GC has most of my sparks before they turn into useful parallelism.
Here's the compilation result with ghc -O2 -threaded and works with +RTS -s -N2
839,892,616 bytes allocated in the heap 123,999,464 bytes copied during GC 5,320,184 bytes maximum residency (19 sample(s)) 3,214,200 bytes maximum slop 16 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1550 colls, 1550 par 0.23s 0.11s 0.0001s 0.0004s Gen 1 19 colls, 18 par 0.11s 0.06s 0.0030s 0.0052s Parallel GC work balance: 16.48% (serial 0%, perfect 100%) TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2) SPARKS: 215623 (1318 converted, 0 overflowed, 0 dud, 198111 GC'd, 16194 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 3.72s ( 3.66s elapsed) GC time 0.34s ( 0.17s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.07s ( 3.84s elapsed) Alloc rate 225,726,318 bytes per MUT second Productivity 91.6% of total user, 97.1% of total elapsed gc_alloc_block_sync: 9862 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 2103
As you can see, most sparks are GC'd or fizzle before conversion. I tried experimenting with different stringency, having findNearest return a strict pair data type instead of a tuple, or using rdeepseq from Control.Parallel.Strategies , but my sparks are still GC'd.
I'd like to know
- Why are my sparks gc'd before converting?
- How can I change my program to take advantage of parallelism?