I am trying to add parallelism to a program that converts .bmp to grayscale .bmp. I see, as a rule, 2-4 times worse for parallel code. I am setting parBuffer / chunking sizes and still can't explain it. Search for leadership.
The whole source file used here: http://lpaste.net/106832
We use Codec.BMP to read in the pixel stream represented by type RGBA = (Word8, Word8, Word8, Word8) . To convert to grayscale, simply draw a “brightness” conversion in all pixels.
The sequential implementation is literally:
toGray :: [RGBA] -> [RGBA] toGray x = map luma x
Test input .bmp - 5184 x 3456 (71.7 MB).
The serial implementation runs at ~ 10 s, ~ 550 ns / pixel. The drawing looks clean:

Why is it so fast? I suppose it has something with a lazy ByteString (although Codec.BMP uses a strict ByteString - is there an implicit conversion here?) And a merge.
Adding Parallelism
The first attempt to add parallelism went through parList . Oh boy. The program used memory and a ~ 4-5 GB system.
Then I read the section “Parallelizing Lazy Streams Using ParBuffer” by Simon Marlow O'Reilly and tried parBuffer with a large size. This still did not produce the desired results. The spark was incredibly small.
Then I tried to increase the size of the spark by breaking a lazy list and then sticking with parBuffer for parallelism:
toGrayPar :: [RGBA] -> [RGBA] toGrayPar x = concat $ (withStrategy (parBuffer 500 rpar) . map (map luma)) (chunk 8000 x) chunk :: Int -> [a] -> [[a]] chunk n [] = [] chunk n xs = as : chunk n bs where (as,bs) = splitAt (fromIntegral n) xs
But this still does not give the desired performance:
18,934,235,760 bytes allocated in the heap 15,274,565,976 bytes copied during GC 639,588,840 bytes maximum residency (27 sample(s)) 238,163,792 bytes maximum slop 1910 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 35277 colls, 35277 par 19.62s 14.75s 0.0004s 0.0234s Gen 1 27 colls, 26 par 13.47s 7.40s 0.2741s 0.5764s Parallel GC work balance: 30.76% (serial 0%, perfect 100%) TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2) SPARKS: 4480 (2240 converted, 0 overflowed, 0 dud, 2 GC'd, 2238 fizzled) INIT time 0.00s ( 0.01s elapsed) MUT time 14.31s ( 14.75s elapsed) GC time 33.09s ( 22.15s elapsed) EXIT time 0.01s ( 0.12s elapsed) Total time 47.41s ( 37.02s elapsed) Alloc rate 1,323,504,434 bytes per MUT second Productivity 30.2% of total user, 38.7% of total elapsed gc_alloc_block_sync: 7433188 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 1017408

How can I better talk about what's going on here?