I wrote a Bytestring analyzer using the Attoparsec library:
import qualified Data.ByteString.Char8 as B import qualified Data.Attoparsec.ByteString.Char8 as P parseComplex :: P.Parser Complex
My intention was to use this parsing of large (> 5 GB) files so that the implementation uses this parser lazily:
import qualified Data.ByteString.Lazy.Char8 as LB import qualified Data.Attoparsec.ByteString.Lazy as LP extr :: LP.Result a -> a main = do rawData <- liftA LB.words (LB.readFile "/mnt/hgfs/outputs/out.txt") let formatedData = map (extr.LP.parse parseComplex) rawData ...
Performing this in a test file with the -O2 and -s flags, I see:
3,509,019,048 bytes allocated in the heap 2,086,240 bytes copied during GC 58,256 bytes maximum residency (30 sample(s)) 126,240 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 6737 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s Gen 1 30 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.83s ( 0.83s elapsed) GC time 0.04s ( 0.04s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.87s ( 0.86s elapsed) %GC time 4.3% (4.3% elapsed) Alloc rate 4,251,154,493 bytes per MUT second Productivity 95.6% of total user, 95.8% of total elapsed
Since I randomly map a function over a list, I thought that this code might benefit from parallelization. I never did anything like this in Haskell, but fiddled with the Control.Monad.Par library, I wrote a simple, naive, static pass-through function that I thought would match my parsing in parallel:
import Control.Monad.Par parseMap :: [LB.ByteString] -> [Complex] parseMap x = runPar $ do let (as, bs) = force $ splitAt (length x `div` 2) x a <- spawnP $ map (extr.LP.parse parseComplex) as b <- spawnP $ map (extr.LP.parse parseComplex) bs c <- get a d <- get b return $ c ++ d
I did not expect too much from this function, however parallel performance was much worse than sequential computation. Here is the main function and results compiled with -O2 -threaded -rtsopts and executed with +RTS -s -N2 :
main = do rawData <- liftA LB.words (LB.readFile "/mnt/hgfs/outputs/out.txt") let formatedData = parseMap rawData ...
3,641,068,984 bytes allocated in the heap 356,490,472 bytes copied during GC 82,325,144 bytes maximum residency (10 sample(s)) 14,182,712 bytes maximum slop 253 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4704 colls, 4704 par 0.50s 0.25s 0.0001s 0.0006s Gen 1 10 colls, 9 par 0.57s 0.29s 0.0295s 0.1064s Parallel GC work balance: 19.77% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 1.11s ( 0.72s elapsed) GC time 1.07s ( 0.54s elapsed) EXIT time 0.02s ( 0.02s elapsed) Total time 2.20s ( 1.28s elapsed) Alloc rate 3,278,811,516 bytes per MUT second Productivity 51.2% of total user, 88.4% of total elapsed gc_alloc_block_sync: 149514 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 32
As you can see, in the parallel case, there seems to be a lot of garbage collector activity, and the loads are pretty poorly balanced. I profiled the execution using threadcope and got the following:

I see very clearly the garbage collector running on HEC 1 interrupts the calculation on HEC 2. In addition, HEC 1 obviously has less work than HEC 2. As a test, I tried to adjust the relative size of the two sections of the lists to rebalance the loads, but after I did not see this noticeable difference in the behavior of the program. I also tried to run this on different sized inputs, with large minimum heap allocations, and just using the parMap function included in the Control.Monad.Par library, but these efforts also did not affect the result.
I assume that somewhere there is a space leak, possibly from the purpose of let (as,bs) = ... , since the memory usage in the parallel case is much higher. This is problem? If so, how can I solve it?
EDIT: Separating the input manually as suggested, now I see some slight improvements in the timings. For the input file 6 meters long, I manually split the file into two files with 3 points and three files with 2 points and repeat the code using 2 and 3 cores, respectively. Rough timings:
1 Core: 6.5s
2 Core: 5.7s
3 Kernel: 4.5 s
The new thread profile will look like this:

The strange behavior to the beginning disappeared, but now there is still something similar to me, as if there are still some obvious load balancing problems.