Diagnostics of the parallel monad

Question

Diagnostics of the parallel monad

I wrote a Bytestring analyzer using the Attoparsec library:

import qualified Data.ByteString.Char8 as B import qualified Data.Attoparsec.ByteString.Char8 as P parseComplex :: P.Parser Complex

My intention was to use this parsing of large (> 5 GB) files so that the implementation uses this parser lazily:

 import qualified Data.ByteString.Lazy.Char8 as LB import qualified Data.Attoparsec.ByteString.Lazy as LP extr :: LP.Result a -> a main = do rawData <- liftA LB.words (LB.readFile "/mnt/hgfs/outputs/out.txt") let formatedData = map (extr.LP.parse parseComplex) rawData ...

Performing this in a test file with the -O2 and -s flags, I see:

  3,509,019,048 bytes allocated in the heap 2,086,240 bytes copied during GC 58,256 bytes maximum residency (30 sample(s)) 126,240 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 6737 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s Gen 1 30 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.83s ( 0.83s elapsed) GC time 0.04s ( 0.04s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.87s ( 0.86s elapsed) %GC time 4.3% (4.3% elapsed) Alloc rate 4,251,154,493 bytes per MUT second Productivity 95.6% of total user, 95.8% of total elapsed

Since I randomly map a function over a list, I thought that this code might benefit from parallelization. I never did anything like this in Haskell, but fiddled with the Control.Monad.Par library, I wrote a simple, naive, static pass-through function that I thought would match my parsing in parallel:

 import Control.Monad.Par parseMap :: [LB.ByteString] -> [Complex] parseMap x = runPar $ do let (as, bs) = force $ splitAt (length x `div` 2) x a <- spawnP $ map (extr.LP.parse parseComplex) as b <- spawnP $ map (extr.LP.parse parseComplex) bs c <- get a d <- get b return $ c ++ d

I did not expect too much from this function, however parallel performance was much worse than sequential computation. Here is the main function and results compiled with -O2 -threaded -rtsopts and executed with +RTS -s -N2 :

 main = do rawData <- liftA LB.words (LB.readFile "/mnt/hgfs/outputs/out.txt") let formatedData = parseMap rawData ...

  3,641,068,984 bytes allocated in the heap 356,490,472 bytes copied during GC 82,325,144 bytes maximum residency (10 sample(s)) 14,182,712 bytes maximum slop 253 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4704 colls, 4704 par 0.50s 0.25s 0.0001s 0.0006s Gen 1 10 colls, 9 par 0.57s 0.29s 0.0295s 0.1064s Parallel GC work balance: 19.77% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 1.11s ( 0.72s elapsed) GC time 1.07s ( 0.54s elapsed) EXIT time 0.02s ( 0.02s elapsed) Total time 2.20s ( 1.28s elapsed) Alloc rate 3,278,811,516 bytes per MUT second Productivity 51.2% of total user, 88.4% of total elapsed gc_alloc_block_sync: 149514 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 32

As you can see, in the parallel case, there seems to be a lot of garbage collector activity, and the loads are pretty poorly balanced. I profiled the execution using threadcope and got the following:

I see very clearly the garbage collector running on HEC 1 interrupts the calculation on HEC 2. In addition, HEC 1 obviously has less work than HEC 2. As a test, I tried to adjust the relative size of the two sections of the lists to rebalance the loads, but after I did not see this noticeable difference in the behavior of the program. I also tried to run this on different sized inputs, with large minimum heap allocations, and just using the parMap function included in the Control.Monad.Par library, but these efforts also did not affect the result.

I assume that somewhere there is a space leak, possibly from the purpose of let (as,bs) = ... , since the memory usage in the parallel case is much higher. This is problem? If so, how can I solve it?

EDIT: Separating the input manually as suggested, now I see some slight improvements in the timings. For the input file 6 meters long, I manually split the file into two files with 3 points and three files with 2 points and repeat the code using 2 and 3 cores, respectively. Rough timings:

1 Core: 6.5s

2 Core: 5.7s

3 Kernel: 4.5 s

The new thread profile will look like this:

The strange behavior to the beginning disappeared, but now there is still something similar to me, as if there are still some obvious load balancing problems.

+4

parallel-processing haskell

Frank wang Aug 3 '15 at 10:52

source share

1 answer

Erikr · Accepted Answer · 2015-08-04T01:16:53+0000

First of all, I would suggest linking to your code review message (link) to give people more information about what you are trying to do.

The main problem is that you force Haskell to read the entire file in memory using length x . What you want to do is stream in the results so that there are as few files in memory as possible at any time.

What you have is a typical calculation with a reduction in the size of the card, so for dividing the workload into two parts, my recommendation is this:

Open the input file twice by creating two file descriptors.
Place the second descriptor in the "medium" file.
Create two calculations — one for each file descriptor.
The first calculation will be read from its descriptor until it falls into the "middle"; the second will be read from its handle until it reaches the end of the file.
Each calculation will create Vector Int
When all the calculations are completed, we combine both vectors together (add the vectors by the elements.)

Of course, the “middle” file is the beginning of a line that is close to the middle of the file.

The hard part is step 4, so for simplicity we can assume that the input file has already been split into two separate files, part1 and part2 . Then your calculations may look like this:

 main = do content1 <- LB.readFile "part1" content2 <- LB.readFile "part2" let v = runPar $ do a <- spawnP $ computeVector content1 b <- spawnP $ computeVector content2 vec1 <- get a vec2 <- get b -- combine vec1 and vec2 let vec3 = ...vec1 + vec2... return vec3 ...

You should try this approach and determine what acceleration is. If it looks good, we can understand how to practically split a file into several parts without copying the data.

Note. Actually, I didn’t run this, so I don’t know if wrt lazy-IO and Par monad are quirks, but this idea should work in some form.

Diagnostics of the parallel monad

More articles: