Haskell multi-threaded performance profiling - no acceleration when using parallel strategies

After trying to add multithreading in a Haskell program, I noticed that performance has not improved at all. Chasing it, I got the following data from a thread:

Graph 1 Green indicates running, while orange indicates garbage collection. Graph 2Graph 3 Here, the vertical green bars indicate the creation of a spark, the blue bars indicate parallel GC requests, and the light blue ones indicate a thread. Graph 4 Labels: created spark requesting parallel GC, creating thread n and spark from spark from cover 2.

On average, I get about 25% activity compared to 4 cores, which is not an improvement over the entire single-threaded program.

Of course, the question will be invalid without a description of the real program. Essentially, I create a traceable data structure (for example, a tree), and then fmap the function above it, and then feed it into the image recording routine (unambiguously explaining the single-threaded segment at the end of the program run for the last 15 seconds), Both the construction and f- Display functions take a considerable amount of time to run, although the second is slightly larger.

The above graphs were made by adding a parTraversable strategy for this data structure before it is consumed by recording the image. I also tried using toList in the data structure and then used various parallel list strategies (parList, parListChunk, parBuffer), but the results were the same every time for a wide range of parameters (even using large chunks).
I also tried to fully appreciate the roaming data structure before fmapping over it, but the same problem came up.

Here are some additional statistics (for another launch of the same program):

5,702,829,756 bytes allocated in the heap 385,998,024 bytes copied during GC 55,819,120 bytes maximum residency (8 sample(s)) 1,392,044 bytes maximum slop 133 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10379 colls, 10378 par 5.20s 1.40s 0.0001s 0.0327s Gen 1 8 colls, 8 par 1.01s 0.25s 0.0319s 0.0509s Parallel GC work balance: 1.24 (96361163 / 77659897, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 15.92s) 0.02s ( 0.02s) Task 1 (worker) : 0.27s ( 14.00s) 1.86s ( 1.94s) Task 2 (bound) : 14.24s ( 14.30s) 1.61s ( 1.64s) Task 3 (worker) : 0.00s ( 15.94s) 0.00s ( 0.00s) Task 4 (worker) : 0.25s ( 14.00s) 1.66s ( 1.93s) Task 5 (worker) : 0.27s ( 14.09s) 1.69s ( 1.84s) SPARKS: 595854 (595854 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 15.67s ( 14.28s elapsed) GC time 6.22s ( 1.66s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 21.89s ( 15.94s elapsed) Alloc rate 363,769,460 bytes per MUT second Productivity 71.6% of total user, 98.4% of total elapsed 

I am not sure what other useful information I can give to help answer. Profiling does not show anything interesting: this is the same as single-core statistics, except that the added IDLE takes 75% of the time, as expected above.

What happens to prevent useful concurrency?

+6
multithreading profiling parallel-processing haskell
source share
1 answer

Sorry that I was not able to provide the code in a timely manner to help respondents. It took me a while to uncover the exact location of the problem.

The problem was this: I was a fmapping function

 f :: a -> S b 

in comparison with the transmitted data structure

 structure :: T a 

where S and T are two intersecting functors.

Then, using parTraversable, I mistakenly wrote

 Compose (fmap f structure) `using` parTraversable rdeepseq 

instead

 Compose $ fmap f structure `using` parTraversable rdeepseq 

so I mistakenly used a Traversable instance for Compose TS for multithreading (using Data.Functor.Compose).

(It seemed like it was easy to catch, but it took me a while to extract the above error from the code!)

Now it looks much better!

Graph 1

Graph 2

+4
source share

All Articles