After trying to add multithreading in a Haskell program, I noticed that performance has not improved at all. Chasing it, I got the following data from a thread:
Green indicates running, while orange indicates garbage collection. 
Here, the vertical green bars indicate the creation of a spark, the blue bars indicate parallel GC requests, and the light blue ones indicate a thread.
Labels: created spark requesting parallel GC, creating thread n and spark from spark from cover 2.
On average, I get about 25% activity compared to 4 cores, which is not an improvement over the entire single-threaded program.
Of course, the question will be invalid without a description of the real program. Essentially, I create a traceable data structure (for example, a tree), and then fmap the function above it, and then feed it into the image recording routine (unambiguously explaining the single-threaded segment at the end of the program run for the last 15 seconds), Both the construction and f- Display functions take a considerable amount of time to run, although the second is slightly larger.
The above graphs were made by adding a parTraversable strategy for this data structure before it is consumed by recording the image. I also tried using toList in the data structure and then used various parallel list strategies (parList, parListChunk, parBuffer), but the results were the same every time for a wide range of parameters (even using large chunks).
I also tried to fully appreciate the roaming data structure before fmapping over it, but the same problem came up.
Here are some additional statistics (for another launch of the same program):
5,702,829,756 bytes allocated in the heap 385,998,024 bytes copied during GC 55,819,120 bytes maximum residency (8 sample(s)) 1,392,044 bytes maximum slop 133 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10379 colls, 10378 par 5.20s 1.40s 0.0001s 0.0327s Gen 1 8 colls, 8 par 1.01s 0.25s 0.0319s 0.0509s Parallel GC work balance: 1.24 (96361163 / 77659897, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 15.92s) 0.02s ( 0.02s) Task 1 (worker) : 0.27s ( 14.00s) 1.86s ( 1.94s) Task 2 (bound) : 14.24s ( 14.30s) 1.61s ( 1.64s) Task 3 (worker) : 0.00s ( 15.94s) 0.00s ( 0.00s) Task 4 (worker) : 0.25s ( 14.00s) 1.66s ( 1.93s) Task 5 (worker) : 0.27s ( 14.09s) 1.69s ( 1.84s) SPARKS: 595854 (595854 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.00s elapsed) MUT time 15.67s ( 14.28s elapsed) GC time 6.22s ( 1.66s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 21.89s ( 15.94s elapsed) Alloc rate 363,769,460 bytes per MUT second Productivity 71.6% of total user, 98.4% of total elapsed
I am not sure what other useful information I can give to help answer. Profiling does not show anything interesting: this is the same as single-core statistics, except that the added IDLE takes 75% of the time, as expected above.
What happens to prevent useful concurrency?