You can immediately see that there is a leak when you look at the GC heap and time:
USDGHTVVH1$ time ./A 1000 10000 +RTS -s 10000000 1,208,298,840 bytes allocated in the heap 1,352,861,868 bytes copied during GC 280,181,684 bytes maximum residency (9 sample(s)) 4,688,680 bytes maximum slop 545 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1677 colls, 0 par 2.27s 2.22s 0.0013s 0.0063s Gen 1 9 colls, 0 par 1.66s 1.77s 0.1969s 1.0273s INIT time 0.00s ( 0.00s elapsed) MUT time 0.70s ( 0.77s elapsed) GC time 3.92s ( 4.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.62s ( 4.77s elapsed) %GC time 84.8% (83.8% elapsed) Alloc rate 1,718,469,461 bytes per MUT second Productivity 15.2% of total user, 14.7% of total elapsed real 0m4.752s user 0m0.015s sys 0m0.046s
280M residence and 89% GC. A lot of noise is emitted and discarded.
The heap profile makes it obivous.

The key is that this is a "stg_app *" thing (that is, the STG machine uses thunks).
A subtle but wonderful problem with the family of modification functions is the problem here - when you have a lazy atomModify, there is simply no way to strictly update the field without requiring a value.
Thus, all your careful use of atomicModifyIORef r (\ !x' -> (x' + k, ())) is to build a chain of applications of the function (+) , so that the result of the chain (i.e. the value in the cell) is observed Each addition will be strictly his argument. Not what you want! None of your annotations on the severity of the argument to change the IORef will have any effect on the cell itself. Now, as a rule, a lazy change is what you want - it's just a collection of pointers, so you can have very short atomic sections.
But sometimes this is not what you want.
(For the background of this issue, see GHC ticket # 5926 , however the issue was known at least in 2007 when I wrote strict-concurrency to avoid this problem with MVars. Was discussed in 2009 , and now we have strict version in 2012).
By first requiring a value, you can remove the problem. For example.
shW !k !r = --atomicModifyIORef r (\x -> (x + k, ())) do x <- shR r writeIORef r $! (x+1)
Note that this problem is now documented in libraries , and you can use atomicModifyIORef' to avoid it.
And we get:
USDGHTVVH1$ time ./A 1000 10000 +RTS -s 10000000 120,738,836 bytes allocated in the heap 3,758,476 bytes copied during GC 73,020 bytes maximum residency (1 sample(s)) 16,348 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 230 colls, 0 par 0.00s 0.01s 0.0000s 0.0001s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.17s ( 0.17s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.19s ( 0.17s elapsed) %GC time 0.0% (3.9% elapsed) Alloc rate 643,940,458 bytes per MUT second Productivity 100.0% of total user, 108.3% of total elapsed real 0m0.218s user 0m0.000s sys 0m0.015s
That is, 22x acceleration and memory usage become permanent. For laughs, here's a new heap profile:
