Fetching a collector with respect to getting the length of the list and selecting random items

Question

Fetching a collector with respect to getting the length of the list and selecting random items

I wrote two functions to select a random element from a list of unknown length. The first uses a collector sample (with a tank of size 1), and the second gets the length of the list to select a random index and return it. For some reason, the first is much faster.

The first function uses one traversal and selects each element with probability (1 / i), where i is the index of the element in the list. This gives an equal probability of choosing each item.

pickRandom :: [a] -> IO a pickRandom [] = error "List is empty" pickRandom (x:xs) = do stdgen <- newStdGen return (pickRandom' xs x 1 stdgen) -- Pick a random number using reservoir sampling pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a pickRandom' [] xi _ _ = xi pickRandom' (x:xs) xi n gen = let (rand, gen') = randomR (0, n) gen in if (rand == 0) then pickRandom' xs x (n + 1) gen' -- Update value else pickRandom' xs xi (n + 1) gen' -- Keep previous value

The second version runs once to get its length, and then selects an index between 0 and the length of the input list (-1) to get one of the elements, again with equal probability. The expected number of crawls of the list is 1.5:

 -- Traverses the list twice pickRandomWithLen :: [a] -> IO a pickRandomWithLen [] = error "List is empty" pickRandomWithLen xs = do gen <- newStdGen (e, _) <- return $ randomR (0, (length xs) - 1) gen return $ xs !! e

Here is the code I use to compare these two functions:

 main :: IO () main = do gen <- newStdGen let size = 2097152 inputList = getRandList gen size defaultMain [ bench "Using length" (pickRandomWithLen inputList) , bench "Using reservoir" (pickRandom inputList) ]

Here is a stripped-down conclusion:

 benchmarking Using reservoir mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950 benchmarking Using length mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950

In other words, the first function is about 200 times faster than the second. I expected that the runtime will be affected mainly by the generation of random numbers and the number of transitions on lists (1 versus 1.5). What other factors can explain such a huge difference?

+4

performance haskell criterion

Antoine Dec 25 '12 at 23:43

source share

1 answer

Daniel Fischer · Accepted Answer · 2012-12-26T00:30:36+0000

Your control actions do not actually evaluate the result,

 pickRandom :: [a] -> IO a pickRandom [] = error "List is empty" pickRandom (x:xs) = do stdgen <- newStdGen return (pickRandom' xs x 1 stdgen)

gets only the new StdGen and returns thunk. It's very fast.

 pickRandomWithLen :: [a] -> IO a pickRandomWithLen [] = error "List is empty" pickRandomWithLen xs = do gen <- newStdGen (e, _) <- return $ randomR (0, (length xs) - 1) gen return $ xs !! e

computes the length of the list and then returns thunk, which of course is much slower.

Coercion is how to evaluate the result,

 return $! ...

speeds up length with the version much

 benchmarking Using length mean: 14.65655 ms, lb 14.14580 ms, ub 15.16942 ms, ci 0.950 std dev: 2.631668 ms, lb 2.378186 ms, ub 2.937339 ms, ci 0.950 variance introduced by outliers: 92.581% variance is severely inflated by outliers benchmarking Using reservoir collecting 100 samples, 1 iterations each, in estimated 47.00930 s mean: 451.5571 ms, lb 448.4355 ms, ub 455.7812 ms, ci 0.950 std dev: 18.50427 ms, lb 14.45557 ms, ub 24.74350 ms, ci 0.950 found 4 outliers among 100 samples (4.0%) 2 (2.0%) high mild 2 (2.0%) high severe variance introduced by outliers: 38.511% variance is moderately inflated by outliers

(after forcing the input list to be computed before printing the sum), since this requires only one PRNG call, while the collector fetch uses length list - 1 calls.

The difference is likely to be less if a faster PRNG is used than StdGen .

Indeed, using System.Random.Mersenne instead of StdGen ( StdGen is required to have an IO a result type, and since it offers no generation in a certain range, but only the default range slightly distorts the distribution of the selected elements, but since we are only interested in time, necessary to generate pseudorandom numbers, it doesn’t matter), the collector sampling time drops to

 mean: 51.83185 ms, lb 51.77620 ms, ub 51.91259 ms, ci 0.950 std dev: 482.4712 us, lb 368.4433 us, ub 649.1758 us, ci 0.950

(the pickRandomWithLen time pickRandomWithLen not change noticeably, of course, since it uses only one generation). About nine times faster, which shows that pseudo-random generation is the dominant factor.

Fetching a collector with respect to getting the length of the list and selecting random items

More articles: