I want to take the n largest items from a lazy list.
I heard that mergesort implemented in Data.List.sort is lazy and does not create more elements than necessary. This may be true in terms of comparisons, but of course this is not the case when it comes to memory usage. The following program illustrates the problem:
{-# LANGUAGE ScopedTypeVariables #-} module Main where import qualified Data.Heap as Heap import qualified Data.List as List import System.Random.MWC import qualified Data.Vector.Unboxed as Vec import System.Environment limitSortL n xs = take n (List.sort xs) limitSortH n xs = List.unfoldr Heap.uncons (List.foldl' (\ acc x -> Heap.take n (Heap.insert x acc) ) Heap.empty xs) main = do st <- create rxs :: [Int] <- Vec.toList `fmap` uniformVector st (10^7) args <- getArgs case args of ["LIST"] -> print (limitSortL 20 rxs) ["HEAP"] -> print (limitSortH 20 rxs) return ()
Runtime:
Data.List:
./lazyTest LIST + RTS -s
[-9,223,371,438,221,280,004, -9,223,369,283,422,017,686, -9,223,368,296,903,201,811, -9,223,365,203,042,113,783, -9,223,364,809,100,004,863, -9,223,363,058,932,210,878, -9,223,362,160,334,234,021, -9,223,359,019,266,180,408, -9,223,358,851,531,436,915, -9,223,345,045,262,962,114, -9,223,343,191,568,060,219, -9,223,342,956,514,809,662, -9,223,341,125,508,040,302, -9,223,340,661,319,591,967, -9,223,337,771,462,470,186, -9,223,336,010,230,770,808 - 9223331570472117335, -9223329558935830150, -9223329536207787831, -9223328937489459283]
2,059,921,192 bytes allocated in the heap
2,248,105,704 bytes copied during GC
552,350,688 bytes maximum residency (5 sample (s))
3,390,456 bytes maximum slop
1168 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 3772 collections, 0 parallel, 1.44s, 1.48s elapsed
Generation 1: 5 collections, 0 parallel, 0.90s, 1.13s elapsed
INIT time 0.00s (0.00s elapsed)
MUT time 0.82s (0.84s elapsed)
GC time 2.34s (2.61s elapsed)
EXIT time 0.00s (0.00s elapsed)
Total time 3.16s (3.45s elapsed)
% GC time 74.1% (75.7% elapsed)
Alloc rate 2,522,515,156 bytes per MUT second
Productivity 25.9% of total user, 23.7% of total elapsed
Data.Heap:
./lazyTest HEAP + RTS -s
[-9,223,371,438,221,280,004, -9,223,369,283,422,017,686, -9,223,368,296,903,201,811, -9,223,365,203,042,113,783, -9,223,364,809,100,004,863, -9,223,363,058,932,210,878, -9,223,362,160,334,234,021, -9,223,359,019,266,180,408, -9,223,358,851,531,436,915, -9,223,345,045,262,962,114, -9,223,343,191,568,060,219, -9,223,342,956,514,809,662, -9,223,341,125,508,040,302, -9,223,340,661,319,591,967, -9,223,337,771,462,470,186, -9,223,336,010,230,770,808 - 9223331570472117335, -9223329558935830150, -9223329536207787831, -9223328937489459283]
177,559,536,928 bytes allocated in the heap
237,093,320 bytes copied during GC
80,031,376 bytes maximum residency (2 sample (s))
745,368 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 338539 collections, 0 parallel, 1.24s, 1.31s elapsed
Generation 1: 2 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s (0.00s elapsed)
MUT time 35.24s (35.46s elapsed)
GC time 1.24s (1.31s elapsed)
EXIT time 0.00s (0.00s elapsed)
Total time 36.48s (36.77s elapsed)
% GC time 3.4% (3.6% elapsed)
Alloc rate 5,038,907,812 bytes per MUT second
Productivity 96.6% of total user, 95.8% of total elapsed
It is clear that limitSortL is much faster, but it is also very hungry. In large lists, he was gaining RAM size.
Is there a faster algorithm to solve this problem that is not memory hungry?
Edit : Clarification: I am using Data.Heap from the heap package, I have not tried the heap package.