Haskell batch file processing does not improve spatial profile

I have a simple algorithm to implement: compare each line with every other line. Each line contains one number, and the comparison function is the distance. The sum of all distances is the final result.

This can be implemented as follows:

sumOfDistancesOnSmallFile :: FilePath -> IO Integer sumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do distances <- liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h let subSet = drop offset distances let emptySubSet = null subSet return $ if (emptySubSet) then (0) else (distancesManyToMany subSet) hListToEOF :: (Handle -> IO a) -> Handle -> IO [a] hListToEOF func h = do element <- func h atEOF <- hIsEOF h rest <- case(atEOF) of True -> return [] False -> hListToEOF func h return $ element:rest distancesManyToMany :: [Integer]->Integer distancesManyToMany (x:xs) = distancesOneToMany x xs + (distancesManyToMany xs) distancesManyToMany _ = 0 distancesOneToMany :: Integer -> [Integer] -> Integer distancesOneToMany one many = sum $ map (distance one) many distance :: Integer -> Integer -> Integer distance ab = (ab) 

To get reasonable big data for each row, I used the following file generator:

 createTestFile :: Int -> FilePath -> IO () createTestFile n path = writeFile path $ unlines $ map show $ take n $ infiniteList 0 1 where infiniteList :: Integer->Integer-> [Integer] infiniteList ij = (i+j) * (i+j) : infiniteList j (i+j) 

The 2000 line file, 840kb in size, takes 1.92 seconds and 1.5 GB, with a maximum use of about 1.5 MB.

A linear file with a length of 6 kilobytes, equal to 7.5 MB, will take 22 seconds, 34Gb distribution with a maximum memory capacity of about 15 MB

Unfortunately, my data will be millions of rows. At first I tried to improve speed (which I asked about in MapReduce's two previous posts in conjunction with Iteratee IO ), but space is the actual limit problem.

Intermediate thought . This can be overcome by reading the full file for each number that needs to be compared. This takes a lot of extra time because the file needs to be opened and analyzed for each line that needs to be compared with the rest of the file. Also, the number of memory allocations will become quadratic. So this is not very useful as a final solution.

Last step : This was my first step towards my goal: batch execution. I would like to take a few k lines into memory. Apply the ManyToMany algorithm to functions in memory. Then iterate over the rest of the file. At each iteration step, you need to read and analyze only one sequential line, which can then be compared with all the elements in the memory pack.

When choosing a batch size large enough, the file does not need to be re-read often. My implementation is as follows:

 sumOfDistancesOnBigFileUsingBatches :: FilePath -> Int -> Int -> IO Integer sumOfDistancesOnBigFileUsingBatches path batchSize offset = do (firstResult, maybeRecurse) <- singleResultBatch path batchSize offset recursiveResult <- case maybeRecurse of Nothing -> return 0 Just newOffset -> sumOfDistancesOnBigFileUsingBatches path batchSize newOffset return $ firstResult + recursiveResult singleResultBatch :: FilePath -> Int -> Int -> IO(Integer, Maybe Int) singleResultBatch path batchSize offset = withFile path ReadMode $ \h->do distances <- readDistances h let (batch, subSet) = splitAt batchSize $ drop offset distances let batchInner = distancesManyToMany batch let recursionTerminated = null subSet let (batchToSubSet, newOffset) = if (recursionTerminated) then (0, Nothing) else (distancesSetToSet batch subSet, Just (offset+batchSize)) return (batchInner+batchToSubSet, newOffset) where readDistances h = liftM ( (map textRead) ) $ hListToEOF Text.hGetLine h distancesSetToSet :: [Integer] -> [Integer] -> Integer distancesSetToSet xs ys = sum $ map (\one->distancesOneToMany one xs) ys 

In a 2k line file with a batch size of 500, it ends with 2.6 s, 2.2 GB of distribution and about 6 MB of required space. This is 4 times the space of the simplest version! It may be a coincidence, but there are also 4 parties that are used ...

What surprised me was that all the necessary space is consumed initially, and then the necessary space only decreases. This becomes a problem with a 50 kilobyte (500 MB) file, because then it runs out of memory.

My question is: why does the batch solution consume more memory? It seems to keep the entire file in memory for each batch, although it should (at least, this is my intention) contain only one batch in memory.

EDIT : I deleted the details of the 6k line file and 500 lines (I took the wrong profile file)

And as a complement, here is a space profile generated using a 2k line file and 500 lines: space profile of batched solution on 2k line file (840KB)

EDIT2 : Lock profiling resulted in:

 total time = 2.24 secs (112 ticks @ 20 ms) total alloc = 2,126,803,896 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc textRead MapReduceTestStrictStrings 47.3 44.4 distance MapReduceTestStrictStrings 25.9 25.3 distancesOneToMany MapReduceTestStrictStrings 18.8 29.5 singleResultBatch MapReduceTestStrictStrings 4.5 0.0 readTextDevice Data.Text.IO.Internal 2.7 0.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 1604 2 0.0 0.0 100.0 100.0 sumOfDistancesOnBigFileUsingBatches MapReduceTestStrictStrings 1605 4 0.0 0.0 100.0 100.0 singleResultBatch MapReduceTestStrictStrings 1606 20 4.5 0.0 100.0 100.0 distancesSetToSet MapReduceTestStrictStrings 1615 3 0.0 0.0 34.8 43.3 distancesOneToMany MapReduceTestStrictStrings 1616 3000 14.3 23.2 34.8 43.2 distance MapReduceTestStrictStrings 1617 1500000 20.5 20.0 20.5 20.0 textRead MapReduceTestStrictStrings 1614 5000 47.3 44.4 47.3 44.4 distancesManyToMany MapReduceTestStrictStrings 1611 2004 0.0 0.0 9.8 11.7 distancesOneToMany MapReduceTestStrictStrings 1612 2000 4.5 6.3 9.8 11.6 distance MapReduceTestStrictStrings 1613 499000 5.4 5.3 5.4 5.3 hListToEOF MapReduceTestStrictStrings 1609 23996 0.9 0.6 3.6 0.6 readTextDevice Data.Text.IO.Internal 1610 1660 2.7 0.0 2.7 0.0 CAF:main4 Main 1591 1 0.0 0.0 0.0 0.0 CAF:main5 Main 1590 1 0.0 0.0 0.0 0.0 main Main 1608 0 0.0 0.0 0.0 0.0 CAF GHC.Num 1580 1 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 1526 2 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 1510 2 0.0 0.0 0.0 0.0 CAF System.Event.Thread 1508 3 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 1487 2 0.0 0.0 0.0 0.0 CAF System.Event.Internal 1486 2 0.0 0.0 0.0 0.0 CAF System.Event.Unique 1483 1 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 1480 1 0.0 0.0 0.0 0.0 CAF Data.Text.Internal 813 1 0.0 0.0 0.0 0.0 CAF Data.Text.Array 811 1 0.0 0.0 0.0 0.0 Retainer sets created during profiling: SET 2 = {<MAIN.SYSTEM>} SET 3 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 15 = {<GHC.IO.FD.CAF>} SET 17 = {<System.Event.Thread.CAF>} SET 18 = {<>} SET 44 = {<GHC.IO.Handle.FD.CAF>} SET 47 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>} SET 56 = {<GHC.Conc.Signal.CAF>} SET 57 = {<>, <MAIN.SYSTEM>} SET 66 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 67 = {<System.Event.Thread.CAF>, <>, <MAIN.SYSTEM>} SET 72 = {<GHC.Conc.Sync.CAF>, <MAIN.SYSTEM>} SET 76 = {<MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 81 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 83 = {<GHC.IO.Encoding.Iconv.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 86 = {<GHC.Conc.Signal.CAF>, <>} SET 95 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 96 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 100 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 102 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 103 = {<MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 136 = {<GHC.IO.Handle.FD.CAF>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 143 = {<GHC.Conc.Sync.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>} SET 144 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 145 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 146 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 147 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 148 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} , MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches, Main.main>, <MapReduceTestStrictStrings.distancesOneToMany, MapReduceTestStrictStrings.distancesManyToMany, MapReduceTestStrictStrings.singleResultBatch, MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches, Main.main>} total time = 2.24 secs (112 ticks @ 20 ms) total alloc = 2,126,803,896 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc textRead MapReduceTestStrictStrings 47.3 44.4 distance MapReduceTestStrictStrings 25.9 25.3 distancesOneToMany MapReduceTestStrictStrings 18.8 29.5 singleResultBatch MapReduceTestStrictStrings 4.5 0.0 readTextDevice Data.Text.IO.Internal 2.7 0.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 1604 2 0.0 0.0 100.0 100.0 sumOfDistancesOnBigFileUsingBatches MapReduceTestStrictStrings 1605 4 0.0 0.0 100.0 100.0 singleResultBatch MapReduceTestStrictStrings 1606 20 4.5 0.0 100.0 100.0 distancesSetToSet MapReduceTestStrictStrings 1615 3 0.0 0.0 34.8 43.3 distancesOneToMany MapReduceTestStrictStrings 1616 3000 14.3 23.2 34.8 43.2 distance MapReduceTestStrictStrings 1617 1500000 20.5 20.0 20.5 20.0 textRead MapReduceTestStrictStrings 1614 5000 47.3 44.4 47.3 44.4 distancesManyToMany MapReduceTestStrictStrings 1611 2004 0.0 0.0 9.8 11.7 distancesOneToMany MapReduceTestStrictStrings 1612 2000 4.5 6.3 9.8 11.6 distance MapReduceTestStrictStrings 1613 499000 5.4 5.3 5.4 5.3 hListToEOF MapReduceTestStrictStrings 1609 23996 0.9 0.6 3.6 0.6 readTextDevice Data.Text.IO.Internal 1610 1660 2.7 0.0 2.7 0.0 CAF:main4 Main 1591 1 0.0 0.0 0.0 0.0 CAF:main5 Main 1590 1 0.0 0.0 0.0 0.0 main Main 1608 0 0.0 0.0 0.0 0.0 CAF GHC.Num 1580 1 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 1526 2 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 1510 2 0.0 0.0 0.0 0.0 CAF System.Event.Thread 1508 3 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 1487 2 0.0 0.0 0.0 0.0 CAF System.Event.Internal 1486 2 0.0 0.0 0.0 0.0 CAF System.Event.Unique 1483 1 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 1480 1 0.0 0.0 0.0 0.0 CAF Data.Text.Internal 813 1 0.0 0.0 0.0 0.0 CAF Data.Text.Array 811 1 0.0 0.0 0.0 0.0 Retainer sets created during profiling: SET 2 = {<MAIN.SYSTEM>} SET 3 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 15 = {<GHC.IO.FD.CAF>} SET 17 = {<System.Event.Thread.CAF>} SET 18 = {<>} SET 44 = {<GHC.IO.Handle.FD.CAF>} SET 47 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>} SET 56 = {<GHC.Conc.Signal.CAF>} SET 57 = {<>, <MAIN.SYSTEM>} SET 66 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 67 = {<System.Event.Thread.CAF>, <>, <MAIN.SYSTEM>} SET 72 = {<GHC.Conc.Sync.CAF>, <MAIN.SYSTEM>} SET 76 = {<MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 81 = {<GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 83 = {<GHC.IO.Encoding.Iconv.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 86 = {<GHC.Conc.Signal.CAF>, <>} SET 95 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 96 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 100 = {<MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.hListToEOF,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 102 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesManyToMany,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 103 = {<MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 136 = {<GHC.IO.Handle.FD.CAF>, <MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 143 = {<GHC.Conc.Sync.CAF>, <GHC.IO.Handle.FD.CAF>, <MAIN.SYSTEM>} SET 144 = {<MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 145 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 146 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 147 = {<MAIN.SYSTEM>, <MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} SET 148 = {<MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>, <MapReduceTestStrictStrings.distancesOneToMany,MapReduceTestStrictStrings.distancesSetToSet,MapReduceTestStrictStrings.singleResultBatch,MapReduceTestStrictStrings.sumOfDistancesOnBigFileUsingBatches,Main.main>} 

And the following .hp image: Retainer heap profile for 2k line file and 500line batches

EDIT 3: In the previous code, everyone used the packages:

 Data.Text.IO Data.Text Data.Text.Read 

When I use lazy versions of them, the total use of time / memory / space does not really change: 2.62 s, 2.25 GB and 5.5 MB of space

Decision taken: Lazy versions did not work, because hListToEOF forcibly read the full file (I expected the constructor: llustly to work).

The solution is to use it after import:

 import qualified Data.ByteString.Char8 as Str import qualified Data.Text.Lazy.IO as TextIO import qualified Data.Text.Lazy as T import Data.Text.Lazy.Read 

and in the singleResultBatch function, the following modification:

  readDistances = liftM ( (map textRead . T.lines)) $ TextIO.readFile path 

Then, both speeds (2.72 s) and memory allocation (2.3 GB) do not change, which is expected.

The heap profile (space usage) improves (1.8 MB instead of 5.5 MB), as seen in: heap profile when Text.Lazy packages are used

+8
profiling haskell heap-memory
source share
2 answers

You need to process the data step by step. Currently hListToEOF reads all the data in one go, which you then slowly process (hence the initial burst of memory when everything is read, then the slow reduction when redistributing the list).

Instead of performing your own I / O using hListToEOF , lazily read / transfer files (for example, using readFile from the Text.Lazy library) and map your processing functions to them.

+3
source share

I think that most of your problem is that you are holding onto lists that are quite space inefficient if you need to keep them. Try switching to a more compact storage engine, such as Vector . This is a pretty direct translation of some of your functions, perhaps there is a lot of room for optimization:

 sumOfDistancesOnSmallFile :: FilePath -> IO IntegersumOfDistancesOnSmallFile path = withFile path ReadMode $ \h->do distances <- liftM ( V.fromList . (map textRead) ) $ hListToEOF Text.hGetLine h let subSet = distances let emptySubSet = V.null subSet return $ if (emptySubSet) then (0) else (distancesManyToMany subSet) distancesManyToMany :: V.Vector Integer -> Integer distancesManyToMany mp0 = fst $ V.foldl' (\(!acc,mp) _ -> let (el,mp1) = (V.head mp, V.tail mp) in (acc+el,mp1)) (0,mp0) mp0 distancesOneToMany :: Integer -> V.Vector Integer -> Integer distancesOneToMany one many = V.sum $ V.map (distance one) many 

On my system, this will run in a 10,000-line test data file in about 14 seconds with a maximum memory usage of 125 MB.

0
source share

All Articles