How to parse a 7GB file using Data.ByteString?

I need to analyze the file, and really need to read it first, here is my program:

import qualified Data.ByteString.Char8 as B import System.Environment main = do args <- getArgs let path = args !! 0 content <- B.readFile path let lines = B.lines content foobar lines foobar :: [B.ByteString] -> IO() foobar _ = return () 

but after compilation

 > ghc --make -O2 tmp.hs 

when called with 7Gigabyte, the file performs the following error.

 > ./tmp big_big_file.dat > tmp: {handle: big_big_file.dat}: hGet: illegal ByteString size (-1501792951): illegal operation 

Thanks for any answer!

+7
source share
2 answers

Strict ByteString only supports up to 2 gigabytes of memory. You must use lazy ByteString s in order for it to work.

+5
source

The length of the ByteString is Int . If Int is 32 bits, a 7 GB file will exceed the Int range and the buffer request will be the wrong size and can easily request a negative size.

readFile code converts file size to Int for buffer request

 readFile :: FilePath -> IO ByteString readFile f = bracket (openBinaryFile f ReadMode) hClose (\h -> hFileSize h >>= hGet h . fromIntegral) 

and if it is an overflow, the most likely outcomes are the “illegal byte size” error or segmentation error.

If at all possible, use lazy ByteString to process large files. In your case, you should pretty much make this possible, since with 32-bit Int s it is not possible to create a 7GB ByteString .

If you need a ByteString line to process and the line is not too long, you can go through the lazy ByteString to achieve this

 import qualified Data.ByteString.Lazy.Char8 as LC import qualified Data.ByteString.Char8 as C main = do ... content <- LC.readFile path let llns = LC.lines content slns = map (C.concat . LC.toChunks) llns foobar slns 

but if you can change your processing to handle lazy ByteString s, it will probably be better.

+9
source

All Articles