Haskell A more efficient way to parse a string of numbers file

So, I have an 8 MB file, each of which has 6 integers, separated by a space.

My current parsing method is:

tuplify6 :: [a] -> (a, a, a, a, a, a) tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q) toInts :: String -> (Int, Int, Int, Int, Int, Int) toInts line = tuplify6 $ map read stringNumbers where stringNumbers = split " " line 

and mapping toInts over

 liftM lines . readFile 

which will return me a list of tuples. However, when I run this, it takes about 25 seconds to download the file and analyze it. Anyway, can I speed it up? The file is plain text.

+7
source share
1 answer

You can speed it up using ByteString s, for example.

 module Main (main) where import System.Environment (getArgs) import qualified Data.ByteString.Lazy.Char8 as C import Data.Char main :: IO () main = do args <- getArgs mapM_ doFile args doFile :: FilePath -> IO () doFile file = do bs <- C.readFile file let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs print (length tups) buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)] buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs buildTups k acc bs | C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k) | otherwise = case C.readInt bs of Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm Nothing -> error ("No Int found: " ++ show (C.take 100 bs)) tuplify6:: [a] -> (a, a, a, a, a, a) tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q) 

runs pretty fast:

 $ time ./fileParse IntList 200000 real 0m0.119s user 0m0.115s sys 0m0.003s 

for the 8.1 MiB file.

On the other hand, using String and your conversion (with a pair of seq for forced evaluation) also took only 0.66 s, so most of the time, apparently, is spent not on parsing, but on working with the result.

Unfortunately, I missed a seq , so read not really evaluated for the String version. Fixing this, String + read takes about four seconds, slightly higher than one with the custom parser Int from the @Rotsor comment

 foldl' (\ac -> 10*a + fromEnum c - fromEnum '0') 0 

therefore, parsing seemed to take considerable time.

+8
source

All Articles