Writing "wc -l" using Iteratee library - how to filter for a new line?

I am trying to come up with the equivalent of "wc -l" using the Haskell Iteratee library. Below is the code for "wc" (which just counts words - similar to the code in the example with iteration on the hack) and works very quickly:


{-# LANGUAGE BangPatterns #-} import Data.Iteratee as I import Data.ListLike as LL import Data.Iteratee.IO import Data.ByteString length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee sma length1 = liftI (step 0) where step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs)) step !i stream = idone i stream {-# INLINE length1 #-} main = do i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int) result <- run i' print result {- Time measured on a linux x86 box: $ time ./test ## above haskell compiled code 4950996 real 0m0.013s user 0m0.004s sys 0m0.007s $ time wc -c /usr/share/dict/words 4950996 /usr/share/dict/words real 0m0.003s user 0m0.000s sys 0m0.002s -} 

Now, how do you expand it to count the number of lines that are fast too? I used the Prelude.filter version to filter only "\ n" to length, but slower than linux "wc -l" due to too much memory, and gc (lazy rating, I think). So, I wrote another version using Data.ListLike.filter, but it will not compile because it does not enter check-help here, it will be appreciated:


  {-# LANGUAGE BangPatterns #-} import Data.Iteratee as I import Data.ListLike as LL import Data.Iteratee.IO import Data.ByteString import Data.Char import Data.ByteString.Char8 (pack) numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee sma numlines = liftI $ step 0 where step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == Data.ByteString.Char8.pack "\n") xs)) step !i stream = idone i stream {-# INLINE numlines #-} main = do i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int) result <- run i' print result 
+7
source share
4 answers

There are already many good answers; I have very few performance suggestions, but few styles.

First I will write like this:

 import Prelude as P import Data.Iteratee import qualified Data.Iteratee as I import qualified Data.Iteratee.IO as I import qualified Data.ByteString as B import Data.Char import System.Environment -- numLines has a concrete stream type so it not necessary to provide an -- annotation later. It could have a more general type. numLines :: Monad m => I.Iteratee B.ByteString m Int numLines = I.foldl' step 0 where --step :: Int -> Word8 -> Int step acc el = if el == (fromIntegral $ ord '\n') then acc + 1 else acc main = do f:_ <- getArgs words <- run =<< I.enumFile 65536 f numLines print words 

The biggest difference is that it uses Data.Iteratee.ListLike.foldl' . Note that for the step function, only the individual elements of the stream are important, not the type of stream. This is exactly the same function as you would use, for example, Data.ByteString.Lazy.foldl' .

Using foldl' also means you don't need to manually write iterations using liftI . I would discourage users from doing this if absolutely necessary. The result is usually longer and more complex to support with little or no benefit.

Finally, I significantly increased the size of the buffer. On my system, this is slightly faster than the enumerator default 4096, which again is slightly faster (with iteration) than your choice of 1024. YMMV with this setting, of course.

+1
source

So, I was experimenting a bit, and I have wc -l, which is about twice as slow as wc -l. This is better performance than even the wc-c version shown above.

 {-# LANGUAGE OverloadedStrings #-} import qualified Data.ByteString.Lazy.Char8 as BSL import qualified Data.ByteString.Char8 as BS import qualified Data.Enumerator as E import qualified Data.Enumerator.Binary as EB import Control.Monad.IO.Class (liftIO) import Data.Int numlines :: Int64 -> E.Iteratee BS.ByteString IO () numlines n = do chunk <- EB.take 1024 case chunk of "" -> do liftIO $ print n return () a -> do let ct = BSL.count '\n' a numlines (n+ct) main = do let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines 0 E.run_ i 

Run compared to native:

 Eriks-MacBook-Air:skunk erikhinton$ time wc -l "/usr/share/dict/words" 235886 /usr/share/dict/words real 0m0.009s user 0m0.006s sys 0m0.002s Eriks-MacBook-Air:skunk erikhinton$ time ./wcl 235886 real 0m0.019s user 0m0.013s sys 0m0.005s 

[EDIT]

Here's an even faster, smaller size and a much more concise / expressive way to do this. These counters are starting to have fun.

 {-# LANGUAGE OverloadedStrings #-} import qualified Data.ByteString.Lazy.Char8 as BSL import qualified Data.ByteString.Char8 as BS import qualified Data.Enumerator as E import qualified Data.Enumerator.Binary as EB import qualified Data.Enumerator.List as EL import Control.Monad.IO.Class (liftIO) import Data.Int numlines :: E.Iteratee BS.ByteString IO () numlines = do num <- EL.fold (\nb -> (BS.count '\n' b) + n ) 0 liftIO . print $ num main = do let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines E.run_ i 

And time

 Eriks-MacBook-Air:skunk erikhinton$ time ./wcl2 235886 real 0m0.015s user 0m0.010s sys 0m0.004s 
+3
source

If you are reading ByteString fragments, you can use the count function of Data.ByteString , then the appropriate step will be

 step !i (Chunk xs) = liftI (step $ i + count 10 xs) 

(possibly with Integral). Data.ByteString.count pretty fast; it should not be too slow than wc -l.

+1
source

I figured out how to fix a type error. The key to a commit type error is to understand the relationship between the Data.ListLike.filter input and the ByteString that is passed to this filter. Here is the Data.ListLike.filter type:

 Data.ListLike.filter :: Data.ListLike.Base.ListLike full item => (item -> Bool) -> full -> full 

full refers to the stream in the context of the / iteratee enumerator, if I understand it correctly. item refers to a stream item.

Now, if we want to filter a new line in the input file, we need to know the type of stream of the input files and the type of elements in this stream. In this case, the input file is read as a ByteString stream. ByteString is documented as a spatially efficient representation of a Word8 vector. So the item type is Word8.

So, when we write the filter, in the step function, we need to make sure that the Bool operation is defined for Word8, since this is the type of element passed to the filter (as explained above). We are filtering for a new line. Thus, a bool function like the one below, which builds a newline representation of Word8 and checks equality against x of type Word8, should work:

 \x -> x == Data.ByteString.Internal.c2w '\n' 

There is another missing element - for some reason, the compiler (v7.0.3 Mac) cannot infer the type el in a signature type of type numfile (if anyone has ideas on why this is why, please discuss). So, an explicit indication that Word8 solves the compilation problem:

 numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee sma 

The full code is below - it compiles and works pretty fast.

 {-# LANGUAGE BangPatterns,FlexibleContexts #-} import Data.Iteratee as I import Data.ListLike as LL import Data.Iteratee.IO import Data.ByteString import GHC.Word (Word8) import Data.ByteString.Internal (c2w) numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee sma numlines = liftI $ step 0 where step !i (Chunk xs) = let newline = c2w '\n' in liftI (step $i + fromIntegral (LL.length $ LL.filter (\x -> x == newline) xs)) step !i stream = idone i stream {-# INLINE numlines #-} main = do i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int) result <- run i' print result {- Time to run on mac OSX: $ time ./test ## above compiled program: ghc --make -O2 test.hs 235886 real 0m0.011s user 0m0.007s sys 0m0.004s $ time wc -l /usr/share/dict/words 235886 /usr/share/dict/words real 0m0.005s user 0m0.002s sys 0m0.002s -} 
+1
source

All Articles