I assume that you start with something like
import Prelude hiding (getContents, putStrLn) import Data.ByteString import Text.Parsec.ByteString
Here is what I have so far. There are two versions. Both are compiled. You probably don't want to, but you should help discuss and help you clarify your question.
Something I noticed along the way:
- If you
import Text.Parsec.ByteString , then this uses the uncons from Data.ByteString.Char8, which in turn uses the w2c from Data.ByteString.Internal to convert all read bytes to Char s. This allows Parsec's distinct error reporting for Parsec rows and columns to work reasonably, and also allows string and friends to be used without problems.
Thus, a simple version of the CSV analyzer that does just that:
import Prelude hiding (getContents, putStrLn) import Data.ByteString (ByteString) import qualified Prelude (getContents, putStrLn) import qualified Data.ByteString as ByteString (getContents) import Text.Parsec import Text.Parsec.ByteString csvFile :: Parser [[String]] csvFile = endBy line eol line :: Parser [String] line = sepBy cell (char ',') cell :: Parser String cell = quotedCell <|> many (noneOf ",\n\r") quotedCell :: Parser String quotedCell = do _ <- char '"' content <- many quotedChar _ <- char '"' <?> "quote at end of cell" return content quotedChar :: Parser Char quotedChar = noneOf "\"" <|> try (string "\"\"" >> return '"') eol :: Parser String eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line" parseCSV :: ByteString -> Either ParseError [[String]] parseCSV = parse csvFile "(unknown)" main :: IO () main = do c <- ByteString.getContents case parse csvFile "(stdin)" c of Left e -> do Prelude.putStrLn "Error parsing input:" print e Right r -> mapM_ print r
But it was so trivial that I started working, I believe that this may not be what you want. Perhaps you want everything to remain ByteString or [Word8] or something similar all over? Hence my second attempt below. I am still Text.Parsec.ByteString , which may be a mistake, and the code is hopelessly riddled with transformations.
But , it compiles and has annotations of the full type and, therefore, should make the starting point of the sound.
import Prelude hiding (getContents, putStrLn) import Data.ByteString (ByteString) import Control.Monad (liftM) import qualified Prelude (getContents, putStrLn) import qualified Data.ByteString as ByteString (pack, getContents) import qualified Data.ByteString.Char8 as Char8 (pack) import Data.Word (Word8) import Data.ByteString.Internal (c2w) import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many) import Text.Parsec.ByteString import Text.Parsec.Prim (tokens, tokenPrim) import Text.Parsec.Pos (updatePosChar, updatePosString) import Text.Parsec.Error (ParseError) csvFile :: Parser [[ByteString]] csvFile = endBy line eol line :: Parser [ByteString] line = sepBy cell (char ',') cell :: Parser ByteString cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r")) quotedCell :: Parser ByteString quotedCell = do _ <- char '"' content <- many quotedChar _ <- char '"' <?> "quote at end of cell" return (ByteString.pack content) quotedChar :: Parser Word8 quotedChar = noneOf "\"" <|> try (string "\"\"" >> return (c2w '"')) eol :: Parser ByteString eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line" parseCSV :: ByteString -> Either ParseError [[ByteString]] parseCSV = parse csvFile "(unknown)" main :: IO () main = do c <- ByteString.getContents case parse csvFile "(stdin)" c of Left e -> do Prelude.putStrLn "Error parsing input:" print e Right r -> mapM_ print r -- replacements for some of the functions in the Parsec library noneOf :: String -> Parser Word8 noneOf cs = satisfy (\b -> b `notElem` [c2w c | c <- cs]) char :: Char -> Parser Word8 char c = byte (c2w c) byte :: Word8 -> Parser Word8 byte c = satisfy (==c) <?> show [c] satisfy :: (Word8 -> Bool) -> Parser Word8 satisfy f = tokenPrim (\c -> show [c]) (\pos c _cs -> updatePosChar pos c) (\c -> if f (c2w c) then Just (c2w c) else Nothing) string :: String -> Parser ByteString string s = liftM Char8.pack (tokens show updatePosString s)
Probably your concern for efficiency should be the two ByteString.pack instructions in the definitions of cell and quotedCell . You can try replacing the Text.Parsec.ByteString module so that instead of "making a strict ByteStrings instance of Stream with a tick type Char " you make ByteStrings an instance of Stream with a tick type Word8 but this will not help you with efficiency, it will just give you headache trying to override all sourcePos functions to track your input position for error messages.
No, the way to make it more efficient is to change the Char , quotedChar and string types to Parser [Word8] and the line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You can even change the eol type to Parser () . The necessary changes look something like this:
cell :: Parser [Word8] cell = quotedCell <|> many (noneOf ",\n\r") quotedCell :: Parser [Word8] quotedCell = do _ <- char '"' content <- many quotedChar _ <- char '"' <?> "quote at end of cell" return content string :: String -> Parser [Word8] string s = [c2w c | c <- (tokens show updatePosString s)]
You do not need to worry about all the c2w challenges in terms of efficiency, because they cost nothing.
If this does not answer your question, please tell me.