Matching Bytes in Parsec

I'm currently trying to use the full CSV parser introduced in Real World Haskell. So I try to change the code to use ByteString instead of String , but there is a String combinator that only works with String .

Is there a Parsec combinator like String that works with ByteString without having to do back and forth conversions?

I saw that there is an alternative parser that handles ByteString : attoparsec , but I would prefer to stick with Parsec since I just learned how to use it.

+4
haskell parsec bytestring
source share
2 answers

I assume that you start with something like

 import Prelude hiding (getContents, putStrLn) import Data.ByteString import Text.Parsec.ByteString 

Here is what I have so far. There are two versions. Both are compiled. You probably don't want to, but you should help discuss and help you clarify your question.

Something I noticed along the way:

  • If you import Text.Parsec.ByteString , then this uses the uncons from Data.ByteString.Char8, which in turn uses the w2c from Data.ByteString.Internal to convert all read bytes to Char s. This allows Parsec's distinct error reporting for Parsec rows and columns to work reasonably, and also allows string and friends to be used without problems.

Thus, a simple version of the CSV analyzer that does just that:

 import Prelude hiding (getContents, putStrLn) import Data.ByteString (ByteString) import qualified Prelude (getContents, putStrLn) import qualified Data.ByteString as ByteString (getContents) import Text.Parsec import Text.Parsec.ByteString csvFile :: Parser [[String]] csvFile = endBy line eol line :: Parser [String] line = sepBy cell (char ',') cell :: Parser String cell = quotedCell <|> many (noneOf ",\n\r") quotedCell :: Parser String quotedCell = do _ <- char '"' content <- many quotedChar _ <- char '"' <?> "quote at end of cell" return content quotedChar :: Parser Char quotedChar = noneOf "\"" <|> try (string "\"\"" >> return '"') eol :: Parser String eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line" parseCSV :: ByteString -> Either ParseError [[String]] parseCSV = parse csvFile "(unknown)" main :: IO () main = do c <- ByteString.getContents case parse csvFile "(stdin)" c of Left e -> do Prelude.putStrLn "Error parsing input:" print e Right r -> mapM_ print r 

But it was so trivial that I started working, I believe that this may not be what you want. Perhaps you want everything to remain ByteString or [Word8] or something similar all over? Hence my second attempt below. I am still Text.Parsec.ByteString , which may be a mistake, and the code is hopelessly riddled with transformations.

But , it compiles and has annotations of the full type and, therefore, should make the starting point of the sound.

 import Prelude hiding (getContents, putStrLn) import Data.ByteString (ByteString) import Control.Monad (liftM) import qualified Prelude (getContents, putStrLn) import qualified Data.ByteString as ByteString (pack, getContents) import qualified Data.ByteString.Char8 as Char8 (pack) import Data.Word (Word8) import Data.ByteString.Internal (c2w) import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many) import Text.Parsec.ByteString import Text.Parsec.Prim (tokens, tokenPrim) import Text.Parsec.Pos (updatePosChar, updatePosString) import Text.Parsec.Error (ParseError) csvFile :: Parser [[ByteString]] csvFile = endBy line eol line :: Parser [ByteString] line = sepBy cell (char ',') cell :: Parser ByteString cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r")) quotedCell :: Parser ByteString quotedCell = do _ <- char '"' content <- many quotedChar _ <- char '"' <?> "quote at end of cell" return (ByteString.pack content) quotedChar :: Parser Word8 quotedChar = noneOf "\"" <|> try (string "\"\"" >> return (c2w '"')) eol :: Parser ByteString eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line" parseCSV :: ByteString -> Either ParseError [[ByteString]] parseCSV = parse csvFile "(unknown)" main :: IO () main = do c <- ByteString.getContents case parse csvFile "(stdin)" c of Left e -> do Prelude.putStrLn "Error parsing input:" print e Right r -> mapM_ print r -- replacements for some of the functions in the Parsec library noneOf :: String -> Parser Word8 noneOf cs = satisfy (\b -> b `notElem` [c2w c | c <- cs]) char :: Char -> Parser Word8 char c = byte (c2w c) byte :: Word8 -> Parser Word8 byte c = satisfy (==c) <?> show [c] satisfy :: (Word8 -> Bool) -> Parser Word8 satisfy f = tokenPrim (\c -> show [c]) (\pos c _cs -> updatePosChar pos c) (\c -> if f (c2w c) then Just (c2w c) else Nothing) string :: String -> Parser ByteString string s = liftM Char8.pack (tokens show updatePosString s) 

Probably your concern for efficiency should be the two ByteString.pack instructions in the definitions of cell and quotedCell . You can try replacing the Text.Parsec.ByteString module so that instead of "making a strict ByteStrings instance of Stream with a tick type Char " you make ByteStrings an instance of Stream with a tick type Word8 but this will not help you with efficiency, it will just give you headache trying to override all sourcePos functions to track your input position for error messages.

No, the way to make it more efficient is to change the Char , quotedChar and string types to Parser [Word8] and the line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You can even change the eol type to Parser () . The necessary changes look something like this:

 cell :: Parser [Word8] cell = quotedCell <|> many (noneOf ",\n\r") quotedCell :: Parser [Word8] quotedCell = do _ <- char '"' content <- many quotedChar _ <- char '"' <?> "quote at end of cell" return content string :: String -> Parser [Word8] string s = [c2w c | c <- (tokens show updatePosString s)] 

You do not need to worry about all the c2w challenges in terms of efficiency, because they cost nothing.

If this does not answer your question, please tell me.

+5
source share

I don’t believe that. You will need to create it yourself using tokens . Although the documentation for it is a little ... nonexistent, the first two arguments are a function used to display the expected tokens in the error message and a function to update the original position, which will be printed with errors.

0
source share

All Articles