Why did the Parsex “choice” combinator seem to be stuck on the first choice?

After looking at a sample CSV code in Real World Haskell, I tried to create a small XML parser. But the tags close with "unexpected" / "errors. Can you tell me why my closeTag parser doesn't work (or maybe never gets called)? Thanks!

import Text.ParserCombinators.Parsec xmlFile = manyTill line eof line = manyTill tag eol eol = char '\n' word = many1 (noneOf "></") tag = choice [openTag, closeTag, nullTag, word] nullTag = between (char '<') (string "/>") word closeTag = between (string "</") (char '>') word openTag = between (char '<') (char '>') tagContent attrval = between (char '"') (char '"') word atts = do { (char ' ') ; sepBy attr (char ' ') } attr = do { word ; char '=' ; attrval } tagContent = do { w <- word ; option [] atts ; return w } parseXML :: String -> Either ParseError [[String]] parseXML input = parse xmlFile "(unknown)" input main = do c <- getContents case parse xmlFile "(stdin)" c of Left e -> do putStrLn "Error parsing input:" print e Right r -> mapM_ print r 
+7
source share
1 answer

Parsec's strategy is essentially LL (1), which means that it “locks” the current branch whenever any input is consumed. Your openTag parser consumes < with its char '<' , which means that if, when viewing > instead of / , all parsing will fail with an error instead of trying a new choice. If openTag did not consume any input and failed, another choice would be tried. Parsec does this for efficiency (the alternative is exponential time!) And for reasonable error messages.

You have two options. The preferred option when it’s reasonable to shoot is the factor of your grammar so that all options are executed without consuming input, for example:

 tag = word <|> (char '<' >> tagbody) where tagbody = do content <- tagcontent choice [ string "/>", char '>' ] 

Errors and style in the module (my brain is now a little fried: -P).

Another way that locally changes the semantics of parsec (due to the aforementioned error messages and performance - but this is usually not so bad because it is local) is to use a try combinator that allows the parser to consume input and still doesn't work soft, so you can try a different choice:

 nulltag = try $ between (char '<') (string "/>") word -- etc. 

Sometimes using try is simpler and easier than factoring, as described above, which can hide the "deep structure" of the language. This is a stylistic compromise.

+14
source

All Articles