Is it possible to efficiently view more than one Char in Attoparsec?

Question

Is it possible to efficiently view more than one Char in Attoparsec?

I am trying to expand the Haskell Attoparsec parser library with a function

takeRegex :: Regex -> Parser ByteString

using one of the regular expression implementations.

(Motivation: good regex libraries can provide performance that is linear in input length, while attoparsec requires backtracking. Part of my input is especially suitable for parsing using regexes and even for backtracking Text.Regex.PCRE library gets me 4x speed over attoparsec code for this part.)

Attoparsec had a getInput :: Parser ByteString to get the remaining input (no consumption); this would probably be perfectly ideal for my purposes, since my input is not incremental, strict, and small enough - I run the parser for the log file line at a time. With this, it seems, I could do something like

 takeRegex re = do input <- getInput m <- matchM re input take $ length m

Unfortunately, in recent versions of attoparsec this feature is missing. Is there any way to achieve the same? Why was the feature removed?

Now there is a function takeByteString :: Parser ByteString that accepts and consumes the rest of the input. If there was a function for trying to parse and get the result without actually consuming anything, it could be used in combination with this, but I can not find (or figure out how to implement) such a function.

Is there a way to achieve this using the current version of attoparsec?

+6

regex parsing haskell attoparsec

Sami liedes Jan 03 '14 at 1:24

source share

1 answer

jamshidh · Accepted Answer · 2014-01-03T05:29:04+0000

There are several solutions for this, but no one is great.

Method 1- Quickly implement, but not so fast to launch

Well, (according to http://hackage.haskell.org/package/attoparsec-0.10.1.1/docs/Data-Attoparsec-ByteString.html ), attoparsec always backs down, so you can always do something like this -

 parseLine1 = do line <- takeTill (== '\n') char '\n' case <some sort of test on line, ie- a regex> of Just -> return <some sort of data type> Nothing -> fail "Parse Error"

then many of these chains will work as expected later

 parseLine = parseLine1 <|> parseLine2

The problem with this solution is that, as you can see, you are still doing a bunch of digressions, which can really slow things down.

Method 2- Traditional Method

The usual way to deal with this type is to rewrite the grammar, or in the case of the parser combinator, to move things around so that the complete algorithm needs only one look symbol. This can almost always be done in practice, although sometimes it makes logic much harder to follow ....

For example, suppose you have a grammar creation rule like this -

 pet = "dog" | "dolphin"

To do this, you need two view characters before any path is allowed. Instead, you can leave it all as a thing

 pet => "ca" halfpet halfpet => "g" | "lphin"

No parallel processing is required, but the grammar is much uglier. (Although I wrote this as a production rule, there is a one-to-one comparison with a similar parser combinator).

Method 3. The correct way, but for the record.

The true way you want to do this is to directly compile the regular expression into a parser combinator ... After compiling any ordinary language, the resulting algorithm always needs only one lookahead character, so the resulting attoparsec code should (for example, the procedure in method 1 to read one character), but the job will be to compile the regular expression.

Regular expression compilation is considered in many textbooks, so I won’t go into details here, but it basically comes down to replacing all the ambiguous paths in the regex state machine with new states. Or, to put it another way, he automatically “left the factors” in all cases that need backtracking.

(I wrote a library that automatically “left factors” in many cases in the context of free grammars, turning almost any contextual free grammar into a linear parser once, but I haven't made it available ... sometime when I cleaned it, I will).

Is it possible to efficiently view more than one Char in Attoparsec?

More articles: