Writing a parser for regular expressions

Even after many years of programming, I am ashamed to say that I never understood fully regular expressions. In the general case, when a problem causes a regular expression, I usually (after a bunch of syntax) come up with a suitable one, but this is a method that I often use.

So, in order to correctly teach myself and correctly understand regular expressions, I decided to do what I always do, trying to learn something; those. try to write something ambitious that I will probably leave as soon as I feel that I have learned enough.

For this purpose, I want to write a regular expression parser in Python. In this case, “learning enough” means that I want to implement a parser that can fully understand the syntax of the extended Perl regular expression. However, it does not have to be the most efficient parser or even be used in the real world. It just needs to match or not match the pattern in the string correctly.

Question: where to start? I know almost nothing about how regular expressions are analyzed and interpreted separately from the fact that it is somehow connected with the state machine. Any suggestions on how to approach this rather complex problem would be highly appreciated.

EDIT: I have to clarify that, although I am going to implement the regex parser in Python, I have not been too fussy about what programming language the examples or articles were written in. this is not in Brainfuck, I will probably understand that it will be worth my time.

+57
python regex parsing
Sep 03 '10 at 21:07
source share
5 answers

Writing an implementation of the regex engine is indeed quite a challenge.

But if you are interested in how to do this, even if you cannot understand enough details to really implement it, I would recommend that you at least take a look at this article:

Regular expression matching can be simple and fast (but slow in Java, Perl, PHP, Python, Ruby, ...)

It explains how many of the popular programming languages ​​implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that runs faster. The article contains some information on how the proposed implementation works, including some C source texts. It can be a bit difficult reading if you are just starting to learn regular expressions, but I think it's worth knowing about the difference between the two approaches.

+34
03 Sep '10 at 21:13
source share

I already gave +1 to Byers, but as far as I remember, this article doesn’t really talk about how the regular expression works, without explaining why one algorithm is bad and the other is much better. Maybe something in the links?

I will focus on a good approach - the creation of finite state machines. If you limit yourself to deterministic automata without minimization, this is not too complicated.

What I (very quickly) described is the approach taken in Modern compiler design .

Imagine you have the following regular expression ...

a (bc)* d 

Letters represent letter characters to match. * - normal match with zero or more repetitions. The basic idea is to derive states based on dotted rules. We take State zero as a state in which nothing was matched, so the point goes ahead ...

 0 : .a (bc)* d 

The only possible match is 'a', so the next state we get is ...

 1 : a.(bc)* d 

Now we have two possibilities - match “b” (if there is at least one repeat of “bc”) or match “d” otherwise. Note. Basically we do a digraph search here (first in depth or in width, or another), but we find the digraph when we look for it. Assuming the strategy is wide, we will need to queue up one of our cases for further consideration, but I will ignore this question here. Anyway, we discovered two new conditions ...

 2 : a (bc)* d 3 : a (bc)* d. 

State 3 is the final state (there may be several). For state 2, we can only match “c”, but after that we must be careful with the dot. We get "a. (B c) * d" - the same as state 1, so we do not need a new state.

IIRC, the approach in Modern Compiler Design is to translate the rule when you click on the operator to simplify point processing. State 1 will be converted to ...

 1 : ab c (bc)* d ad 

So your next parameter should either match the first rep or skip the rep. The following states from this are equivalent to states 2 and 3. The advantage of this approach is that you can discard all your past matches (all up to "."), Since you only care about future matches. This usually gives a smaller state model (but not necessarily a minimum).

EDIT If you discard already agreed data, your state description is a representation of a set of strings that can occur from that point.

In terms of abstract algebra, this is a kind of closure of the set. An algebra is basically a collection with one (or more) operators. Our set is descriptions of states, and our operators are our transitions (character matches). A closed set is an application in which applying any operator to any member in a set always creates another element that is in the set. Closing a set is a finite set that is closed. Thus, basically, starting from the obvious start state, we build a minimal set of states that is closed with respect to our set of transition operators - the minimum set of reachable states.

Minimum here refers to the closing process - there may be smaller equivalent automata, which are usually called minimal.

Given this basic idea, it’s not so difficult to say “if I have two state machines representing two sets of strings, how do I get a third representing the union” (or intersection, or set the difference ...), Instead of dashed rules, your state representations will have a current state (or many current states) from each input automaton and, possibly, additional information.

If your regular grammar becomes complicated, you can minimize it. The basic idea here is relatively simple. You group all your states into one equivalence class or “block”. Then you repeatedly check whether the blocks need to be separated (the states are not really equivalent) with respect to a certain type of transition. If all the states in a particular block can take the correspondence of the same symbol and at the same time reach the same next block, they are equivalent.

The Hopcrofts algorithm is an effective way to deal with this basic idea.

It is especially interesting that minimization consists in the fact that each deterministic finite automaton has exactly one minimal form. In addition, the Hopcrofts algorithm will give the same representation of this minimal form, no matter what representation of the larger case it started. That is, this is a "canonical" representation that can be used to obtain a hash or for arbitrarily consistent orders. This means that you can use minimal automata as keys in containers.

The above is probably a slightly messy definition of WRT, so make sure you learn any terms yourself before using them yourself, but with little luck it gives a fair, brief introduction to the main ideas.

BTW - take a look at the rest of Dick Grunes's website - he has a free pdf book on parsing methods. The first edition of Modern Compiler Design is pretty good IMO, but as you will see, the second edition appears.

+18
03 Sep '10 at 22:26
source share

This document takes an interesting approach. Implementation is done in Haskell, but it has been reimplemented in Python at least once.

+6
03 Sep '10 at 21:39
source share

There is an interesting (if a little short) chapter in Brian Kernigan’s Beautiful Code , respectively called “Regular Expression Regulator”. In it, he discusses a simple match, which can match literal characters and characters .^$* .

+4
Sep 03 '10 at 21:24
source share

I agree that creating a regex mechanism will improve understanding, but have you looked at ANTLR ??. It automatically generates parsers for any language. Therefore, perhaps you can try your hand at one of the language grammars listed in the “Grammar Examples” and go through the AST and the parser that he created. It generates really complex code, but you will have a good understanding of how the parser works.

0
Sep 23 '10 at 7:03
source share



All Articles