A brief example of a regular expression converted to a state machine?

In podcast Qaru # 36 ( http://blog.stackoverflow.com/2009/01/podcast-36/ ) the view was expressed: Once you understand how easy it is to configure the state machine, you will never try to use the regular expression inappropriately again .

I did a bunch of searches. I found some scientific articles and other complex examples, but I would like to find a simple example that will help me understand this process. I use a lot of regular expressions and I would like to make sure that I will never use "inappropriate" again.

+15
regex finite-state-machine fsm
Feb 08 '09 at 2:11
source share
6 answers

Of course, although you will need more complex examples to really understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$ 

which is a typical identifier (must begin with alpha and can contain any number of alphanumeric and fuzzy characters, including none). The following pseudo-code shows how this can be done using a state machine:

 state = FIRSTCHAR for char in all_chars_in(string): if state == FIRSTCHAR: if char is not in the set "AZ" or "az": error "Invalid first character" state = SUBSEQUENTCHARS next char if state == SUBSEQUENTCHARS: if char is not in the set "AZ" or "az" or "0-9" or "_": error "Invalid subsequent character" state = SUBSEQUENTCHARS next char 

Now, as I said, this is a very simple example. It does not show how to make greedy / genuine matches, rollback, matching within a string (and not the entire string) and other more esoteric state functions of machines that are easily handled by RE syntax.

This is why REs are so strong. The actual final machine code needed to accomplish what a single-line RE can do is usually very long and complex.

The best you could do is grab a copy of some lex / yacc code (or equivalent) for a particular simple language and see the code that it generates. It is not very beautiful (it is not necessary, since it should not be read by people, they should look at the lex / yacc code), but it can give you a better idea of ​​how they work.

+21
Feb 08 '09 at 2:36
source share

Make your own on the fly!

http: //osteele.com/tools/reanimator/ ???

A finite state machine

This is a really well-matched tool that renders regular expressions like FSM. It does not support any syntax that you will find in real regular expression systems, but it is certainly enough to understand exactly what is going on.

+19
Feb 08 '09 at 2:46
source share

A reasonably convenient way to help with this is to use the little-known python re.DEBUG flag for any template:

 >>> re.compile(r'<([AZ][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG) literal 60 subpattern 1 in range (65, 90) max_repeat 0 65535 in range (65, 90) range (48, 57) at at_boundary max_repeat 0 65535 not_literal 62 literal 62 subpattern 2 min_repeat 0 65535 any None literal 60 literal 47 groupref 1 literal 62 

The numbers after "literal" and "range" refer to the integer values ​​of the ascii characters that they must match.

+19
Feb 08 '09 at 3:01
source share

The question arises: "How to choose transition states and conditions?" or "How to implement my abstract automaton in Foo?"

How to choose transition conditions and conditions?

I usually use FSM for fairly simple tasks and select them intuitively. In answer to another question about regular expressions , I just looked at the parsing problem as one of Inside or outside , and wrote out the transitions from there (with the initial and final state to keep the implementation clean).

How to implement my abstract automaton in Foo?

If your implementation language supports a structure like c switch , then you turn on the current state and process the input to see which action and / or transition are also performed next.

Without switch structures, or if they are somehow imperfect, you fork the if style. Ugh.

Everything is written in one place in c , the example that I linked will look something like this:

 token_t token; state_t state=BEGIN_STATE; do { switch ( state.getValue() ) { case BEGIN_STATE; state=OUT_STATE; break; case OUT_STATE: switch ( token.getValue() ) { case CODE_TOKEN: state = IN_STATE; output(token.string()); break; case NEWLINE_TOKEN; output("<break>"); output(token.string()); break; ... } break; ... } } while (state != END_STATE); 

which is pretty dirty, so I usually tore up state cases to separate functions.

+4
Feb 08 '09 at 2:49
source share

I'm sure someone has better examples, but you could check out this post by Phil Haack , which has an example of a regular expression and the state machine does the same (there is a previous post with a few examples of regular expressions there, as I think. .)

Check out the “HenriFormatter” on this page.

+2
Feb 08 '09 at 2:18
source share

I do not know what academic documents you have already read, but in fact it is not so difficult to understand how to implement the final machine. There is interesting math, but the idea is actually very trivial to understand. The easiest way to understand FSM is through input and output (in fact, this includes most formal definitions, which I will not describe here). A “state” essentially simply describes a set of input and output signals that have occurred and can come from a specific point.

Finite state machines are most easily understood from diagrams. For example:

alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif

All this suggests that if you start in some state q0 (one that has a "Start" symbol next to it), you can switch to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think about the final machine is with a "valid" or "acceptable" entry. There are certain output lines that are NOT possible for certain state machines; this will allow you to match expressions.

Now suppose you start with q0. Now, if you enter 0, you will go to state q1. However, if you enter 1, you will go to state q2. You can see this with the symbols above the I / O arrows.

Say you start with q0 and get this input

0, 1, 0, 1, 1, 1

This means that you have passed the state (there is no input for q0, you are just starting there):

q0 → q1 → q0 → q1 → q0 → q2 → q3 → q3

Trace the image with your finger if that makes no sense. Note that q3 returns to itself for both inputs 0 and 1.

Another way of saying all this is: "If you are in q0 state and you see 0, go to q1, but if you see 1, go to q2." If you make these conditions for each state, you are almost finished defining your state apparatus. All you have to do is have a state variable and then a way to input the input, and that is basically what it is.

Okay, so why is this important in relation to Joel's statement? Well, creating a “ONE TRUE REGULAR EXPRESSION TO USE THEM ALL” can be very difficult, as well as difficult to maintain a modification or even for others to come back and understand. In addition, in some cases it is more effective.

Of course, state machines have many other uses. Hope this helps in some way. Note that I did not delve into the theory, but there is some interesting evidence regarding FSM.

+1
Feb 08 '09 at 2:43
source share



All Articles