There are two large families of regex engines: NFA and DFA (I use the terminology from Jeffrey Friedl's book):
- Nondeterministic state machine
- Deterministic state machine
An NFA implementation will roughly work as follows:
- Keep a pointer to the current offset in the input line
- Hold the pointer to the current position in the template (which is interpreted as a graph or tree).
Then use the template as a recipe for moving forward in the input line. If the pattern describes a , for example, and if the current input offset points to the character a , then this character will be consumed, and both pointers will move to the next position. If the character does not match, there will be a countdown (the input pointer will go to the previous valid position, and the template pointer will be set to another possible alternative in the input position).
The fact is that recognition is controlled by a pattern .
(the above explanation is very crude, since it does not include optimization, etc.), and modern templates in any case cannot be implemented using a formal machine)
DFA implementation works the other way around:
There is another pointer to input, but there are several pointers to a pointer. An input pattern will advance character by character, and pattern pointers will track the actual state in the pattern for that input.
The tag is controlled by the input .
Both of these methods have very different properties:
- NFA engines can offer much more features, but their runtime depends on the combination of input and template itself.
- DFA engines offer fewer features, but their complexity is O (n), where n is the length of the input string.
Some regex mechanisms (such as PCRE) can implement both recognition methods. I recommend that you read PCRE docs about two relevant algorithms that explain the differences in more technical terms.
As for the actual implementation, it is highly dependent on the regular expression mechanism itself. PCRE has several of them:
- Tree-Based NFA Algorithm
- Optimized version above based on JIT compilation (one version for each supported command set)
- DFA implementation
So you can see that there are several possible approaches for the NFA. Other engines have different implementations that allow you to use a different set of functions. For example, .NET regular expressions can be matched from left to right or right to left, and therefore can easily provide a variable lookbehind length, while PCRE lookbehind is implemented by temporarily switching the input pointer left to the expected input lookbehind length and matching from left to right from this offset.