How do regular expressions work behind the scenes (processor level)?

Do interpreters and compilers (and ultimately) use two lines to potentially match character by character and from left to right? Or is there a basic binary value (for example, a bit pattern) assigned to each line in the comparison function? Or does it depend on which string is encoded in a certain way (ASCII or UTF-32), or an interpreter, compiler, database engine, or programming language?

Reorganizing a data warehouse (data files or databases) is a significant effort. The answer to a similar question about stackoverflow did not provide a definitive description of the encoding question (whether bit patterns or actual letter characters were evaluated). The answer to this question may be important for optimization.

I do not want to know how to implement a regex (e.g. write my own). I want to know for educational purposes in favor of using existing regular expressions in an optimal way (for example, when it is time to design the data that will be stored as a composition of substrings, should I consider the estimate from left to right). A similar StackOverflow answer (which is a link that has an unreliable certificate to view it) focuses on state machines (string comparison theory). This answer highlights how it can work, and the computational complexity of string comparisons. This means that there is a score from left to right. I do not think it was final. The article was largely specific to Perl and the Thomson language agnostic, a non-deterministic finite state machine algorithm. I would like to know exactly these three combinations of technologies: 1) native Java functions using ASCII data files, 2) MySQL (table data and SELECT statements), and 3) with native Python functions and UTF-32 data files.

My question and approach differs from the old post in that I am not trying to develop a parser for executing regular expressions. I am trying to archive data for future development. I want to know how to use existing regex tools in an optimal way. I believe that stackoverflow is the right forum because it is central to regular expressions, and this question was voted in its original and less detailed form.

I want to know at the CPU level whether bit patterns are representations of characters in a string? Is there a short-lived index of bit patterns corresponding to each character of the strings involved in comparisons in which one string is fixed? I would think that the technology (e.g. database, programming language and / or data encoding) will vary.

+7
optimization string filter regex
source share
3 answers

There are two large families of regex engines: NFA and DFA (I use the terminology from Jeffrey Friedl's book):

  • Nondeterministic state machine
  • Deterministic state machine

An NFA implementation will roughly work as follows:

  • Keep a pointer to the current offset in the input line
  • Hold the pointer to the current position in the template (which is interpreted as a graph or tree).

Then use the template as a recipe for moving forward in the input line. If the pattern describes a , for example, and if the current input offset points to the character a , then this character will be consumed, and both pointers will move to the next position. If the character does not match, there will be a countdown (the input pointer will go to the previous valid position, and the template pointer will be set to another possible alternative in the input position).

The fact is that recognition is controlled by a pattern .

(the above explanation is very crude, since it does not include optimization, etc.), and modern templates in any case cannot be implemented using a formal machine)

DFA implementation works the other way around:

There is another pointer to input, but there are several pointers to a pointer. An input pattern will advance character by character, and pattern pointers will track the actual state in the pattern for that input.

The tag is controlled by the input .

Both of these methods have very different properties:

  • NFA engines can offer much more features, but their runtime depends on the combination of input and template itself.
  • DFA engines offer fewer features, but their complexity is O (n), where n is the length of the input string.

Some regex mechanisms (such as PCRE) can implement both recognition methods. I recommend that you read PCRE docs about two relevant algorithms that explain the differences in more technical terms.

As for the actual implementation, it is highly dependent on the regular expression mechanism itself. PCRE has several of them:

  • Tree-Based NFA Algorithm
  • Optimized version above based on JIT compilation (one version for each supported command set)
  • DFA implementation

So you can see that there are several possible approaches for the NFA. Other engines have different implementations that allow you to use a different set of functions. For example, .NET regular expressions can be matched from left to right or right to left, and therefore can easily provide a variable lookbehind length, while PCRE lookbehind is implemented by temporarily switching the input pointer left to the expected input lookbehind length and matching from left to right from this offset.

+6
source share

Regular expressions do not specify implementation details. It depends on the language to decide how to best compile them for machine code. For example, the regex.c implementation looks like it has more than one character at a time.

+2
source share

The work of regular expressions is an implementation detail. They can be implemented one way or the second way.

In fact, some of the languages ​​implement them inefficiently.

If you want to know more, I can refer to this article: https://swtch.com/~rsc/regexp/regexp1.html

0
source share

All Articles