Find out the position where the regular expression failed

I am trying to write lexer in JavaScript to search for tokens of a simple domain specific language. I started with a simple implementation that simply tries to match subsequent regular expressions with the current position in the string to find out if it matches the token format and accept it then.

The problem is that when something does not match with such a regular expression, full regexp fails, so I don’t know which character caused the error.

Is there a way to find out the position in the line that caused the regular expression to fail?

INB4: I am not asking for debugging my regular expression and checking its correctness. This is already correct, matches the correct lines and the wrong ones. I just want to know programmatically exactly where regexp stops matching, to find out the position of the character that was incorrect in user input, and how many of them were in order.

Is there a way to do this with simple regular expressions instead of continuing to implement a full-blown state machine?

+7
javascript regex lexical-analysis
source share
3 answers

Short answer

There is no such thing as a "position in a string that causes a regular expression to fail."

However, I will show you the approach to answer the reverse question:

In which regex token made the engine unable to match the string?

Discussion

In my opinion, the question of the position in the string which caused the regular expression to fail upside down. When the engine moves down the line with the left hand and the drawing with the right hand, the regex token that matches six characters at a time can later be reduced due to quantifiers and going backward to match the next zero characters or extended to match 10.

In my opinion, a more correct question:

In which regex token made the engine unable to match the string?

For example, consider the regular expression ^\w+\d+$ and the string abc132z .

\w+ can actually match the whole line. However, all regex fails. Does it make sense to say that the regex doesn't work at the end of the line? I do not think so. Consider this.

Initially, \w+ will match abc132z . Then the engine goes to the next token: \d+ . At this point, the engine returns to the string, gradually allowing \w+ to abandon 2z (so \w+ now only abc13 ), allowing \d+ match 2 .

At this point, the statement $ fails because z remains. The engine backs off, allowing \w+ to drop the character 3 , then 1 (so that \w+ now matches only abc ), eventually allowing \d+ match 132 At each step, the engine tries to execute the $ statement and fails. Depending on the internal parts of the engine, a larger shutdown may occur: \d+ will again reset 2 and 3, then \w+ will refuse c and b. When the engine finally surrenders, \w+ matches only the initial a . Can you say that the regular expression "doesn't work" to "3"? On the "b"?

Not. If you look at the regex pattern from left to right, you can argue that it fails on $ , because this is the first token that we could not add to the match. Keep in mind that there are other ways to argue for this.

Lower, I will give you a screenshot to visualize this. But first, let's see if we can answer another question.

Another question

Are there methods to answer another question:

In which regex token made the engine unable to match the string?

It depends on your regular expression. If you can slice a regular expression into pure components, you can create an expression using a series of optional lookaheads inside capture groups, which allows a match to always succeed. The first capture release group is the one that caused the failure.

Javascript is a bit stingy with optional lookaheads, but you can write something like this:

 ^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?. 

In PCRE, .NET, Python ... you can write this more compactly:

 ^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?. 

What's going on here? Each lookahead sequentially builds on the latter, adding one token at a time. Therefore, we can test each token separately. The point at the end is additional prosperity for visual feedback: we can see in the debugger that at least one character is matched, but we do not care about this character, we only care about capture groups.

  • Group 1 tests the token \w+
  • Group 2 seems to be testing \w+\d+ , so gradually it checks the token \d+
  • Group 3 seems to be testing \w+\d+$ , therefore, gradually, it is testing the $ token

There are three capture groups. If all three are given, the match is a complete success. Unless group 3 is set (as in abc123a ), you can say that $ caused a crash. If group 1 is selected, but not group 2 (like abc ), you can say that \d+ failed.

For reference: Internal view of the failure path

For what it's worth, here is an introduction to the rejection of the RegexBuddy debugger.

RegexBuddy Debug

+19
source share

You can use RegExp negative character set,

 [^xyz] [^ac] 

Negative or padded character set. That is, it matches something that is not enclosed in parentheses. You can specify a range of characters using a hyphen, but if a hyphen appears as the first or last character enclosed in square brackets, it is taken as a literal hyphen, which should be included in the character set as a regular character.

index property String.prototype.match()

The returned array has an additional input property that contains the original string that was parsed. In addition, it has an index property that represents the zero index of a match in a string.

For example, for index journal, where the number corresponds to RegExp /[^a-zA-z]/ in the line aBcD7zYx

 var re = /[^a-zA-Z]/; var str = "aBcD7zYx"; var i = str.match(re).index; console.log(i); // 4 
+1
source share

Is there a way to find out the position in the line that caused the regular expression to fail?

No no. The regular expression matches or doesn't match. Nothing in between.

Partial expressions may match, but the entire template does not work. Therefore, the engine must always evaluate the entire expression:

Take the line Hello my World and the pattern /Hello World/ . While each word will match individually, the full expression is not executed. You cannot determine whether Hello or World independent, both are executed. Gaps between them are also available.

0
source share

All Articles