Potential common quantifier {m, n} + not implemented in Ruby 1.9.3?

Potential quantifiers are greedy and refuse to return. The regular expression /.{1,3}+b/ should mean: match any character, except line breaks, from 1 to 3 times, as much as possible and not back off. Tthen matches b .

In this example:

 'ab'.sub /.{1,3}+b/, 'c' #=> "c" 

Replacement should not occur, contrary to fact.

The result in these two examples is different:

 'aab'.sub /.{0,1}+b/, 'c' #=> "c" 'aab'.sub /.?+b/, 'c' #=> "ac" 

Compare this to Scala, where they give the same answer:

 scala> ".{0,1}+b".r.replaceAllIn("aab", "c") res1: String = ac scala> ".?+b".r.replaceAllIn("aab", "c") res2: String = ac 

Is this a Ruby bug or motivates this behavior? Perhaps Oniguruma for some reason applied possessive with all quantifiers ? , * , + except for the common quantifier {m,n} ? If so, why?

+7
source share
2 answers

It seems like it is intended on Oniguruma. The documentation says {n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only {n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only . I assume this is due to backward compatibility considerations or?

+2
source

What is really going on

It seems that + , followed by a range quantifier, does not offer the property of attraction to the range quantifier. Rather, it is considered as something that used to be repeated one or more times. Using .{1,3}+b as an example, it will be equivalent to (?:.{1,3})+b .

Work around

You can get around this with a more general build group without backtracking (or atomic grouping) (?>pattern) . Let's use the general case of pattern{n,m}+ as an example to create an equivalent regular expression with a group without backtracking (equivalent to Java behavior when matching with pattern{n,m}+ ):

 (?>(?>pattern){n,m}) 

Why are there 2 levels of groups without backtracking? 2 are necessary because:

  • If a match is found for pattern (one repeat instance), rollback to pattern prohibited. (Note that until the instance is found, backtracking within the pattern allowed). It is emulated with an internal group without reverse processing.
  • If no more instances of pattern are found, backtracking is canceled to remove any of the instances. It is emulated with an external group without backtracking.

I am not sure if there is any caution here. Please email me with a comment if you find any case not emulated using this method.

Testing

Test 1

First I tested this regex:

 (.{1,3}+)b 

I initially tested without a capture group, but the result was so unexpected that I needed a capture group to confirm what was happening.

At this input:

 2343333ab 

As a result, the entire line corresponds , and the capture group is 2343333a (without end b at the end). This shows that the upper limit was somehow violated.

DEMO in rubular

Test 2

This second test shows how the behavior of the range quantifiers {n} cannot be modified to be possessive, and it is likely that this also applies to other range quantifiers {n,} and {n,m} . Instead, the next + will only display a repeat of 1 or more temporary behavior.

(My initial conclusion is that + overwrites the upper limit, but it turns out to be wrong).

Regular expression:

 (.{3}+)b 

Input line:

 23d4344333ab 234344333ab 23434433ab 

Matches captured in capture group 1 are all multiples of 3. From top to bottom, the regular expression skips 2, 1, 0 characters respectively for input lines.

An input line with an annotation ( [] denotes a match for the entire regular expression, () denotes the text captured by capture group 1):

 23[(d4344333a)b] 2[(34344333a)b] [(23434433a)b] 

DEMO in rubular

Testing code to work

This is test code in Java to show that both external and internal groups are needed without backtracking. ideone

 class TestPossessive { public static void main(String args[]) { String inputText = "123456789012"; System.out.println("Input string: " + inputText); System.out.println("Expected: " + inputText.replaceFirst("(?:\\d{3,4}(?![89])){2,}+", ">$0<")); System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>(?:\\d{3,4}(?![89])){2,})", ">$0<")); System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>\\d{3,4}(?![89])){2,}", ">$0<")); System.out.println("Both: " + inputText.replaceFirst("(?>(?>\\d{3,4}(?![89])){2,})", ">$0<")); System.out.println(); inputText = "aab"; System.out.println("Input string: " + inputText); System.out.println("Expected: " + inputText.replaceFirst(".{1,3}+b", ">$0<")); System.out.println("Outer possessive group: " + inputText.replaceFirst("(?>.{1,3})b", ">$0<")); System.out.println("Inner possessive group: " + inputText.replaceFirst("(?>.){1,3}b", ">$0<")); System.out.println("Both: " + inputText.replaceFirst("(?>(?>.){1,3})b", ">$0<")); } } 
+5
source

All Articles