Regular expression to match only the smallest

I have an expression like c.{0,2}?m and a string like "abcemtcmncefmf" . Currently, it will match three substrings: cem , cm and cefm ( see here ). But I like to match only the smallest of them, in this case cm .

My problem is that I do not have global matching support, only the first match, because I use the MariaDB REGEXP_SUBSTR() function. My current solution is the stored procedure that I created to solve my problem. But it is 10 times slower than usual expression for simple cases.

I also tried to do something like: (cm|c.{0,1}?m|c.{0,2}?m) , but this did not work, because it would correspond to the first of any group patterns, and don't try one at a time in the whole topic.

I know that regular expressions (PCRE) have black magic, but I have not found anything to solve my problem.


  • Note I am using an unwanted pattern ( .{0,2}? ) For the current pattern;
  • Question Regular expression to find the smallest possible match . Not my problem;
+7
regex perl pcre
source share
3 answers

You can simply use alternation in the reset group branch:

 /^(?|.*(cm)|.*(cm)|.*(c..m))/s 

(result in group 1)

or like this:

 /^.*\Kcm|^.*\Kc.m|^.*\Kc..m/s 

The first successful branch wins.

+3
source share

There are many things that regular expressions can do - some of them are - as you say - "dark magic." But the main problem is quite fundamental, regular expressions relate to the choice of text. They do not compare comparisons or scores — they either match or not.

You can see what the regex does by turning it on in debug mode. I use perl for this because you can set use re 'debug'; ':

 #!/usr/bin/env perl use strict; use warnings; use re 'debug'; my @matches = "abcemtcmncefmf" =~ m/(cm|cm|c..m)/; print join "\n", @matches; 

This will print what the regex engine does when it goes:

 Compiling REx "(cm|cm|c..m)" Final program: 1: OPEN1 (3) 3: TRIE-EXACT[c] (19) <cm> (19) <c> (9) 9: REG_ANY (10) 10: EXACT <m> (19) <c> (15) 15: REG_ANY (16) 16: REG_ANY (17) 17: EXACT <m> (19) 19: CLOSE1 (21) 21: END (0) stclass AHOCORASICK-EXACT[c] minlen 1 Matching REx "(cm|cm|c..m)" against "abcemtcmncefmf" Matching stclass AHOCORASICK-EXACT[c] against "abcemtcmncefmf" (14 bytes) 0 <> <abcemtcmnc> | Scanning for legal start char... 2 <ab> <cemtcmncef> | Charid: 1 CP: 63 State: 1, word=0 - legal 3 <abc> <emtcmncefm> | Charid: 0 CP: 65 State: 2, word=2 - fail 3 <abc> <emtcmncefm> | Fail transition to State: 1, word=0 - fail Matches word #2 at position 2. Trying full pattern... 2 <ab> <cemtcmncef> | 1:OPEN1(3) 2 <ab> <cemtcmncef> | 3:TRIE-EXACT[c](19) 2 <ab> <cemtcmncef> | State: 1 Accepted: N Charid: 1 CP: 63 After State: 2 3 <abc> <emtcmncefm> | State: 2 Accepted: Y Charid: 0 CP: 65 After State: 0 got 2 possible matches TRIE matched word #2, continuing 3 <abc> <emtcmncefm> | 9: REG_ANY(10) 4 <abce> <mtcmncefmf> | 10: EXACT <m>(19) 5 <abcem> <tcmncefmf> | 19: CLOSE1(21) 5 <abcem> <tcmncefmf> | 21: END(0) Match successful! Freeing REx: "(cm|cm|c..m)" 

I hope you see what he does here?

  • works from left to right
  • shows the first 'c'
  • checks if 'cm' matches (doesn't work)
  • checks if 'cm' matches (successful).
  • called here and returns hits.

Turn g on and you will get it several times - I will not play it, but that is quite a lot.

While you can do a lot of smart tricks with PCRE, for example, look around, look forward, greedy / inaudible match .... quite fundamentally, here you are trying to choose some valid matches and choose the shortest one. And regex cannot do this.

I would suggest, though - with the same perl , the process of finding the shortest is pretty simple:

 use List::Util qw/reduce/; print reduce { length( $a ) < length( $b ) ? $a : $b } @matches; 
+4
source share

Technically, this can be done.

 my ($match) = / ^ (?:(?! c[^m]{0,2}m ).)*+ # Skip past area with no matches. (?: (?:(?! c[^m]{0,1}m ).)*+ # Skip past area with no matches except longuest. (?: (?:(?! c[^m]{0,0}m ).)*+ # Skip past area with no matches except 2 longuest. )? )? ( c[^m]{0,2}m ) /xs; 

[Note. Removing possessive quantifier ( + ) modifiers will significantly affect performance.]

But it is usually much better to find all the matches and find the smallest one.

 use List::Util qw( reduce ); my ($match) = reduce { length($a) <= length($b) ? $a : $b } /c[^m]{0,2}m/g; 
+2
source share

All Articles