Regular expression to match only the smallest

Question

Regular expression to match only the smallest

I have an expression like c.{0,2}?m and a string like "abcemtcmncefmf" . Currently, it will match three substrings: cem , cm and cefm ( see here ). But I like to match only the smallest of them, in this case cm .

My problem is that I do not have global matching support, only the first match, because I use the MariaDB REGEXP_SUBSTR() function. My current solution is the stored procedure that I created to solve my problem. But it is 10 times slower than usual expression for simple cases.

I also tried to do something like: (cm|c.{0,1}?m|c.{0,2}?m) , but this did not work, because it would correspond to the first of any group patterns, and don't try one at a time in the whole topic.

I know that regular expressions (PCRE) have black magic, but I have not found anything to solve my problem.

Note I am using an unwanted pattern ( .{0,2}? ) For the current pattern;
Question Regular expression to find the smallest possible match . Not my problem;

+7

regex perl pcre

David Rodrigues Jan 29 '16 at 18:42

source share

3 answers

There are many things that regular expressions can do - some of them are - as you say - "dark magic." But the main problem is quite fundamental, regular expressions relate to the choice of text. They do not compare comparisons or scores — they either match or not.

You can see what the regex does by turning it on in debug mode. I use perl for this because you can set use re 'debug'; ':

 #!/usr/bin/env perl use strict; use warnings; use re 'debug'; my @matches = "abcemtcmncefmf" =~ m/(cm|cm|c..m)/; print join "\n", @matches;

This will print what the regex engine does when it goes:

 Compiling REx "(cm|cm|c..m)" Final program: 1: OPEN1 (3) 3: TRIE-EXACT[c] (19) <cm> (19) <c> (9) 9: REG_ANY (10) 10: EXACT <m> (19) <c> (15) 15: REG_ANY (16) 16: REG_ANY (17) 17: EXACT <m> (19) 19: CLOSE1 (21) 21: END (0) stclass AHOCORASICK-EXACT[c] minlen 1 Matching REx "(cm|cm|c..m)" against "abcemtcmncefmf" Matching stclass AHOCORASICK-EXACT[c] against "abcemtcmncefmf" (14 bytes) 0 <> <abcemtcmnc> | Scanning for legal start char... 2 <ab> <cemtcmncef> | Charid: 1 CP: 63 State: 1, word=0 - legal 3 <abc> <emtcmncefm> | Charid: 0 CP: 65 State: 2, word=2 - fail 3 <abc> <emtcmncefm> | Fail transition to State: 1, word=0 - fail Matches word #2 at position 2. Trying full pattern... 2 <ab> <cemtcmncef> | 1:OPEN1(3) 2 <ab> <cemtcmncef> | 3:TRIE-EXACT[c](19) 2 <ab> <cemtcmncef> | State: 1 Accepted: N Charid: 1 CP: 63 After State: 2 3 <abc> <emtcmncefm> | State: 2 Accepted: Y Charid: 0 CP: 65 After State: 0 got 2 possible matches TRIE matched word #2, continuing 3 <abc> <emtcmncefm> | 9: REG_ANY(10) 4 <abce> <mtcmncefmf> | 10: EXACT <m>(19) 5 <abcem> <tcmncefmf> | 19: CLOSE1(21) 5 <abcem> <tcmncefmf> | 21: END(0) Match successful! Freeing REx: "(cm|cm|c..m)"

I hope you see what he does here?

works from left to right
shows the first 'c'
checks if 'cm' matches (doesn't work)
checks if 'cm' matches (successful).
called here and returns hits.

Turn g on and you will get it several times - I will not play it, but that is quite a lot.

While you can do a lot of smart tricks with PCRE, for example, look around, look forward, greedy / inaudible match .... quite fundamentally, here you are trying to choose some valid matches and choose the shortest one. And regex cannot do this.

I would suggest, though - with the same perl , the process of finding the shortest is pretty simple:

 use List::Util qw/reduce/; print reduce { length( $a ) < length( $b ) ? $a : $b } @matches;

+4

Sobrique Jan 29 '16 at 19:30

source share

Technically, this can be done.

 my ($match) = / ^ (?:(?! c[^m]{0,2}m ).)*+ # Skip past area with no matches. (?: (?:(?! c[^m]{0,1}m ).)*+ # Skip past area with no matches except longuest. (?: (?:(?! c[^m]{0,0}m ).)*+ # Skip past area with no matches except 2 longuest. )? )? ( c[^m]{0,2}m ) /xs;

[Note. Removing possessive quantifier ( + ) modifiers will significantly affect performance.]

But it is usually much better to find all the matches and find the smallest one.

 use List::Util qw( reduce ); my ($match) = reduce { length($a) <= length($b) ? $a : $b } /c[^m]{0,2}m/g;

+2

ikegami Jan 29 '16 at 20:55

source share

Casimir et Hippolyte · Accepted Answer · 2016-01-30T02:16:31+0000

You can simply use alternation in the reset group branch:

 /^(?|.*(cm)|.*(cm)|.*(c..m))/s

(result in group 1)

or like this:

 /^.*\Kcm|^.*\Kc.m|^.*\Kc..m/s

The first successful branch wins.

Regular expression to match only the smallest

More articles: