There are many things that regular expressions can do - some of them are - as you say - "dark magic." But the main problem is quite fundamental, regular expressions relate to the choice of text. They do not compare comparisons or scores — they either match or not.
You can see what the regex does by turning it on in debug mode. I use perl for this because you can set use re 'debug'; ':
#!/usr/bin/env perl use strict; use warnings; use re 'debug'; my @matches = "abcemtcmncefmf" =~ m/(cm|cm|c..m)/; print join "\n", @matches;
This will print what the regex engine does when it goes:
Compiling REx "(cm|cm|c..m)" Final program: 1: OPEN1 (3) 3: TRIE-EXACT[c] (19) <cm> (19) <c> (9) 9: REG_ANY (10) 10: EXACT <m> (19) <c> (15) 15: REG_ANY (16) 16: REG_ANY (17) 17: EXACT <m> (19) 19: CLOSE1 (21) 21: END (0) stclass AHOCORASICK-EXACT[c] minlen 1 Matching REx "(cm|cm|c..m)" against "abcemtcmncefmf" Matching stclass AHOCORASICK-EXACT[c] against "abcemtcmncefmf" (14 bytes) 0 <> <abcemtcmnc> | Scanning for legal start char... 2 <ab> <cemtcmncef> | Charid: 1 CP: 63 State: 1, word=0 - legal 3 <abc> <emtcmncefm> | Charid: 0 CP: 65 State: 2, word=2 - fail 3 <abc> <emtcmncefm> | Fail transition to State: 1, word=0 - fail Matches word #2 at position 2. Trying full pattern... 2 <ab> <cemtcmncef> | 1:OPEN1(3) 2 <ab> <cemtcmncef> | 3:TRIE-EXACT[c](19) 2 <ab> <cemtcmncef> | State: 1 Accepted: N Charid: 1 CP: 63 After State: 2 3 <abc> <emtcmncefm> | State: 2 Accepted: Y Charid: 0 CP: 65 After State: 0 got 2 possible matches TRIE matched word #2, continuing 3 <abc> <emtcmncefm> | 9: REG_ANY(10) 4 <abce> <mtcmncefmf> | 10: EXACT <m>(19) 5 <abcem> <tcmncefmf> | 19: CLOSE1(21) 5 <abcem> <tcmncefmf> | 21: END(0) Match successful! Freeing REx: "(cm|cm|c..m)"
I hope you see what he does here?
- works from left to right
- shows the first 'c'
- checks if 'cm' matches (doesn't work)
- checks if 'cm' matches (successful).
- called here and returns hits.
Turn g on and you will get it several times - I will not play it, but that is quite a lot.
While you can do a lot of smart tricks with PCRE, for example, look around, look forward, greedy / inaudible match .... quite fundamentally, here you are trying to choose some valid matches and choose the shortest one. And regex cannot do this.
I would suggest, though - with the same perl , the process of finding the shortest is pretty simple:
use List::Util qw/reduce/; print reduce { length( $a ) < length( $b ) ? $a : $b } @matches;
Sobrique
source share