Make sure one regex closes another regex

I am trying to implement a text clustering algorithm. The algorithm clusters similar lines of source text, replacing them with regular expressions, and aggregates the number of patterns matching each regular expression to provide a neat summary of the input text instead of showing duplicate patterns from the input text. In this attempt, I was faced with the need to find if one regular expression closes another.

Suppose we are talking only about regular expressions with wildcards β€œ*” and β€œ+”, that is, β€œ*” means zero or more cases of the alphabet, and β€œ+” means 1 or more cases of the alphabet. Also suppose the character set is ASCII.

For example:

1. AB covers AB This is straightforward. 2. ABC* covers ABC Because ABC* can generate: ABC, ABCC, ABCCC etc. 3. A*B+C* covers AB+C* Because A*B+C* can generate ABBC, AABBC, AABBCC etc. which covers all strings generated by AB+C*. 4. A+M+BC* covers AMM+B+C+M+BC* Similar to case [3] above. 

Basically I am looking for an effective implementation of the following method, which tells if strA (may contain a regular expression) spans strB (may contain a regular expression). Note that there should also be a way to avoid the regex characters '*' and '+' in the input strings strA and strB.

Method signature in C ++:

 bool isParentRegex(const string& strA, const string& strB) 

I believe that implementation requires a recursive approach, and this can be a little more complicated. But I'm curious to know if I can reuse existing implementations instead of reinventing the wheel, or if there are other simple ways to do this.

+8
c ++ regex cluster-analysis data-mining
source share
4 answers

Given the simple regex grammar you suggest, the solution is pretty trivial.

Take a more complex example, A+M+BC* covers AMM+B+C+M+BC* You can rewrite it as A{1,}M{1,}B{1,1}C{0,} covers A{1,1}M{2,}B{1,}C{1,}M{1,}B{1,1}C{0,}

This leads us to a simple rule: R1 spans R2 , if all characters appear in the same order, all lower bounds of R1 less than or equal to the values ​​in R2 , and the upper bounds of R1 greater than or equal to the values ​​of R2 .

Now there is one small problem with a simple rule. AB*C covers AC , that is, the possibility that an optional character appears in R1 rather than in R2 . This can be solved by inserting {0,0} in R2 , when in R1 there is a (optional) character that does not appear at the equivalent position in R2 . For example. AB*C covers AB{0,0}C

The "optional character" rule is optimization. If the character in R1 is not optional, R1 , of course, does not cover R2 . For example. AB+C does not cover AC . Therefore, there is no need to insert B{0,0} . But if you do, you will see that A{1,1}B{1,}C{1,1} does not cover A{1,1}B{0,0}C{1,1} , since the lower bound of R1 on B (1) is larger than the lower bound of R2 on B (0)

+4
source share

I would do something like implement a minimum DFA lookup function from a given regular expression. Let's pretend that

DFA GetMinimalDFA (Regex r1) does this.

 bool isParentRegex(Regex r1, Regex r2) { DFA a = GetMinimalDFA(r1); DFA b = GetMinimalDFA(Regex.OR(r1,r2)) return a.Equals(b); } 
+2
source share

In Perl, that would be pretty simple. The first step is to normalize each regular expression by changing A+ to AA* , A*A to AA* and A*A* to A* :

 sub normalize_regex($) { local $_ = shift; s/(.)\+/$1$1*/g; 1 while s/(.)\*\1(?!\*)/$1$1*/g or s/(.\*)\1/$1/g; return $_; } 

Step two is to convert the first regular expression from the regular expression that matches the strings itself to Perl-regex, which matches the normalized regular expressions that match those strings; for example, AA*B will be converted to ^AA*\*?B$ , which means β€œbeginning of line”, followed by A, followed by zero or more A, optionally followed by an asterisk, followed by B, and then the end of the line ":

 sub regex_to_metaregex($) { local $_ = shift; s/(.)(\*?)/$2 ? "\Q$1\E*(\Q$1\E\\*)?" : "\Q$1"/eg; return qr/^$_$/; } 

Step three needs no explanation:

 sub does_regex1_cover_regex2($$) { my ($r1, $r2) = @_; $r1 = regex_to_metaregex normalize_regex $r1; $r2 = normalize_regex $r2; return scalar $r2 =~ m/$r1/; } 

This returns the true value for your cases # 1 - 3. However, it returns a false value for your case # 4, because if I really lose nothing, A+M+BC* does not cover AMM+B+C+M+BC* ?

Note that there should also be a way to avoid duplicate characters "*" and "+" in the input lines strA and strB.

I was not worried about this in the above code, but since you are only concerned with ASCII, the preprocessing step can process the \* value * , \+ , which means + and \\ , which means \ , translating them into separate characters outside the ASCII range:

 sub process_escapes($) { local $_ = shift; s/\\\\/\x80/g; s/\\\+/\x81/g; s/\\\*/\x82/g; s/\x80/\\/g; return $_; } 

(although this is clearly pretty hacky).

In C ++ you can use the same approach - there are libraries that implement all the necessary functions of Perl regexes - although obviously this will be a bit more.

+2
source share

please check this source of the perl module , but remember that it will not work for all regular expressions (as this will solve the problem with stopping .

0
source share

All Articles