Simplifying OR Regular Expressions

I was asked today if there is a library to take a list of strings and calculate the most efficient regular expression to match only those strings. I think he is NP Complete problem himself , but I think we can improve the area a bit.

How do I create and simplify a regex to match a subset of hosts from a larger set of all hosts on my network? (Knowing that I cannot get the most effective regular expression.)

The first step is very simple. From the following list:

  • appserver1.domain.tld
  • appserver2.domain.tld
  • appserver3.domain.tld

I can combine and avoid them in

appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\.domain\.tld

And I know how to manually simplify a regular expression in

appserver[123]\.domain\.tld

, 3 . , . ( Perl, Javascript #) ?

. perl, . Javascript. , perl JS, .

+5
3
+9

Regex::PreSuf .

:

use Regex::PreSuf;

my $re = presuf(qw(foobar fooxar foozap));

# $re should be now 'foo(?:zap|[bx]ar)'
+7

The Perl regular expression compiler builds a trie branching data structure from patterns with parts that are common to alternatives:

 $ perl -Mre=debug -ce '"whatever" =~ /appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\.domain\.tld/'
Compiling REx "appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\."...
Final program:
   1: EXACT <appserver> (5)
   5: TRIEC-EXACT[123] (25)
      <1.domain.tld> 
      <2.domain.tld> 
      <3.domain.tld> 
  25: END (0)
anchored "appserver" at 0 (checking anchored) minlen 21 
-e syntax OK
Freeing REx: "appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\."...
+3
source

All Articles