Generating Regular Expressions

Usually in our work, we use regular expressions in the operations of capture or matching.

However, regular expressions can be used - at least manually - to generate legal proposals that match the regular expression. Of course, some regular expressions may correspond to infinitely long sentences, for example, an expression .+ .

I have a problem that can be solved with a regular expression sentence generation algorithm.

In pseudocode, it will work something like this:

 re = generate("foo(bar|baz)?", max_match = 100); #Don't give me more than 100 results assert re == ("foobar", "foobaz", "foo"); 

What algorithm will do this for me?

+7
regex regular-language generative-programming
source share
2 answers

Microsoft has a free SMT-enabled tool (MSRL-licensed) for Rex: http://research.microsoft.com/en-us/downloads/7f1d87be-f6d9-495d-a699-f12599cea030/

In the Introduction section of Rex: Symbolic Regular Expression Explorer:

We translate (extended) regular expressions or regular expressions [5] into the symbolic representation of finite state machines called SFAs. In SFA, moves are indicated by formulas representing character sets, not individual characters. SFA A is converted to a set of (recursive) axioms that describe the acceptance condition for the strings accepted by A and are based on the presentation of strings in lists.

Since the SMT solver can output all possible solutions within a certain size limit, this may be close to what you are looking for.

The CPAN Regexp :: Genex module from http://search.cpan.org/dist/Regexp-Genex/ can work on a more statistical and less formal front.

You can use it with something like this:

 #!/usr/bin/env perl use Regexp::Genex ':all'; my $hits = 100; my $re = qr/[az](123|456)/; local $Regexp::Genex::DEFAULT_LEN = length $re; my %seen; while ((time - $^T) < 2) { @seen{strings($re)} = (); $Regexp::Genex::DEFAULT_LEN++; } print "$_\n" for (sort %seen)[0..$hits-1]; 

If necessary, adjust the time and sample size. Hope this helps!

+2
source share

Check out Xeger (Google Code) .

The Visual Studio Team System also has an inverse regular expression generator , but it doesn't look like the algorithm is open source.

+1
source share

All Articles