Find the simplest regular expression matching all given lines

Is there an algorithm that can call a regular expression (possibly limited by a simplified grammar) from a set of strings, so that evaluating all possible strings that match the regular expression reproduces the original set of strings?

It is probably unrealistic to find such an algorithm for regular expression grammars with a very "complex" syntax (including arbitrary repetitions, statements, etc.), so let's start with a simplified one that allows only OR substrings

foo(a|b|cd)bar must match fooabar , foobbar and foocdbar .

Examples

Given the set of strings h_q1_a , h_q1_b , h_q1_c , h_p2_a , h_p2_b , h_p2_c , the desired result of the algorithm will be h_(q1|p2)_(a|b|c) .

Given the set of rows h_q1_a , h_q1_b , h_p2_a , the desired result of the algorithm will be h_(q1_(a|b)|p2_a) . Note that h_(q1|p2)_(a|b) will not be correct, because it is an extension to 4 lines, including h_p2_b , which was not in the original rowset.

Use case

I have a long list of shortcuts that were created using substrings. Instead of printing an extensive list of lines, I would like to have compact output indicating which labels are in the list. Since the complete list was prepared programmatically (using a finite set of pre- and suffixes), I expect the compact notation to be (potentially) much shorter than the original list.

The ((simplified) regular expression should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer, of course, is to simply concatenate all strings, such as A | B | C | D | ... which, however, does not help.)

+7
source share
2 answers

You can try to use the Aho-Corasick algorithm to create a finite state machine from the input lines, after which it should be somewhat easy to generate a simplified regular expression. Your input lines as an example:

 h_q1_a h_q1_b h_q1_c h_p2_a h_p2_b h_p2_c 

will create the final machine, which is likely to look like this:

  [h_] <-level 0 / \ [q1] [p2] <-level 1 \ / [_] <-level 2 /\ \ / \ \ abc <-level 3 

Now for each level / depth trie all bites (if several) will be in OR brackets, therefore

 h_(q1|p2)_(a|b|c) L0 L1 L2 L3 
+2
source

There is a direct solution to this problem if you want to find a minimal state machine (FSM) for a rowset. Since the resulting FSM cannot have loops (otherwise it will correspond to an infinite number of lines), it is easy to convert it to a regular expression using only concatenation and disjunction operations ( | ). Although it may not be the shortest regex, it will result in the smallest compiled regex if the regex library you use is compiled with minimized DFA. (Alternatively, you can use DFA directly with a library such as Ragel.)

The procedure is simple if you have access to standard FSM algorithms:

  • Make a non-deterministic finite state machine (NFA) by simply adding each row as a sequence of states, with each sequence starting from the beginning. It is clear that O(N) in the total size of the lines, since for each character in the source lines there will be exactly one NFA state.

  • Build a deterministic finite state machine (DFA) from the NFA. The NFA is a tree, not even a DAG, that should avoid the exponential worst case for the standard algorithm. In fact, you are just building the prefix tree here, and you could skip step 1 and build the prefix tree directly, converting it directly to DFA. There can be no more nodes in the prefix tree than the original number of characters (and can have the same number of nodes if all lines start with different characters), so its size is O(N) , but I have no evidence from the top of the head that she is also O(N) in time.

  • Minimize DFA.

DFA minimization is a well-studied problem. The Hopcroft algorithm is the worst-case O(NS log N) algorithm, where N is the number of states in the DFA, and S is the size of the alphabet. Usually, S will be considered a constant; in any case, the expected time of the Hopcroft algorithm is much better.

For acyclic DFAs, linear time algorithms exist; the most frequently cited is because of Dominic Revuz, and I found a rough description of it here in English ; the original paper seems to be paid, but a revolutionary thesis (in French) is available.

+2
source

All Articles