Is there an algorithm that can call a regular expression (possibly limited by a simplified grammar) from a set of strings, so that evaluating all possible strings that match the regular expression reproduces the original set of strings?
It is probably unrealistic to find such an algorithm for regular expression grammars with a very "complex" syntax (including arbitrary repetitions, statements, etc.), so let's start with a simplified one that allows only OR substrings
foo(a|b|cd)bar must match fooabar , foobbar and foocdbar .
Examples
Given the set of strings h_q1_a , h_q1_b , h_q1_c , h_p2_a , h_p2_b , h_p2_c , the desired result of the algorithm will be h_(q1|p2)_(a|b|c) .
Given the set of rows h_q1_a , h_q1_b , h_p2_a , the desired result of the algorithm will be h_(q1_(a|b)|p2_a) . Note that h_(q1|p2)_(a|b) will not be correct, because it is an extension to 4 lines, including h_p2_b , which was not in the original rowset.
Use case
I have a long list of shortcuts that were created using substrings. Instead of printing an extensive list of lines, I would like to have compact output indicating which labels are in the list. Since the complete list was prepared programmatically (using a finite set of pre- and suffixes), I expect the compact notation to be (potentially) much shorter than the original list.
The ((simplified) regular expression should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer, of course, is to simply concatenate all strings, such as A | B | C | D | ... which, however, does not help.)