Here is an example using Haskell. The algorithm should be easily transferred to another language.
Only some import templates;
import Data.Maybe import Data.List import Data.Function
Here is the data type for representing Regex. You will need to parse it yourself or use the library:
data Regex = Concat Regex Regex
Here is the algorithm:
regexMustMatch (Single x) = [Just x]
Some methods to use the results:
selectAll = map (concatMap (return . fromJust)) . filter (isJust . head) . groupBy ((==) `on` isJust) selectLongest x = case selectAll x of [] -> "" xs -> maximumBy (compare `on` length) xs
And some examples:
main = do -- your tests -- /ab/ print . selectAll . regexMustMatch $ (Single 'a' `Concat` Single 'b') -- /a|b/ print . selectAll . regexMustMatch $ (Single 'a' `Alt` Single 'b') -- /[ab]/ print . selectAll . regexMustMatch $ (CharClass [('a','a'),('b','b')]) -- /[ab]c/ print . selectAll . regexMustMatch $ ((Single 'a' `Alt` Single 'b') `Concat` Single 'c') -- a few more -- /[a]/ print . selectAll . regexMustMatch $ (CharClass [('a','a')]) -- /ab*c/ print . selectAll . regexMustMatch $ (Single 'a' `Concat` Star (Single 'b') `Concat` Single 'c') -- /s(ab*)(cd)/ - these aren't capturing parens, just grouping to test associativity print . selectAll . regexMustMatch $ (Single 's' `Concat` (Single 'a' `Concat` Star (Single 'b')) `Concat` (Single 'c' `Concat` Single 'd'))
Conclusion:
["ab"] -- /ab/ [] -- /a|b/ [] -- /[ab]/ ["c"] -- /[ab]c/ ["a"] -- /[a]/ ["a","c"] -- /ab*c/ ["sa","cd"] -- /s(ab*)(cd)/
The main area where this can be improved is the alternation algorithm.
If we have the regular expression /a*bc*|d*be*/ , then b needs to be matched, but it will not be selected.
Edit : The alternation algorithm has been improved here:
regexMustMatch (Alt xy) | x' == y' = x' | otherwise = start ++ [Nothing] ++ common ++ [Nothing] ++ end where x' = regexMustMatch x y' = regexMustMatch y start = map fst $ takeWhile (uncurry (==)) (zip x' y') end = map fst $ reverse $ takeWhile (uncurry (==)) (zip (reverse (drop (length start) x')) (reverse (drop (length start) y'))) dropEnds = drop (length start) . reverse . drop (length end) . reverse common = intercalate [Nothing] $ map (map Just) (selectAll (dropEnds x') `intersect` selectAll (dropEnds y'))
A few more tests with improved alternation:
/a*bc*|d*be*/ == b /s(abc*|abe*)/ == sab /s(a*bc*|d*be*)/ == s, b /sa*b|b*/ == s /(abc*|abe*)s/ == ab, s /(a*bc*|d*be*)s/ == b, s /(a*b|b*)s/ == s /s(ab|b)e/ == s, be /s(ba|b)e/ == sb, e /s(b|b)e/ == sbe /s(ac*b|ac*b)e/ == sa, be