If the regex contains one immutable segment

I am writing a search tool that is optimized to search for fixed phrases of characters in sentences - consider it simple: "Does the sentence contain a specific sequence of characters." The result will be a set of found offers that you can continue to search in the second stage.

For this second step, I like to use regex search for convenience. But I need to pre-select the elements first in the first stage, and I can’t just get all the offers - the API that I need to use for the first stage requires me to search for the phrase of at least one matching char. So there is no way around this.

Now the user will enter only one regular expression, and my software must first determine if it can perform the first step of the search on this. If the user enters something ambiguous, then I will tell the user to change his regular expression.

I need an algorithm that defines all the substrings that I can use to search for the first stage.

Here are some examples of expected results:

  • ab - Yes (first searches for "a" or "b")
  • a | b - No (there are two different first level queries)
  • [ab] - No (same problem: not an explicit goal to search for the first stage)
  • [ab] c - Yes (first search for "c")

These are simple examples. But since a regex can be quite complicated, I wonder if I can build a regex or another test that will tell me if I have a useful result. I could also live with a limitation of regex syntax to more common cases, if that simplifies the test, for example. no recursion or anything else helped.

0
source share
1 answer

Here is an example using Haskell. The algorithm should be easily transferred to another language.

Only some import templates;

import Data.Maybe import Data.List import Data.Function 

Here is the data type for representing Regex. You will need to parse it yourself or use the library:

 data Regex = Concat Regex Regex -- eg /ab/ | Alt Regex Regex -- eg /a|b/ | Single Char -- eg /a/ | Star Regex -- eg /a*/ | CharClass [(Char,Char)] -- a list of ranges. for non-range (eg [a]) just use the same char twice 

Here is the algorithm:

 regexMustMatch (Single x) = [Just x] -- has to match the character regexMustMatch (Alt _ _) = [Nothing] -- doesn't need to match one thing (you could actually check for equality here, so something like /a|a/ would work) regexMustMatch (Star _) = [Nothing] -- doesn't need to match one thing regexMustMatch (CharClass ((a,b):[])) | a == b = [Just a] -- char class must match if it only has one character regexMustMatch (CharClass _) = [Nothing] -- otherwise doesn't need to match one thing regexMustMatch (Concat xy) = (regexMustMatch x) ++ (regexMustMatch y) -- must match both parts in sequence 

Some methods to use the results:

 selectAll = map (concatMap (return . fromJust)) . filter (isJust . head) . groupBy ((==) `on` isJust) selectLongest x = case selectAll x of [] -> "" xs -> maximumBy (compare `on` length) xs 

And some examples:

 main = do -- your tests -- /ab/ print . selectAll . regexMustMatch $ (Single 'a' `Concat` Single 'b') -- /a|b/ print . selectAll . regexMustMatch $ (Single 'a' `Alt` Single 'b') -- /[ab]/ print . selectAll . regexMustMatch $ (CharClass [('a','a'),('b','b')]) -- /[ab]c/ print . selectAll . regexMustMatch $ ((Single 'a' `Alt` Single 'b') `Concat` Single 'c') -- a few more -- /[a]/ print . selectAll . regexMustMatch $ (CharClass [('a','a')]) -- /ab*c/ print . selectAll . regexMustMatch $ (Single 'a' `Concat` Star (Single 'b') `Concat` Single 'c') -- /s(ab*)(cd)/ - these aren't capturing parens, just grouping to test associativity print . selectAll . regexMustMatch $ (Single 's' `Concat` (Single 'a' `Concat` Star (Single 'b')) `Concat` (Single 'c' `Concat` Single 'd')) 

Conclusion:

 ["ab"] -- /ab/ [] -- /a|b/ [] -- /[ab]/ ["c"] -- /[ab]c/ ["a"] -- /[a]/ ["a","c"] -- /ab*c/ ["sa","cd"] -- /s(ab*)(cd)/ 

The main area where this can be improved is the alternation algorithm.

If we have the regular expression /a*bc*|d*be*/ , then b needs to be matched, but it will not be selected.

Edit : The alternation algorithm has been improved here:

 regexMustMatch (Alt xy) | x' == y' = x' | otherwise = start ++ [Nothing] ++ common ++ [Nothing] ++ end where x' = regexMustMatch x y' = regexMustMatch y start = map fst $ takeWhile (uncurry (==)) (zip x' y') end = map fst $ reverse $ takeWhile (uncurry (==)) (zip (reverse (drop (length start) x')) (reverse (drop (length start) y'))) dropEnds = drop (length start) . reverse . drop (length end) . reverse common = intercalate [Nothing] $ map (map Just) (selectAll (dropEnds x') `intersect` selectAll (dropEnds y')) 

A few more tests with improved alternation:

 /a*bc*|d*be*/ == b /s(abc*|abe*)/ == sab /s(a*bc*|d*be*)/ == s, b /sa*b|b*/ == s /(abc*|abe*)s/ == ab, s /(a*bc*|d*be*)s/ == b, s /(a*b|b*)s/ == s /s(ab|b)e/ == s, be /s(ba|b)e/ == sb, e /s(b|b)e/ == sbe /s(ac*b|ac*b)e/ == sa, be 
+1
source

All Articles