Recursive solutions for matching globe patterns

I am currently studying the implementation of pattern matching in the UNIX style. Generally, the fnmatch library is a good reference implementation for this function.

Looking at some implementations , as well as reading various blogs / tutorials about it, it seems that this algorithm is usually implemented recursively.

As a rule, any algorithm requiring "backtracking" or requiring a return to an earlier state is perfect for a recursive solution. This includes things like traversing a tree or parsing nested structures.

But I had trouble understanding why glob pattern matching, in particular, is so often implemented recursively. I understand that sometimes aabaabxbaab is required, for example, if we have aabaabxbaab string and a*baab , the characters after * will correspond to the first substring "baab", for example aa(baab)xbaab , and then do not coincide with the rest of the string. Thus, the algorithm will have to back off so that the match of the character counter baab , and we can match the second occurrence of baab , for example: aabaabx(baab) .

Good, but usually recursion is used when we may need several nested return levels, and I don’t see how it is necessary in this case. It seems that we only had to return one level at a time when the iterator in the template and the iterator in the input line do not match. At this point, the template iterator will need to return to the last "save point", which would be either the beginning of the line or the last * that successfully matched something. This does not require a stack - only one savepoint.

The only complication that I can think of is in case of coincidence of coincidences. For example, if we have an aabaabaab input aabaabaab and a*baab , we won’t be able to match, since the “b” in the last baab can be part of either the first match or the second match. But it seems that this can be solved by simply discarding the input iterator if the distance between the last save point of the template iterator and the end of the template is greater than the distance between the iterator input position and the end of the input string.

So, as far as I can see, this should not be too big a problem for implementing this glob matching algorithm iteratively (assuming there is a very simple glob helper that uses the * character to mean “match” zero or more characters. ”Also, the matching strategy will be greedy.)


So, I assume that I am definitely wrong about this because everyone else does it recursively - so I have to skip something. I just don’t see what I am missing here. So I implemented a simple glob resolver in C ++ (which only supports the * operator) to find out if I can understand what I am missing. This is a very simple and easy iterative solution that just uses the inner loop to match wildcards. It also writes indexes that the * character matches the vector of pairs:

 bool match_pattern(const std::string& pattern, const std::string& input, std::vector<std::pair<std::size_t, std::size_t>>& matches) { const char wildcard = '*'; auto pat = std::begin(pattern); auto pat_end = std::end(pattern); auto it = std::begin(input); auto end = std::end(input); while (it != end && pat != pat_end) { const char c = *pat; if (*it == c) { ++it; ++pat; } else if (c == wildcard) { matches.push_back(std::make_pair(std::distance(std::begin(input), it), 0)); ++pat; if (pat == pat_end) { matches.back().second = input.size(); return true; } auto save = pat; std::size_t matched_chars = 0; while (it != end && pat != pat_end) { if (*it == *pat) { ++it; ++pat; ++matched_chars; if (pat == pat_end && it != end) { pat = save; matched_chars = 0; // Check for an overlap and back up the input iterator if necessary // std::size_t d1 = std::distance(it, end); std::size_t d2 = std::distance(pat, pat_end); if (d2 > d1) it -= (d2 - d1); } } else if (*pat == wildcard) { break; } else { if (pat == save) ++it; pat = save; matched_chars = 0; } } matches.back().second = std::distance(std::begin(input), it) - matched_chars; } else break; } return it == end && pat == pat_end; } 

Then I wrote a series of tests to see if I could find a pattern or input line that would require several levels of return (and therefore recursion or a stack), but I couldn't think of anything.

Here is my test function:

 void test(const std::string& input, const std::string& pattern) { std::vector<std::pair<std::size_t, std::size_t>> matches; bool result = match_pattern(pattern, input, matches); auto match_iter = matches.begin(); std::cout << "INPUT: " << input << std::endl; std::cout << "PATTERN: " << pattern << std::endl; std::cout << "INDICES: "; for (auto& p : matches) { std::cout << "(" << p.first << "," << p.second << ") "; } std::cout << std::endl; if (result) { std::cout << "MATCH: "; for (std::size_t idx = 0; idx < input.size(); ++idx) { if (match_iter != matches.end()) { if (idx == match_iter->first) std::cout << '('; else if (idx == match_iter->second) { std::cout << ')'; ++match_iter; } } std::cout << input[idx]; } if (!matches.empty() && matches.back().second == input.size()) std::cout << ")"; std::cout << std::endl; } else { std::cout << "NO MATCH!" << std::endl; } std::cout << std::endl; } 

And my actual tests:

 test("aabaabaab", "a*b*ab"); test("aabaabaab", "a*"); test("aabaabaab", "aa*"); test("aabaabaab", "aaba*"); test("/foo/bar/baz/xlig/fig/blig", "/foo/*/blig"); test("/foo/bar/baz/blig/fig/blig", "/foo/*/blig"); test("abcdd", "*d"); test("abcdd", "*d*"); test("aabaabqqbaab", "a*baab"); test("aabaabaab", "a*baab"); 

So the outputs are:

 INPUT: aabaabaab PATTERN: a*b*ab INDICES: (1,2) (3,7) MATCH: a(a)b(aaba)ab INPUT: aabaabaab PATTERN: a* INDICES: (1,9) MATCH: a(abaabaab) INPUT: aabaabaab PATTERN: aa* INDICES: (2,9) MATCH: aa(baabaab) INPUT: aabaabaab PATTERN: aaba* INDICES: (4,9) MATCH: aaba(abaab) INPUT: /foo/bar/baz/xlig/fig/blig PATTERN: /foo/*/blig INDICES: (5,21) MATCH: /foo/(bar/baz/xlig/fig)/blig INPUT: /foo/bar/baz/blig/fig/blig PATTERN: /foo/*/blig INDICES: (5,21) MATCH: /foo/(bar/baz/blig/fig)/blig INPUT: abcdd PATTERN: *d INDICES: (0,4) MATCH: (abcd)d INPUT: abcdd PATTERN: *d* INDICES: (0,3) (4,5) MATCH: (abc)d(d) INPUT: aabaabqqbaab PATTERN: a*baab INDICES: (1,8) MATCH: a(abaabqq)baab INPUT: aabaabaab PATTERN: a*baab INDICES: (1,5) MATCH: a(abaa)baab 

The brackets that appear on the output after "MATCH: " show the substrings that were matched / consumed by each character * . This way, it works fine, and I cannot understand why it would be necessary to cancel several levels here - at least if we restrict our pattern to allow only * characters, and we accept greedy matches.

Therefore, I assume that I am definitely mistaken in this matter and probably do not notice anything simple - can someone help me understand why this algorithm may require several levels of return (and therefore recursion or stack)?

+5
source share
1 answer

I did not check all the details of your implementation, but it is certainly true that you can complete the match without a recursive return.

In fact, you can perform glob matching without returning by creating a simple state machine. You can translate the globe into regex and build DFA in the usual way, or you can use something very similar to an Aho-Corasick machine; if you change your algorithm a little, you will achieve the same result. (The key is that you don’t really need to reserve an input scan, you just need to find out the correct scan state, which can be pre-calculated.)

Classic fnmatch implementations are not optimized for speed; they are based on the assumption that patterns and target lines are short. This assumption is usually reasonable, but if you allow untrusted patterns, you start a DoS attack. And based on this assumption, an algorithm similar to the one you imagine that does not require preliminary calculation is probably faster in the vast majority of cases than any algorithm that requires preliminary calculation of state transition tables, avoiding exponential bloating with pathological patterns.

+2
source

All Articles