I am currently studying the implementation of pattern matching in the UNIX style. Generally, the fnmatch library is a good reference implementation for this function.
Looking at some implementations , as well as reading various blogs / tutorials about it, it seems that this algorithm is usually implemented recursively.
As a rule, any algorithm requiring "backtracking" or requiring a return to an earlier state is perfect for a recursive solution. This includes things like traversing a tree or parsing nested structures.
But I had trouble understanding why glob pattern matching, in particular, is so often implemented recursively. I understand that sometimes aabaabxbaab is required, for example, if we have aabaabxbaab string and a*baab , the characters after * will correspond to the first substring "baab", for example aa(baab)xbaab , and then do not coincide with the rest of the string. Thus, the algorithm will have to back off so that the match of the character counter baab , and we can match the second occurrence of baab , for example: aabaabx(baab) .
Good, but usually recursion is used when we may need several nested return levels, and I don’t see how it is necessary in this case. It seems that we only had to return one level at a time when the iterator in the template and the iterator in the input line do not match. At this point, the template iterator will need to return to the last "save point", which would be either the beginning of the line or the last * that successfully matched something. This does not require a stack - only one savepoint.
The only complication that I can think of is in case of coincidence of coincidences. For example, if we have an aabaabaab input aabaabaab and a*baab , we won’t be able to match, since the “b” in the last baab can be part of either the first match or the second match. But it seems that this can be solved by simply discarding the input iterator if the distance between the last save point of the template iterator and the end of the template is greater than the distance between the iterator input position and the end of the input string.
So, as far as I can see, this should not be too big a problem for implementing this glob matching algorithm iteratively (assuming there is a very simple glob helper that uses the * character to mean “match” zero or more characters. ”Also, the matching strategy will be greedy.)
So, I assume that I am definitely wrong about this because everyone else does it recursively - so I have to skip something. I just don’t see what I am missing here. So I implemented a simple glob resolver in C ++ (which only supports the * operator) to find out if I can understand what I am missing. This is a very simple and easy iterative solution that just uses the inner loop to match wildcards. It also writes indexes that the * character matches the vector of pairs:
bool match_pattern(const std::string& pattern, const std::string& input, std::vector<std::pair<std::size_t, std::size_t>>& matches) { const char wildcard = '*'; auto pat = std::begin(pattern); auto pat_end = std::end(pattern); auto it = std::begin(input); auto end = std::end(input); while (it != end && pat != pat_end) { const char c = *pat; if (*it == c) { ++it; ++pat; } else if (c == wildcard) { matches.push_back(std::make_pair(std::distance(std::begin(input), it), 0)); ++pat; if (pat == pat_end) { matches.back().second = input.size(); return true; } auto save = pat; std::size_t matched_chars = 0; while (it != end && pat != pat_end) { if (*it == *pat) { ++it; ++pat; ++matched_chars; if (pat == pat_end && it != end) { pat = save; matched_chars = 0;
Then I wrote a series of tests to see if I could find a pattern or input line that would require several levels of return (and therefore recursion or a stack), but I couldn't think of anything.
Here is my test function:
void test(const std::string& input, const std::string& pattern) { std::vector<std::pair<std::size_t, std::size_t>> matches; bool result = match_pattern(pattern, input, matches); auto match_iter = matches.begin(); std::cout << "INPUT: " << input << std::endl; std::cout << "PATTERN: " << pattern << std::endl; std::cout << "INDICES: "; for (auto& p : matches) { std::cout << "(" << p.first << "," << p.second << ") "; } std::cout << std::endl; if (result) { std::cout << "MATCH: "; for (std::size_t idx = 0; idx < input.size(); ++idx) { if (match_iter != matches.end()) { if (idx == match_iter->first) std::cout << '('; else if (idx == match_iter->second) { std::cout << ')'; ++match_iter; } } std::cout << input[idx]; } if (!matches.empty() && matches.back().second == input.size()) std::cout << ")"; std::cout << std::endl; } else { std::cout << "NO MATCH!" << std::endl; } std::cout << std::endl; }
And my actual tests:
test("aabaabaab", "a*b*ab"); test("aabaabaab", "a*"); test("aabaabaab", "aa*"); test("aabaabaab", "aaba*"); test("/foo/bar/baz/xlig/fig/blig", "/foo/*/blig"); test("/foo/bar/baz/blig/fig/blig", "/foo/*/blig"); test("abcdd", "*d"); test("abcdd", "*d*"); test("aabaabqqbaab", "a*baab"); test("aabaabaab", "a*baab");
So the outputs are:
INPUT: aabaabaab PATTERN: a*b*ab INDICES: (1,2) (3,7) MATCH: a(a)b(aaba)ab INPUT: aabaabaab PATTERN: a* INDICES: (1,9) MATCH: a(abaabaab) INPUT: aabaabaab PATTERN: aa* INDICES: (2,9) MATCH: aa(baabaab) INPUT: aabaabaab PATTERN: aaba* INDICES: (4,9) MATCH: aaba(abaab) INPUT: /foo/bar/baz/xlig/fig/blig PATTERN: /foo/*/blig INDICES: (5,21) MATCH: /foo/(bar/baz/xlig/fig)/blig INPUT: /foo/bar/baz/blig/fig/blig PATTERN: /foo/*/blig INDICES: (5,21) MATCH: /foo/(bar/baz/blig/fig)/blig INPUT: abcdd PATTERN: *d INDICES: (0,4) MATCH: (abcd)d INPUT: abcdd PATTERN: *d* INDICES: (0,3) (4,5) MATCH: (abc)d(d) INPUT: aabaabqqbaab PATTERN: a*baab INDICES: (1,8) MATCH: a(abaabqq)baab INPUT: aabaabaab PATTERN: a*baab INDICES: (1,5) MATCH: a(abaa)baab
The brackets that appear on the output after "MATCH: " show the substrings that were matched / consumed by each character * . This way, it works fine, and I cannot understand why it would be necessary to cancel several levels here - at least if we restrict our pattern to allow only * characters, and we accept greedy matches.
Therefore, I assume that I am definitely mistaken in this matter and probably do not notice anything simple - can someone help me understand why this algorithm may require several levels of return (and therefore recursion or stack)?