An elegant way to match two wildcards

Question

An elegant way to match two wildcards

I am writing text from two different sources. Everyone can make mistakes in different places where they do not recognize the letter / group of letters. If they do not recognize something, is it replaced by a symbol ?. For example, if the word is Roflcopter , one source may return Ro?copter and the other Roflcop?er . I would like the function to return two equivalent equivalents, allowing the use of several ? s. Example:

 match("Ro?copter", "Roflcop?er") --> True match("Ro?copter", "Roflcopter") --> True match("Roflcopter", "Roflcop?er") --> True match("Ro?co?er", "Roflcop?er") --> True

So far, I could match one OCR with a perfect one using regular expressions:

 >>> def match(tn1, tn2): tn1re = tn1.replace("?", ".{0,4}") tn2re = tn2.replace("?", ".{0,4}") return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1)) >>> match("Roflcopter", "Roflcop?er") True >>> match("R??lcopter", "Roflcopter") True

But that will not work if they both have? s in different places:

 >>> match("R??lcopter", "Roflcop?er") False

+7

python string string-matching regex

Claudiu Oct 14 '10 at 18:57

source share

4 answers

OK if one? matches one character, then I can offer a tool and a fairly compact method.

 def match(str1, str2): if len(str1) != len(str2): return False for index, ch1 in enumerate(str1): ch2 = str2[index] if ch1 == '?' or ch2 == '?': continue if ch1 != ch2: return False return True

 >>> ================================ RESTART ================================ >>> >>> match("Roflcopter", "Roflcop?er") True >>> match("R??lcopter", "Roflcopter") True >>> >>> match("R??lcopter", "Roflcop?er") True >>>

Edit: Part B), now brain damage.

 def sets_match(set1, set2): return any(match(str1, str2) for str1 in set1 for str2 in set2)

 >>> ================================ RESTART ================================ >>> >>> s1 = set(['a?', 'fg']) >>> s2 = set(['?x']) >>> sets_match(s1, s2) # a? = x? True >>>

+2

Hamish grubijan Oct 14 '10 at 19:06

source share

Using Levenshtein distance may be useful. This will give a value of how similar the lines are to each other. This will work if they are also different. There are several psuedocode on the linked page to get you started.

You end up with something like this:

 >>> match("Roflcopter", "Roflcop?er") 1 >>> match("R??lcopter", "Roflcopter") 2 >>> match("R?lcopter", "Roflcop?er") 3

Thus, you may have a maximum threshold below which you say that they can match.

+1

Joshd Oct 14 '10 at 19:54

source share

This may not be the most Pythonic options, but if a ? allowed to match any number of characters, then the following search in the opposite direction performs the trick:

 def match(a,b): def matcher(i,j): if i == len(a) and j == len(b): return True elif i < len(a) and a[i] == '?' \ or j < len(b) and b[j] == '?': return i < len(a) and matcher(i+1,j) \ or j < len(b) and matcher(i,j+1) elif i == len(a) or j == len(b): return False else: return a[i] == b[j] and matcher(i+1,j+1) return matcher(0,0)

This can be adapted to be more rigorous in what needs to be matched. In addition, to save the stack space, the final case ( i+1,j+1 ) can be transformed into a non-recursive solution.

Edit : A few more clarifications in response to the reactions below. This is an adaptation of the naive matching algorithm for simplified regular expressions / NFA (see Kernighan contrib to Beautiful Code, O'Reilly 2007 or Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2009).

How it works: the matcher function recursively scans both lines / patterns starting at (0,0) . This succeeds when it reaches the end of both lines (len(a),len(b)) ; it fails when it encounters two unequal characters or the end of one line, while on the other line there are characters that must match.

When matcher encounters a variable ( ? ) matcher any line (say a ), it can do two things: either skip the variable (matching zero characters) or skip the next character in b , but keep pointing to the variable in a so that it matches the larger number of characters.

+1

Fred foo Oct 14 '10 at 20:50

source share

Claudiu · Accepted Answer · 2010-10-14T20:48:13+0000

Thanks to Hamish Grubizhan for this idea. Everyone? in my names ocr'd can be from 0 to 3 letters. What I am doing is expanding each line to a list of possible extensions:

 >>> list(expQuestions("?flcopt?")) ['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']

then I expand both and use its match function, which I called matchats :

 def matchOCR(l, r): for expl in expQuestions(l): for expr in expQuestions(r): if matchats(expl, expr): return True return False

Works as desired:

 >>> matchOCR("Ro?co?er", "?flcopt?") True >>> matchOCR("Ro?co?er", "?flcopt?z") False >>> matchOCR("Ro?co?er", "?flc?pt?") True >>> matchOCR("Ro?co?e?", "?flc?pt?") True

Match Function:

 def matchats(l, r): """Match two strings with @ representing exactly 1 char""" if len(l) != len(r): return False for i, c1 in enumerate(l): c2 = r[i] if c1 == "@" or c2 == "@": continue if c1 != c2: return False return True

and an extension function, where cartesian_product does just that:

 def expQuestions(s): """For OCR w/ a questionmark in them, expand questions with @s for all possibilities""" numqs = s.count("?") blah = list(s) for expqs in cartesian_product([(0,1,2,3)]*numqs): newblah = blah[:] qi = 0 for i,c in enumerate(newblah): if newblah[i] == '?': newblah[i] = '@'*expqs[qi] qi += 1 yield "".join(newblah)

An elegant way to match two wildcards

More articles: