Thanks to Hamish Grubizhan for this idea. Everyone? in my names ocr'd can be from 0 to 3 letters. What I am doing is expanding each line to a list of possible extensions:
>>> list(expQuestions("?flcopt?")) ['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']
then I expand both and use its match function, which I called matchats :
def matchOCR(l, r): for expl in expQuestions(l): for expr in expQuestions(r): if matchats(expl, expr): return True return False
Works as desired:
>>> matchOCR("Ro?co?er", "?flcopt?") True >>> matchOCR("Ro?co?er", "?flcopt?z") False >>> matchOCR("Ro?co?er", "?flc?pt?") True >>> matchOCR("Ro?co?e?", "?flc?pt?") True
Match Function:
def matchats(l, r): """Match two strings with @ representing exactly 1 char""" if len(l) != len(r): return False for i, c1 in enumerate(l): c2 = r[i] if c1 == "@" or c2 == "@": continue if c1 != c2: return False return True
and an extension function, where cartesian_product does just that:
def expQuestions(s): """For OCR w/ a questionmark in them, expand questions with @s for all possibilities""" numqs = s.count("?") blah = list(s) for expqs in cartesian_product([(0,1,2,3)]*numqs): newblah = blah[:] qi = 0 for i,c in enumerate(newblah): if newblah[i] == '?': newblah[i] = '@'*expqs[qi] qi += 1 yield "".join(newblah)