Approximate match of author name strings - modules and strategies

I created a small program that checks if authors are present in the authors database. I could not find any specific modules for this problem, so I am writing this from scratch, using modules for approximate string matching.

The database contains about 6,000 authors and is very poorly formatted (many typos, variations, names, such as "Doctor", etc.). The list of query authors is usually between 500-1000 (and I have many such lists), which makes speed very important.

My overall strategy is to crop and filter the database as much as possible and look for exact matches. If no matches are found, I proceed to approximate string matching.

I am currently using the built-in difflib.get_close_matches , which does exactly what I want, however it is very slow (a few minutes). Therefore, I am looking for other options:

  • What is the fastest module that can return the best, say, 3 matches above some threshold in the database giving a query string?
  • What is the fastest module for comparing two strings?

The only thing I found is fuzzy wuzzy, which is even slower than difflib.

+4
source share
2 answers

Try fuzzywuzzy with native-C python-levenshtein lib.

I run a benchmark on my PC to search for the best candidates of 8 words in ~ ~ 19k word-lists with and without C-native levenshtein backend (using pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl ) and I got these timings:

  • No C-backend:
    Compared to 151664 words in 48.591717004776 seconds (0.00032039058052521366 seconds / Search).
  • Installed C-backend:
    Compared to 151664 words in 13.034106969833374 sec (8.594067787895198e-05 sec / search).

This is ~ x4 faster (but not as much as I expected).

Here are the results:

 0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]` 1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]` 2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]` 3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]` 4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]` 5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]` 6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]` 7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]` 

And here is the standard python code:

 import os import zipfile from urllib import request as urlrequest from fuzzywuzzy import process as fzproc import time import random download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip' zip_name = os.path.basename(download_url) fname, _ = os.path.splitext(zip_name) def fuzzy_match(dictionary, search): nsearch = len(search) for i, s in enumerate(search): best = fzproc.extractBests(s, dictionary) print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best)) def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996): """ Shuffle and split words-list into `dictionary` and `search-words`. """ rnd = random.Random(0) rnd.shuffle(wordslist) nwords = len(wordslist) ndictionary = int(dict_split_ratio * nwords) dictionary = wordslist[:ndictionary] search = wordslist[ndictionary:] fuzzy_match(dictionary, search) return ndictionary, (nwords - ndictionary) def run_benchmark(): if not os.path.exists(zip_name): urlrequest.urlretrieve(download_url, filename=zip_name) with zipfile.ZipFile(zip_name, 'r') as zfile: with zfile.open(fname) as words_file: blines = words_file.readlines() wordslist = [line.decode('ascii').strip() for line in blines] wordslist = wordslist[4:] # Skip header. t_start = time.time() ndict, nsearch = benchmark_fuzzy_match(wordslist) t_finish = time.time() t_elapsed = t_finish - t_start ncomparisons = ndict * nsearch sec_per_search = t_elapsed / ncomparisons msg = "Compared %s words in %s sec (%s sec/search)." print(msg % (ncomparisons, t_elapsed, sec_per_search)) if __name__ == '__main__': run_benchmark() 
+1
source

The Python Natural Language Toolkit (nltk) may have additional resources that you could try - this google group stream seems like a good start to this, Just an idea.

0
source

All Articles