Is there a way to filter a set of django requests based on string similarity (a la python difflib)?

I need to combine cold excerpts from our customer database.

Leads from a third-party provider in bulk (thousands of records), and sales ask us (according to them) to "filter our customers" so that they do not try to sell our service to an established customer.

Obviously there are typos in the news. Charles becomes Charlie, Joseph becomes Joe, etc. Therefore, I cannot just compare the lead_first_name filter with the name client_first_name, etc.

I need to use some kind of string affinity mechanism.

Right now I'm using fine difflib to compare the names of potential customers with the list generated by Client.objects.all() . It works, but due to the number of customers it tends to be slow.

I know that most sql databases have soundex and difference functions. See my test for it in the update below - it does not work, like difflib does.

Is there any other solution? Is there a better solution?

Edit:

Soundex, at least in my db, doesn't behave as good as difflib.

Here is a simple test - find “Joe Lopez” in the table containing “Joseph Lopez”:

 with temp (first_name, last_name) as ( select 'Joseph', 'Lopes' union select 'Joe', 'Satriani' union select 'CZ', 'Lopes' union select 'Blah', 'Lopes' union select 'Antonio', 'Lopes' union select 'Carlos', 'Lopes' ) select first_name, last_name from temp where difference(first_name+' '+last_name, 'Joe Lopes') >= 3 order by difference(first_name+' '+last_name, 'Joe Lopes') 

The above returns “Joe Satriani” as the only match. Even lowering the similarity threshold to 2 does not return Joseph Lopez as a potential match.

But difflib does a much better job:

 difflib.get_close_matches('Joe Lopes', ['Joseph Lopes', 'Joe Satriani', 'CZ Lopes', 'Blah Lopes', 'Antonio Lopes', 'Carlos Lopes']) ['Joseph Lopes', 'CZ Lopes', 'Carlos Lopes'] 

Change after gruszczy answer:

Before writing my own, I searched and found the T-SQL implementation of Levenshtein Distance in the repository of all knowledge.

In testing, it still won’t do a more suitable job than difflib.

This led me to learn which algorithm is behind difflib. This seems to be a modified version of Ratcliff-Obershelp .

Unfortunately, I cannot find any other kind soul who has already created diff-lib based T-SQL implementation ... I will try my hand when I can.

If no one else finds a better answer in the next few days, I will give it gruszczy. Thank you, good sir.

+7
django django-queryset similarity
source share
2 answers

soundex will not help, because it is a phonetic algorithm. Joe and Joseph are not phonetically similar, so soundex will not mark them as similar.

You can try the Levenshtein distance , which is implemented in PostgreSQL. Perhaps in your database, and if not, you should be able to write a stored procedure that will calculate the distance between two lines and use it in your calculations.

+2
source share

This is possible when searching for trigram_similar with Django 1.10, see docs for special PostgreSQL queries and Full-text search

+2
source share

All Articles