String similarity with Python + Sqlite (Levenshtein distance / distance)

Question

String similarity with Python + Sqlite (Levenshtein distance / distance)

Does Python + Sqlite have a string affinity method, for example, with the sqlite3 module?

Usage example:

 import sqlite3 conn = sqlite3.connect(':memory:') c = conn.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")') c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

This query should match the line with id 1, but not the line with id 2:

 c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

How to do it in Sqlite + Python?

Notes on what I have found so far:

Levenshtein distance , that is, the minimum number of one-character changes (insertion, deletion or replacement) needed to change one word into another, can be useful, but I'm not sure that the official implementation exists in Sqlite (I saw several custom implementations, for example this one )
Damerau-Levenshtein is the same, except that it also allows you to transpose between two adjacent characters; it is also called change distance
I know that it is possible to define a function on your own, but to implement such a distance would be non-trivial (making a natural language so performance comparison is super efficient for databases is really non-trivial), so I wanted to see if Python / Sqlite has such a tool

Sqlite has FTS (Full Text Seach) features : FTS3 , FTS4 , FTS5

 CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */ CREATE TABLE enrondata2(content TEXT); /* Ordinary table */ SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */ SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */

but I don’t see a string comparison with such a “similarity distance”, the FTS MATCH or NEAR functions do not seem to have a measure of similarity with letter changes, etc.

Moreover, this answer shows that:
The SQLite FTS engine is based on tokens - keywords that the search engine is trying to match.
Various tokenizers are available, but they are relatively simple. A “simple” tokenizer simply breaks each word and reduces it: for example, in the line “Fast brown fox jumps over a lazy dog” the word “jumps” will correspond, but not “jump”. The porter tokenizer is a little more advanced, removing word conjugations, so that the jumps and jumps will match , but there will be no typo like jmups.
The latter (the fact that "jmups" cannot be found as similar to "jumping") makes this impractical for my use, unfortunately.

+1

python sqlite string-comparison sqlite3 similarity

Basj Apr 11 '18 at 15:41

source share

1 answer

Basj · Answer 1 · 2018-04-13T10:58:14+0000

Here is a ready-to-use test.py example:

 import sqlite3 db = sqlite3.connect(':memory:') db.enable_load_extension(True) db.load_extension('./spellfix.so') # for Linux #db.load_extension('./spellfix.dll') # <-- UNCOMMENT HERE FOR WINDOWS db.enable_load_extension(False) c = db.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")') c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")') c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600') print c.fetchall() # Output: [(1, u'hello world, guys')]

Important Note: editdist3 distance is normalized so that

a value of 100 is used to insert and delete and 150 is used to replace

Here's what to do first on Windows:

Download https://sqlite.org/2016/sqlite-src-3110100.zip , https://sqlite.org/2016/sqlite-amalgamation-3110100.zip and unzip them
Replace C:\Python27\DLLs\sqlite3.dll with the new sqlite3.dll from. If you miss this, you will get sqlite3.OperationalError: The specified procedure could not be found later

Run:

 call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll python test.py

(With MinGW it will be: gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll )

Here's how to do it on Linux Debian:

(based on this answer )

 apt-get -y install unzip build-essentials libsqlite3-dev wget https://sqlite.org/2016/sqlite-src-3110100.zip unzip sqlite-src-3110100.zip gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so python test.py

String similarity with Python + Sqlite (Levenshtein distance / distance)

Here's what to do first on Windows:

Here's how to do it on Linux Debian:

More articles: