Does Python + Sqlite have a string affinity method, for example, with the sqlite3 module?
Usage example:
import sqlite3 conn = sqlite3.connect(':memory:') c = conn.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")') c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
This query should match the line with id 1, but not the line with id 2:
c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')
How to do it in Sqlite + Python?
Notes on what I have found so far:
Levenshtein distance , that is, the minimum number of one-character changes (insertion, deletion or replacement) needed to change one word into another, can be useful, but I'm not sure that the official implementation exists in Sqlite (I saw several custom implementations, for example this one )
Damerau-Levenshtein is the same, except that it also allows you to transpose between two adjacent characters; it is also called change distance
I know that it is possible to define a function on your own, but to implement such a distance would be non-trivial (making a natural language so performance comparison is super efficient for databases is really non-trivial), so I wanted to see if Python / Sqlite has such a tool
Sqlite has FTS (Full Text Seach) features : FTS3 , FTS4 , FTS5
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); CREATE TABLE enrondata2(content TEXT); SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%';
but I don’t see a string comparison with such a “similarity distance”, the FTS MATCH or NEAR functions do not seem to have a measure of similarity with letter changes, etc.
Moreover, this answer shows that:
The SQLite FTS engine is based on tokens - keywords that the search engine is trying to match.
Various tokenizers are available, but they are relatively simple. A “simple” tokenizer simply breaks each word and reduces it: for example, in the line “Fast brown fox jumps over a lazy dog” the word “jumps” will correspond, but not “jump”. The porter tokenizer is a little more advanced, removing word conjugations, so that the jumps and jumps will match , but there will be no typo like jmups.
The latter (the fact that "jmups" cannot be found as similar to "jumping") makes this impractical for my use, unfortunately.
source share