How to do a fuzzy string search without a heavy database?

I have a mapping of directory numbers to product names:

35 cozy comforter 35 warm blanket 67 pillow 

and you need a search in which erroneous mixed names are found, such as "warm cmfrter" .

We have code using edit-distance (difflib), but it probably won't scale for 18000 names.

I achieved something similar with Lucene, but since PyLucene only wraps Java, which will complicate deployment for end users.

SQLite usually does not contain full text or scoring.

Xapian bindings are similar to C ++ and have some learning curve.

Whoosh is not well documented yet, but includes offensive spell checking.

What else do you have?

+6
python database full-text-search fuzzy-search
source share
6 answers

Apparently the only way to quickly make fuzzy comparisons is to make fewer of them;)

Instead of recording another n-gram search or improving one that is in Whoosh, we now save the word index, extract all records that contain at least one (correctly spelled) word along with the query, and use difflib to rank these words . Works well in this case.

+4
source share

Nucular has a full-text search, but it does not support matching matches out of the box. You can try adding an additional field to each record that indexes the sound translation of terms, and then searches using the user input conversion soundex. I really don't know how well this will work ...

Take a look at advas http://advas.sourceforge.net/news.php , which has a nice demo comparing various soundex-like methods:

 advas/examples Aaron$ python phonetic_algorithms.py soundex metaphone nyiis caverphone ==================================================================================================== schmidt : S253 sxmtt sssnad SKMT111111 schmid : S253 sxmt sssnad SKMT111111 schmitt : S253 sxmt sssnatt SKMT111111 smith : S530 sm0h snatt SMT1111111 smythe : S530 smy0h snatt SMT1111111 schmied : S253 sxmt sssnaad SKMT111111 mayer : M600 myr naaar MA11111111 meier : M600 mr naaar MA11111111 .... 

I donโ€™t know if any of them are suitable for your unnamed language ...

+2
source share

You will get too many false positives when implementing SOUNDEX. In total, only 26,000 (at most) SOUNDEX codes are available.

Although the Metaphone algorithm was designed for English surnames, it is great for spelling mistakes; I used it once in a branch locator, and it was pretty successful.

Add a field with a Metaphone translation and match it if no exact match is found. You will still receive false positives, but less than with other algorithms.

+2
source share

The usual way to calculate the editing distance between two lines is a rather expensive algorithm (if I remember correctly, its temporal complexity is quadratic). Perhaps if you used a different string affinity metric, your problem would disappear.

One of my favorite methods for matching fuzzy strings is matching trigrams . Comparing two lines using this method has linear time complexity, which is much better than the mentioned editing distance. You can find my Python implementation on Github . There is also a PostgreSQL contrib module . It should not be too difficult to adapt it to work with SQLite3.

+1
source share

Sybase SQL Anywhere has a free version of Web Edition / Developer Edition and comes with full text indexing / search and the FUZZY operator (and some restriction implementation).

To quote from the documentation:

 Specifying 'FUZZY "500 main street"' is equivalent to '500 OR mai OR ain OR str OR tre OR ree OR eet'. 

Another approach would be to use scoring for full-text searches.

+1
source share

sqlite3 supports python callback functions. Matthew Barnett's regular expression (http://code.google.com/p/mrab-regex-hg/) now supports approximate matching.

So, like this:

 try: import regex except ImportError: sys.stderr.write("Can't import mrab-regex; see http://pypi.python.org/pypi/regex\n") sys.exit(1) def _sqlite3_regex(expr, item): return (not (not regex.search(expr, item))) def main(): ... database = sqlite3.connect(dbfile) database.create_function("regexp", 2, _sqlite3_regex) pattern = '(?:%s){e<=%d}' % (queriedname, distance) print [x for x in database.cursor().execute( "SELECT * FROM products WHERE (productname regexp '%s')" % pattern)] 
+1
source share

All Articles