Levenshtein Search

Question

Levenshtein Search

I work on a site that sells let say stuff and offers "seller search." In this search, you enter your city or zip code or region and distance (in km or miles), then the site gives you a list of suppliers.

For this, I have a database with suppliers. In the form for saving these providers, you enter the full address, and when you click the "Save" button, a request for Google maps is made in order to get their latitude and longitude.

When someone performs a search, I look at the table where I store all the search terms and their lat / lng. This table looks like

+--------+-------+------+ | term | lat | lng | +--------+-------+------+

So the first request is very simple

 select lat, lng from my_search_table where term = "the term"

If I find a result, I then search with a good method for all the suppliers in the range that the visitor wants, and print the result on a map.

If I can’t find the result, I search using the levenshtein function, because people writing bruxelle or bruxeles instead of bruxelles are something really common I don’t want to constantly query google maps (I also have a column “how many times time "in my table to get some statistics).

So I request my_search_time without a where clause and go through all the results to get the smallest levensthein distance. If the smallest result is greater than 2, I request coordinates from google maps.

Here is my problem. For some countries (we have several sites around the world), my_search_table has 15-20k + entries ... and php doesn’t (really) look like such entries (which I understand very well), and my request is timed out php, I could increase this timeout, but the problem will be the same for several months.

So, I tried the MySQL levensthein function (found in https://stackoverflow.com/a/13677/ ... btw), but it is also very slow.

So my question is: "Is there a way to quickly do this search even on very large datasets?"

+7

php mysql levenshtein distance

bmichotte Mar 2 '13 at 9:07

source share

3 answers

You can try the MySQL SOUNDS LIKE function

 SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"

+1

david strachan Mar 2 '13 at 9:17

source share

To speed up the search, you can use the kd tree or the triple tree. The idea is to use binary search.

0

Bytemain Mar 2 '13 at 11:27

source share

Alma do · Accepted Answer · 2014-04-03T09:12:15+0000

My suggestion is based on three things:

First, your dataset is large. This means that it is: big enough to reject the idea of "select all" + "run levenshtein() in a PHP application"
Secondly, you have control over your database. This way you can customize some architecture-related things.
Finally, the performance of SELECT queries is the most important thing, and the performance for adding new data does not matter .

The thing is, you cannot perform a fast levenshtein search because levenshtein itself is very slow. I mean, calculating the Levenshtein distance is a slow thing. Thus, you cannot solve the problem only with the help of "smart search". You will need to prepare some data.

Possible solution: create a group index and assign it while adding / updating data. This means that you will save an additional column in which some hash will be stored (for example, a numeric one). When adding new data you will:

Perform a search with levenshtein distance (for this you can either use your application, or the function that you already mentioned, across all entries in the table against the inserted data
Set the group index for a new row for the index value that found the rows in the previous step.
If nothing is found, set some new group index value (this is the “first row and there are no identical rows yet”) - this will be different from any group index values that are already in the table

To search for the necessary rows you just need to select the rows with the same group index value. This means: your favorite queries will be very fast. But - yes, this will lead to extremely huge overhead when adding / modifying your data. Thus, this is not applicable for the case where update / insert performance matters.

Levenshtein Search

More articles: