SQL query nearest SQL neighbor

Question

SQL query nearest SQL neighbor

I am having problems with efficient SQL query to handle the following situation:

Suppose we have a table with two columns

groupId : int value : float

The table is huge (several million rows). There are different "values" for "groupId" - they say something between 100 and 50,000. All float values are greater than or equal to zero, but otherwise unlimited.

For this group of Id, the query should return all other groups sorted by decreasing similarity, where “similar” is defined as the minimum Euclidean distance between all possible pairs of 30 values in two groups.

This definition of similarity is what kills me. I think that to calculate the similarity, as defined above, the naiive algorithm is O (n ^ 2). Now I am looking for ideas to redefine "similarity" or to effectively implement the above. I could imagine a solution involving the nearest neighbor k, something like the geometric nearest neighbors of PostGis or maybe the largest general subsequence algorithm (although I would need a “fuzzy” implementation of the latter, because the “values” are hardly ever compare exactly the same).

We are currently on mySQL, if that matters.

amuses

 Sören

+6

sql nearest-neighbor

BuschnicK Apr 6 '09 at 9:25

source share

3 answers

Daniel Brückner · Answer 1 · 2009-04-06T19:14:22+0000

Could you confirm the correctness of the question?

Your table represents the vectors identified by groupId. Each vector has a dimension from 100 to 50,000, but there is no order defined for the measurement. This vector from the table is actually a representative of the equivalence class.

Now you define the similarity of the two equivalence classes as the minimum Euclidean distance of the projections of any two representatives of the equivalence classes into the subspace of the first 30 dimensions.

Examples for projection in two dimensions:

 A = <1, 2, 3, 4> B = <5, 6, 7, 8, 9, 10>

A represents the following equivalence class of vectors.

 <1, 2, 3, 4> <2, 1, 2, 3> <3, 1, 2, 4> <4, 1, 2, 3> <1, 2, 4, 4> <2, 1, 3, 2> <3, 1, 4, 2> <4, 1, 3, 2> <1, 3, 2, 4> <2, 3, 1, 4> <3, 2, 1, 4> <4, 2, 1, 3> <1, 3, 4, 2> <2, 3, 4, 1> <3, 2, 4, 1> <4, 2, 3, 1> <1, 4, 2, 2> <2, 4, 1, 3> <3, 4, 1, 2> <4, 3, 1, 2> <1, 4, 3, 2> <2, 4, 3, 1> <3, 4, 2, 1> <4, 3, 2, 1>

Projecting the entire representative of this equivalence class on the first two dimensions gives.

 <1, 2> <1, 3> <1, 4> <2, 1> <2, 3> <2, 4> <3, 1> <3, 2> <3, 4> <4, 1> <4, 2> <4, 3>

B represents an equivalence class with 720 elements. The projection on the first two dimensions gives 30 elements.

 < 5, 6> < 5, 7> < 5, 8> < 5, 9> < 5, 10> < 6, 5> < 6, 7> < 6, 8> < 6, 9> < 6, 10> < 7, 5> < 7, 6> < 7, 8> < 7, 9> < 7, 10> < 8, 5> < 8, 6> < 8, 7> < 8, 9> < 8, 10> < 9, 5> < 9, 6> < 9, 7> < 9, 8> < 9, 10> <10, 5> <10, 6> <10, 7> <10, 8> <10, 9>

Thus, the distance A and B is the square root of 8, since this is the minimum distance of two vectors from the projections. For example, <3, 4> and <5, 6> give this distance.

So, do I agree with understanding the problem?

A truly naive algorithm for n vectors with m components would have to calculate (n - 1) distances. For each distance, the algorithm would calculate the distances m! / (M - 30)! projection for each vector. Thus, for 100 dimensions (your lower bound) there are 2.65 * 10 ^ 32 possible projections for the vector. To do this, you need to calculate about 7 * 10 ^ 64 distances between projections and find the minimum to find the distance of two vectors. And then repeat this once n times.

I hope that I misunderstood you or made a mistake. Otherwise, it sounds something between really complex and impossible.

Something I was thinking about was arranging the vector components and trying to match them. Using Manhattan distance - if possible - can help simplify the solution.

fuzzy-waffle · Answer 2 · 2009-04-07T01:47:44+0000

Here are some good approximations:

You can calculate the center of mass of each group and then compare based on the distance from each center of mass of the group.

Another way you could do this is a hash, the coordinates of each line and lines that the hash in the same place are considered similar, and thus two groups are updated.

Additional information would be helpful, for example:

Whether information is constantly updated, and if so, at what interval. How to update and how accurate should it be?

Friguy · Answer 3 · 2009-04-07T02:21:38+0000

The naive version will look something like this: (does not start through the query analyzer)

 select groupid, min(distance) as mindist from (select other.groupid as groupid, min(abs(other.value - us.value)) as distance from g us join g other on other.groupid != us.groupid where us.groupid = ?) order by mindist group by groupid

Then, to follow the directions:

 select groupid, min(abs(value - usvalue)) as mindist from (select other.groupid as groupid, max(other.value) as value, us.value as usvalue from g us join g other on other.groupid != us.groupid and other.value <= us.value where us.groupid = ? union select other.groupid as groupid, min(other.value) as value, us.value as usvalue from g us join g other on other.groupid != us.groupid and other.value >= us.value where us.groupid = ?) order by mindist group by groupid

This will hopefully allow mysql to use the index to quickly find the closest neighbors in the connection.

There may be errors in this, but I hope this line of thinking will help.

SQL query nearest SQL neighbor

More articles: