Implementing a kd tree to search for "nearest neighbor" in MYSQL?

I am developing automated trading software for the forex market. In the MYSQL database, I have market data for five minutes. I have 4 different indicators for this data along with price and time.

[Time|Price|M1|M2|M3|M4] x ~400,0000 

Time is the primary key, and M1 through M4 are different indicators (for example, standard deviation or moving average slope).

Here is a real example (excerpt :)

 +------------+--------+-----------+--------+-----------+-----------+ | Time | Price | M1 | M2 | M3 | M4 | +------------+--------+-----------+--------+-----------+-----------+ | 1105410300 | 1.3101 | 12.9132 | 0.4647 | 29.6703 | 50 | | 1105410600 | 1.3103 | 14.056 | 0.5305 | 29.230801 | 50 | | 1105410900 | 1.3105 | 15.3613 | 0.5722 | 26.8132 | 25 | | 1105411200 | 1.3106 | 16.627501 | 0.4433 | 24.395599 | 26.47059 | | 1105411500 | 1.3112 | 18.7843 | 1.0019 | 24.505501 | 34.375 | | 1105411800 | 1.3111 | 19.8375 | 0.5626 | 20 | 32.8125 | | 1105412100 | 1.3105 | 20.0168 | 0.6718 | 9.7802 | 23.4375 | | 1105412400 | 1.3105 | 20.4538 | 0.8943 | 7.033 | 23.4375 | | 1105412700 | 1.3109 | 21.6078 | 0.4902 | 11.7582 | 29.6875 | | 1105413000 | 1.3104 | 21.2045 | 1.565 | 8.6813 | 21.875 | +------------+--------+-----------+--------+-----------+-----------+...400k more 

Given the input of M1 , M2 , M3 and M4 , I want to (quickly and accurately) find the 5,000 closest matches.

Input Example:

 +------------+--------+-----------+--------+-----------+-----------+ | Time | Price | M1 | M2 | M3 | M4 | +------------+--------+-----------+--------+-----------+-----------+ | 1205413000 | 1.4212 | 20.1045 | 1.0012 | 9.1013 | 11.575 | +------------+--------+-----------+--------+-----------+-----------+ 

I realized that each of these metrics can be considered a “dimension” and that I can do the nearest neighbor search to find the nearest data points in this multidimensional space.

It seems that the easiest way to do this is to iterate over each data point and measure the multidimensional distance to my input point; but speed matters!

I read about something called KD Trees used for this purpose. Can someone explain or provide me some material that explains how to implement this in MYSQL?

It may be appropriate to mention that I can pre-process the table, but the input is received in real time.

Currently, I'm just doing a rough cluster around the data for each dimension independently:

 INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 < currentM1 ORDER BY M1 DESC LIMIT 2500; INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 > currentM1 ORDER BY M1 ASC LIMIT 2500; INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 < currentM2 ORDER BY M2 DESC LIMIT 2500; INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 > currentM2 ORDER BY M2 ASC LIMIT 2500; INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 < currentM3 ORDER BY M3 DESC LIMIT 2500; INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 > currentM3 ORDER BY M3 ASC LIMIT 2500; INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 < currentM4 ORDER BY M4 DESC LIMIT 2500; INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 > currentM4 ORDER BY M4 ASC LIMIT 2500; 

It’s important to understand that I am interested in distance by rank, and not by value.

Edit: I'm a little closer to understanding how to do this (I think): I need to pre-process each row of each metric and assign it a percentile that will represent its location (in percent) in its range.

For example, for any given value of M1 :

 percentile = (# rows with values less than input)/(# total rows) 

If I calculate the input percentage and use it to find the nearest neighbor instead of the actual value, I would effectively scale the various indicators so that they can be used as measurements.

I'm still lost on how to do a real search. Is this even possible for efficient MySQL execution?

+4
source share
1 answer

You must complete the request, for example:

 SELECT * FROM myTable WHERE M1 BETWEEN searchM1 - radiusM1 AND searchM1 + radiusM1 AND M2 BETWEEN searchM2 - radiusM2 AND searchM2 + radiusM2 AND M3 BETWEEN searchM3 - radiusM3 AND searchM3 + radiusM3 AND M4 BETWEEN searchM4 - radiusM4 AND searchM4 + radiusM4 

In the case of a sphere, all radius values ​​will be the same, of course. Then you adjust the radius until you get close to the number of records you need. I would suggest a binary search .

I'm not sure if you want to mess with the distribution or not, but on condition that you do, you just need to give each search value a rank between the two values ​​that it will fall in your table (for example, if rank 5 is 5.5, rank 6 is 5.9, and the search value is 5.6, then the search rank can be 5.5)

0
source

All Articles