I am developing automated trading software for the forex market. In the MYSQL database, I have market data for five minutes. I have 4 different indicators for this data along with price and time.
[Time|Price|M1|M2|M3|M4] x ~400,0000
Time is the primary key, and M1 through M4 are different indicators (for example, standard deviation or moving average slope).
Here is a real example (excerpt :)
+------------+--------+-----------+--------+-----------+-----------+ | Time | Price | M1 | M2 | M3 | M4 | +------------+--------+-----------+--------+-----------+-----------+ | 1105410300 | 1.3101 | 12.9132 | 0.4647 | 29.6703 | 50 | | 1105410600 | 1.3103 | 14.056 | 0.5305 | 29.230801 | 50 | | 1105410900 | 1.3105 | 15.3613 | 0.5722 | 26.8132 | 25 | | 1105411200 | 1.3106 | 16.627501 | 0.4433 | 24.395599 | 26.47059 | | 1105411500 | 1.3112 | 18.7843 | 1.0019 | 24.505501 | 34.375 | | 1105411800 | 1.3111 | 19.8375 | 0.5626 | 20 | 32.8125 | | 1105412100 | 1.3105 | 20.0168 | 0.6718 | 9.7802 | 23.4375 | | 1105412400 | 1.3105 | 20.4538 | 0.8943 | 7.033 | 23.4375 | | 1105412700 | 1.3109 | 21.6078 | 0.4902 | 11.7582 | 29.6875 | | 1105413000 | 1.3104 | 21.2045 | 1.565 | 8.6813 | 21.875 | +------------+--------+-----------+--------+-----------+-----------+...400k more
Given the input of M1 , M2 , M3 and M4 , I want to (quickly and accurately) find the 5,000 closest matches.
Input Example:
+------------+--------+-----------+--------+-----------+-----------+ | Time | Price | M1 | M2 | M3 | M4 | +------------+--------+-----------+--------+-----------+-----------+ | 1205413000 | 1.4212 | 20.1045 | 1.0012 | 9.1013 | 11.575 | +------------+--------+-----------+--------+-----------+-----------+
I realized that each of these metrics can be considered a “dimension” and that I can do the nearest neighbor search to find the nearest data points in this multidimensional space.
It seems that the easiest way to do this is to iterate over each data point and measure the multidimensional distance to my input point; but speed matters!
I read about something called KD Trees used for this purpose. Can someone explain or provide me some material that explains how to implement this in MYSQL?
It may be appropriate to mention that I can pre-process the table, but the input is received in real time.
Currently, I'm just doing a rough cluster around the data for each dimension independently:
INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 < currentM1 ORDER BY M1 DESC LIMIT 2500; INSERT INTO Dim1 SELECT * FROM myTable AS myTable USE INDEX(M1) WHERE myTable.M1 > currentM1 ORDER BY M1 ASC LIMIT 2500; INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 < currentM2 ORDER BY M2 DESC LIMIT 2500; INSERT INTO Dim2 SELECT * FROM myTable AS myTable USE INDEX(M2) WHERE myTable.M2 > currentM2 ORDER BY M2 ASC LIMIT 2500; INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 < currentM3 ORDER BY M3 DESC LIMIT 2500; INSERT INTO Dim3 SELECT * FROM myTable AS myTable USE INDEX(M3) WHERE myTable.M3 > currentM3 ORDER BY M3 ASC LIMIT 2500; INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 < currentM4 ORDER BY M4 DESC LIMIT 2500; INSERT INTO Dim4 SELECT * FROM myTable AS myTable USE INDEX(M4) WHERE myTable.M4 > currentM4 ORDER BY M4 ASC LIMIT 2500;
It’s important to understand that I am interested in distance by rank, and not by value.
Edit: I'm a little closer to understanding how to do this (I think): I need to pre-process each row of each metric and assign it a percentile that will represent its location (in percent) in its range.
For example, for any given value of M1 :
percentile = (
If I calculate the input percentage and use it to find the nearest neighbor instead of the actual value, I would effectively scale the various indicators so that they can be used as measurements.
I'm still lost on how to do a real search. Is this even possible for efficient MySQL execution?