How can I calculate the point between two overlapping linear datasets?

I have two datasets that overlap a bit (see graph below). I need to find a point between these sets where it could be assumed that an unknown data point would belong to a certain category.

If I have a new data point (say 5000 ) and you had to put $$$ on whether it belongs to group A or group B, how can I calculate the point, makes the bet more confident?

See an example dataset and accompanying graph below with an approximate point between these groups (calculated by eye).

 GROUP A [385,515,975,1136,2394,2436,4051,4399,4484,4768,4768,4849,4856,4954,5020,5020,5020,5020,5020,5020,5020,5020,5020,5052,5163,5200,5271,5421,5421,5442,5746,5765,5903,5992,5992,6046,6122,6205,6208,6239,6310,6360,6416,6512,6536,6543,6581,6609,6696,6699,6752,6796,6806,6855,6859,6886,6906,6911,6923,6953,7016,7072,7086,7089,7110,7232,7278,7293,7304,7309,7348,7367,7378,7380,7419,7453,7454,7492,7506,7549,7563,7721,7723,7731,7745,7750,7751,7783,7791,7813,7813,7814,7818,7833,7863,7875,7886,7887,7902,7907,7935,7942,7942,7948,7973,7995,8002,8013,8013,8015,8024,8025,8030,8038,8041,8050,8056,8060,8064,8071,8081,8082,8085,8093,8124,8139,8142,8167,8179,8204,8214,8223,8225,8247,8248,8253,8258,8264,8265,8265,8269,8277,8278,8289,8300,8312,8314,8323,8328,8334,8363,8369,8390,8397,8399,8399,8401,8436,8442,8456,8457,8471,8474,8483,8503,8511,8516,8533,8560,8571,8575,8583,8592,8593,8626,8635,8635,8644,8659,8685,8695,8695,8702,8714,8715,8717,8729,8732,8740,8743,8750,8756,8772,8772,8778,8797,8828,8840,8840,8843,8856,8865,8874,8876,8878,8885,8887,8893,8896,8905,8910,8955,8970,8971,8991,8995,9014,9016,9042,9043,9063,9069,9104,9106,9107,9116,9131,9157,9227,9359,9471] GROUP B [12,16,29,32,33,35,39,42,44,44,44,45,45,45,45,45,45,45,45,45,47,51,51,51,57,57,60,61,61,62,71,75,75,75,75,75,75,76,76,76,76,76,76,79,84,84,85,89,93,93,95,96,97,98,100,100,100,100,100,102,102,103,105,108,109,109,109,109,109,109,109,109,109,109,109,109,110,110,112,113,114,114,116,116,118,119,120,121,122,124,125,128,129,130,131,132,133,133,137,138,144,144,146,146,146,148,149,149,150,150,150,151,153,155,157,159,164,164,164,167,169,170,171,171,171,171,173,174,175,176,176,177,178,179,180,181,181,183,184,185,187,191,193,199,203,203,205,205,206,212,213,214,214,219,224,224,224,225,225,226,227,227,228,231,234,234,235,237,240,244,245,245,246,246,246,248,249,250,250,251,255,255,257,264,264,267,270,271,271,281,282,286,286,291,291,292,292,294,295,299,301,302,304,304,304,304,304,306,308,314,318,329,340,344,345,356,359,363,368,368,371,375,379,386,389,390,392,394,408,418,438,440,456,456,458,460,461,467,491,503,505,508,524,557,558,568,591,609,622,656,665,668,687,705,728,817,839,965,1013,1093,1126,1512,1935,2159,2384,2424,2426,2484,2738,2746,2751,3006,3184,3184,3184,3184,3184,4023,5842,5842,6502,7443,7781,8132,8237,8501] 

Data plot

Array statistics:

  Group A Group B Total Numbers 231 286 Mean 7534.71 575.56 Standard Deviation 1595.04 1316.03 
+7
sorting arrays algorithm machine-learning classification
source share
5 answers

I just want to indicate a different approach using a density estimate.

Given your data, it is easy to insert a smoothed pdf using kernel density estimation . The python code below shows how to use the kde module in scipy .

 from scipy.stats.kde import gaussian_kde from numpy import linspace import matplotlib.pyplot as plt data1 = [385,515,975,1136,2394,2436,4051,4399,4484,4768,4768,4849,4856,4954,5020,5020,5020,5020,5020,5020,5020,5020,5020,5052,5163,5200,5271,5421,5421,5442,5746,5765,5903,5992,5992,6046,6122,6205,6208,6239,6310,6360,6416,6512,6536,6543,6581,6609,6696,6699,6752,6796,6806,6855,6859,6886,6906,6911,6923,6953,7016,7072,7086,7089,7110,7232,7278,7293,7304,7309,7348,7367,7378,7380,7419,7453,7454,7492,7506,7549,7563,7721,7723,7731,7745,7750,7751,7783,7791,7813,7813,7814,7818,7833,7863,7875,7886,7887,7902,7907,7935,7942,7942,7948,7973,7995,8002,8013,8013,8015,8024,8025,8030,8038,8041,8050,8056,8060,8064,8071,8081,8082,8085,8093,8124,8139,8142,8167,8179,8204,8214,8223,8225,8247,8248,8253,8258,8264,8265,8265,8269,8277,8278,8289,8300,8312,8314,8323,8328,8334,8363,8369,8390,8397,8399,8399,8401,8436,8442,8456,8457,8471,8474,8483,8503,8511,8516,8533,8560,8571,8575,8583,8592,8593,8626,8635,8635,8644,8659,8685,8695,8695,8702,8714,8715,8717,8729,8732,8740,8743,8750,8756,8772,8772,8778,8797,8828,8840,8840,8843,8856,8865,8874,8876,8878,8885,8887,8893,8896,8905,8910,8955,8970,8971,8991,8995,9014,9016,9042,9043,9063,9069,9104,9106,9107,9116,9131,9157,9227,9359,9471] data2 = [12,16,29,32,33,35,39,42,44,44,44,45,45,45,45,45,45,45,45,45,47,51,51,51,57,57,60,61,61,62,71,75,75,75,75,75,75,76,76,76,76,76,76,79,84,84,85,89,93,93,95,96,97,98,100,100,100,100,100,102,102,103,105,108,109,109,109,109,109,109,109,109,109,109,109,109,110,110,112,113,114,114,116,116,118,119,120,121,122,124,125,128,129,130,131,132,133,133,137,138,144,144,146,146,146,148,149,149,150,150,150,151,153,155,157,159,164,164,164,167,169,170,171,171,171,171,173,174,175,176,176,177,178,179,180,181,181,183,184,185,187,191,193,199,203,203,205,205,206,212,213,214,214,219,224,224,224,225,225,226,227,227,228,231,234,234,235,237,240,244,245,245,246,246,246,248,249,250,250,251,255,255,257,264,264,267,270,271,271,281,282,286,286,291,291,292,292,294,295,299,301,302,304,304,304,304,304,306,308,314,318,329,340,344,345,356,359,363,368,368,371,375,379,386,389,390,392,394,408,418,438,440,456,456,458,460,461,467,491,503,505,508,524,557,558,568,591,609,622,656,665,668,687,705,728,817,839,965,1013,1093,1126,1512,1935,2159,2384,2424,2426,2484,2738,2746,2751,3006,3184,3184,3184,3184,3184,4023,5842,5842,6502,7443,7781,8132,8237,8501] pdf1 = gaussian_kde(data1) pdf2 = gaussian_kde(data2) x = linspace(0, 9500, 1000) plt.plot(x, pdf1(x),'r') plt.plot(x, pdf2(x),'g') plt.legend(['data1 pdf', 'data2 pdf']) plt.show() 

enter image description here

In the graph green is pdf for the second data set; red is the pdf for the first dataset. It is clear that the boundary of the solution is a vertical line passing through the point where green intersects red.

To find the border numerically, we can do something like below (suppose there is only one intersection, otherwise it does not make sense):

 min_diff = 10000 min_diff_x = -1 for x in linspace(3600, 4000, 400): diff = abs(pdf1(x) - pdf2(x)) if diff < min_diff: min_diff = diff min_diff_x = x print min_diff, min_diff_x 

We found that the border is located at approximately 3762.

If there are several intersections of two PDF files, in order to predict which class the data point x belongs to, we calculate pdf1(x) and pdf2(x) , the maximum is the class that minimizes the risk of bikes. See here for more information on Bayesian risk and an estimate of the probability of prediction errors.

The following is an example that actually includes three PDF files, at any point in the x query, we need to ask three PDF files separately and choose one with the maximum pdf(x) value as the predicted class.

enter image description here

+4
source share

This can be seen as a binary classification problem with a single continuous predictor. You could consider this as a suitable one simple decision tree by finding the threshold value t so that you predict group A when the value> = t.

To do this, you choose t, which minimizes the entropy of the resulting splits. Let's say you have the following calculations for some t:

| | <t | >= t | | Group A | X | Y | | Group B | Z | W |

Entropy <rank - (X / (X + Z)) * log (X / (X + Z)) - (Z / (X + Z)) * log (Z / (X + Z)). Entropy> = splits - (Y / (Y + W)) * log (Y / (Y + W)) - (W / (Y + W)) * log (W / (Y + W)), It looks messier than there is; it's just the sum of -p * log (p) for the fraction p of each group in the split.

You take the weighted average of two, weighted by the total size of the split. Thus, the first term is weighted (X + Z) / (X + Y + Z + W), and the second - (Y + W) / (X + Y + Z + W).

+4
source share

Under reasonable assumptions, a good discriminant is the unique value of the data, which leads to the fact that the area of ​​the probability density B to the left of the separation point is equal to the area A to the right (or vice versa, which gives the same point).

An easy way to find this is to compute the two cumulative empiric distribution functions (CDFs) as shown here , and look for them to ensure dot separation. This is the point at which two CDFs add up to 1.

In short, creating empirical CDFs simply sorts each data set and uses the data as x-axis values. Drawing a curve from left to right, start with y = 0 and take a step 1 / n up for each value of x. Such a curve increases asymptotically from 0 for x <= data 1 to y = CDF (x) = 1 for x> = data [n]. There is a slightly more complicated method that gives a continuous stepwise linear curve, rather than step steps, which for certain assumptions are the best estimate of true CDF. IN

Please note that the discussion above is only an intuition. CDF is beautifully represented by a sorted data array. No new data structure is required; that is, x [i], i = 1,2, ..., n is the value of x at which the curve reaches y = i / n.

With two CDFs, R (x) and B (x), according to your diagram, you want to find a unique point x such that | 1 - R (x) - B (x) | (with a piecewise linear CDF you can always make that zero). This can be done quite simply using binary search.

The good thing about this method is that you can make it dynamic by supporting two CDFs in sorted sets (balanced binary search trees). As you add points, the new split point is easy to find.

For ordered sets, you need "order statistics." Here is the link . By this, I mean that you will need to request a sorted set to get the serial number of any stored x value in the CDF. This can be done using skip lists as well as trees.

I encoded one version of this algorithm. It uses piecewise approximation of CDF, but also allows for “vertical steps” at repeated data points. This complicates the algorithm somewhat, but it is not so bad. Then I used a halving (not a combinatorial binary search) to find the split point. The normal bisection algorithm needs to be modified to accommodate vertical “steps” in the CDF. I think everything is fine with me, but this is slightly verified.

One edge register that is not processed is if the datasets have disjoint ranges. This will find a point between the top of the bottom and bottom of the top, which is a perfectly acceptable discriminator. But you might want to do something more fun, like returning a weighted average.

Please note that if you have a good idea of ​​the actual minimum and maximum values ​​that the data can reach, and they do not appear in the data, you should consider adding them so that the CDFs are not inadvertently biased.

In your example data, the code produces 4184.76, which looks pretty close to the value you selected in the diagram (slightly below half the gap between the min and max data).

Note. I did not sort the data because it already was. Sorting is definitely necessary.

 public class SplitData { // Return: i such that a[i] <= x < a[i+1] if i,i+1 in range // else -1 if x < a[0] // else a.length if x >= a[a.length - 1] static int hi_bracket(double[] a, double x) { if (x < a[0]) return -1; if (x >= a[a.length - 1]) return a.length; int lo = 0, hi = a.length - 1; while (lo + 1 < hi) { int mid = (lo + hi) / 2; if (x < a[mid]) hi = mid; else lo = mid; } return lo; } // Return: i such that a[i-1] < x <= a[i] if i-1,i in range // else -1 if x <= a[0] // else a.length if x > a[a.length - 1] static int lo_bracket(double[] a, double x) { if (x <= a[0]) return -1; if (x > a[a.length - 1]) return a.length; int lo = 0, hi = a.length - 1; while (lo + 1 < hi) { int mid = (lo + hi) / 2; if (x <= a[mid]) hi = mid; else lo = mid; } return hi; } // Interpolate the CDF value for the data a at value x. Returns a range. static void interpolate_cdf(double[] a, double x, double[] rtn) { int lo_i1 = lo_bracket(a, x); if (lo_i1 == -1) { rtn[0] = rtn[1] = 0; return; } int hi_i0 = hi_bracket(a, x); if (hi_i0 == a.length) { rtn[0] = rtn[1] = 1; return; } if (hi_i0 + 1 == lo_i1) { // normal interpolation rtn[0] = rtn[1] = (hi_i0 + (x - a[hi_i0]) / (a[lo_i1] - a[hi_i0])) / (a.length - 1); return; } // we're on a joint or step; return range answer rtn[0] = (double)lo_i1 / (a.length - 1); rtn[1] = (double)hi_i0 / (a.length - 1); assert rtn[0] <= rtn[1]; } // Find the data value where the two given data set empirical CDFs // sum to 1. This is a good discrimination value for new data. // This deals with the case where there a step in either or both CDFs. static double find_bisector(double[] a, double[] b) { assert a.length > 0; assert b.length > 0; double lo = Math.min(a[0], b[0]); double hi = Math.max(a[a.length - 1], b[b.length - 1]); double eps = (hi - lo) * 1e-7; double[] a_rtn = new double[2], b_rtn = new double[2]; while (hi - lo > eps) { double mid = 0.5 * (lo + hi); interpolate_cdf(a, mid, a_rtn); interpolate_cdf(b, mid, b_rtn); if (1 < a_rtn[0] + b_rtn[0]) hi = mid; else if (a_rtn[1] + b_rtn[1] < 1) lo = mid; else return mid; // 1 is included in the interpolated range } return 0.5 * (lo + hi); } public static void main(String[] args) { double split = find_bisector(a, b); System.err.println("Split at x = " + split); } static final double[] a = { 385, 515, 975, 1136, 2394, 2436, 4051, 4399, 4484, 4768, 4768, 4849, 4856, 4954, 5020, 5020, 5020, 5020, 5020, 5020, 5020, 5020, 5020, 5052, 5163, 5200, 5271, 5421, 5421, 5442, 5746, 5765, 5903, 5992, 5992, 6046, 6122, 6205, 6208, 6239, 6310, 6360, 6416, 6512, 6536, 6543, 6581, 6609, 6696, 6699, 6752, 6796, 6806, 6855, 6859, 6886, 6906, 6911, 6923, 6953, 7016, 7072, 7086, 7089, 7110, 7232, 7278, 7293, 7304, 7309, 7348, 7367, 7378, 7380, 7419, 7453, 7454, 7492, 7506, 7549, 7563, 7721, 7723, 7731, 7745, 7750, 7751, 7783, 7791, 7813, 7813, 7814, 7818, 7833, 7863, 7875, 7886, 7887, 7902, 7907, 7935, 7942, 7942, 7948, 7973, 7995, 8002, 8013, 8013, 8015, 8024, 8025, 8030, 8038, 8041, 8050, 8056, 8060, 8064, 8071, 8081, 8082, 8085, 8093, 8124, 8139, 8142, 8167, 8179, 8204, 8214, 8223, 8225, 8247, 8248, 8253, 8258, 8264, 8265, 8265, 8269, 8277, 8278, 8289, 8300, 8312, 8314, 8323, 8328, 8334, 8363, 8369, 8390, 8397, 8399, 8399, 8401, 8436, 8442, 8456, 8457, 8471, 8474, 8483, 8503, 8511, 8516, 8533, 8560, 8571, 8575, 8583, 8592, 8593, 8626, 8635, 8635, 8644, 8659, 8685, 8695, 8695, 8702, 8714, 8715, 8717, 8729, 8732, 8740, 8743, 8750, 8756, 8772, 8772, 8778, 8797, 8828, 8840, 8840, 8843, 8856, 8865, 8874, 8876, 8878, 8885, 8887, 8893, 8896, 8905, 8910, 8955, 8970, 8971, 8991, 8995, 9014, 9016, 9042, 9043, 9063, 9069, 9104, 9106, 9107, 9116, 9131, 9157, 9227, 9359, 9471 }; static final double[] b = { 12, 16, 29, 32, 33, 35, 39, 42, 44, 44, 44, 45, 45, 45, 45, 45, 45, 45, 45, 45, 47, 51, 51, 51, 57, 57, 60, 61, 61, 62, 71, 75, 75, 75, 75, 75, 75, 76, 76, 76, 76, 76, 76, 79, 84, 84, 85, 89, 93, 93, 95, 96, 97, 98, 100, 100, 100, 100, 100, 102, 102, 103, 105, 108, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 109, 110, 110, 112, 113, 114, 114, 116, 116, 118, 119, 120, 121, 122, 124, 125, 128, 129, 130, 131, 132, 133, 133, 137, 138, 144, 144, 146, 146, 146, 148, 149, 149, 150, 150, 150, 151, 153, 155, 157, 159, 164, 164, 164, 167, 169, 170, 171, 171, 171, 171, 173, 174, 175, 176, 176, 177, 178, 179, 180, 181, 181, 183, 184, 185, 187, 191, 193, 199, 203, 203, 205, 205, 206, 212, 213, 214, 214, 219, 224, 224, 224, 225, 225, 226, 227, 227, 228, 231, 234, 234, 235, 237, 240, 244, 245, 245, 246, 246, 246, 248, 249, 250, 250, 251, 255, 255, 257, 264, 264, 267, 270, 271, 271, 281, 282, 286, 286, 291, 291, 292, 292, 294, 295, 299, 301, 302, 304, 304, 304, 304, 304, 306, 308, 314, 318, 329, 340, 344, 345, 356, 359, 363, 368, 368, 371, 375, 379, 386, 389, 390, 392, 394, 408, 418, 438, 440, 456, 456, 458, 460, 461, 467, 491, 503, 505, 508, 524, 557, 558, 568, 591, 609, 622, 656, 665, 668, 687, 705, 728, 817, 839, 965, 1013, 1093, 1126, 1512, 1935, 2159, 2384, 2424, 2426, 2484, 2738, 2746, 2751, 3006, 3184, 3184, 3184, 3184, 3184, 4023, 5842, 5842, 6502, 7443, 7781, 8132, 8237, 8501 }; } 
+4
source share

You can calculate the distance Mahalanobis of a new point relative to each set. The most likely match is for which the new point has the lowest distance.

The Mahalanobis distance is a measure of the distance between point P and the distribution D introduced by PK Mahalanobis in 1936. 1 multidimensional generalization of the idea of ​​measuring the number of standard deviations of P from the average value of D. This distance is zero if P is on average from D and grows when P moves away from the average

Since your space is one-dimensional, the calculation should be simplified to:

  • Calculate the standard deviation of each distribution
  • Calculate the average for each distribution
  • For each distribution, calculate how many standard deviations are at a point from the average distribution.
+3
source share

You describe a one-dimensional problem, a statistical classification , in which you look for a “decision boundary”. You have many options to choose from:

  • logistic regression
  • nearest neighbor classifier
  • supporting vector machines
  • multilayer perceptrons
  • ...

but since the problem is simple (one-dimensional, two well-separated classes), and the decision boundary is a rather empty area, I suspect that not a single heavy statistical method will significantly exceed the simple eye principle.

+3
source share

All Articles