Splitting a float array into similar segments (clustering)

I have an array of such floats:

[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200] 

Now I want to split the array as follows:

 [[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]] 

// [200] will be considered an outlier due to less cluster support

I need to find this segment for multiple arrays, and I don't know what section size should be. I tried to do this using hierarchical clustering (Agglomerative) and it gives me satisfactory results. However, the problem is that I was asked not to use clustering algorithms for a one-dimensional problem, since their theoretical justification (as well as for multidimensional data) does not have for this.

I spent a lot of time to find a solution. However, the sentences look very different: this and this VS. and and.

I found another suggestion, not clustering, i.e. natural gap optimization . However, it is also necessary to declare the partition number, for example, K-means (right?).

This is rather confusing (especially because I have to perform such segmentation on multiple arrays, and it is impossible to find out the optimal number of partitions).

Is there a way to find partitions (so we can reduce the variance within partitions and maximize the difference between partitions) with some theoretical justification?

Any pointers to articles / articles (if available for implementing C / C ++ / Java) with some theoretical justification will be very useful to me.

+8
java c ++ algorithm data-partitioning cluster-analysis
source share
2 answers

I think that I would sort the data (if they were not already) and then accept the related differences. Divide the differences into smaller numbers in which the difference is between getting a percentage change. Set a threshold value, and when the change exceeds this threshold, start a new "cluster".

Edit: Quick demo code in C ++:

 #include <iostream> #include <vector> #include <algorithm> #include <iterator> #include <numeric> #include <functional> int main() { std::vector<double> data{ 1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200 }; // sort the input data std::sort(data.begin(), data.end()); // find the difference between each number and its predecessor std::vector<double> diffs; std::adjacent_difference(data.begin(), data.end(), std::back_inserter(diffs)); // convert differences to percentage changes std::transform(diffs.begin(), diffs.end(), data.begin(), diffs.begin(), std::divides<double>()); // print out the results for (int i = 0; i < data.size(); i++) { // if a difference exceeds 40%, start a new group: if (diffs[i] > 0.4) std::cout << "\n"; // print out an item: std::cout << data[i] << "\t"; } return 0; } 

Result:

 1.91 2.87 3.61 10.91 11.91 12.82 100.71 100.73 101.89 200 
+8
source share

Typically, clustering involves multidimensional data.

If you have one-dimensional data, sort , then use either a kernel density estimate, or just scan the largest spaces.

In 1 dimension, the problem becomes much simpler because the data can be sorted. If you use the clustering algorithm, unfortunately, it will not use it, so use the 1-dimensional method instead!

Consider finding the largest gap in one-dimensional data. This is trivial: sort (n log n, but in practice as fast as it can get), and then look at two adjacent values โ€‹โ€‹for the biggest difference.

Now try to determine the "largest gap" in 2 dimensions and an effective algorithm to find it ...

+2
source share

All Articles