Splitting a float array into similar segments (clustering)

Question

Splitting a float array into similar segments (clustering)

I have an array of such floats:

[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]

Now I want to split the array as follows:

 [[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]]

// [200] will be considered an outlier due to less cluster support

I need to find this segment for multiple arrays, and I don't know what section size should be. I tried to do this using hierarchical clustering (Agglomerative) and it gives me satisfactory results. However, the problem is that I was asked not to use clustering algorithms for a one-dimensional problem, since their theoretical justification (as well as for multidimensional data) does not have for this.

I spent a lot of time to find a solution. However, the sentences look very different: this and this VS. and and.

I found another suggestion, not clustering, i.e. natural gap optimization . However, it is also necessary to declare the partition number, for example, K-means (right?).

This is rather confusing (especially because I have to perform such segmentation on multiple arrays, and it is impossible to find out the optimal number of partitions).

Is there a way to find partitions (so we can reduce the variance within partitions and maximize the difference between partitions) with some theoretical justification?

Any pointers to articles / articles (if available for implementing C / C ++ / Java) with some theoretical justification will be very useful to me.

+8

java c ++ algorithm data-partitioning cluster-analysis

alessandro Jul 05 '13 at 1:33

source share

2 answers

Typically, clustering involves multidimensional data.

If you have one-dimensional data, sort , then use either a kernel density estimate, or just scan the largest spaces.

In 1 dimension, the problem becomes much simpler because the data can be sorted. If you use the clustering algorithm, unfortunately, it will not use it, so use the 1-dimensional method instead!

Consider finding the largest gap in one-dimensional data. This is trivial: sort (n log n, but in practice as fast as it can get), and then look at two adjacent values for the biggest difference.

Now try to determine the "largest gap" in 2 dimensions and an effective algorithm to find it ...

+2

Anony-mousse Jul 05 '13 at 7:44

source share

Jerry Coffin · Accepted Answer · 2013-07-05T01:57:34+0000

I think that I would sort the data (if they were not already) and then accept the related differences. Divide the differences into smaller numbers in which the difference is between getting a percentage change. Set a threshold value, and when the change exceeds this threshold, start a new "cluster".

Edit: Quick demo code in C ++:

 #include <iostream> #include <vector> #include <algorithm> #include <iterator> #include <numeric> #include <functional> int main() { std::vector<double> data{ 1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200 }; // sort the input data std::sort(data.begin(), data.end()); // find the difference between each number and its predecessor std::vector<double> diffs; std::adjacent_difference(data.begin(), data.end(), std::back_inserter(diffs)); // convert differences to percentage changes std::transform(diffs.begin(), diffs.end(), data.begin(), diffs.begin(), std::divides<double>()); // print out the results for (int i = 0; i < data.size(); i++) { // if a difference exceeds 40%, start a new group: if (diffs[i] > 0.4) std::cout << "\n"; // print out an item: std::cout << data[i] << "\t"; } return 0; }

Result:

 1.91 2.87 3.61 10.91 11.91 12.82 100.71 100.73 101.89 200

Splitting a float array into similar segments (clustering)

More articles: