Minimum Size Clustering Algorithm

I have a set of data clusters in groups k , each cluster has a minimum size limit m

I did a few repetitions of the data. So, now I have this set of points, each of which contains one or more of the best clusters, but cannot be switched individually, because it will violate the size limit.

Purpose: to minimize the sum of the distance from each point to the center of its cluster.

Depending on: Minimum cluster size m

I want to find an algorithm for reassigning all points without violating the constraint, while guaranteeing a reduction in the goal.

I thought about using Graph to represent the interconnection between points. But I'm not sure how to reassign, since there is the possibility of a large dense cycle, and I got lost when replacing several points between several clusters.

I also created a list of cluster pairs with possible candidate replacements, but so far I have not been able to find a way for the optimal goal.

I hope I explained my situation. I am new to the algorithm and not familiar with jargon and rules. If any other information is needed, please let me know.

I did a lot of research, I tried the algorithm in this article, but to no avail, since the sum of the degree of membership does not necessarily correlate with the size of the cluster. Size Clustering

I also read other similar posts about SO, but did not find a drill-down algorithm that I could implement.

I tried to build a weighted directed graph, where the vertex representing the clusters and edges from A to B represents the points in cluster A that want to move to cluster B., and the weight is the sum of the points

But according to my data, all nodes are in a huge cycle with very dense edges. Due to my limited experience, I still could not figure out how to reassign among many clusters. Any suggestions are welcome!

Something like that. enter image description here

+5
source share
2 answers

To get a minimal (unfortunately, not minimal) solution:

  • First, eagerly reveal any points that you can without violating the minimum size limit.
  • Then build a directional multigraph as follows:
    • Each cluster becomes a node.
    • An edge (A, B) is added for each point in A, which is closer to the center of B (so there are potentially several edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
  • Searching for cycles in this graph will allow you to find the moves (where the movement consists of moving each vertex in the cycle).
  • Select the cycle with the highest total weight and skip the nodes corresponding to the edges.
  • Repeat steps 1-4 until there are no more cycles.

Creating a graph will have O (kn) complexity, where you have k and n total points, and can create the same amount of verbosity. Tarjan's algorithm will have O (k 2 ) complexity, assuming you are missing multiedges at the same destination in DFS. Each time you eliminate a cycle, you reduce the global distance by a certain amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O (k 4 m 2 ). This is quite extravagant; I'm sure there may be a heuristic to improve the lower bound (and probably a more detailed analysis).

+4
source

This issue is addressed in this article:

Bradley, P.S., C.P. Bennett and Ayhan Demiriz. "Limited clustering of k-values." Microsoft Research, Redmond (2000): 1-8.

We propose explicitly adding the constraints of $ k $ to the main problem of clustering optimization, which requires that the cluster $ h $ contain at least $ \ tau_h $ points.

I have an implementation of an algorithm in python.

+1
source

All Articles