Minimum Size Clustering Algorithm

Question

Minimum Size Clustering Algorithm

I have a set of data clusters in groups k , each cluster has a minimum size limit m

I did a few repetitions of the data. So, now I have this set of points, each of which contains one or more of the best clusters, but cannot be switched individually, because it will violate the size limit.

Purpose: to minimize the sum of the distance from each point to the center of its cluster.

Depending on: Minimum cluster size m

I want to find an algorithm for reassigning all points without violating the constraint, while guaranteeing a reduction in the goal.

I thought about using Graph to represent the interconnection between points. But I'm not sure how to reassign, since there is the possibility of a large dense cycle, and I got lost when replacing several points between several clusters.

I also created a list of cluster pairs with possible candidate replacements, but so far I have not been able to find a way for the optimal goal.

I hope I explained my situation. I am new to the algorithm and not familiar with jargon and rules. If any other information is needed, please let me know.

I did a lot of research, I tried the algorithm in this article, but to no avail, since the sum of the degree of membership does not necessarily correlate with the size of the cluster. Size Clustering

I also read other similar posts about SO, but did not find a drill-down algorithm that I could implement.

I tried to build a weighted directed graph, where the vertex representing the clusters and edges from A to B represents the points in cluster A that want to move to cluster B., and the weight is the sum of the points

But according to my data, all nodes are in a huge cycle with very dense edges. Due to my limited experience, I still could not figure out how to reassign among many clusters. Any suggestions are welcome!

Something like that. enter image description here

+5

algorithm cluster-analysis

qshngv May 07 '15 at 10:01

source share

2 answers

This issue is addressed in this article:

Bradley, P.S., C.P. Bennett and Ayhan Demiriz. "Limited clustering of k-values." Microsoft Research, Redmond (2000): 1-8.

We propose explicitly adding the constraints of $ k $ to the main problem of clustering optimization, which requires that the cluster $ h $ contain at least $ \ tau_h $ points.

I have an implementation of an algorithm in python.

+1

Behrouz babaki Feb 10 '17 at 14:11

source share

Kittsil · Accepted Answer · 2015-05-07T22:15:27+0000

To get a minimal (unfortunately, not minimal) solution:

First, eagerly reveal any points that you can without violating the minimum size limit.
Then build a directional multigraph as follows:
- Each cluster becomes a node.
- An edge (A, B) is added for each point in A, which is closer to the center of B (so there are potentially several edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
Searching for cycles in this graph will allow you to find the moves (where the movement consists of moving each vertex in the cycle).
Select the cycle with the highest total weight and skip the nodes corresponding to the edges.
Repeat steps 1-4 until there are no more cycles.

Creating a graph will have O (kn) complexity, where you have k and n total points, and can create the same amount of verbosity. Tarjan's algorithm will have O (k ² ) complexity, assuming you are missing multiedges at the same destination in DFS. Each time you eliminate a cycle, you reduce the global distance by a certain amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O (k ⁴ m ² ). This is quite extravagant; I'm sure there may be a heuristic to improve the lower bound (and probably a more detailed analysis).

Minimum Size Clustering Algorithm

More articles: