How to recognize the interior of a node with all its containing points in one cluster in a spherical tree when executing the k-means algorithm?

Now I am reading Data Mining: Practical Tools and Machine Learning Methods of the third edition. Section 4.8 clustering discusses how to use kd trees or ball trees to improve performance for the k-means algorithm .

After building a tree of balls with all data points, he searches for all leaf nodes to see which pre-selected center of clustering is close to them. It is sometimes said that the area represented by the higher internal space of the node is completely within the area of ​​one center of the cluster. Then we do not need to go through its child nodes, and all date points can be processed in one hit.

Question: when implementing the data structure and algorithm, how can we decide whether the region related to the inside of the node falls into one center of the cluster?

In two-dimensional or three-dimensional space it is not difficult. We can see if all the middle perpendiculars of each pair intersect at the centers of the cluster, referring to the inside of the node.

But in higher dimensional spaces, how to recognize this? Is there a general methodology for this?

+6
source share
2 answers

You need to consider the maximum and minimum distances.

If the minimum distance of a spatial object (say, a sphere of radius r) to all other means is greater than the maximum distance to one, all objects inside the container will belong to this means. Because if

 maxdist(mean_i, container) < min of all j != i mindist(mean_j, container) 

then, in particular, for any object in the container

 dist(mean_i, obj_in_container) < min of all j != i dist(mean_j, obj_in_container) 

those. the object will belong to the average value of i.

The minimum and maximum distances for spheres and rectangles can be trivially calculated in arbitrary dimensions. However, in higher dimensions, mindist and maxdist become very similar, and the condition is rarely fulfilled. In addition, it is of great importance if your tree is well structured (for example, small containers) or poorly structured (overlapping containers).

kd trees are good for read-only memory operations. For the inserts, they are pretty good. R * demos are much better here. In addition, the improved R * -trees separation strategy pays off because it generates more rectangular rectangles than other strategies.

+1
source

I have a solution on my own, but not very good, in my opinion.
For k-dimensional spaces, the midpoint of the middle environment is a hyperplane. We compute each pair of all central centering centers.
When you encounter the above problem, we see node. It can indicate a region ---- hyperspace in space. First, we find the closest clustering center to the center point of the hypersphere P. Then, iteratively for each middle perpendicular (hyperplane) of each pair of centers with one of the two above, we calculate the Euclidean distance between the point P and the middle of the perpendicular. If the distance is less than the radius of the hypersphere, it appears when the hypersphere intersects with the middle of the perpendicular. Thus, this interior node may contain data points that may not belong to the above cluster center, iterative breaks. If the iteration ends without a break, we can say that it is quite normal to put all the data points contained in this node into the aforementioned clustering center, no doubt.

0
source

All Articles