Clustering 2d integer coordinates into sets of at most N points

I have several points on a relatively small 2-dimensional grid that wraps in both dimensions. Coordinates can be integer. I need to divide them into sets of not more than N points that are close to each other, where N will be a fairly small cut-off, I suspect not more than 10.

I am developing an AI for the game, and I'm 99% sure that minimax on all game items will give me a useful look at about 1 turn if that happens. However, the remote game parts should not influence each other until we look ahead of a large number of moves, so I want to split the game into several sub-games N pieces at a time. However, I need to ensure that I select reasonable N pieces at a time, i.e. Those that are close to each other.

I don't care if the emissions themselves remain or are concentrated in their less distant cluster. The destruction of natural clusters, large N, is inevitable and should be reasonable. Since this is used in game AI with a limited response time, I am looking for an algorithm as quickly as possible and wish to compromise accuracy for performance.

Does anyone have any suggestions on using algorithms for adaptation? K-funds and relatives do not seem appropriate, since I do not know how many clusters I want to find, but I have a limit on how large clusters I want. I saw some evidence that approximating a solution by linking points to a grid can help some clustering algorithms, so I hope that the whole coordinates make the task easier. Distance-based hierarchical clustering will easily adapt to rounding coordinates, as I simply connect another distance function, and also relatively close the cluster size. Are there any other ideas I should look at?

I'm more interested in algorithms than libraries, although libraries with good documentation on how they work will be welcome.

EDIT . I originally asked this question when I was working on a Fall 2011 AI Challenge record, which I, unfortunately, never finished. The page I'm connected to has a fairly short, fairly high-level description of the game.

Two key points:

  • Each player has a potentially large number of ants.
  • Each ant receives orders for each revolution, moving 1 square either north, south, east or west; this means that the branching coefficient of the game is O (4 ants ).

The competition also had strict time limits for each bot turn. I thought to approach the game using minimax (the turns are actually simultaneous, but as a heuristic, I thought that everything would be fine), but I was afraid that there would be no time to look ahead a lot of moves if I considered the whole game once. But since each ant moves only one square per turn, two ants cannot N spaces from each other along the shortest route, they may interfere with each other until we look ahead N / 2 moves.

So, the solution I was looking for was a good way to select smaller groups of ants at a time and minimize each group separately. I was hoping this would allow me to search deeper in the tree without losing accuracy. But obviously, it makes no sense to use a very expensive clustering algorithm as a time-saving heuristic!

I am still interested in the answer to this question, although more in what I can learn from the technician than for this particular competition, since it ended! Thanks for all the answers.

+7
source share
5 answers

The mid-cut algorithm is very easy to implement in 2D and will work well here. Your outliers will turn out to be groups 1, which you can drop or whatever.

Additional explanations: The cut median is a quantization algorithm, but all quantization algorithms are special clustering algorithms. In this case, the algorithm is extremely simple: find the smallest bounding box containing all the points, divide the field along its longest side (and reduce it to fit the points), repeat until the specified number of boxes is reached.

A more detailed description and encoded example

Color quantization wiki has good visual effects and links

+6
source

Since you are writing a game where (I suppose), only one number of fragments moves between each plexus, you can use the Online algorithm to get the update time for convenience.

The property of not blocking itself for several clusters is called unsteady , I suppose.

This article has an efficient algorithm with two of the following two properties: Improving the reliability of the "Online Agglomerative Clustering Method" based on the Kernel-Induce Distance Measurement (you can also find it elsewhere).

Here is a nice video showing the algorithm in operation: enter image description here

+4
source

Build the graph G = (V, E) on your grid and divide it. Since you're interested in algorithms, not libraries, here's a recent article:

Daniel Delling, Andrew W. Goldberg, Ilya Rasenstein and Renato F. Wernek. Graphic separation with natural abbreviations. At the 25th International Symposium on Parallel and Distributed Processing (IPDPS11). Computer IEEE Society, 2011. [PDF]

From the text:

The task of the graph partition problem is to find the minimum cost section P such that the size of each cell is bounded by U.

So you set U = 10.

+1
source

You can calculate the minimum spanning tree and remove the longest edges. Then you can calculate the k-means. Remove another long edge and calculate the k-means. Rinse and repeat until you have N = 10. I believe this algorithm is called a simply connected k-tool, and the cluster is similar to voronoi diagrams:

"The single-channel k-clustering algorithm ... is exactly the Kruskal algorithm ... equivalent to finding MST and removing the k-1 most expensive edges."

See here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering

+1
source

Consider the case when you need only two clusters. If you run the k-tool, you get two points, and the separation between the two clusters is the plane orthogonal to the line between the centers of the two clusters. You can find out in which cluster a point is located by projecting it onto a line and then comparing its position on the line with a threshold (for example, take a point product between a line and a vector from any of two cluster centers and a point).

For two clusters, this means that you can adjust the size of the clusters by moving the threshold. You can sort the points at your distance along the line connecting the two cluster centers, and then it’s pretty easy to move the threshold along the line, trading with inequality of separation with how neat the clusters are.

You probably don't have k = 2, but you can do it hierarchically by dividing into two clusters and then dividing the clusters.

(After the comment)

I am poorly versed in pictures, but here is some kind of corresponding algebra.

With k-means we divide the points according to their distance from the centers of the clusters, therefore for the point Xi and two centers Ai and Bi we may be interested

SUM_i (Xi - Ai) ^ 2 - SUM_i (Xi - Bi) ^ 2

This is SUM_i Ai ^ 2 - SUM_i Bi ^ 2 + 2 SUM_i (Bi - Ai) Xi

Thus, the point is attached to any cluster, depending on the sign of K + 2 (B - A). X is the constant plus the point product between the vector and the point and the vector connecting the two circles of the cluster. In two dimensions, the dividing line between the points of the plane that is in one cluster and the points on the plane that are in the other cluster are a line perpendicular to the line between the two centers of the cluster. I suggest that in order to control the number of points after your division, you calculate (B - A) .X for each point X, and then choose a threshold that divides all points in one cluster from all points in another cluster. This means a sliding dividing line up or down the line between the two centers of the cluster, keeping it perpendicular to the line between them.

When you have point products Yi, where Yi = SUM_j (Bj-Aj) Xij, a measure of how closely the cluster is grouped is SUM_i (Yi - Ym) ^ 2, where Ym is the average value of Yi per cluster. I suggest you use the sum of these values ​​for two clusters to tell how good the split you have is. To move a point to or from a cluster and get a new sum of squares without recounting everything from scratch, note that SUM_i (Si + T) ^ 2 = SUM_i Si ^ 2 + 2T SUM_i Si + T ^ 2, so if you track the sums and sums of squares, you can determine what happens to the sum of squares when adding or subtracting values ​​for each component, since the average value of the cluster changes when adding or removing a point.

+1
source

All Articles