3D Clustering Algorithm

Problem: I have the following problem:

3D space contains more than a billion dots. The goal is to find the top N points that have the largest number of neighbors at a given distance R. Another condition is that the distance between any two points of these top N points must be greater than R. The distribution of these points is not uniform. Very often in some areas of space there are many points.

Purpose: To find an algorithm that can scale well for many processors and requires a small amount of memory.

Thoughts: Normal spatial decomposition is not enough for this kind of problem due to uneven distribution. irregular spatial decomposition, which evenly divides the number of points, can help us in this problem. I really would appreciate it if someone could shed some light on how to solve this problem.

+7
algorithm data-partitioning cluster-analysis 3d spatial
source share
5 answers

Use Octree . For 3D data with a limited range of values, which scales very well for huge data sets.

Many of the above methods, such as location sensitivity, are approximate versions designed for a much larger dimension where you can no longer split.

Dividing at each level into 8 boxes (2 ^ d for d = 3) works very well. And since you can stop when there are too few points in the cell, and build a deeper tree where there are many points that should meet your requirements quite well.

See Wikipedia for details:

https://en.wikipedia.org/wiki/Octree

Alternatively, you can try building an R-tree. But the R-tree is trying to balance, which makes it difficult to find the most dense areas. For your specific task, this Octree flaw is really useful! The R-tree makes great efforts to ensure that the depth of the tree is the same everywhere, so that each point can be found at about the same time. However, you are only interested in the dense areas that will be found on the longest routes in Octree, without even looking at the real moments!

+3
source share

I do not have a specific answer for you, but I have a suggestion for an approach that can provide a solution.

I think it's worth exploring location sensitivity . I think that dividing the points evenly, and then applying this kind of LSH to each set should be easily parallelizable. If you create your hash algorithm in such a way that the bucket size is determined in terms of R , it seems likely that for a given set of points divided into buckets, points that meet your criteria are likely to exist in the most complete buckets.

By doing this locally, perhaps you can use some kind of map reduction strategy to combine spatial buckets from different parallel runs of the LSH algorithm in stages, using the fact that you can start excluding parts of your problem space by discounting entire buckets. Obviously, you need to be careful in cases of cross that cover different buckets, but I suspect that at each stage of the merger, you can apply different sizes / offsets of the bucket to remove this effect (for example, merge spatially equivalent buckets, as well as neighboring buckets). I believe that this method can be used to make small memory requirements (i.e. you do not need to store much more than the points themselves at any time, and you always work with small (ish) subsets).

If you are looking for some kind of heuristic, I think that this result will immediately give something similar to a “good” solution, that is, it will give you a small number of probable points that you can check to meet your criteria. If you are looking for the exact answer, then you will have to use other methods to crop the search space when you start combining parallel buckets.

Another thought I had was that this could relate to finding a metric k-center . This is definitely not the same problem, but perhaps some of the methods used in the solution are applicable in this case. The problem is that it assumes that you have a metric space in which the calculation of the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any global traversal (for example, sorting the distances between points). As I said, just a thought, and perhaps a source of further inspiration.

+2
source share

Here are some possible parts of the solution. At each stage, there are different options that will depend on Ncluster, on how fast the data changes, and what you want to do with the tools.

3 stages: quantize, field, K-means.

1) quantize: reduce the input coordinates of XYZ to say 8 bits each, taking 2 ^ 8 percentiles of X, Y, Z separately. This will accelerate the entire flow without significant loss of detail. You can sort all 1G points or just random 1M, to get 8-bit x0 <x1 <... x256, y0 <y1 <... y256, z0 <z1 <... z256 with 2 ^ (30-8) points in each range. To display float X -> 8 bits x, the expanded binary search is fast - see Bentley, Pearls p. 95.

Added: Kd trees split any point cloud into boxes of different sizes, each with ~ Leafsize points; much better than XYZ splitting as stated above. But afaik you will need to collapse your own Kd tree code to split only the first, say, 16M-boxes, and save only the score, not the points.

2): count the number of points in each rectangle, [xj .. xj + 1, yj .. yj + 1, zj .. zj + 1]. The average square will have 2 ^ (30-3 * 8) points; the distribution will depend on how complex the data is. If some boxes are too big or too many points, you can a) divide them by 8, b) track the center of the points in each field, in other cases they simply take the midpoints.

3) K-means clustering on the centers of the boxes 2 ^ (3 * 8). (Google parallel “k means” → 121k hits.) It depends a lot on K aka Ncluster, also on your radius R. A rough approach would be to heap say 27 * Ncluster boxes with the most points, then take the largest ones subject to limitation radius. (I like to start with the Minimum spanning tree , then delete the longest K-1 links to get K-clusters.) See also Color quantization .

I would make Nbit, here is 8, a parameter from the very beginning.

What is your Ncluster?

Added: if your glasses move on time, see collision-detection-of-huge-number-of-circles on SO.

+1
source share

I also suggest using an octet. The OctoMap structure is very good at working with huge clouds of 3D points. It does not save all points directly, but it updates the fill density of each node (aka a 3D window). After the tree is built, you can use a simple iterator to find the node with the highest density. If you want to model point density or distribution within nodes, OctoMap is very easy to accept.

Here you can see how it was expanded to model the distribution of points using a planar model.

+1
source share

Just an idea. Create a graph with given points and edges between points when the distance is <R.

Creating such a graph is similar to spatial decomposition. You can answer your questions with a local search on the chart. First, vertices with a maximum degree, and the second, finding the maximum disconnected set of vertices of a maximum degree.

I think that creating graphs and searches can be done in parallel. Such an approach may have a great need for memory. Domain separation and graphing for small volumes can reduce memory requirements.

0
source share

All Articles