Nearest neighbor [OpenCL] using Euclidean distance

I use OpenCL to find the closest neighbor between two sets of 3D points.

Nearest neighbor: for each point (x, y, z) in the DataSet, I need to find the nearest one in the model. Square distance = (Ax-Bx) ^ 2 + (Ay-By) ^ 2 + (Az-Bz) ^ 2

Here is what I have done so far:

struct point { int x; int y; int z; }; __kernel void nearest_neighbour(__global struct point *model, __global struct point *dataset, __global int *nearest, const unsigned int model_size) { int g_dataset_id = get_global_id(0); int dmin = -1; int d, dx, dy, dz; for (int i=0; i<model_size; ++i) { dx = model[i].x - dataset[g_dataset_id].x; dx = dx * dx; dy = model[i].y - dataset[g_dataset_id].y; dy = dy * dy; dz = model[i].z - dataset[g_dataset_id].z; dz = dz * dz; d = dx + dy + dz; if(dmin == -1 || d < dmin) { nearest[g_dataset_id] = i; dmin = d; } } } 

The code works, but I'm sure it can be optimized. I would like to know how I can use local memory to make it better.

thanks

PS I know that there are other (better) methods for finding the closest neighbor, for example kd-tree, but now I would like to make it simple.

+7
source share
7 answers

The compiler probably raises these loopback invariants for you, but to be sure that this is done, try this code that assigns them to temporary resources called datum_x , etc. In addition, initializing MAX_INT to MAX_INT avoids unnecessary comparisons with -1 . Another approach is to expand the first iteration of the loop (using i=0 ) to initialize dmin.

 int dmin = MAX_INT; int d, dx, dy, dz; int datum_x, datum_y, datum_z; datum_x = dataset[g_model_id].x; datum_y = dataset[g_model_id].y; datum_z = dataset[g_model_id].z; for (int i=0; i<size_dataset; ++i) { dx = model[i].x - datum_x; dx = dx * dx; dy = model[i].y - datum_y; dy = dy * dy; dz = model[i].z - datum_z; dz = dz * dz; d = dx + dy + dz; if(d < dmin) { nearest[g_dataset_id] = i; dmin = d; } } 
+4
source

Perhaps a quick pre-filter can speed up the process. Instead of immediately calculating the square of the distance, you can first check if the distance in all three coordinates is closer than dmin. So you can replace your inner loop with

 { dx = model[i].x - datum_x; if (abs(dx) >= dmin) continue; dy = model[i].y - datum_y; if (abs(dy) >= dmin) continue; dz = model[i].z - datum_z; if (abs(dz) >= dmin) continue; dx = dx * dx; dy = dy * dy; dz = dz * dz; d = dx + dy + dz; if(d < dmin) { nearest[g_dataset_id] = i; dmin = d; } } 

I'm not sure if the extra calls to abs() and if for each point will filter enough data points, but I think this is a pretty simple change.

+2
source

The first thing that occurred to me was the offer Heath made. Each work item simultaneously refers to the model[i] memory element. Depending on how good the compiler optimizer is, it may be better for each work item to access another item from the array. One way to stun:

 int datum_x, datum_y, datum_z; datum_x = dataset[g_model_id].x; datum_y = dataset[g_model_id].y; datum_z = dataset[g_model_id].z; for (int i=0; i<size_dataset; ++i) { j = (i + g_model_id) % size_dataset; // i --> j by cyclic permutation dx = model[j].x - datum_x; dx = dx * x; dy = model[j].y - datum_y; dy = dy * dy; /* and so on... */ } 

However, it is possible that access to model[i] in your code is treated as a "broadcast", in which case my code will work more slowly.

+1
source

The Heath clause can also be applied to the output index: save the nearest_id variable and write it only once after the loop.

Instead of a 3-component structure, I would experiment with int4 vectors and use vector operations.

+1
source

I am sure a lot of time has been spent writing the current min to nearest[g_dataset_id] . Access to global memory is often very slow, so you better keep the current min in the register, as you do with dmin = d .

Similar:

 ... int dmin = MAX_INT; int imin = 0; ... for (...) { ... if(d < dmin) { imin = i; dmin = d; } } nearest[g_dataset_id] = imin; //write to global memory only once 
+1
source

Following Eric Bainville's suggestion, I tried to get rid of the point structure. As I said, I used float4, here is what I did:

 __kernel void nearest_neighbour(__global float4 *model, __global float4 *dataset, __global unsigned int *nearest, const unsigned int model_size) { int g_dataset_id = get_global_id(0); float dmin = MAXFLOAT; float d; /* Ottimizzato per memoria locale */ float4 local_xyz = dataset[g_dataset_id]; float4 d_xyz; int imin; for (int i=0; i<model_size; ++i) { d_xyz = model[i] - local_xyz; d_xyz *= d_xyz; d = d_xyz.x + d_xyz.y + d_xyz.z; if(d < dmin) { imin = i; dmin = d; } } nearest[g_dataset_id] = imin; // Write only once in global memory } 

The problem is that this version is a bit slower than this one based on a point structure. Probably because in the structure I used a preliminary filter:

 dx = model[i].x - local_x; dx = dx * dx; if (dx >= dmin) continue; dy = model[i].y - local_y; dy = dy * dy; if (dy >= dmin) continue; dz = model[i].z - local_z; dz = dz * dz; if (dz >= dmin) continue; d = dx + dy + dz; 

I cannot use this width float4 with a width before filter. In your opinion, is there any other optimization that I can do in float4?

Thank you all for your valuable suggestions.

+1
source

To your specific question, "I would like to know how I can use local memory to make it better."

Using local GPU memory can be tricky. Before you solve the problem, you need to spend some quality time with the help of sample SDK code and programming guide.

Basically, you use local memory to cache some block of global data β€” in your case, the model [] array β€” so you can read it from a place faster than read it from the global. If you want to try, it will look like this pseudocode:

 For each block of the model array { 1) Read data from __global and write it to __local 2) Barrier 3) For each model datum in the __local cache, Read it and process it. 4) Barrier } 

Step 3 is basically a loop that you have now, except that it will only process a piece of model data, not just the whole.

Steps 2 and 4 are absolutely necessary when you are using local memory. You must sync all tags in your workgroup. The barrier forces all work items to fill in the code before the barrier before any of them are allowed to continue executing the code after the barrier. This prevents work items from reading data from local memory before they are written there by other streams. I don’t remember the syntax of the barrier instructions, but they are in the OpenCL docs.

Step 1: each work item must read a different database from the global network and write it to the local cache.

Something like this (carefully, this is simplified and untested!):

 __local float4 modelcache[CACHESIZE]; int me = get_local_id(0); for (int j = 0; j < model_size; j += CACHESIZE) { modelcache[me] = dataset[j+me]; barrier(CLK_LOCAL_MEM_FENCE); for (int i=0; i < CACHESIZE; ++i) { d_xyz = modelcache[i] - local_xyz; ... etc. } barrier(CLK_LOCAL_MEM_FENCE); } 

Then the design question: how big should the local cache be? What is the size of the workgroup?

The local data warehouse is shared by work items in the workgroup. If your ND work item array runs multiple workgroups in parallel, each workgroup has its own copy of the model cache.

If you make local data arrays too small, you get very little or no use at all. If you make them too large, then the GPU will not be able to run as many workgroups in parallel, and you can work much slower.

Finally, I have to say that this particular algorithm is unlikely to bring much benefit from the local memory cache. In your program, all work items are simultaneously viewing the same model [i], and most GPUs have hardware that is specifically optimized for fast execution.

+1
source

All Articles