To your specific question, "I would like to know how I can use local memory to make it better."
Using local GPU memory can be tricky. Before you solve the problem, you need to spend some quality time with the help of sample SDK code and programming guide.
Basically, you use local memory to cache some block of global data β in your case, the model [] array β so you can read it from a place faster than read it from the global. If you want to try, it will look like this pseudocode:
For each block of the model array { 1) Read data from __global and write it to __local 2) Barrier 3) For each model datum in the __local cache, Read it and process it. 4) Barrier }
Step 3 is basically a loop that you have now, except that it will only process a piece of model data, not just the whole.
Steps 2 and 4 are absolutely necessary when you are using local memory. You must sync all tags in your workgroup. The barrier forces all work items to fill in the code before the barrier before any of them are allowed to continue executing the code after the barrier. This prevents work items from reading data from local memory before they are written there by other streams. I donβt remember the syntax of the barrier instructions, but they are in the OpenCL docs.
Step 1: each work item must read a different database from the global network and write it to the local cache.
Something like this (carefully, this is simplified and untested!):
__local float4 modelcache[CACHESIZE]; int me = get_local_id(0); for (int j = 0; j < model_size; j += CACHESIZE) { modelcache[me] = dataset[j+me]; barrier(CLK_LOCAL_MEM_FENCE); for (int i=0; i < CACHESIZE; ++i) { d_xyz = modelcache[i] - local_xyz; ... etc. } barrier(CLK_LOCAL_MEM_FENCE); }
Then the design question: how big should the local cache be? What is the size of the workgroup?
The local data warehouse is shared by work items in the workgroup. If your ND work item array runs multiple workgroups in parallel, each workgroup has its own copy of the model cache.
If you make local data arrays too small, you get very little or no use at all. If you make them too large, then the GPU will not be able to run as many workgroups in parallel, and you can work much slower.
Finally, I have to say that this particular algorithm is unlikely to bring much benefit from the local memory cache. In your program, all work items are simultaneously viewing the same model [i], and most GPUs have hardware that is specifically optimized for fast execution.