How to structure this OpenCL code with brute force

I'm just starting to play with OpenCL, and I'm stuck on how to structure the program quite efficiently (mainly by avoiding a lot of data transfer to / from the GPU or wherever I work)

What I'm trying to do is given:

v = r*i + b*j + g*k 

. I know v for different values ​​of r , g and b , but i , j and k unknown. I want to calculate reasonable values ​​for i / j / k using brute force

In other words, I have a bunch of β€œraw” RGB pixel values, and I have a desaturated version of these colors. I do not know that the weights (i / j / k) used are calculating unsaturated values.

My initial plan was as follows:

  • load data into CL buffer (therefore input r / g / b values ​​and output)

  • have a kernel that takes three possible matrix values ​​and various pixel data buffers.

    Then it executes v = r*i + b*j + g*k and subtracts the value of v into a known value and saves it in the "grade" buffer

  • The other core calculates the RMS error for this value (if the difference is zero for all input values, the values ​​for i / j / k are "correct")

I have this work (written using Python and PyCL, the code is here ), but I'm wondering how I can parallelize this piece more work (with a few attempts to multiply the i / j / k values)

I give out, I have 4 read-only buffers (3 for input values, 1 for expected values), but I need a separate evaluation buffer for each i / j / k combination

Another problem is that RMS calculation is the slowest part because it is efficiently single-threaded (sums all the values ​​in "score" and sqrt () total)

Basically, I wonder if there is a reasonable way to structure such a program.

This seems like a task well suited to OpenCL β€” hope the description of my goal was not too confusing! As already mentioned, my current code is here , and in case it is clearer, this is the version of Python that I am trying to do:

 import sys import math import random def make_test_data(w = 128, h = 128): in_r, in_g, in_b = [], [], [] print "Make raw data" for x in range(w): for y in range(h): in_r.append(random.random()) in_g.append(random.random()) in_b.append(random.random()) # the unknown values mtx = [random.random(), random.random(), random.random()] print "Secret numbers were: %s" % mtx out_r = [(r*mtx[0] + g*mtx[1] + b*mtx[2]) for (r, g, b) in zip(in_r, in_g, in_b)] return {'in_r': in_r, 'in_g': in_g, 'in_b': in_b, 'expected_r': out_r} def score_matrix(ir, ig, ib, expected_r, mtx): ms = 0 for i in range(len(ir)): val = ir[i] * mtx[0] + ig[i] * mtx[1] + ib[i] * mtx[2] ms += abs(val - expected_r[i]) ** 2 rms = math.sqrt(ms / float(len(ir))) return rms # Make random test data test_data = make_test_data(16, 16) lowest_rms = sys.maxint closest = [] divisions = 10 for possible_r in range(divisions): for possible_g in range(divisions): for possible_b in range(divisions): pr, pg, pb = [x / float(divisions-1) for x in (possible_r, possible_g, possible_b)] rms = score_matrix( test_data['in_r'], test_data['in_g'], test_data['in_b'], test_data['expected_r'], mtx = [pr, pg, pb]) if rms < lowest_rms: closest = [pr, pg, pb] lowest_rms = rms print closest 
+4
source share
2 answers

Are i, j, k independent? I assumed yes. Slightly harms your work:

  • launching too many small cores
  • using global memory to communicate between score_matrix and rm_to_rms

You can rewrite both cores into one with the following changes:

  • make one OpenCL working group work on different i, j, k - you can pre-create this on the CPU
  • to make 1, you need to process several elements of the array in one thread, you can do it as follows:

     int i = get_thread_id(0); float my_sum = 0; for (; i < array_size; i += get_local_size(0)){ float val = in_r[i] * mtx_r + in_g[i] * mtx_g + in_b[i] * mtx_b; my_sum += pow(fabs(expect_r[i] - val), 2); } 
  • after that you write my_sum for each thread in local memory and sum it using the reduction algorithm (O (log (n)).

  • save the result to global memory

alternatively, if you need to calculate i, j, k sequentially, you can search for the barrier and memory functions in the OpenCL specification so that you can use them instead of two cores, just remember to summarize everything in the first step, write to the global synchronization of all threads, and then summarize again

+1
source

Two problems are possible:

  • Kernel overhead can be large if the work required to process each of your images is small. This is what you would turn to by combining an estimate of several values ​​of i,j,k in one core.
  • Serialization of sum calculation for RMSE. This is probably a big problem now.

To address (2), please note that summation can be evaluated in parallel, but this is not as trivial as displaying a function separately for each pixel at your input. This is due to the fact that summation requires the transfer of values ​​between adjacent elements, and not the processing of all elements independently. This pattern is commonly called abbreviation.

PyOpenCL includes support for high-level common abbreviations . Here you want to reduce the amount: pyopencl.array.sum(array) .

In more detail about how this is implemented in raw OpenCL, Apple OpenCL docs includes an example parallel reduction of the amount. The most important for what you want to do are the core and main and create_reduction_pass_counts functions of the C host program, which performs the reduction .

+1
source

All Articles