Pricing for global optimization in CUDA

I am trying to optimize a function (say, find the minimum) with parameters n ( Xn ). All Xi tied to a certain range (for example, from -200 to 200 ), and if any parameter goes out of this range, the function goes to infinity very quickly. However, n can be large (from 20 to 60-70 ), and it takes a long time to calculate its value.

I donโ€™t think that the details of the function are of great importance, but here are some of them: it consists of a weighted sum of smaller functions 20-30 (all different), which in turn consist of sums of point products under the sign of the inverse sinusoidal function ( arcsin , arccos , arctan , etc.). Something like arcsin(X1 . X2) + arcsin(X4 . X7) + ...

A function has many local minima in general, so approaches such as (naive) conjugate gradients or quasi-Newton are useless. The search for all the brute force of the domain is too slow.

My initial idea was to use some kind of massive parallelization in combination with a genetic algorithm that does a lot of searches at different points in the function area and regularly checks to see if some of the queries have reached local minima. If so, he compares them and discards all the results, but the smallest, and continues the search until a sufficiently small value is found.

My two questions are:

1) Is it possible to implement this problem in CUDA or similar technology? Can CUDA calculate the value of such a function fast enough?

2) Would it be better / faster to implement the problem on a multi-core PC (with 12+ cores)?

+4
source share
3 answers

Of course, it is possible to implement global optimization algorithms on GPUs, but you may have to modify the algorithms to get maximum performance. I personally implemented about half a dozen population-based metaheuristics on GPUs, and you can get great accelerations relative to a multi-threaded processor.

With population-based algorithms that have been repeated over several generations, you may find that population size becomes a limiting factor in your work. If your objective function is not easily parallelized, then the maximum number of parallel threads is usually the number of possible solutions for the population. GPUs work best with at least tens of thousands of simultaneous threads, which is not always practical for demographic algorithms due to population inertia and other effects.

To some extent, you can get around population limits by using several parallel parallel computations in parallel, each of which starts with a different random seed. In addition, you can organize parallel instances for communication over some network topology (this will be the island model algorithm), so that parallel instances can jointly develop solutions to complex problems. I really implemented this in OpenCL for my MScE thesis.

In general, here are the short answers to your questions:

1) Yes, you can implement this in CUDA or similar technology. Your speedup will depend on how parallel your target function is and how many parallel instances you want to run at the same time. 2) It would almost certainly be faster to implement on the processor due to the wider range of existing libraries and the fact that CPU programming models are conceptually simpler than GPU models. No matter how "better" it is, it depends on what you value.

This is still an area of โ€‹โ€‹active research, and it will probably take longer to create a working GPU implementation than a multi-core CPU implementation. If implementation time is your primary concern, I would recommend that you take a look at the PaGMO project (and its Python PyGMO bindings), which is an excellent implementation of the island model optimizer with a wide range of local and global optimization functions included. The islands can be assigned any arbitrary algorithm for independent work, and you can specify exactly how they interact to jointly optimize your objective function.

http://sourceforge.net/apps/mediawiki/pagmo/index.php?title=Main_Page

Your question is well in my field of research, and I would be happy to help you if you need it.

+3
source

Can CUDA calculate the value of such a function fast enough?

Many value functionals are expressed as a summation of a certain number of members. Examples are:

Sphere function

enter image description here

Rosenbrock Function

enter image description here

Styblinski-Tang Function

enter image description here

In all these cases, the valuation of the cost function can be carried out by reduction or, better, transformation, followed by reduction.

CUDA Thrust has thrust::transform_reduce , which can certainly serve the field, but of course you can customize your own conversion + reduction procedures.

Below I will talk about how you can calculate the functionality of Rosenbrock using either CUDA Thrust or the individual version of the reduction procedure offered by CUDA examples. In the latter case, the pointer to the __device__ transform __device__ is passed to the transform_reduce custom function if the EXTERNAL keyword is defined, or the transform function is defined and compiled in the custom transform_reduce compilation module.

Some results of work on the Kepler K20c card for a case other than EXTERNAL:

 N = 90000 Thrust = 0.055ms Customized = 0.059ms N = 900000 Thrust = 0.67ms Customized = 0.14ms N = 9000000 Thrust = 0.85ms Customized = 0.87ms 

Here is the code. For synchronization features, see OrangeOwlSolutions / Timing and OrangeOwlSolutions / CUDA_Utilities github Projects.

Please note that separate compilation is required.

kernel.cu

 // --- Requires separate compilation #include <stdio.h> #include <thrust/device_vector.h> #include "transform_reduce.cuh" #include "Utilities.cuh" #include "TimingCPU.h" #include "TimingGPU.cuh" /***************************************/ /* COST FUNCTION - GPU/CUSTOMIZED CASE */ /***************************************/ // --- Transformation function __device__ __forceinline__ float transformation(const float * __restrict__ x, const int i) { return (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) ; } // --- Device-side function pointer __device__ pointFunction_t dev_pfunc = transformation; /***********************************/ /* COST FUNCTION - GPU/THRUST CASE */ /***********************************/ struct CostFunctionStructGPU{ template <typename Tuple> __host__ __device__ float operator()(Tuple a) { float temp1 = (thrust::get<1>(a) - thrust::get<0>(a) * thrust::get<0>(a)); float temp2 = (thrust::get<0>(a) - 1.f); return 100.f * temp1 * temp1 + temp2 * temp2; } }; /********/ /* MAIN */ /********/ int main() { const int N = 90000000; float *x = (float *)malloc(N * sizeof(float)); for (int i=0; i<N; i++) x[i] = 3.f; float *d_x; gpuErrchk(cudaMalloc((void**)&d_x, N * sizeof(float))); gpuErrchk(cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice)); /************************************************/ /* "CUSTOMIZED" DEVICE-SIDE TRANSFORM REDUCTION */ /************************************************/ float customizedDeviceResult = transform_reduce(d_x, N - 1, &dev_pfunc); TimingGPU timerGPU; timerGPU.StartCounter(); customizedDeviceResult = transform_reduce(d_x, N - 1, &dev_pfunc); printf("Timing for 'customized', device-side transform reduction = %f\n", timerGPU.GetCounter()); printf("Result for 'customized', device-side transform reduction = %f\n", customizedDeviceResult); printf("\n\n"); /************************************************/ /* THRUST-BASED DEVICE-SIDE TRANSFORM REDUCTION */ /************************************************/ thrust::device_vector<float> d_vec(N,3.f); timerGPU.StartCounter(); float ThrustResult = thrust::transform_reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec.begin(), d_vec.begin() + 1)), thrust::make_zip_iterator(thrust::make_tuple(d_vec.begin() + N - 1, d_vec.begin() + N)), CostFunctionStructGPU(), 0.f, thrust::plus<float>()); printf("Timing for Thrust-based, device-side transform reduction = %f\n", timerGPU.GetCounter()); printf("Result for Thrust-based, device-side transform reduction = %f\n", ThrustResult); printf("\n\n"); /*********************************/ /* HOST-SIDE TRANSFORM REDUCTION */ /*********************************/ // thrust::host_vector<float> h_vec(d_vec); //sum_host = sum_host + transformation(thrust::raw_pointer_cast(h_vec.data()), i); TimingCPU timerCPU; timerCPU.StartCounter(); float sum_host = 0.f; for (int i=0; i<N-1; i++) { float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)); sum_host = sum_host + temp; //printf("%i %f %f\n", i, temp, sum_host); } printf("Timing for host-side transform reduction = %f\n", timerCPU.GetCounter()); printf("Result for host-side transform reduction = %f\n", sum_host); printf("\n\n"); sum_host = 0.f; float c = 0.f; for (int i=0; i<N-1; i++) { float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) - c; float t = sum_host + temp; c = (t - sum_host) - temp; sum_host = t; } printf("Result for host-side transform reduction = %f\n", sum_host); // cudaDeviceReset(); } 

transform_reduce.cuh

 #ifndef TRANSFORM_REDUCE_CUH #define TRANSFORM_REDUCE_CUH // --- Function pointer type // --- Complete with your own favourite instantiations typedef float(*pointFunction_t)(const float * __restrict__, const int); template <class T> T transform_reduce(T *, unsigned int, pointFunction_t *); #endif 

transform_reduce.cu

 #include <stdio.h> #include "Utilities.cuh" #include "transform_reduce.cuh" #define BLOCKSIZE 512 #define warpSize 32 // --- Host-side function pointer pointFunction_t h_pfunc; // --- Uncomment if you want to apply CUDA error checking to the kernel launches //#define DEBUG //#define EXTERNAL /*******************************************************/ /* CALCULATING THE NEXT POWER OF 2 OF A CERTAIN NUMBER */ /*******************************************************/ unsigned int nextPow2(unsigned int x) { --x; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; return ++x; } /*************************************/ /* CHECK IF A NUMBER IS A POWER OF 2 */ /*************************************/ // --- Note: although x = 1 is a power of 2 (1 = 2^0), this routine returns 0 for x == 1. bool isPow2(unsigned int x) { if (x == 1) return 0; else return ((x&(x-1))==0); } /***************************/ /* TRANSFORMATION FUNCTION */ /***************************/ template <class T> __host__ __device__ __forceinline__ T transformation(const T * __restrict__ x, const int i) { return ((T)100 * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - (T)1) * (x[i] - (T)1)) ; } /********************/ /* REDUCTION KERNEL */ /********************/ /* This version adds multiple elements per thread sequentially. This reduces the overall cost of the algorithm while keeping the work complexity O(n) and the step complexity O(log n). (Brent Theorem optimization) Note, this kernel needs a minimum of 64*sizeof(T) bytes of shared memory. In other words if blockSize <= 32, allocate 64*sizeof(T) bytes. If blockSize > 32, allocate blockSize*sizeof(T) bytes. */ template <class T, unsigned int blockSize, bool nIsPow2> __global__ void reductionKernel(T *g_idata, T *g_odata, unsigned int N, pointFunction_t pPointTransformation) { extern __shared__ T sdata[]; unsigned int tid = threadIdx.x; // Local thread index unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; // Global thread index - Fictitiously double the block dimension unsigned int gridSize = blockSize*2*gridDim.x; // --- Performs the first level of reduction in registers when reading from global memory on multiple elements per thread. // More blocks will result in a larger gridSize and therefore fewer elements per thread T mySum = 0; while (i < N) { #ifdef EXTERNAL mySum += (*pPointTransformation)(g_idata, i); #else mySum += transformation(g_idata, i); #endif // --- Ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays if (nIsPow2 || i + blockSize < N) #ifdef EXTERNAL mySum += (*pPointTransformation)(g_idata, i+blockSize); #else mySum += transformation(g_idata, i+blockSize); #endif i += gridSize; } // --- Each thread puts its local sum into shared memory sdata[tid] = mySum; __syncthreads(); // --- Reduction in shared memory. Fully unrolled loop. if ((blockSize >= 512) && (tid < 256)) sdata[tid] = mySum = mySum + sdata[tid + 256]; __syncthreads(); if ((blockSize >= 256) && (tid < 128)) sdata[tid] = mySum = mySum + sdata[tid + 128]; __syncthreads(); if ((blockSize >= 128) && (tid < 64)) sdata[tid] = mySum = mySum + sdata[tid + 64]; __syncthreads(); #if (__CUDA_ARCH__ >= 300 ) // --- Single warp reduction by shuffle operations if ( tid < 32 ) { // --- Last iteration removed from the for loop, but needed for shuffle reduction mySum += sdata[tid + 32]; // --- Reduce final warp using shuffle for (int offset = warpSize/2; offset > 0; offset /= 2) mySum += __shfl_down(mySum, offset); //for (int offset=1; offset < warpSize; offset *= 2) mySum += __shfl_xor(mySum, i); } #else // --- Reduction within a single warp. Fully unrolled loop. if ((blockSize >= 64) && (tid < 32)) sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads(); if ((blockSize >= 32) && (tid < 16)) sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads(); if ((blockSize >= 16) && (tid < 8)) sdata[tid] = mySum = mySum + sdata[tid + 8]; __syncthreads(); if ((blockSize >= 8) && (tid < 4)) sdata[tid] = mySum = mySum + sdata[tid + 4]; __syncthreads(); if ((blockSize >= 4) && (tid < 2)) sdata[tid] = mySum = mySum + sdata[tid + 2]; __syncthreads(); if ((blockSize >= 2) && ( tid < 1)) sdata[tid] = mySum = mySum + sdata[tid + 1]; __syncthreads(); #endif // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of // individual blocks if (tid == 0) g_odata[blockIdx.x] = mySum; } /******************************/ /* REDUCTION WRAPPER FUNCTION */ /******************************/ template <class T> T transform_reduce_inner(T *g_idata, unsigned int N, pointFunction_t h_pfunc) { // --- Reduction parameters const int NumThreads = (N < BLOCKSIZE) ? nextPow2(N) : BLOCKSIZE; const int NumBlocks = (N + NumThreads - 1) / NumThreads; const int smemSize = (NumThreads <= 32) ? 2 * NumThreads * sizeof(T) : NumThreads * sizeof(T); // --- Device memory space where storing the partial reduction results T *g_odata; gpuErrchk(cudaMalloc((void**)&g_odata, NumBlocks * sizeof(T))); if (isPow2(N)) { switch (NumThreads) { case 512: reductionKernel<T, 512, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 256: reductionKernel<T, 256, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 128: reductionKernel<T, 128, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 64: reductionKernel<T, 64, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 32: reductionKernel<T, 32, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 16: reductionKernel<T, 16, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 8: reductionKernel<T, 8, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 4: reductionKernel<T, 4, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 2: reductionKernel<T, 2, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 1: reductionKernel<T, 1, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; } #ifdef DEBUG gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); #endif } else { switch (NumThreads) { case 512: reductionKernel<T, 512, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 256: reductionKernel<T, 256, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 128: reductionKernel<T, 128, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 64: reductionKernel<T, 64, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 32: reductionKernel<T, 32, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 16: reductionKernel<T, 16, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 8: reductionKernel<T, 8, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 4: reductionKernel<T, 4, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 2: reductionKernel<T, 2, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; case 1: reductionKernel<T, 1, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break; } #ifdef DEBUG gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); #endif } // --- The last part of the reduction, which would be expensive to perform on the device, is executed on the host T *host_vector = (T *)malloc(NumBlocks * sizeof(T)); gpuErrchk(cudaMemcpy(host_vector, g_odata, NumBlocks * sizeof(T), cudaMemcpyDeviceToHost)); T sum_transformReduce = (T)0; for (int i=0; i<NumBlocks; i++) sum_transformReduce = sum_transformReduce + host_vector[i]; return sum_transformReduce; } template <class T> T transform_reduce(T *g_idata, unsigned int N, pointFunction_t *dev_pfunc) { #ifdef EXTERNAL gpuErrchk(cudaMemcpyFromSymbol(&h_pfunc, *dev_pfunc, sizeof(pointFunction_t))); #endif T customizedDeviceResult = transform_reduce_inner(g_idata, N, h_pfunc); return customizedDeviceResult; } // --- Complete with your own favourite instantiations template float transform_reduce(float *, unsigned int, pointFunction_t *); 
+2
source

My answer above is mostly suitable for problems with a very large number of unknowns that are usually handled by local optimization algorithms. I will leave it here for a possible link to other users.

As you already mentioned, you are dealing with 60 - 70 unknowns, a scenario that makes it easier to manage global optimization algorithms.

As emphasized above, cost functionals often consist of summation, so that their calculations are reduced to subsequent transformations and reductions in operations. With so many unknowns, abbreviation is shared memory, which can be an interesting option. Fortunately, CUB offers primitives to reduce shared memory.

Here is an example of using CUB to calculate a large number of functional cost values โ€‹โ€‹for problems with a moderate number of unknowns. The cost functionality in this case is selected as the Rastrigin function, but the example can be adapted to other cost functionals simply by changing the corresponding __device__ function.

 #include <cub/cub.cuh> #include <cuda.h> #include "Utilities.cuh" #include <iostream> #define BLOCKSIZE 256 const int N = 4096; /************************/ /* RASTRIGIN FUNCTIONAL */ /************************/ __device__ float rastrigin(float x) { return x * x - 10.0f * cosf(2.0f * x) + 10.0f; } /******************************/ /* TRANSFORM REDUCTION KERNEL */ /******************************/ __global__ void CostFunctionalCalculation(const float * __restrict__ indata, float * __restrict__ outdata) { unsigned int tid = threadIdx.x + blockIdx.x * gridDim.x; // --- Specialize BlockReduce for type float. typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT; __shared__ typename BlockReduceT::TempStorage temp_storage; float result; if(tid < N) result = BlockReduceT(temp_storage).Sum(rastrigin(indata[tid])); if(threadIdx.x == 0) outdata[blockIdx.x] = result; return; } /********/ /* MAIN */ /********/ int main() { // --- Allocate host side space for float *h_data = (float *)malloc(N * sizeof(float)); float *h_result = (float *)malloc((N / BLOCKSIZE) * sizeof(float)); float *d_data; gpuErrchk(cudaMalloc(&d_data, N * sizeof(float))); float *d_result; gpuErrchk(cudaMalloc(&d_result, (N / BLOCKSIZE) * sizeof(float))); for (int i = 0; i < N; i++) { h_data[i] = 1.f; } gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice)); CostFunctionalCalculation<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_data, d_result); gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); gpuErrchk(cudaMemcpy(h_result, d_result, (N / BLOCKSIZE) * sizeof(float), cudaMemcpyDeviceToHost)); std::cout << "output: \n"; for (int k = 0; k < N / BLOCKSIZE; k++) std::cout << k << " " << h_result[k] << "\n"; std::cout << std::endl; gpuErrchk(cudaFree(d_data)); gpuErrchk(cudaFree(d_result)); return 0; } 
0
source

All Articles