Fast two-dimensional convolution in C

Question

Fast two-dimensional convolution in C

I am trying to implement a convolutional neural network in Python. I originally used the scipy.signal convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like.

I implemented two functions:

Matrix conjugation with inseparable core
Combining a matrix with a separable kernel (at the moment I assumed that python is checking and splitting ranks before passing to C)

None of these functions has a supplement, since it requires a decrease in dimension.

Inseparable 2D convolution

// a - 2D matrix (as a 1D array), w - kernel double* conv2(double* a, double* w, double* result) { register double acc; register int i; register int j; register int k1, k2; register int l1, l2; register int t1, t2; for(i = 0; i < RESULT_DIM; i++) { t1 = i * RESULT_DIM; // loop invariants for(j = 0; j < RESULT_DIM; j++) { acc = 0.0; for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++) { t2 = k1 * FILTER_DIM; // loop invariants for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++) { acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)]; } } result[t1 + j] = acc; } } return result; }

Divisible 2D convolution

 // a - 2D matrix, w1, w2 - the separated 1D kernels double* conv2sep(double* a, double* w1, double* w2, double* result) { register double acc; register int i; register int j; register int k1, k2; register int t; double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double)); for(i = 0; i < RESULT_DIM; i++) // convolve with w1 { t = i * RESULT_DIM; for(j = 0; j < IMG_DIM; j++) { acc = 0.0; for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++) { acc += w1[k1] * a[k2 * IMG_DIM + t + j]; } tmp[t + j] = acc; } } for(i = 0; i < RESULT_DIM; i++) // convolve with w2 { t = i * RESULT_DIM; for(j = 0; j < RESULT_DIM; j++) { acc = 0.0; for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++) { acc += w2[k1] * tmp[t + (j + k2)]; } result[t + j] = acc; } } free(tmp); return result; }

Compiling with the gcc-O3 flag and testing at 2.7 GHz Intel i7 using a 4000x4000 matrix and a 5x5 core, I get accordingly (out of 5):

 271.21900 ms 127.32000 ms

This is still a significant improvement over scipy.signal convolve2d, which takes about 2 seconds for the same operation, but I need more speed, since I will call this function thousands of times. Changing the data type for the float is not an option at the moment, although this can lead to significant acceleration.

Can these algorithms be further optimized? Can I apply any tricks or cache routines to speed it up?

Any suggestions would be appreciated.

+5

performance optimization c python algorithm

cᴏʟᴅsᴘᴇᴇᴅ Jun 29 '16 at 16:18

source share

1 answer

Paul r · Accepted Answer · 2016-06-29T16:24:01+0000

If you are using x86 only, consider using SSE or AVX SIMD optimization. For double data, the throughput improvement will be modest, but if you can switch to float , then you can achieve 4 times improvement using SSE or 8x with AVX. There are a number of questions and answers on this topic in StackOverflow, from which you can get some ideas for implementation. As an alternative, there are also many libraries available that include high-performance convolution (filtering) algorithms that typically use SIMD to improve performance, for example. Intel IPP (commercial) or OpenCV (free).

Another possibility is to use several cores - divide the image into blocks and start each block in its stream. For instance. if you have a 4-core processor, divide the image into 4 blocks. (See pthreads ).

Of course, you can combine both of these ideas if you really want to fully optimize this operation.

Some small optimizations that you can apply to your current code, and to any future implementations (for example, SIMD):

if your kernels are symmetric (or odd-symmetric), you can reduce the number of operations by adding (subtracting) the symmetric input values and doing one multiplication, not two
for a separable case, and not for allocating a temporary full-frame buffer, consider using the strip-mining approach - select a smaller buffer that is full width but relatively small number of lines, then process your image in stripes, applying horizontal alternately core and vertical core. The advantage of this is that you have much more convenient cache access and less memory.

A few comments about the coding style:

the register keyword has been redundant for many years, and modern compilers will give a warning, if you try to use it, save some noise (and some typification) by discarding it
listing the malloc result in C is not happy - it is redundant and potentially dangerous .
enter any input parameters const (i.e. read-only) and use restrict for any parameters that can never be smoothed (for example, a and result ) - this will not only help to avoid (at least in the case of const ), but in some cases, it can help the compiler generate more optimized code (especially in the case of potentially smoothed pointers).

Fast two-dimensional convolution in C

Inseparable 2D convolution

Divisible 2D convolution

More articles: