I am trying to implement a convolutional neural network in Python. I originally used the scipy.signal convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like.
I implemented two functions:
- Matrix conjugation with inseparable core
- Combining a matrix with a separable kernel (at the moment I assumed that python is checking and splitting ranks before passing to C)
None of these functions has a supplement, since it requires a decrease in dimension.
Inseparable 2D convolution
// a - 2D matrix (as a 1D array), w - kernel double* conv2(double* a, double* w, double* result) { register double acc; register int i; register int j; register int k1, k2; register int l1, l2; register int t1, t2; for(i = 0; i < RESULT_DIM; i++) { t1 = i * RESULT_DIM; // loop invariants for(j = 0; j < RESULT_DIM; j++) { acc = 0.0; for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++) { t2 = k1 * FILTER_DIM; // loop invariants for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++) { acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)]; } } result[t1 + j] = acc; } } return result; }
Divisible 2D convolution
// a - 2D matrix, w1, w2 - the separated 1D kernels double* conv2sep(double* a, double* w1, double* w2, double* result) { register double acc; register int i; register int j; register int k1, k2; register int t; double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double)); for(i = 0; i < RESULT_DIM; i++) // convolve with w1 { t = i * RESULT_DIM; for(j = 0; j < IMG_DIM; j++) { acc = 0.0; for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++) { acc += w1[k1] * a[k2 * IMG_DIM + t + j]; } tmp[t + j] = acc; } } for(i = 0; i < RESULT_DIM; i++) // convolve with w2 { t = i * RESULT_DIM; for(j = 0; j < RESULT_DIM; j++) { acc = 0.0; for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++) { acc += w2[k1] * tmp[t + (j + k2)]; } result[t + j] = acc; } } free(tmp); return result; }
Compiling with the gcc-O3 flag and testing at 2.7 GHz Intel i7 using a 4000x4000 matrix and a 5x5 core, I get accordingly (out of 5):
271.21900 ms 127.32000 ms
This is still a significant improvement over scipy.signal convolve2d, which takes about 2 seconds for the same operation, but I need more speed, since I will call this function thousands of times. Changing the data type for the float is not an option at the moment, although this can lead to significant acceleration.
Can these algorithms be further optimized? Can I apply any tricks or cache routines to speed it up?
Any suggestions would be appreciated.