Optimization 1D convolution

Is there a way to speed up this 1D convolution? I tried to make dy cache efficient but compiling with g ++ and -O3 gave worse results.

I curl up with [-1., 0., 1] in both directions. Not homework.

#include<iostream> #include<cstdlib> #include<sys/time.h> void print_matrix( int height, int width, float *matrix){ for (int j=0; j < height; j++){ for (int i=0; i < width; i++){ std::cout << matrix[j * width + i] << ","; } std::cout << std::endl; } } void fill_matrix( int height, int width, float *matrix){ for (int j=0; j < height; j++){ for (int i=0; i < width; i++){ matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ; } } } #define RESTRICT __restrict__ void dx_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){ //init min,max *min = *max = -1.F * in_matrix[0] + in_matrix[1]; for (int j=0; j < height; j++){ float* row = in_matrix + j * width; for (int i=1; i < width-1; i++){ float res = -1.F * row[i-1] + row[i+1]; /* -1.F * value + 0.F * value + 1.F * value; */ if (res > *max ) *max = res; if (res < *min ) *min = res; out_matrix[j * width + i] = res; } } } void dy_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){ //init min,max *min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1]; for (int j=1; j < height-1; j++){ for (int i=0; i < width; i++){ float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ; if (res > *max ) *max = res; if (res < *min ) *min = res; out_matrix[j * width + i] = res; } } } double now (void) { struct timeval tv; gettimeofday(&tv, NULL); return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0; } int main(int argc, char **argv){ int width, height; float *in_matrix; float *out_matrix; if(argc < 3){ std::cout << argv[0] << "usage: width height " << std::endl; return -1; } srand(123); width = atoi(argv[1]); height = atoi(argv[2]); std::cout << "Width:"<< width << " Height:" << height << std::endl; if (width < 3){ std::cout << "Width too short " << std::endl; return -1; } if (height < 3){ std::cout << "Height too short " << std::endl; return -1; } in_matrix = (float *) malloc( height * width * sizeof(float)); out_matrix = (float *) malloc( height * width * sizeof(float)); fill_matrix(height, width, in_matrix); //print_matrix(height, width, in_matrix); float min, max; double a = now(); dx_matrix(height, width, in_matrix, out_matrix, &min, &max); std::cout << "dx min:" << min << " max:" << max << std::endl; dy_matrix(height, width, in_matrix, out_matrix, &min, &max); double b = now(); std::cout << "dy min:" << min << " max:" << max << std::endl; std::cout << "time: " << ba << " sec" << std::endl; return 0; } 
+1
source share
5 answers

First of all, I would rewrite the dy loop to get rid of "[(j-1) * width + i]" and "in_matrix [(j + 1) * width + i]" and do something like

  float* p, *q, *out; p = &in_matrix[(j-1)*width]; q = &in_matrix[(j+1)*width]; out = &out_matrix[j*width]; for (int i=0; i < width; i++){ float res = -1.F * p[i] + q[i] ; if (res > *max ) *max = res; if (res < *min ) *min = res; out[i] = res; } 

But this is a trivial optimization that the compiler can already do for you.

It will be a little faster to execute "q [i] -p [i]" instead of "-1.f * p [i] + q [i]", but, again, the compiler may be smart enough to do this behind his back.

All of this will benefit SSE2 and multithreading. I would argue at least 3 times faster than SSE2. Multithreading can be added using OpenMP, and it only takes a few lines of code.

+1
source

Use local variables to calculate min and max. Every time you do this:

 if (res > *max ) *max = res; if (res < *min ) *min = res; 

max and min must be written to memory. Adding a constraint to pointers would help (by indicating that the entries are independent), but an even better way would be something like

 //Setup float tempMin = ... float tempMax = ... ... // Inner loop tempMin = (res < tempMin) ? res : tempMin; tempMax = (res > tempMax) ? res : tempMax; ... // End *min = tempMin; *max = tempMax; 
+2
source

Well, the compiler can take care of this, but here are a few little things:

a) Why are you multiplying by -1.F? Why not just subtract? For instance:

 float res = -1.F * row[i-1] + row[i+1]; 

could be simple:

 float res = row[i+1] - row[i-1]; 

b) These are:

 if (res > *max ) *max = res; if (res < *min ) *min = res; 

can do in

 if (res > *max ) *max = res; else if (res < *min ) *min = res; 

and in other places. If the first is true, the second cannot be so as not to test it.

Addition:

Here is one more thing. To minimize your multiplications, change

 for (int j=1; j < height-1; j++){ for (int i=0; i < width; i++){ float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ; 

to

 int h = 0; int width2 = 2 * width; for (int j=1; j < height-1; j++){ h += width; for (int i=h; i < h + width; i++){ float res = in_matrix[i + width2] - in_matrix[i]; 

and at the end of the cycle

  out_matrix[i + width] = res; 

You can do similar things in other places, but hopefully you get this idea. In addition, there is a small mistake,

 *min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1 ]; 

should be just in_matrix[ width ] at the end.

+1
source

The compiler may notice this, but you create / release many variables on the stack when you enter and exit the scope operators {}. Instead:

 for (int j=0; j < height; j++){ float* row = in_matrix + j * width; for (int i=1; i < width-1; i++){ float res = -1.F * row[i-1] + row[i+1]; 

What about:

 int i, j; float *row; float res; for (j=0; j < height; j++){ row = in_matrix + j * width; for (i=1; i < width-1; i++){ res = -1.F * row[i-1] + row[i+1]; 
+1
source

Profiling this with -O3 and -O2 using the clang and g ++ compiler versions in OS X, I found that

30% of the time was spent filling out the original matrix

  matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ; 

40% of the time was spent on dx_matrix on the line.

  out_matrix[j * width + i] = row[i+1] -row[i-1]; 

Approximately 9% of the time was spent in dummies in dx_matrix. I divided them into a separate cycle to see if this helped, but it didn’t change anything.

The shark suggested that this could be improved with SSE instructions.

Interestingly, only about 19% of the time was spent in the dy_matrix routine.

This was done on a 10k by 10k matrix (about 1.6 seconds)

Please note that your results may vary if you use a different compiler, different OS, etc.

+1
source

All Articles