Fast image transposition and Sobel filter optimization in C (SIMD)

I want to implement a really (really) fast Sobel operator for a ray tracer, a friend from me, and I wrote (sources can be found here ). What follows is what I have found out so far ...

First, suppose the image is a gray scale image repository of rows in an 8-bit unsigned array.

To write a real Sobel filter, I need to calculate Gx and Gy for each pixel. Each of these numbers is calculated using 6 pixels next to the source. But the SIMD instruction allows me to deal with 16 or even 32 (AVX) pixels. We hope that the kernel of the operator has some nice property, so that I can calculate Gy like this:

  • subtracting each line I and I + 2 and saving the result in line I + 1 of some other image (array)
  • adding i, twice the columns i + 1 and i + 2, gives the column i + 1 of the final image

I would do the same (but transposed) to calculate Gx, then add two images.

Some notes:

  • I do not care about memory allocation, since everything will be allocated at the beginning.
  • I can cope with the overflow and sign problem by dividing the values ​​into four (thanks to _mm_srli_epi8) (uint8_t >> 2 - uint8_t >> 2) = int7_t //really store as int8_t
    int7_t + uint8_t << 1 >> 2 + int7_t = uint8_t
    //some precision is lost but I don't care
    (uint8_t >> 2 - uint8_t >> 2) = int7_t //really store as int8_t
    int7_t + uint8_t << 1 >> 2 + int7_t = uint8_t
    //some precision is lost but I don't care

The real problem I ran into is moving from rows to columns. Since I could not load the image in the SIMD register, otherwise. I have to flip the image three times, at least, right?

After the original image. Then I can calculate the first step for Gx and Gy, and then flip the resulting snapshots to calculate the second step.

So here are my questions:

  • Is such an implementation a good idea?
  • Is there a way to transpose an array faster than a dumb algorithm? (I do not think so)
  • Where will the bottlenecks be? (any guess ?: P)
+7
optimization c sse simd
source share
1 answer

I think transpose / 2-pass is not suitable for optimizing Sobel operator code. The Sobel operator is not a computational function, so wasting memory access for transpose / 2-pass access is not suitable for this case. I wrote some test codes for Sobel Operator to find out how fast SSE can get. this code does not process the pixels of the first and last edge and uses FPU to calculate the sqrt () value.

The Sobel operator needs 14 times, 1 square root, 11 addition, 2 min / max, 12 read access and 1 write access operator. This means that you can process the component in 20-30 cycles if you optimize the code well.

The FloatSobel () function took 2113044 CPU cycles for processing 256 * 256 image processing 32.76 cycles / component. I will convert this sample code to SSE.

 void FPUSobel() { BYTE* image_0 = g_image + g_image_width * 0; BYTE* image_1 = g_image + g_image_width * 1; BYTE* image_2 = g_image + g_image_width * 2; DWORD* screen = g_screen + g_screen_width*1; for(int y=1; y<g_image_height-1; ++y) { for(int x=1; x<g_image_width-1; ++x) { float gx = image_0[x-1] * (+1.0f) + image_0[x+1] * (-1.0f) + image_1[x-1] * (+2.0f) + image_1[x+1] * (-2.0f) + image_2[x-1] * (+1.0f) + image_2[x+1] * (-1.0f); float gy = image_0[x-1] * (+1.0f) + image_0[x+0] * (+2.0f) + image_0[x+1] * (+1.0f) + image_2[x-1] * (-1.0f) + image_2[x+0] * (-2.0f) + image_2[x+1] * (-1.0f); int result = (int)min(255.0f, max(0.0f, sqrtf(gx * gx + gy * gy))); screen[x] = 0x01010101 * result; } image_0 += g_image_width; image_1 += g_image_width; image_2 += g_image_width; screen += g_screen_width; } } 

The SseSobel () function took 613220 processor cycles to process the same 256 * 256 images. It took 9.51 cycles / component and 3.4 times faster than FPUSobel (). There are several gaps for optimization, but it will not be faster than 4 times, because it uses 4-way SIMD.

This feature used the SoA approach to process 4 pixels at a time. SoA is better than AoS in most arrays or images because you need to transpose / shuffle to use AoS. And SoA is much easier to change common C code to SSE codes.

 void SseSobel() { BYTE* image_0 = g_image + g_image_width * 0; BYTE* image_1 = g_image + g_image_width * 1; BYTE* image_2 = g_image + g_image_width * 2; DWORD* screen = g_screen + g_screen_width*1; __m128 const_p_one = _mm_set1_ps(+1.0f); __m128 const_p_two = _mm_set1_ps(+2.0f); __m128 const_n_one = _mm_set1_ps(-1.0f); __m128 const_n_two = _mm_set1_ps(-2.0f); for(int y=1; y<g_image_height-1; ++y) { for(int x=1; x<g_image_width-1; x+=4) { // load 16 components. (0~6 will be used) __m128i current_0 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_0+x-1)), _mm_setzero_si128()); __m128i current_1 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_1+x-1)), _mm_setzero_si128()); __m128i current_2 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_2+x-1)), _mm_setzero_si128()); // image_00 = { image_0[x-1], image_0[x+0], image_0[x+1], image_0[x+2] } __m128 image_00 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_0, _mm_setzero_si128())); // image_01 = { image_0[x+0], image_0[x+1], image_0[x+2], image_0[x+3] } __m128 image_01 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 2), _mm_setzero_si128())); // image_02 = { image_0[x+1], image_0[x+2], image_0[x+3], image_0[x+4] } __m128 image_02 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 4), _mm_setzero_si128())); __m128 image_10 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_1, _mm_setzero_si128())); __m128 image_12 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_1, 4), _mm_setzero_si128())); __m128 image_20 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_2, _mm_setzero_si128())); __m128 image_21 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 2), _mm_setzero_si128())); __m128 image_22 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 4), _mm_setzero_si128())); __m128 gx = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), _mm_add_ps( _mm_mul_ps(image_02,const_n_one), _mm_add_ps( _mm_mul_ps(image_10,const_p_two), _mm_add_ps( _mm_mul_ps(image_12,const_n_two), _mm_add_ps( _mm_mul_ps(image_20,const_p_one), _mm_mul_ps(image_22,const_n_one)))))); __m128 gy = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), _mm_add_ps( _mm_mul_ps(image_01,const_p_two), _mm_add_ps( _mm_mul_ps(image_02,const_p_one), _mm_add_ps( _mm_mul_ps(image_20,const_n_one), _mm_add_ps( _mm_mul_ps(image_21,const_n_two), _mm_mul_ps(image_22,const_n_one)))))); __m128 result = _mm_min_ps( _mm_set1_ps(255.0f), _mm_max_ps( _mm_set1_ps(0.0f), _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(gx, gx), _mm_mul_ps(gy,gy))) )); __m128i pack_32 = _mm_cvtps_epi32(result); //R32,G32,B32,A32 __m128i pack_16 = _mm_packs_epi32(pack_32, pack_32); //R16,G16,B16,A16,R16,G16,B16,A16 __m128i pack_8 = _mm_packus_epi16(pack_16, pack_16); //RGBA,RGBA,RGBA,RGBA __m128i unpack_2 = _mm_unpacklo_epi8(pack_8, pack_8); //RRGG,BBAA,RRGG,BBAA __m128i unpack_4 = _mm_unpacklo_epi8(unpack_2, unpack_2); //RRRR,GGGG,BBBB,AAAA _mm_storeu_si128((__m128i*)(screen+x),unpack_4); } image_0 += g_image_width; image_1 += g_image_width; image_2 += g_image_width; screen += g_screen_width; } } 
+8
source share

All Articles