By calling RGB888ToPlanar8, you scatter data and then collect it again. This is very, very, very bad. If the memory overhead is 33%, try using the RGBA format and rearrange the B / R bytes.
If you want to keep 33% percent, I can suggest the following. Iterate all the pixels, but only read a multiple of 4 bytes (since lcm (3,4) is 12, i.e. 3 words).
uint8_t* src_image; uint8_t* dst_image; uint32_t* src = (uint32_t*)src_image; uint32_t* dst = (uint32_t*)dst_image; uint32_t v1, v2, v3; uint32_t nv1, nv2, nv3; for(int i = 0 ; i < num_pixels / 12 ; i++) {
Even better can be done with NEON.
See the link from the ARM website for how 24-bit swap is performed.
BGR-to-RGB can be done as follows:
void neon_asm_convert_BGR_TO_RGB(uint8_t* img, int numPixels24) { // numPixels is divided by 24 __asm__ volatile( "0: \n" "# load 3 64-bit regs with interleave: \n" "vld3.8 {d0,d1,d2}, [%0] \n" "# swap d0 and d2 - R and B\n" "vswp d0, d2 \n" "# store 3 64-bit regs: \n" "vst3.8 {d0,d1,d2}, [%0]! \n" "subs %1, %1, #1 \n" "bne 0b \n" : : "r"(img), "r"(numPixels24) : "r4", "r5" ); }
Viktor Latypov
source share