I am programming for the cross-platform C library to do various things for webcam images. All operations are pixels and are very parallelizable - for example, they apply bit masks, multiplying color values by constants, etc. Therefore, I think I can get performance using the built-in SSE / SSE2 features.
However, I have a problem with the data format. My webcam library gives me webcam frames as a pointer (void *) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I passed them to char *, so ptr ++ etc. Behaves correctly. However, all SSE / SSE2 operations expect either four integers or four floats in the __m128 or __m64 data types. If I do this (if I read the color values from the buffer into the characters r, g and b):
float pixel [] = {(float) r, (float) g, {float) b, 0.0f};
then load another floating point array, full constants
constants float [] = {0.299, 0.587, 0.114, 0.0f};
discard both floating-point pointers to __m128 and use __mm_mul_ps to execute r * 0.299, g * 0.587, etc. etc., there is no overall performance gain, because all shuffled things take so long!
Does anyone have any suggestions on how to quickly and efficiently load these byte pixel values into SSE registers so that I can get a performance boost from working with them as such?
source
share