Question

I am programming, for cross-platform C, a library to do various things to webcam images. All operations are per-pixel and highly parallelizable - for example applying bit masks, multiplying color values by constants, etc. Therefore I think I can gain performance by using SSE/SSE2 intrinsics.

However, I am having a data format problem. My webcam library gives me webcam frames as a pointer (void*) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I have been casting these to char* so that ptr++ etc behaves correctly. However, all the SSE/SSE2 operations expect either four integers or four floats, in the __m128 or __m64 data types. If I do this (assuming I have read the color values from the buffer into chars r, g, and b):

float pixel[] = {(float)r, (float)g, {float)b, 0.0f};

then load another float array full of constants

float constants[] = {0.299, 0.587, 0.114, 0.0f};

cast both float pointers to __m128, and use the __mm_mul_ps intrinsic to do r * 0.299, g * 0.587 etc etc... there is no overall performance gain because all the shuffling stuff around takes up so much time!

Does anyone have any suggestions for how I can load these byte pixel values quickly and efficiently into the SSE registers so that I actually get a performance gain from operating on them as such?

Answer 1

If you are willing to use MMX...

MMX gives you a bunch of 64 bit registers that can treat each register as 8, 8-bit values.

Like the 8-bit values you re working with.

There s a good primer here.

Answer 2

I think your performance bottleneck could come from the casting to float, that is a rather expensive operation.

If I remember well, that casting is about 50 clock cycles in most architectures... and considering the worst case in which the FP multiplications could take, let s say, about 4 clocks each one with no overlapping in the pipeline, doing all of them in parallel in 1 cycle could save you 15 cycles at most, still no gain.

I d definitively go for working always with the same number format (integer in this case), if streamed with MMX like Shmoopty said, then better.

Answer 3

First, the data you re copying from (I m guessing it s pointed to by that void* pointer) should be memory aligned for optimal performance - if not copy it to a memory aligned buffer.

Second, you can still use SSE2 once you ve moved your data into a memory aligned buffer, it s quite easy - I used the code here without any issues with the intrinsics (but had problems with the assembly as detailed here).

Hope this is useful - I too worked with images and stored them as unsigned char in the main memory and copied them to the SSE2 registers (made sense since R,G, or B varied from 0-255) - but I used the assembly code since I felt it was easier.

But if you want to make it cross-platform, I suppose using the intrinsics would be cleaner.

Good luck!

友情链接