English 中文(简体)
Processing byte pixels with SSE/SSE2 intrinsics in C
原标题:

I am programming, for cross-platform C, a library to do various things to webcam images. All operations are per-pixel and highly parallelizable - for example applying bit masks, multiplying color values by constants, etc. Therefore I think I can gain performance by using SSE/SSE2 intrinsics.

However, I am having a data format problem. My webcam library gives me webcam frames as a pointer (void*) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I have been casting these to char* so that ptr++ etc behaves correctly. However, all the SSE/SSE2 operations expect either four integers or four floats, in the __m128 or __m64 data types. If I do this (assuming I have read the color values from the buffer into chars r, g, and b):

float pixel[] = {(float)r, (float)g, {float)b, 0.0f};

then load another float array full of constants

float constants[] = {0.299, 0.587, 0.114, 0.0f};

cast both float pointers to __m128, and use the __mm_mul_ps intrinsic to do r * 0.299, g * 0.587 etc etc... there is no overall performance gain because all the shuffling stuff around takes up so much time!

Does anyone have any suggestions for how I can load these byte pixel values quickly and efficiently into the SSE registers so that I actually get a performance gain from operating on them as such?

最佳回答

If you are willing to use MMX...

MMX gives you a bunch of 64 bit registers that can treat each register as 8, 8-bit values.

Like the 8-bit values you re working with.

There s a good primer here.

问题回答

I think your performance bottleneck could come from the casting to float, that is a rather expensive operation.

If I remember well, that casting is about 50 clock cycles in most architectures... and considering the worst case in which the FP multiplications could take, let s say, about 4 clocks each one with no overlapping in the pipeline, doing all of them in parallel in 1 cycle could save you 15 cycles at most, still no gain.

I d definitively go for working always with the same number format (integer in this case), if streamed with MMX like Shmoopty said, then better.

First, the data you re copying from (I m guessing it s pointed to by that void* pointer) should be memory aligned for optimal performance - if not copy it to a memory aligned buffer.

Second, you can still use SSE2 once you ve moved your data into a memory aligned buffer, it s quite easy - I used the code here without any issues with the intrinsics (but had problems with the assembly as detailed here).

Hope this is useful - I too worked with images and stored them as unsigned char in the main memory and copied them to the SSE2 registers (made sense since R,G, or B varied from 0-255) - but I used the assembly code since I felt it was easier.

But if you want to make it cross-platform, I suppose using the intrinsics would be cleaner.

Good luck!





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...

热门标签