English 中文(简体)
SSE2: How to reduce a _m128 to a word
原标题:
  • 时间:2009-11-13 11:29:54
  •  标签:
  • sse
  • simd

What s the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word? I want the low part of each _m128 components:

int result = ( _m128.a & 0x000000ff ) <<  24
        | ( _m128.b & 0x000000ff ) << 16
        | ( _m128.c & 0x000000ff ) << 8
        | ( _m128.d & 0x000000ff ) << 0

Is there an intrinsics for that ? thanks !

问题回答

FYI, the sse3 intrinsics _mm_shuffle_epi8 do the job: (with the mask 0x0004080c in this case )

The SSE2 answer takes more than one instructions:

unsigned benoit(__m128i x)
{
    __m128i zero = _mm_setzero_si128(), mask = _mm_set1_epi32(255);
    return _mm_cvtsi128_si32(
                _mm_packus_epi16(
                        _mm_packus_epi16(
                                _mm_and_si128(x, mask), zero), zero));
}

The above amounts to 5 machine ops, given the input in %xmm1 and output in %rax:

 pxor     %xmm0, %xmm0
 pand     MASK, %xmm1
 packuswb %xmm0, %xmm1
 packuswb %xmm0, %xmm1
 movd     %xmm1, %rax

If you want to see some unusual uses of SSE2, including high-speed bit-matrix transpose, string search and bitonic (GPGPU-style) sort, you might want to check my blog, Coding on the edges.

Anyway, hope that helps.





相关问题
Intel Intrinsic: Convert int16 x 8 to : int32 x 8

In RAM I have 8 x (int16). I read it with: __m128i RawInt16 = _mm_load_si128 (pSrc); I have to convert RawInt16 into 2 registers of 4 x (int32) My code is: __m128i Zero = { 0,0,0,0,0,0,0,0 }; _mm128i ...

VC++ SSE intrinsic optimisation weirdness

I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I m having is with my re-...

Fast Image Manipulation using SSE instructions?

I am writing a graphics library in C and I would like to utilize SSE instructions to speed up some of the functions. How would I go about doing this? I am using the GCC compiler so I can rely on ...

Benchmarking SSE instructions

I m benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that ...

SSE2: How to reduce a _m128 to a word

What s the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word? I want the low part of each _m128 components: int result = ( _m128.a & 0x000000ff ) << 24 | ( _m128.b ...

热门标签