English 中文(简体)
How do you populate an x86 XMM register with 4 identical floats from another XMM register entry?
原标题:

I m trying to implement some inline assembler (in C/C++ code) to take advantage of SSE. I d like to copy and duplicate values (from an XMM register, or from memory) to another XMM register. For example, suppose I have some values {1, 2, 3, 4} in memory. I d like to copy these values such that xmm1 is populated with {1, 1, 1, 1}, xmm2 with {2, 2, 2, 2}, and so on and so forth.

Looking through the Intel reference manuals, I couldn t find an instruction to do this. Do I just need to use a combination of repeated MOVSS and rotates (via PSHUFD?)?

最佳回答

There are two ways:

  1. Use shufps exclusively:

    __m128 first = ...;
    __m128 xxxx = _mm_shuffle_ps(first, first, 0x00); // _MM_SHUFFLE(0, 0, 0, 0)
    __m128 yyyy = _mm_shuffle_ps(first, first, 0x55); // _MM_SHUFFLE(1, 1, 1, 1)
    __m128 zzzz = _mm_shuffle_ps(first, first, 0xAA); // _MM_SHUFFLE(2, 2, 2, 2)
    __m128 wwww = _mm_shuffle_ps(first, first, 0xFF); // _MM_SHUFFLE(3, 3, 3, 3)
    
  2. Let the compiler choose the best way using _mm_set1_ps and _mm_cvtss_f32:

    __m128 first = ...;
    __m128 xxxx = _mm_set1_ps(_mm_cvtss_f32(first));
    

Note that the 2nd method will produce horrible code on MSVC, as discussed here, and will only produce xxxx as result, unlike the first option.

I m trying to implement some inline assembler (in C/C++ code) to take advantage of SSE

This is highly unportable. Use intrinsics.

问题回答

Move the source to the dest register. Use shufps and just use the new dest register twice and then select the appropriate mask.

The following example broadcasts the values of XMM2.x to XMM0.xyzw

MOVAPS XMM0, XMM2
SHUFPS XMM0, XMM0, 0x00

If your values are 16 byte aligned in memory:

movdqa    (mem),    %xmm1
pshufd    $0xff,    %xmm1,    %xmm4
pshufd    $0xaa,    %xmm1,    %xmm3
pshufd    $0x55,    %xmm1,    %xmm2
pshufd    $0x00,    %xmm1,    %xmm1

If not, you can do an unaligned load, or four scalar loads. On newer platforms, the unaligned load should be faster; on older platforms the scalar loads may win.

As others have noted, you can also use shufps.





相关问题
Undefined reference

I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...

C++ Equivalent of Tidy

Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...

Template Classes in C++ ... a required skill set?

I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

typedef ing STL wstring

Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

Window iconification status via Xlib

Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?

热门标签