Question

I m benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that the non-SSE code is faster than the SSE by a factor of 2-3.

Can someone tell me what is wrong with the benchmarking code below? And perhaps suggest another approach that accurately shows the speeds for both the SSE and non-SSE code.

#include <time.h>
#include <string.h>
#include <stdio.h>

#define ITERATIONS 100000

#define MULT_FLOAT4(X, Y) ({ 
asm volatile ( 
    "movaps (%0), %%xmm0
	" 
    "mulps (%1), %%xmm0
	" 
    "movaps %%xmm0, (%1)" 
    :: "r" (X), "r" (Y)); })

int main(void)
{
    int i, j;
    float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
    time_t timer, sse_time, std_time;

    timer = time(NULL);
    for(j = 0; j < 5000; ++j)
        for(i = 0; i < ITERATIONS; ++i) {
            float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

            MULT_FLOAT4(a, b);

        }
    sse_time = time(NULL) - timer;

    timer = time(NULL);
    for(j = 0; j < 5000; ++j)
        for(i = 0; i < ITERATIONS; ++i) {
            float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

            b[0] *= a[0];
            b[1] *= a[1];
            b[2] *= a[2];
            b[3] *= a[3];

    }
    std_time = time(NULL) - timer;

    printf("sse_time %d
std_time %d
", sse_time, std_time);

    return 0;
}

Answer 1

When you enable optimizations the non-SSE code is eliminated completely, whereas the SSE code remains there, so this case is trivial. The more interesting part is when the optimizations are turned off: in this case the SSE-code is still slower whereas the loops code is the same.

Non-SSE code of the innermost loop s body:

movl    $0x3dcccccd, %eax
movl    %eax, -80(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -76(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -72(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -68(%rbp)
movss   -80(%rbp), %xmm1
movss   -48(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -80(%rbp)
movss   -76(%rbp), %xmm1
movss   -44(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -76(%rbp)
movss   -72(%rbp), %xmm1
movss   -40(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -72(%rbp)
movss   -68(%rbp), %xmm1
movss   -36(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -68(%rbp)

SSE code of the innermost loop s body:

movl    $0x3dcccccd, %eax
movl    %eax, -64(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -60(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -56(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -52(%rbp)
leaq    -48(%rbp), %rax
leaq    -64(%rbp), %rdx
movaps (%rax), %xmm0
mulps (%rdx), %xmm0
movaps %xmm0, (%rdx)

I m not sure about this, but here s my guess:

As you can see the compiler just stores the 4 floating values by 4 32-bit stores. This is then read back by a 16 byte load. This causes store forwarding stall which is costly when happens. You can look up this in the Intel manuals. It doesn t occur in the scalar version and this makes the performance difference.

To make it faster you need to make sure that this stall doesn t occur. If you are using a constant array of 4 floats, make it const and store the results in an another aligned array. This way the compiler hopefully won t make those unnecessary 4 byte movs before the load. Or, if you need to fill up the resulting array, do it with a 16 byte store command. If you can t avoid those 4 byte movs, you need to do something else after the store but before the load (for example calculating something else).

友情链接