English 中文(简体)
Benchmarking SSE instructions

I m benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that the non-SSE code is faster than the SSE by a factor of 2-3.

Can someone tell me what is wrong with the benchmarking code below? And perhaps suggest another approach that accurately shows the speeds for both the SSE and non-SSE code.

#include <time.h>
#include <string.h>
#include <stdio.h>

#define ITERATIONS 100000

#define MULT_FLOAT4(X, Y) ({ 
asm volatile ( 
    "movaps (%0), %%xmm0
    "mulps (%1), %%xmm0
    "movaps %%xmm0, (%1)" 
    :: "r" (X), "r" (Y)); })

int main(void)
    int i, j;
    float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
    time_t timer, sse_time, std_time;

    timer = time(NULL);
    for(j = 0; j < 5000; ++j)
        for(i = 0; i < ITERATIONS; ++i) {
            float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

            MULT_FLOAT4(a, b);

    sse_time = time(NULL) - timer;

    timer = time(NULL);
    for(j = 0; j < 5000; ++j)
        for(i = 0; i < ITERATIONS; ++i) {
            float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

            b[0] *= a[0];
            b[1] *= a[1];
            b[2] *= a[2];
            b[3] *= a[3];

    std_time = time(NULL) - timer;

    printf("sse_time %d
std_time %d
", sse_time, std_time);

    return 0;

When you enable optimizations the non-SSE code is eliminated completely, whereas the SSE code remains there, so this case is trivial. The more interesting part is when the optimizations are turned off: in this case the SSE-code is still slower whereas the loops code is the same.

Non-SSE code of the innermost loop s body:

movl    $0x3dcccccd, %eax
movl    %eax, -80(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -76(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -72(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -68(%rbp)
movss   -80(%rbp), %xmm1
movss   -48(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -80(%rbp)
movss   -76(%rbp), %xmm1
movss   -44(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -76(%rbp)
movss   -72(%rbp), %xmm1
movss   -40(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -72(%rbp)
movss   -68(%rbp), %xmm1
movss   -36(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -68(%rbp)

SSE code of the innermost loop s body:

movl    $0x3dcccccd, %eax
movl    %eax, -64(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -60(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -56(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -52(%rbp)
leaq    -48(%rbp), %rax
leaq    -64(%rbp), %rdx
movaps (%rax), %xmm0
mulps (%rdx), %xmm0
movaps %xmm0, (%rdx)

I m not sure about this, but here s my guess:

As you can see the compiler just stores the 4 floating values by 4 32-bit stores. This is then read back by a 16 byte load. This causes store forwarding stall which is costly when happens. You can look up this in the Intel manuals. It doesn t occur in the scalar version and this makes the performance difference.

To make it faster you need to make sure that this stall doesn t occur. If you are using a constant array of 4 floats, make it const and store the results in an another aligned array. This way the compiler hopefully won t make those unnecessary 4 byte movs before the load. Or, if you need to fill up the resulting array, do it with a 16 byte store command. If you can t avoid those 4 byte movs, you need to do something else after the store but before the load (for example calculating something else).



gcc -fPIC seems to muck with optimization flags

Following along from this question: how-do-i-check-if-gcc-is-performing-tail-recursion-optimization, I noticed that using gcc with -fPIC seems to destroy this optimization. I am creating a shared ...

Generate assembler code from C file in linux

I would like to know how to generate assembler code from a C program using Unix. I tried the gcc: gcc -c file.c I also used firstly cpp and then try as but I m getting errors. I m trying to build an ...

Getting rid of pre-compiled headers

OK, I have old Metrowerks code for Mac and Windows where the previous developer used pre-compiled headers for every project that this code base builds. How does one get rid of Pre-compiled headers, ...

Include a .txt file in a .h in C++?

I have a number of places where I need to re-use some template code. Many classes need these items In a .h could I do something like: #include <xxx.txt> and place all of this code in the ....

How to compile for Mac OS X 10.5

I d like to compile my application for version 10.5 and forward. Ever since I upgraded to Snow Leopard and installed the latest XCode, gcc defaults to 10.6. I ve tried -isysroot /Developer/SDKs/...
