Question

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.

I have selected a typical representative of the modern desktop CPUs -
Quad Core, about 10Mb cache, 3GHz, 45nm.
Can you please help me to find out its limits:

1) Highest possible Multiply-Accumulate operations per second if CPU s specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?

2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).

UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.

Answer 1

First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.

Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:

number of cores * cycles per second * flops per cycle * vector width

Which in your case sounds like:

4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops

If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).

edit: notes on DPPS:

DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you re getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.

More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.

In fairness to Intel, if you can t change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you re limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.

Answer 2

This may not directly answer your question, but have you considered using the PC s graphics cards for parallel floating-point computations? It s getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.

I m short on details, sorry; this is just to give you an idea.

Answer 3

Some points you should consider:

1) Intel s i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD s Opterons can compete.

2) Intel s compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD s CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).

3) No x86-CPU supports multiply-and-add yet, AMD s next architecure "Bulldozer" will probably be the first to support it.

4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).

5) Use Intel s highly efficient libraries if possible.

友情链接