English 中文(简体)
C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?
原标题:

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.

I have selected a typical representative of the modern desktop CPUs -
Quad Core, about 10Mb cache, 3GHz, 45nm.
Can you please help me to find out its limits:

1) Highest possible Multiply-Accumulate operations per second if CPU s specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?

2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).

UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.

最佳回答

First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.

Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:

number of cores * cycles per second * flops per cycle * vector width

Which in your case sounds like:

4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops

If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).

edit: notes on DPPS:

DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you re getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.

More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.

In fairness to Intel, if you can t change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you re limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.

问题回答

This may not directly answer your question, but have you considered using the PC s graphics cards for parallel floating-point computations? It s getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.

I m short on details, sorry; this is just to give you an idea.

Some points you should consider:

1) Intel s i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD s Opterons can compete.

2) Intel s compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD s CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).

3) No x86-CPU supports multiply-and-add yet, AMD s next architecure "Bulldozer" will probably be the first to support it.

4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).

5) Use Intel s highly efficient libraries if possible.





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...

热门标签