English 中文(简体)
Prefetch for Intel Core 2 Duo
原标题:

Has anyone had experience using prefetch instructions for the Core 2 Duo processor?

I ve been using the (standard?) prefetch set (prefetchnta, prefetcht1, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i) instructions do nothing, and that the prefetchnta instruction is less effective.

My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.

Have Intel introduced new prefetch instructions?

问题回答

From an Intel reference document on Intel 64 and IA-32 Architectures, check out page 163 and 77:

Pentium 4 and Intel Xeon processors based on Intel NetBurst microarchitecture introduced hardware prefetching in addition to software prefetching. The hardware prefetcher operates transparently to fetch data and instruction streams from memory without requiring programmer intervention. Subsequent microarchitectures continue to improve and add features to the hardware prefetching mechanisms. Earlier implementations of hardware prefetching mechanisms focus on prefetching data and instruction from memory to L2; more recent implementations provide additional features to prefetch data from L2 to L1. In Intel NetBurst microarchitecture, the hardware prefetcher can track 8 independent streams.

The Pentium M processor also provides a hardware prefetcher for data. It can track 12 separate streams in the forward direction and 4 streams in the backward direction. The processor’s PREFETCHNTA instruction also fetches 64-bytes into the firstlevel data cache without polluting the second-level cache.

Intel Core Solo and Intel Core Duo processors provide more advanced hardware prefetchers for data than Pentium M processors. Key differences are summarized in Table 2-10.

I don t know whether it might be an issue with your code, but consider that the cache line size (which determines the stride size for use with prefetch instructions) may vary between different processors. Therefore, if you use code which is optimised under the assumption of a different cache line size on a CPU where this assumption isn t met, it s bound to deteriorate performance.

This question here asked how to determine prefetch cache line size.

I ve tried this once on a tight loop I was trying to optimize that loaded 4 doubles and did about 15 floating point operations per loop. I found that to have a positive effect on a core 2 duo, the prefetch needed to be set for at least 16 loops ahead in the code, where for older processors 4 loops ahead was enough.





相关问题
List of suspected Malicious patterns

I am doing an anti-virus project by disassembling its code and analyzing it. So I want a list of the Suspected Malicious pattern codes, so I can observe which is suspected and which is not? So I want ...

Prefetch for Intel Core 2 Duo

Has anyone had experience using prefetch instructions for the Core 2 Duo processor? I ve been using the (standard?) prefetch set (prefetchnta, prefetcht1, etc) with success for a series of P4 ...

How are mutex and lock structures implemented?

I understand the concept of locks, mutex and other synchronization structures, but how are they implemented? Are they provided by the OS, or are these structures dependent on special CPU instructions ...

Installing GNU Assembler in OSX

No matter how hard I google, I can t seem to find a (relatively) easy-to-follow instruction on how to install the GNU Assembler on a mac. I know I can use gcc -c (Apple Clang on a Mac) to assemble .s ...

8086 assembler,INT 16,2

I got stuck at this thing,I want to see if right shift button has been pressed,so I have this assambler code: mov ah,2 int 16h ;calling INT 16,2 - Read Keyboard Flags interrupt mov ah,...

Translate a FOR to assembler

I need to translate what is commented within the method, to assembler. I have a roughly idea, but can t. Anyone can help me please? Is for an Intel x32 architecture: int secuencia ( int n, ...

热门标签