English 中文(简体)
How can I do a CPU cache flush in x86 Windows?
原标题:

I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in CPU cache), preferably a basic C implementation or Win32 call.

Is there a known way to do this with a system call or even something as sneaky as doing say a large memcpy?

Intel i686 platform (P4 and up is okay as well).

最佳回答

Fortunately, there is more than one way to explicitly flush the caches.

The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external caches flush their data. Unfortunately, it is a privileged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the "OS" very small.

Additionally, there is the "invd" instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.

For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.

You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.

问题回答

There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from all levels of cache (L1, L2, L3).

something as sneaky as doing say a large memcopy?

Yes, this is the simplest approach, and will make sure that the CPU flushes all levels of cache. Just exclude the cache flushing time from your benchmakrs and you should get a good idea how your program performs under cache pressure.

There is unfortunately no way to explicitly flush the cache. A few of your options are:

1.) Thrash the cache by doing some very large memory operations between iterations of the code you re benchmarking.

2.) Enable Cache Disable in the x86 Control Registers and benchmark that. This will probably disable the instruction cache also, which may not be what you want.

3.) Implement the portion of your code your benchmarking (if it s possible) using Non-Temporal instructions. Though, these are just hints to the processor about using the cache, it s still free to do what it wants.

1 is probably the easiest and sufficient for your purposes.

Edit: Oops, I stand corrected there is an instruction to invalidate the x86 cache, see drhirsch s answer

The x86 instruction WBINVD writes back and invalidates all caches. It is described as:

Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated.

Importantly, the instruction can only be executed in ring0, i.e. the operating system. So your userland programs can t simply use it. On Linux, you can write a kernel module that can execute that instruction on demand. Actually, someone already wrote such a kernel module: https://github.com/batmac/wbinvd

Luckily, the kernel module s code is really tiny, so you can actually check it before loading code from strangers on the internet into your kernel. You can use that module (and trigger executing the WBINVD instruction) by reading /proc/wbinvd, for example via cat /proc/wbinvd.

However, I found that this instruction (or at least this kernel module) is really slow. On my i7-6700HQ I measured it to take 750µs! This number seems really high to me, so I might have made a mistake measuring this -- please keep that in mind! Explanation of that instruction just say:

The amount of time or cycles for WBINVD to complete will vary due to size and other factors of different cache hierarchies.





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...

热门标签