English 中文(简体)
如何以方便缓存的方式将物品储存在LIFO的堆叠中?
原标题:How to store items in the LIFO stack in a cache-friendly manner?
EDIT / DISCLAIMER: It appears that my understanding of cache lines was wrong which makes this question not relevant and misleading. I thought whenever CPU tries to fetch a memory at a specific index, it also grabs indexes right after the first one which is NOT true. See How do cache lines work? for reference. I was about to write a data container for storing a continuous and re-sizable memory block, in which items would only be possible to access from one side by pushing or popping - basically a LIFO stack. I ve been reading a lot about CPU cache and memory access patterns and wanted said container to be as cache-friendly as possible. So I ve taken a look at common implementations of LIFO stacks on the internet. Most of them suggest using a dynamic array as a base and accessing the data by appending or removing items from the end of the array. In this case the empty capacity gap is stored after the data in the following way: stack bottom stack top array begin V array end V V V |V data V |V gap | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |8 |7 |6 |5 |4 |3 |2 |1 |0 | | | | | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | cache line | However, this does not seem very cache-friendly to me. Whenever one looks up the item on the top of the stack, CPU would fetch the memory from gap containing garbage, or at least this is how I understand it. Would the approach where gap appears as the begging of memory block and the data at the end be more performent? In this case the indexes of the items would be reversed, same with bottom and top. In this case the cache line could hit more items like so: stack top stack bottom V V | gap |V data V | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | | | | | |0 |1 |2 |3 |4 |5 |6 |7 |8 | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | cache line | Is my understanding correct here? Which approach is more performent and cache-friendly? I could be totally wrong with my assumptions. As I said most of the implementations that I saw use the 1st approach so there must be something to it. Cheers!
最佳回答
LIFO stacks using an array are equally cache-friendly for either growth direction, unless you have a specific pattern of sizes that happens to favour an odd-sized allocation and growing down. APIs for efficient reallocation (without copying the data) like Linux mremap or C realloc are designed around adding new space at the end of an allocation, not the start, so that s a point in favour of the normal way if your stack can grow beyond the initial allocation. At least if you re using an allocation API that supports those, not like C++ s crippled new/delete which essentially forces std::vector to suck. Other than that, they re equal unless your high water mark is commonly 1 or 2 elements more than a multiple of 64 bytes, like you ve shown here to make your other version look good. In that version, the first element pushed is in a cache line by itself, but the top-of-stack hasn t yet crossed into a new cache line. If you push one more element in your second diagram, it will be the only in-use element in its cache line. It only looks good because you picked a number of elements for the current stack state that happens to work well with the misalignment (relative to cache-line boundaries) you chose for the first element (the "bottom" of the stack, at the highest address for your grows-down stack). With your grows-down gap-at-the-front design, potentially the first element pushed (highest address) is in the same cache line as some other useful data, so that cache-line staying hot can be helpful (if your stack frequently becomes empty so you actually access that element). Memory allocators often round up allocations to multiples of some larger size like 16 bytes, but there could still be other allocations in one cache line. Still, unless you know something special about your expected stack sizes and/or what comes next, normally you d want to allocate a chunk of memory that s a multiple of the cache-line size. Fun fact: the asm callstack grows down on mainstream ISAs (and calling conventions for ISAs like MIPS which don t favour one stack-growth direction). e.g. x86 push and call decrement the stack pointer register (RSP). The reasons for this aren t cache efficiency, more historical, dating back to before CPUs had caches. It doesn t present any particular problems for CPU caches, though. CPU/asm stacks are allocated such that the highest-address cache line is fully used, unlike yours where the end of the allocation isn t the end of a cache line. (Typically "the stack" is a multiple of the page size, so there won t be other stuff next to it in the same line anyway.)
问题回答

暂无回答




相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签