Question

我的任务是生成一定数量的数据,并绘制教学材料。我已能够免费处理数据输入部分。

因此,Im留下了教育之.。我没有任何想法造成这些问题。一个人能够建议一种产生这些方法的方法?

I m利用海合会进入欧林区。

Answer 1

正如人们所解释的那样,教学奇迹错误在概念上与数据-奇迹错误相同,而指令不是在海滩上。这是因为,加工商的方案柜已经跳跃到一个装入海滩的地方,或者由于海滩上填满而流出,而海滩线是选择的驱逐线(通常是最近使用的)。

采用手法编制足够的法律,以迫使教学错误,比强迫数据输入错误更为困难。

很少努力地获得大量法典的一个办法是起草一个产生源代码的方案。

举例来说,在C类中(在C类)写出一份大型开关声明,以产生一项功能的方案。

printf("void bigswitch(int n) {
    switch (n) {");
for (int i=1; i<100000; ++i) {
    printf("        case %d: n += %d;
", n, n+i/2);
}
printf("    }
    return n;}
");

然后,你就可以从另一个职能中说出来,你可以控制沿着它所穿的切线跳跃。

换文中的一项财产是,该法典可能被迫执行后退或选择参数。因此,你可以与投标前和预测机制合作,或试图打击这些机制。

也可以采用同样的技术来产生许多功能,以确保海滩能够随意忙.。因此,你可能拥有大米001、大米002002等。你也许会利用你也产生的开关来说这一点。

如果你能够使每项职能(大约)达到一定规模的单轴线,并产生比切身多的职能,那么,产生指示切-米仪的问题就会更容易控制。

You can see exactly how big a function, an entire switch statement, or each leg of a switch statement is by dumping the assembler (using gcc -S), or objdump the .o file. So you could tune the size of a function by adjusting the number of case: statements. You could also choose how many cache lines are hit, by judicious choice of the parameter to bigswitchNNN().

Answer 2

除了这里所提到的所有其他方式外,强迫教学错误的另一个非常可靠的办法是制定自我修改的法典。

如果你写到一页的记号(假设你配置了该名顾问,以便允许这样做),那么,当然,相应的教学线立即变得无效,而处理者被迫重新计算。

不是“prediction”的分支,它造成奇迹错,但只是branching。每当加工商试图执行最近尚未执行的指示时,你就错过指示。现代第86号令足以按顺序排列指令,因此,你不可能通过从一项指示到下一个指示的普通步行,来误导。但是,任何分支(有条件或其他)都按顺序跳跃到新的地址。如果新指示地址最近出现,而且你已经掌握在法典附近,该地址很可能没有切身之处,而处理者必须停止并等待主要区域援助团的指示。这完全与数据相仿。

一些非常现代的加工商(即7个加工商)能够研究即将到来的法典分支,并开始以可能的目标为先决条件的切合,但许多人不能(传统游戏集团)。从主要区域援助团到天文台的取用数据与管道的“损毁装置”阶段完全不同,该阶段是可预见性<>。

“销毁 fetch”是万国邮联执行管道的一部分,并提到将一个从切身到万国邮联执行单位的密码,在那里可以开始脱节和工作。这不同于“制造混凝土”热点,它必须早于许多周期发生,并且涉及海滩电路,要求主要记忆单位将一些 by子送到公共汽车上。第一是万国邮联管道两个阶段之间的相互作用。第二是管道与记忆库和主要区域援助团之间的互动,后者是一个更为复杂的电路。这些名字令人困惑不解,但完全分开运作。

So one other way to cause instruction cache misses would be to write (or generate) lots of really big functions, so that your code segment is huge. Then call wildly from one function to another, so that from the CPU s point of view you are doing crazy GOTOs all over memory.

Answer 3

您的项目需要了解您的目标系统的切身硬件,包括但不限于其切身大小(海滩的整体大小)、切线大小(最轻切身的实体)、联系和书写与安放政策。任何旨在测试海滩绩效的实值算法都必须考虑到所有因素,因为没有任何单一的总算法能够有效地测试所有海滩组合,尽管你可能能够设计一个有效的参数化测试例行发电机,而鉴于特定目标海滩结构的具体细节,这可能会产生一种适当的常规测试。尽管如此,我认为我下面的建议只是一般性检验,但首先我要提到:

You mention that you have a working data cache test that uses a “large integer array a[100].... [which accesses] the elements in such a way that the distance between the two elements is greater than the cache-line size(32 bytes in my case).” I am curious how you’ve determined that your test algorithm works and how you’ve determined how many data cache misses are a result of your algorithm, as opposed to misses caused by other stimuli. Indeed, with a test array of 100*sizeof(int), your test data area is only 400 bytes long on most general-purpose platforms today (perhaps 800 bytes if you’re on a 64-bit platform, or 200 bytes if you’re using a 16-bit platform). For the vast majority of cache architectures, that entire test array will fit into the cache many times over, meaning that randomized accesses to the array will bring the entire array into the cache in somewhere around (400/cache_line_size)*2 accesses, and every access after that will be a cache hit regardless of how you order your accesses, unless some hardware or OS tick timer interrupt pops in and flushes out some or all of your cached data.

关于教学课程: 其他人建议使用大型开关单或功能电话在不同的地点运作,如果不仔细(我指的是CAREFULLY)在个案部门或地点设计该守则的规模;不同地点职能的规模。其原因是,在整个记忆中,在完全可预测的模式下,把“双倍”(技术上,“另一边”)。如果你认真控制开关单每个分支的指示数,你也许能够经你检验,但是,如果你在每一处都发出大量无区别的指示,那么你就不知道他们会怎样把双脚带入海滩,而开关单的几处情况相互交错,以便利用它们相互割裂。

我猜测你不过分熟悉《集会法》,但你本人相信我在此会儿,这个项目正在为此而欢呼。我相信我,在我不要求的情况下,我不要使用组装代码,我强烈希望采用STL &,尽可能使用多吗? ADT等级。但是,就你而言,这样做实际上没有任何其他无节制的方法,而且集会将赋予你绝对控制你真正需要的代码块大小,以便有效地产生具体的切除率。你不必成为大会专家,你甚至可能要大声了解指示和计划;执行C-语言对话与安插所需的结构;“可召集会功能”的书目。你为你的组装职能撰写了一些“C”外部功能原型,你去掉。如果你谨慎地学习一些集会,那么你在集会职能中采用的检验逻辑越多,你对你的测试所施加的“Heisenberg效应”就越少,因为你可以仔细控制试验控制指示的去向(从而对指示的切身影响)。但是,就你的大部分试验守则而言,你只能使用一套“nop”指示(指令切勿真正照顾其中的指示),而且可能只是把你的处理者“返回”指示放在每一法典的底层。

现在让我说,你的教程是32K(按今天的标准计算,收入相对较少,但在许多根基系统中可能仍然很常见)。如果你的切身为4点联系人,你可以设立8个单独的、内容相同的8K组(希望你注意到,这8项职能是64K级的法典价值,是海滩面积的两倍),其中多数只是NOP指令的bu。你们都把他们放在另一个记忆之后(通常是在来源档案中简单地界定每个后一方)。然后,你利用经过仔细计算的顺序,从测试控制功能中召唤他们,以产生你所希望的(当然,由于每个功能都是8K小时)的切除率。如果你把第1、第2、第3和第4项职能相互联系起来,你知道你用这些测试功能填满了全部海滩。此时再找这些人不会造成教学混乱(除了由测试控制职能本身的指示所驱赶的线外),而是说其中任何一条(第5、第6、第7、第8、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第5、第3、第5条)。此时此刻,你唯一能够打电话和知道你本人的电话是你刚才所说的话(第5条),你唯一能够打电话和知道别人的人是你所说的话(第6、第7或第8条)。为了方便,仅保持一个固定阵列,其数量与你拥有的测试功能相同。为了启动驱逐行动,在阵列和营地结束时叫起这一功能;将其指点移至阵列顶,将其他.倒。如果NOT触发驱逐,就把你最近称之为的驱逐(这在阵列顶上;确保NOT在此案中将其他人倒下!) 如果你需要更细微的贪.,那么这方面的一些变化(也许会带来16项单独的4K组组功能)。当然,所有这一切都取决于测试控制逻辑的规模,与每一关联的海滩“航道”的规模相比微不足道;为了更积极的控制,你可以将测试控制逻辑放在测试职能本身,但为了完全控制你必须设计完全没有内部分行的控制逻辑(只是在每个组装功能结束时进行),但我认为我在此站不住脚,因为这样做可能过于重复。

a. 非装饰;未经测试,x86的整批组功能中,可考虑:

myAsmFunc1:
   nop
   nop
   nop  # ...exactly enough NOPs to fill one "way" of the cache
   nop  # minus however many bytes a "ret" instruction is (1?)
   .
   .
   .
   nop
   ret  # return to the caller

权力《刑法》可能照此办理(也未经测试):

myAsmFunc1:
   nop
   nop
   nop   # ...exactly enough NOPs to fill one "way" of the cache
   .     # minus 4 bytes for the "blr" instruction.  Note that
   .     # on PPC, all instructions (including NOP) are 4 bytes.
   .
   nop
   blr   # return to the caller

In both cases, the C++ and C prototypes for calling these functions would be:

extern "C" void myAsmFunc1();    // Prototype for calling from C++ code
void myAsmFunc1(void);           /* Prototype for calling from C code */

视您的汇编者而定,您可能需要在《汇编法》本身的功能名称前强调(但不见您的C++/C功能原型)。

Answer 4

For instruction cache misses, you need to execute code segments that are far apart. Splitting your logic among multiple function calls would be one way to do that.

Answer 5

当我调查Playdate 硬件并试图确认指示的切身大小和行为时,我与ARM M7 CPU进行类似的试验。

I did something similar to @phonetagger s answer, using inline assembly to create functions of known size. I thought it best to generate lots of small functions, because large functions without branches will allow the branch prediction logic to work flawlessly and preload the instruction cache very effectively.

My current test scenario is based on a table of 256 function pointers, each pointing to a function that is 64 bytes long, or two cache lines (in the case of the ARM M7). In total, the 256 functions occupy 256 x 64 = 16K of memory, which is four times the 4K instruction cache size - based on the data sheet that I think matches the part in the Playdate, which also indicates that the instruction cache is 2-way associative.

我的检测战略是,多次运行能够增加已知的记忆量的职能,并改变记忆量,以评估何时一切都适合切身,何时发生。例如,为了测试教育记忆中的2K,我需要操作2048 / 64 = 32个功能,因此,我的代码是:

int n = 32;
for (int calls = 0; calls < 100000; calls++)
{
    functable[calls%n]();
}

I do 100,000 calls to ensure it takes long enough to be able to get consistent timings. Obviously the loop logic is also being run, but that should only consume a couple of cache lines so it shouldn t throw the results off too much.

I repeat the above test for n running from 1 to 256, thus testing 64 bytes up to 16K of instructions, and time how long it takes. Here are the results:

我对几件事情感到迷惑:

Why is there an early spike in time taken, up to and a little beyond the 1K mark?
Why does performance start to drop off at the 8K mark, instead of the 4K mark like I expected from the 4K cache size?
Why isn t there a greater drop in performance? Performance is more than half as good when apparently missing the cache. From messing around with data caches I expected the cache load time to be a more significant hit than that.

All my functions are laid out linearly in memory, so I wondered if the CPU was prefetching subsequent functions, so I tried calling the functions in random order. I used insertion sort to randomize the first n entries in the function table before starting the timing loop. The results were very similar, though surprisingly the early spike in time taken - though still present - was lower than in the linear order case.

In summary, I think my procedure is fairly sound, but I m puzzled by the results and would appreciate additional insight.

Answer 6

A chain of if else on unpredictable conditions (e.g. input or randomly generated data) with amount of instructions both in the if case and in the else case which size is larger than a cache line.

友情链接