English 中文(简体)
如何确定核心目标:周期
原标题:How to determine CPE: Cycles Per Element

How do I determine the CPE of a program? For example, I have this assembly code for a loop:

# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87:                                   # loop:
2   movss  (%rbx,%rdx,4), %xmm0           #  Get udata[i]
3   mulss  (%rax,%rdx,4), %xmm0           #  Multiply by vdata[i]
4   addss  %xmm0, %xmm1                   #  Add to sum
5   addq  $1, %rdx                        #  Increment i
6   cmpq  %rcx, %rdx                      #  Compare i:limit
7   jl .L87                               #  If <, goto loop

我必须找到由使用数据类型浮动的关键性途径确定的低限值。 我认为,关键的道路将指最缓慢的道路,因此,方案必须执行训令,因为这需要时间最长的锁定周期。

However, there doesn t seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

问题回答

如果你想要知道它需要多长时间,你就应该衡量。 大约10^10次休息时间,需要时间,乘以24小时的频率。 您的周期总数为10^10,以获得每处夜总会的数量。

对执行时间的理论预测几乎是never的正确(以及大部分时间到低),因为影响很多,决定了速度:

  • Pipelining (there can be easily about 20 stages in the pipeline)
  • Superscalar execution (up to 5 instructions in parallel, cmp and jl may be fused)
  • Decoding to µOps and reordering
  • The latencies of Caches or Memory
  • The throughput of the instructions (are there enough executions ports free)
  • The latencies of the instructions
  • Bank conflicts, aliasing issues and more esoteric stuff

视万国邮联而定,并且提供所有触角的记忆,我认为,每轮休息至少需要3个24小时的循环,因为最长的依赖链是3个因素。 更慢的CPU(mulsadds指示所需时间增加。

如果你实际上有意加快该守则的速度,而不仅仅是一些理论性意见,那么你就应当把它作为工具。 你可以把4-8级的系数提高业绩。

.L87:                               # loop:
vmovdqa (%rbx,%rdx,4), %ymm0        #  Get udata[i]..udata[i+7]
vmulps  (%rax,%rdx,4), %ymm0, %ymm0 #  Multiply by vdata[i]..vdata[i+7]
vaddps  %ymm0, %ymm1, %ymm1         #  Add to sum
addq    $8, %rdx                    #  Increment i
cmpq    %rcx, %rdx                  #  Compare i:limit
jl .L87                             #  If <, goto loop

在此之后,你需要横向增加所有8个要素,当然确保一致性是32个,反之,在8个层面。

如果你重新经营一个英特尔邮联,你可以找到一些关于各邮联的指令性和投入的良好文件。 这里指的是:

Intel® 64 and IA-32 Architectures Optimization Reference Manual





相关问题
What to look for in performance analyzer in VS 2008

What to look for in performance analyzer in VS 2008 I am using VS Team system and got the performance wizard and reports going. What benchmarks/process do I use? There is a lot of stuff in the ...

SQL Table Size And Query Performance

We have a number of items coming in from a web service; each item containing an unknown number of properties. We are storing them in a database with the following Schema. Items - ItemID - ...

How to speed up Visual Studio 2008? Add more resources?

I m using Visual Studio 2008 (with the latest service pack) I also have ReSharper 4.5 installed. ReSharper Code analysis/ scan is turned off. OS: Windows 7 Enterprise Edition It takes me a long time ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

热门标签