English 中文(简体)
矩阵乘数算法中的排位顺序为何会影响业绩? [复制]
原标题:Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate]

我被赋予两个矩阵的两种功能:

 void MultiplyMatrices_1(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int j = 0; j < n; j++)
              for (int k = 0; k < n; k++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
  }

 void MultiplyMatrices_2(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int k = 0; k < n; k++)
              for (int j = 0; j < n; j++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
 }

我使用<条码>gprof对两个可诉人进行了处理和简介,每个人除这一功能外,都有相同的代码。 其中第二例(约5倍)为2048×2048的tric。 为什么有什么想法?

最佳回答

我认为,你回顾一下的是:

通常,计算机记忆被分为不同的性能特征类型(通常称为<>:memory <>/strong>)。 记忆最快的是处理器登记册,可以(通常)在单一锁的周期中查阅和阅读。 然而,这些登记册通常只有少数(通常不超过1KB)。 而另一方面,计算机的主要记忆是巨大的(即8GB),但使用速度要慢得多。 为了提高性能,计算机通常实际安装在上。 这些藏匿处比登记缓慢,但比主要记忆要快得多,因此,如果你能够提供记忆,在藏匿处发现一些东西,则其速度往往比你必须大得多(通常,5-25x之间更快)。 在获得记忆时,处理员首先检查这种价值的记忆,然后将主要记忆重新读到价值中。 如果你在海滩上不断获得价值,那么,如果你恢复记忆,随意获取价值,那么你的工作就会大大提高。

Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they re already in the cache and you don t have to go to main memory.

现在,最后一点细节——在C/C++中,阵列储存在滚动顺序上,这意味着在单行矩阵中,所有数值相互储存。 因此,记忆中阵列像第一行,第二行,第三行等。

有鉴于此,请看一下你的法典。 第一个版本就是这样:

  for (int i = 0; i < n; i++)
      for (int j = 0; j < n; j++)
          for (int k = 0; k < n; k++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

现在,请看一下这一内在的法典。 每一频率的数值都在上升。 这意味着,在操作最便捷的休息室时,在装上<条码>b[k][j]时,该休息室的每艘喷洒都会有误。 其原因是,由于该矩阵储存在输电管上,每当你 in升时,你就重新跳过矩阵的整段,并 jump跃记忆,可能远比你所预示的数值高。 但是,在研究<代码>c[i][j]时,你没有错误(因为ij是相同的),你也不会错过<代码>a[i][k],因为这些数值是按滚动顺序排列的,如果a[i][k]的价值与以前的编码相挂钩,则该代码的数值从一个相邻的记忆中标明。 因此,每 it一胎,你就会有一孔错。

但审议第二版:

  for (int i = 0; i < n; i++)
      for (int k = 0; k < n; k++)
          for (int j = 0; j < n; j++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

现在,由于你重新增加了关于每一胎的<条码>j,请思考一下,在最晚的发言中,有多少海滩可能把你ll。 由于这些数值按流体顺序排列,<代码>c[i][j]的价值可能随同,因为从以前的编码中得出的<代码>c[i][j]的价值可能也随附并读。 同样,<条码>b[k][j]可能附在后面,自<条码>i和<条码>>>。 机会是<代码>a[i][k]。 这意味着,每当葬礼时,你可能不会有海滩。

总的说来,这意味着第二版的法典不可能在 lo的每个胎盘上 c,而第一版几乎肯定会发生。 因此,正如你所看到的那样,第二胎的速率可能高于第一胎。

有趣的是,许多汇编者开始获得原型支持,以发现第二版的代码比第一版快。 有些人将试图自动改写法典,以尽量扩大平行。 如果您有。 Purple Ron Book,第11章讨论了这些汇编者如何工作。

此外,你可以进一步利用更为复杂的通道,优化这一通道的运行。 例如,可使用一种称为blocking,通过将阵列分为能够更长时间的次区域,显著提高绩效,然后利用这些区上的多个业务来计算总体结果。

希望这一帮助!

问题回答

这很可能是记忆地点。 当你重新排列坡道时,最晚的圈子里所需要的记忆已经接近,而且可以重新安排,而在低效率版本中,你需要从整个数据集获得记忆。

The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.

除了记忆的地方外,还有编辑选择。 病媒和矩阵操作的关键是循环。

for (int k = 0; k < n; k++)
   c[i][j] = c[i][j] + a[i][k]*b[k][j];

您可在这一神经中查阅<代码>i和j。 这意味着可以改写

for (int k = 0; k < n; k+=4) {
   int * aik = &a[i][k];
   c[i][j] +=
         + aik[0]*b[k][j]
         + aik[1]*b[k+1][j]
         + aik[2]*b[k+2][j]
         + aik[3]*b[k+3][j];
}

You can see there will be

  • four times fewer loops and accesses to c[i][j]
  • a[i][k] is being accessed continuously in memory
  • the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.

What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)

To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])

你们可以做的另一项工作是使多倍的重复。 这可以提高四个核心组的3个系数的绩效。

最后,请考虑使用<条码>float或<条码>。 您可能认为,int将更快,但并非总如此,因为浮动点业务可以更大幅度地优化(硬件和汇编者都采用)。

第二种情况是,每轮机的变动使选择更加困难。

或许,第二架飞机必须绕过更多的记忆,才能进入阵列。 也许还有一点——你也可以检查所编纂的法典,以了解实际发生的情况。





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...