Question

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4", it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

如果我写了<代码>(a*a*a)*(a*a*a),则该编码将产生。

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

将乘数指示数减至3:ic也有类似行为。

Why do compilers not recognize this optimization trick?

Answer 1

http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems”rel=“noreferer” Floating Point Math is not Associative. 您将歌剧归入浮动点复制方式,对答案的准确性产生影响。

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don t care about numerical accuracy. For example: the -fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.

Answer 2

>>Lambdageek 正确地指出,由于协会不持有浮动点编号,<代码>a*a*a*a*a*a*a*a*a*a至<代码>(a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a>>>的“优化”可改变该数值。因此,C99(除非用户特别允许,通过编造旗或花旗)不予允许。总的说来,假设方案管理员是出于某种原因撰写的,汇编者应当尊重这一点。如果你想要<代码>(a*a*a)*(a*a),则写上。

这可能是难以书写的;为什么在你使用<代码>pow(a,6)时,汇编者只能做什么? 因为这将是wrong的事情。在有良好数学图书馆的平台上,pow(a,6)比a*a*a*a*a*a*a或(a*a*a*a*a*a*a>>。就提供一些数据而言,我对我的Ma Pro进行了小规模试验,测量了在[1,2]之间所有单程浮标数的^6评估中最差的错误:



worst relative error using    powf(a, 6.f): 5.96e-08
worst relative error using (a*a*a)*(a*a*a): 2.94e-07
worst relative error using     a*a*a*a*a*a: 2.58e-07

使用<代码>pow而不是多倍树,减少受4因素约束的错误。编审员不应(而且通常不)作出“优化”来增加错误,除非用户许可(例如通过<代码>-ffast-math)。

注:海湾合作委员会提供<代码>_builtin_powi(x,n),作为pow( )的替代品,该替代法应产生一条网上复制树。如果你想对业绩的准确性进行交易,但并不希望能够快速计算。

Answer 3

Another similar case: most compilers won t optimize a + b + c + d to (a + b) + (c + d) (this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as (((a + b) + c) + d)). This too is because of corner cases:

float a = 1e35, b = 1e-5, c = -1e35, d = 1e-5;
printf("%e %e
", a + b + c + d, (a + b) + (c + d));

这一产出<000000e-05>0.000000e+00<0>/code>

Answer 4

Fortran (designed for scientific computing) has a built-in power operator, and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe. C/C++ unfortunately don t have a power operator, only the library function pow(). This doesn t prevent smart compilers from treating pow specially and computing it in a faster way for special cases, but it seems they do it less commonly ...

几年前,我试图使以最佳方式计算愤怒权力更加方便,并得出以下结论。它是C++,但并非C,仍然取决于汇编者如何优化/在线工作。不管怎么说,希望你实际上会认为它有用:

template<unsigned N> struct power_impl;

template<unsigned N> struct power_impl {
    template<typename T>
    static T calc(const T &x) {
        if (N%2 == 0)
            return power_impl<N/2>::calc(x*x);
        else if (N%3 == 0)
            return power_impl<N/3>::calc(x*x*x);
        return power_impl<N-1>::calc(x)*x;
    }
};

template<> struct power_impl<0> {
    template<typename T>
    static T calc(const T &) { return 1; }
};

template<unsigned N, typename T>
inline T power(const T &x) {
    return power_impl<N>::calc(x);
}

_{Clarification for the curious: this does not find the optimal way to compute powers, but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway (as opposed to using pow), there s no reason to fuss with the detail.}

然后仅将其用作<代码>的功率与lt;6> (a)。

这使权力的类型容易(无需列出6条<代码>a>>,而如果你有某种精准性,如,则请在<代码>-ffast-math/code>上作这种优化。 (行动次序至关重要的例子)。

或许也可以忘记,这是C++,在C方案中使用(如果与C++汇编者一起汇编)。

希望会有所助益。

http://www.un.org。

这是我从我的汇编者那里得到的:

www.un.org/Depts/DGACM/index_french.htm

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0

www.un.org/Depts/DGACM/index_french.htm

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm0, %xmm0

<代码>power<6>a>

    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm0, %xmm1

Answer 5

海湾合作委员会实际上优化了以下编码:*a*a*a*a*a*a*a*a至(a*a*a*a*a*a*(a*a) *(a*a),如果是惯用的话。我接受这一指挥:

$ echo  int f(int x) { return x*x*x*x*x*x; }  | gcc -o - -O2 -S -masm=intel -x c -

There are a lot of gcc flags but nothing fancy. They mean: Read from stdin; use O2 optimization level; output assembly language listing instead of a binary; the listing should use Intel assembly language syntax; the input is in C language (usually language is inferred from input file extension, but there is no file extension when reading from stdin); and write to stdout.

这里是产出的重要部分。我在附加说明后,提出一些意见,表明大会语文的内容:

; x is in edi to begin with.  eax will be used as a temporary register.
mov  eax, edi  ; temp = x
imul eax, edi  ; temp = x * temp
imul eax, edi  ; temp = x * temp
imul eax, eax  ; temp = temp * temp

I m using system GCC on Linux Mint 16 Petra, an Ubuntu derivative. Here s the gcc version:

$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1

正如其他海报所指出的,这一选择在浮动点是不可能的,因为浮动点算术并非关联。

Answer 6

Because a 32-bit floating-point number - such as 1.024 - is not 1.024. In a computer, 1.024 is an interval: from (1.024-e) to (1.024+e), where "e" represents an error. Some people fail to realize this and also believe that * in a*a stands for multiplication of arbitrary-precision numbers without there being any errors attached to those numbers. The reason why some people fail to realize this is perhaps the math computations they exercised in elementary schools: working only with ideal numbers without errors attached, and believing that it is OK to simply ignore "e" while performing multiplication. They do not see the "e" implicit in "float a=1.2", "a*a*a" and similar C codes.

Should majority of programmers recognize (and be able to execute on) the idea that C expression a*a*a*a*a*a is not actually working with ideal numbers, the GCC compiler would then be FREE to optimize "a*a*a*a*a*a" into say "t=(a*a); t*t*t" which requires a smaller number of multiplications. But unfortunately, the GCC compiler does not know whether the programmer writing the code thinks that "a" is a number with or without an error. And so GCC will only do what the source code looks like - because that is what GCC sees with its "naked eye".

......一旦你知道哪一类方案管理员you,你就可以使用“-ffast-math”的开关,告诉海合会,“我知道我做什么”。这将使海合会能够将*a*a*a*a*a*a*a*a*a*a改成不同的案文——它从*a*a*a*a*a*a*a*a*a看不出,但仍在*a*a*a*a*a*a*a的错误中计算一些。这是大韩民国的,因为你已经知道你的工作是定期的,而不是理想的人数。

Answer 7

No posters have mentioned the contraction of floating expressions yet (ISO C standard, 6.5p8 and 7.12.2). If the FP_CONTRACT pragma is set to ON, the compiler is allowed to regard an expression such as a*a*a*a*a*a as a single operation, as if evaluated exactly with a single rounding. For instance, a compiler may replace it by an internal power function that is both faster and more accurate. This is particularly interesting as the behavior is partly controlled by the programmer directly in the source code, while compiler options provided by the end user may sometimes be used incorrectly.

<代码>的违约状态 FP_CONTRACT pragma is implementation- Defin, so that a codificationer is granted to do suchmaxations byault. 因此,需要严格遵守ISO 754规则的便携式代码应当明确将其编号为。

If a compiler doesn t support this pragma, it must be conservative by avoiding any such optimization, in case the developer has chosen to set it to OFF.

海湾合作委员会不支持这一 p,但以缺省办法假定它为ON<>/code>;因此,如果想要防止向fma(a,b,c)转变<条码>,则需要提供一种选择,例如<条码>-ffp- Contracting=off。 (将pragma明确定为OFF)或-std=c99 (请海合会遵守C标准版本,此处为C99,从而遵循上述段落)。过去,后一种选择并不阻止这一转变,即海合会在此问题上并不一致:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37845

Answer 8

图书馆的职能,如“影子”,通常精心设计,以产生尽可能少的错误(一般情况下)。这通常与间谍职能相近(根据Pascal的评论,最常见的执行似乎使用 Remez算法 )。

基本上如下行动:

pow(x,y);

内在的差错约为的大小,与任何单一复制或分区的差错相同。

在以下行动中:

float a=someValue; float b=a*a*a*a*a*a;

内在错误大于5倍于单一多彩的<>/strong>或分部分的错误(因为你正在合并5个多模版)。

汇编者应当真正谨慎地对待其正在优化的类型:

if optimizing pow(a,6) to a*a*a*a*a*a it may improve performance, but drastically reduce the accuracy for floating point numbers.

if optimizing a*a*a*a*a*a to pow(a,6) it may actually reduce the accuracy because "a" was some special value that allows multiplication without error (a power of 2 or some small integer number)

if optimizing pow(a,6) to (a*a*a)*(a*a*a) or (a*a)*(a*a)*(a*a) there still can be a loss of accuracy compared to pow function.

总的来说,你知道,对于任意浮动点值,“浅”的准确度比你最终可能写出的任何功能要好,但在某些特殊情况下,多重重叠可能更准确、更业绩,应由开发商选择更合适的东西,最终对代码作出评论,以便没有人会“鼓励”该守则。

唯一有意义的事情(个人意见,显然是海合会为优化而选择的任何特殊优化或编纂旗)应当用“a*a”取代“pow(a,2)”。这将是汇编商供应商应当做的唯一一件事。

Answer 9

由于Lambdageek指出,浮动的多重复并非一种关联性,而且你可以提高准确性,但是,如果能够提高准确性,你会反对选择性,因为你想要一种决定性的应用。例如,在游戏模拟客户/服务器中,每个客户必须模拟同一个世界,你希望点算法是决定性的。

Answer 10

我本来不会期望这一案件得到最佳利用。经常会出现这样的情况,即表达方式含有可以重新组合以取消整个行动的次级压力。我期望汇编者在有可能带来明显改善的领域投入时间,而不是涵盖很少遇到的边际案例。

令我感到惊讶的是,从其他答复中了解到,这一表述确实可以通过适当的汇编者开关优化。优化是三维的,还是比较常见的最优选案例,或汇编作者极为透彻。

各位在这里做的那样,向汇编者提供背心没有什么错误。它对微观刺激进程的一个正常和预期的部分,是重新安排声明和表述,看它们会带来哪些差异。

虽然汇编者在考虑两种表达方式以取得不一致的结果(没有适当的交换)时可能是合理的,但没有必要接受这一限制。差异将令人难以置信,因此,如果差异给你,你首先不应使用标准的浮动点算法。

Answer 11

对这一问题已经有一些很好的答案,但为了完整起见,我要指出,C标准的适用部分是5.1.2.2.3/15(与C++11标准第1.9/9款相同)。这一部分指出,只有在经营者真正成为合伙或互惠的,才能重新组合。

Answer 12

gcc 实际能够做到这种优化,即使对于浮动点数字也是如此。例如,

double foo(double a) {
  return a*a*a*a*a*a;
}

......

foo(double):
    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm1, %xmm0
    ret

<代码>-O-funsafe-math-timopizations。这种重新排序违反了EPC-754号文件,因此它需要旗帜。

如Peter Cordes在评论中指出的,不设<条码>-funsafe-math-optimizations,即可实现这一优化。因为它完全是在没有过度流入的情况下,如果存在过度流入,你就没有界定行为。因此,你会

foo(long):
    movq    %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rax, %rax
    ret

缩略语对于没有签名的愤怒者来说,由于他们行使了2项权力,甚至当过度时,他们也可以自由重新排序。

友情链接