English 中文(简体)
为什么海湾合作委员会优化了*a*a*a*a*a至(a*a*a*a)*(a*a*a)?
原标题:Why doesn t GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4", it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

如果我写了<代码>(a*a*a)*(a*a*a),则该编码将产生。

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

将乘数指示数减至3:ic也有类似行为。

Why do compilers not recognize this optimization trick?

最佳回答

http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems”rel=“noreferer” Floating Point Math is not Associative. 您将歌剧归入浮动点复制方式,对答案的准确性产生影响。

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don t care about numerical accuracy. For example: the -fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.

问题回答

>>Lambdageek 正确地指出,由于协会不持有浮动点编号,<代码>a*a*a*a*a*a*a*a*a*a至<代码>(a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a>>>的“优化”可改变该数值。 因此,C99(除非用户特别允许,通过编造旗或花旗)不予允许。 总的说来,假设方案管理员是出于某种原因撰写的,汇编者应当尊重这一点。 如果你想要<代码>(a*a*a)*(a*a),则写上。

这可能是难以书写的;为什么在你使用<代码>pow(a,6)时,汇编者只能做什么? 因为这将是wrong的事情。 在有良好数学图书馆的平台上,pow(a,6)a*a*a*a*a*a*a(a*a*a*a*a*a*a>>。 就提供一些数据而言,我对我的Ma Pro进行了小规模试验,测量了在[1,2]之间所有单程浮标数的^6评估中最差的错误:

worst relative error using    powf(a, 6.f): 5.96e-08
worst relative error using (a*a*a)*(a*a*a): 2.94e-07
worst relative error using     a*a*a*a*a*a: 2.58e-07

使用<代码>pow而不是多倍树,减少受4因素约束的错误。 编审员不应(而且通常不)作出“优化”来增加错误,除非用户许可(例如通过<代码>-ffast-math)。

注:海湾合作委员会提供<代码>_builtin_powi(x,n),作为pow( )的替代品,该替代法应产生一条网上复制树。 如果你想对业绩的准确性进行交易,但并不希望能够快速计算。

Another similar case: most compilers won t optimize a + b + c + d to (a + b) + (c + d) (this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as (((a + b) + c) + d)). This too is because of corner cases:

float a = 1e35, b = 1e-5, c = -1e35, d = 1e-5;
printf("%e %e
", a + b + c + d, (a + b) + (c + d));

这一产出<000000e-05>0.000000e+00<0>/code>

Fortran (designed for scientific computing) has a built-in power operator, and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe. C/C++ unfortunately don t have a power operator, only the library function pow(). This doesn t prevent smart compilers from treating pow specially and computing it in a faster way for special cases, but it seems they do it less commonly ...

几年前,我试图使以最佳方式计算愤怒权力更加方便,并得出以下结论。 它是C++,但并非C,仍然取决于汇编者如何优化/在线工作。 不管怎么说,希望你实际上会认为它有用:

template<unsigned N> struct power_impl;

template<unsigned N> struct power_impl {
    template<typename T>
    static T calc(const T &x) {
        if (N%2 == 0)
            return power_impl<N/2>::calc(x*x);
        else if (N%3 == 0)
            return power_impl<N/3>::calc(x*x*x);
        return power_impl<N-1>::calc(x)*x;
    }
};

template<> struct power_impl<0> {
    template<typename T>
    static T calc(const T &) { return 1; }
};

template<unsigned N, typename T>
inline T power(const T &x) {
    return power_impl<N>::calc(x);
}

Clarification for the curious: this does not find the optimal way to compute powers, but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway (as opposed to using pow), there s no reason to fuss with the detail.

然后仅将其用作<代码>的功率与lt;6> (a)。

这使权力的类型容易(无需列出6条<代码>a>>,而如果你有某种精准性,如,则请在<代码>-ffast-math/code>上作这种优化。 (行动次序至关重要的例子)。

或许也可以忘记,这是C++,在C方案中使用(如果与C++汇编者一起汇编)。

希望会有所助益。

http://www.un.org。

这是我从我的汇编者那里得到的:

www.un.org/Depts/DGACM/index_french.htm

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0

www.un.org/Depts/DGACM/index_french.htm

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm0, %xmm0

<代码>power<6>a>

    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm0, %xmm1

海湾合作委员会实际上优化了以下编码:*a*a*a*a*a*a*a*a至(a*a*a*a*a*a*(a*a) *(a*a),如果是惯用的话。 我接受这一指挥:

$ echo  int f(int x) { return x*x*x*x*x*x; }  | gcc -o - -O2 -S -masm=intel -x c -

There are a lot of gcc flags but nothing fancy. They mean: Read from stdin; use O2 optimization level; output assembly language listing instead of a binary; the listing should use Intel assembly language syntax; the input is in C language (usually language is inferred from input file extension, but there is no file extension when reading from stdin); and write to stdout.

这里是产出的重要部分。 我在附加说明后,提出一些意见,表明大会语文的内容:

; x is in edi to begin with.  eax will be used as a temporary register.
mov  eax, edi  ; temp = x
imul eax, edi  ; temp = x * temp
imul eax, edi  ; temp = x * temp
imul eax, eax  ; temp = temp * temp

I m using system GCC on Linux Mint 16 Petra, an Ubuntu derivative. Here s the gcc version:

$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1

正如其他海报所指出的,这一选择在浮动点是不可能的,因为浮动点算术并非关联。

Because a 32-bit floating-point number - such as 1.024 - is not 1.024. In a computer, 1.024 is an interval: from (1.024-e) to (1.024+e), where "e" represents an error. Some people fail to realize this and also believe that * in a*a stands for multiplication of arbitrary-precision numbers without there being any errors attached to those numbers. The reason why some people fail to realize this is perhaps the math computations they exercised in elementary schools: working only with ideal numbers without errors attached, and believing that it is OK to simply ignore "e" while performing multiplication. They do not see the "e" implicit in "float a=1.2", "a*a*a" and similar C codes.

Should majority of programmers recognize (and be able to execute on) the idea that C expression a*a*a*a*a*a is not actually working with ideal numbers, the GCC compiler would then be FREE to optimize "a*a*a*a*a*a" into say "t=(a*a); t*t*t" which requires a smaller number of multiplications. But unfortunately, the GCC compiler does not know whether the programmer writing the code thinks that "a" is a number with or without an error. And so GCC will only do what the source code looks like - because that is what GCC sees with its "naked eye".

......一旦你知道哪一类方案管理员you,你就可以使用“-ffast-math”的开关,告诉海合会,“我知道我做什么”。 这将使海合会能够将*a*a*a*a*a*a*a*a*a*a改成不同的案文——它从*a*a*a*a*a*a*a*a*a看不出,但仍在*a*a*a*a*a*a*a的错误中计算一些。 这是大韩民国的,因为你已经知道你的工作是定期的,而不是理想的人数。

图书馆的职能,如“影子”,通常精心设计,以产生尽可能少的错误(一般情况下)。 这通常与间谍职能相近(根据Pascal的评论,最常见的执行似乎使用Remez算法)。

基本上如下行动:

pow(x,y);

内在的差错约为的大小,与任何单一复制或分区的差错相同。

在以下行动中:

float a=someValue;
float b=a*a*a*a*a*a;

内在错误大于5倍于单一多彩的<>/strong>或分部分的错误(因为你正在合并5个多模版)。

汇编者应当真正谨慎地对待其正在优化的类型:

  1. if optimizing pow(a,6) to a*a*a*a*a*a it may improve performance, but drastically reduce the accuracy for floating point numbers.
  2. if optimizing a*a*a*a*a*a to pow(a,6) it may actually reduce the accuracy because "a" was some special value that allows multiplication without error (a power of 2 or some small integer number)
  3. if optimizing pow(a,6) to (a*a*a)*(a*a*a) or (a*a)*(a*a)*(a*a) there still can be a loss of accuracy compared to pow function.

总的来说,你知道,对于任意浮动点值,“浅”的准确度比你最终可能写出的任何功能要好,但在某些特殊情况下,多重重叠可能更准确、更业绩,应由开发商选择更合适的东西,最终对代码作出评论,以便没有人会“鼓励”该守则。

唯一有意义的事情(个人意见,显然是海合会为优化而选择的任何特殊优化或编纂旗)应当用“a*a”取代“pow(a,2)”。 这将是汇编商供应商应当做的唯一一件事。

由于Lambdageek指出,浮动的多重复并非一种关联性,而且你可以提高准确性,但是,如果能够提高准确性,你会反对选择性,因为你想要一种决定性的应用。 例如,在游戏模拟客户/服务器中,每个客户必须模拟同一个世界,你希望点算法是决定性的。

我本来不会期望这一案件得到最佳利用。 经常会出现这样的情况,即表达方式含有可以重新组合以取消整个行动的次级压力。 我期望汇编者在有可能带来明显改善的领域投入时间,而不是涵盖很少遇到的边际案例。

令我感到惊讶的是,从其他答复中了解到,这一表述确实可以通过适当的汇编者开关优化。 优化是三维的,还是比较常见的最优选案例,或汇编作者极为透彻。

各位在这里做的那样,向汇编者提供背心没有什么错误。 它对微观刺激进程的一个正常和预期的部分,是重新安排声明和表述,看它们会带来哪些差异。

虽然汇编者在考虑两种表达方式以取得不一致的结果(没有适当的交换)时可能是合理的,但没有必要接受这一限制。 差异将令人难以置信,因此,如果差异给你,你首先不应使用标准的浮动点算法。

对这一问题已经有一些很好的答案,但为了完整起见,我要指出,C标准的适用部分是5.1.2.2.3/15(与C++11标准第1.9/9款相同)。 这一部分指出,只有在经营者真正成为合伙或互惠的,才能重新组合。

gcc 实际能够做到这种优化,即使对于浮动点数字也是如此。 例如,

double foo(double a) {
  return a*a*a*a*a*a;
}

......

foo(double):
    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm1, %xmm0
    ret

<代码>-O-funsafe-math-timopizations。 这种重新排序违反了EPC-754号文件,因此它需要旗帜。

如Peter Cordes在评论中指出的,不设<条码>-funsafe-math-optimizations,即可实现这一优化。 因为它完全是在没有过度流入的情况下,如果存在过度流入,你就没有界定行为。 因此,你会

foo(long):
    movq    %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rax, %rax
    ret

缩略语 对于没有签名的愤怒者来说,由于他们行使了2项权力,甚至当过度时,他们也可以自由重新排序。





相关问题
gcc -fPIC seems to muck with optimization flags

Following along from this question: how-do-i-check-if-gcc-is-performing-tail-recursion-optimization, I noticed that using gcc with -fPIC seems to destroy this optimization. I am creating a shared ...

Generate assembler code from C file in linux

I would like to know how to generate assembler code from a C program using Unix. I tried the gcc: gcc -c file.c I also used firstly cpp and then try as but I m getting errors. I m trying to build an ...

Getting rid of pre-compiled headers

OK, I have old Metrowerks code for Mac and Windows where the previous developer used pre-compiled headers for every project that this code base builds. How does one get rid of Pre-compiled headers, ...

Include a .txt file in a .h in C++?

I have a number of places where I need to re-use some template code. Many classes need these items In a .h could I do something like: #include <xxx.txt> and place all of this code in the ....

How to compile for Mac OS X 10.5

I d like to compile my application for version 10.5 and forward. Ever since I upgraded to Snow Leopard and installed the latest XCode, gcc defaults to 10.6. I ve tried -isysroot /Developer/SDKs/...