Question

My essential problem is how to make arithmetic with floats on x86 behave like a PowerPC, going from Classic MacOS (CodeWarrior) to Windows (VS 2008).

The code in question, of which there is a lot, has a pile of algorithms which are highly iterative and numerically very sensitive.

A typical complex line is:

Ims_sd = sqrt((4.0*Ams*sqr(nz)-8.0*(Ams+Dms)*nz+12.0*sqr(Ams)) /
         (4.0*sqr(Ams)*(sqr(nz)-1)) - 
         sqr(Ims_av))*sqrt(nz-1);

It is written using a typedef d float as the base type.

Changing to double gets very similar results on both platforms but unfortunately the numbers are not acceptable so we can t take that easy way out.

The Mac code is compiled using CodeWarrior and just turning off the generation of the FMADD & FMSUB instructions had a drastic effect on the numbers created. So, my starting point was to search for the Visual Studio (2008) options that seemed most similar - making sure fused add was being used. We suspect that the key lies in the behaviour of the compiler in allocating intermediate storage in computations

Currently the best results are being obtained with a combination of enabling SSE2 and /fp:fast. Enabling intrinsic functions causes values to drift further from the Mac values.

The /fp switch documentation says that only /fp:strict turns off the fused add behaviour.

MSDN talks about linking FP10.OBJ "before LIBC.LIB, LIBCMT.LIB, or MSVCRT.LIB." to guarantee 64 bit precision. I ve apparently achieved this by specifying FP10.OBJ on the linker input field (verbose linker output shows it prior to MSVCRTD.lib).

I ve also set 64 bit precision by invoking

_controlfp_s(&control_word, _PC_64, MCW_PC);

in DllMain.

Note that the problem is not due to differences in floating point exception handling between platforms nor is due to the (delightful) way that PowerPC allows division by zero integers (just returning zero) as these areas have already been audited and addressed, thanks hugely to PC-Lint. The program runs and produces somewhat plausible output, just not quite good enough.

UPDATE:

An interesting comment from a friend: One possibility is that the PPC has a large number of temporary registers that can store 64 bit intermediate values whereas the x86 code may have to unload and reload the FPU (truncating to 4 bytes and losing precision).

This may be why SSE2 works better as (IIRC) it has more registers and more scope for preserving intermediate values.

One possibility - can your code be compiled as 64 bit? The x64 mode also has more registers for intermediates, and better FP instructions so it may be closer to the PPC in design and execution.

Initial testing with a 64-bit build actually got closer, as he suggested it might (I first thought it overshot but that was due to an incorrect modeling setting).

Final Resolution

I m sure anyone interested in this topic is sufficiently obsessive they would like to know how it all worked out in the end. The software is finished and delivering consistent numeric results. We were never able to get all the algorithms to deliver identical results to the Mac but they were close enough to be statistically acceptable. Given that the processing is guided by an expert user selecting the areas of interest and that user input is partly reactive to how the model progresses, the chief scientist deemed it acceptable (this was not an overnight decision!). The remaining numeric differences are well within the bounds of what determines different clinical results so no different diagnoses have been seen with testing.

Answer 1

The whole question of floating point determinism across multiple platforms seems to be a very thorny issue and the more you dig into it, the worse it seems to get.

I did find this interesting article that discusses the problem in great depth - it might be able to throw up some ideas.

Answer 2

I refer you to GCC bug 323:

I d like to welcome the newest members of the bug 323 community, where all x87 floating point errors in gcc come to die! All floating point errors that use the x87 are welcome, despite the fact that many of them are easily fixable, and many are not! We re all one happy family, making the egregious mistake of wanting accuracy out of the most accurate general purpose FPU on the market!

The short summary is that it s incredibly tedious to get "true" IEEE floating-point singles/doubles on an x87 without significant performance penalty; you suffer from double-rounding of denorms even if you use fldcw due to the reduced exponent range (IIRC, IEEE FP specifically allows implementations to do their own thing WRT denorms). Presumably you could do something like this:

Round to positive infinity, perform the operation (getting ldresult1), round to nearest even, convert to float (getting fresult1).
RTNI, perform the op, RTNE, convert to float.
If they re the same, great: You have the correct RTNE float result. If not, then (I think) fresult2 < fresult1, and furthermore, fresult1=nextafterf(fresult2,+inf), and there are two possibilities:
- ldresult1 == ((long double)fresult1+fresult2)/2. The "correct" answer is is fresult2.
- ldresult2 == ((long double)fresult1+fresult2)/2. The "correct" answer is is fresult1.

I m probably wrong in the details somewhere, but this is presumably the pain you have to go through when you get a denorm.

And then you hit the other issue: I m pretty sure there s no guarantee about sqrt() returning the same resolt across different implementations (and very sure for trig functions); the only guarantee I ve ever seen is that the result is "within 1 ulp" (presumably of the correctly rounded result). It s highly dependent on the algorithm used, and modern CPUs have instructions for these, so you suffer a significant performance penalty if you try to implement it in software. Nevertheless, ISTR a "portable" floating point library somewhere which was supposed to achieve consistency, but I don t remember the name OTTOMH.

Answer 3

Not an answer as such, but more text (and formatting) than I could fit in a comment. Reading your question, it strikes me that you have probably considered all of this, but not told us, so this may all be irrelevant chatter. If it is, I apologise.

Can you (did you ?) enforce adherence to IEEE754 rules for floating-point arithmetic on either the original or ported versions of the program ? My first guess is that the two platforms (combination of hardware, o/s, libraries) implement different approaches to fp arithmetic.

What assumptions (if any) have you made about the default sizes, on the two platforms, of some of the fundamental types such as ints and floats. The C standard (and I believe the C++ standard) allow platform-dependency for some such (can t off the top of my head remember which, I m really a Fortran programmer).

Final guess -- I ve grown used (in my Fortranny world) to specifying float constants such as your 4.0 with sufficient digits to specify all the (decimal) digits in the preferred representation, ie something like 4.000000000000000000000000. I know that, in Fortran, a 4-byte float constant such as 3.14159625 will, when automatically cast to 8-bytes, not fill the extra bytes with the further digits in the decimal expression of pi. This may be affecting you.

None of this really helps you ensure that the ported version of your code produces the same, to the bit, results as the original version, only identify sources of difference.

Finally, is your requirement that the new version produce the same results as the old version, or that you provide assurance to your customers that the new version produces accurate answers ? Your question leaves open the possibility that the old version of the program was wronger than the new, taking into account all the sources of error in numerical computations.

友情链接