My essential problem is how to make arithmetic with floats on x86 behave like a PowerPC, going from Classic MacOS (CodeWarrior) to Windows (VS 2008).
The code in question, of which there is a lot, has a pile of algorithms which are highly iterative and numerically very sensitive.
A typical complex line is:
Ims_sd = sqrt((4.0*Ams*sqr(nz)-8.0*(Ams+Dms)*nz+12.0*sqr(Ams)) /
(4.0*sqr(Ams)*(sqr(nz)-1)) -
sqr(Ims_av))*sqrt(nz-1);
It is written using a typedef d float
as the base type.
Changing to double
gets very similar results on both platforms but unfortunately the numbers are not acceptable so we can t take that easy way out.
The Mac code is compiled using CodeWarrior and just turning off the generation of the FMADD & FMSUB instructions had a drastic effect on the numbers created. So, my starting point was to search for the Visual Studio (2008) options that seemed most similar - making sure fused add was being used. We suspect that the key lies in the behaviour of the compiler in allocating intermediate storage in computations
Currently the best results are being obtained with a combination of enabling SSE2 and /fp:fast
. Enabling intrinsic functions causes values to drift further from the Mac values.
The /fp switch documentation says that only /fp:strict
turns off the fused add behaviour.
MSDN talks about linking FP10.OBJ "before LIBC.LIB, LIBCMT.LIB, or MSVCRT.LIB." to guarantee 64 bit precision. I ve apparently achieved this by specifying FP10.OBJ on the linker input field (verbose linker output shows it prior to MSVCRTD.lib).
I ve also set 64 bit precision by invoking
_controlfp_s(&control_word, _PC_64, MCW_PC);
in DllMain.
Note that the problem is not due to differences in floating point exception handling between platforms nor is due to the (delightful) way that PowerPC allows division by zero integers (just returning zero) as these areas have already been audited and addressed, thanks hugely to PC-Lint. The program runs and produces somewhat plausible output, just not quite good enough.
UPDATE:
An interesting comment from a friend: One possibility is that the PPC has a large number of temporary registers that can store 64 bit intermediate values whereas the x86 code may have to unload and reload the FPU (truncating to 4 bytes and losing precision).
This may be why SSE2 works better as (IIRC) it has more registers and more scope for preserving intermediate values.
One possibility - can your code be compiled as 64 bit? The x64 mode also has more registers for intermediates, and better FP instructions so it may be closer to the PPC in design and execution.
Initial testing with a 64-bit build actually got closer, as he suggested it might (I first thought it overshot but that was due to an incorrect modeling setting).
Final Resolution
I m sure anyone interested in this topic is sufficiently obsessive they would like to know how it all worked out in the end. The software is finished and delivering consistent numeric results. We were never able to get all the algorithms to deliver identical results to the Mac but they were close enough to be statistically acceptable. Given that the processing is guided by an expert user selecting the areas of interest and that user input is partly reactive to how the model progresses, the chief scientist deemed it acceptable (this was not an overnight decision!). The remaining numeric differences are well within the bounds of what determines different clinical results so no different diagnoses have been seen with testing.