Standard math functions reproducibility on different CPU's

Question

I am working on project with a lot math calculations. After switching on a new test machine, I have noticed that a lot of tests failed. But also important to notice that tests also failed on my develop machine, and on some machines of other developers. After tracing values and comparing with values from the old machine I found that some functions (At this moment I found only cosine) from math.h sometimes returns slightly different values (for example: 40965.8966304650828827e-01 and 40965.8966304650828816e-01, -3.3088623618085204e-08 and -3.3088623618085197e-08).

New CPU: Intel Xeon Gold 6230R (Intel64 Family 6 Model 85 Stepping 7)

Old CPU: Exact model is unknown (Intel64 Family 6 Model 42 Stepping 7)

My CPU: Intel Core i7-4790K

Tests results doesn't depend on Windows version (7 and 10 were tested).

I have tried to test with binary that was statically linked with standard library to exclude loading of different libraries for different processes and Windows versions, but all results were the same.

Project compiled with /fp:precise, switching to /fp:strict changed nothing.

MSVC from Visual Studio 15 is used: 19.00.24215.1 for x64.

How to make calculations fully reproducible?

You are beyond the precision of a double in the first example. Remember that a double has 15 to 17 digits of precision. — drescherjm
– drescherjm, Commented Oct 14, 2022 at 20:05
"How to make calculations fully reproducible?" - Don't rely on exact results when it comes to floating point math. — Ted Lyngmo
– Ted Lyngmo, Commented Oct 14, 2022 at 20:08
I suspect you should update your test cases to relax their acceptance criteria slightly. Not a lot — you still want to catch real errors — but you don't want to call something a failure when it is, as here, an acceptable and unavoidable variation in the last one or two bits of a floating-point result. I can't tell you exactly how to do this, because it can be a hard problem, with some real subtleties. You might want to retain a consultant with expertise in floating point numerical analysis. — Steve Summit
– Steve Summit, Commented Oct 14, 2022 at 20:11
All your tests did was show the obvious in terms of how floating point works. If you tested maybe 4 or 5 digits of precision, ok. But all 15 / 17 digits? That is bound to fail, if not guaranteed to fail. — PaulMcKenzie
– PaulMcKenzie, Commented Oct 14, 2022 at 20:16
Floating-point results can be exact, but often they're not, and usually it's not appropriate to expect them to be. If you were doing Quality Control in a widget manufacturing plant, and if the widgets were supposed to be 17.5 inches long, you would probably check to see that they were 17.5 ±0.01 inches long, or maybe ±0.001 inches, or maybe ±0.0001 inches. But you would not insist that they be 17.5±0.00000000001 inches. And for a great many programs that compute floating-point results, the same principle applies. — Steve Summit
– Steve Summit, Commented Oct 14, 2022 at 20:26

Sedenion · Accepted Answer · 2022-10-15 16:17:50Z

Since you are on Windows, I am pretty sure the different results are because the UCRT detects during runtime whether FMA3 (fused-multiply-add) instructions are available for the CPU and if yes, use them in transcendental functions such as cosine. This gives slightly different results. The solution is to place the call set_FMA3_enable(0); at the very start of your main() or WinMain() function, as described here.

If you want to have reproducibility also between different operating systems, things become harder or even impossible. See e.g. this blog post.

In response also to the comments stating that you should just use some tolerance, I do not agree with this as a general statement. Certainly, there are many applications where this is the way to go. But I do think that it can be a sensible requirement to get exactly the same floating point results for some applications, at least when staying on the same OS (Windows, in this case). In fact, we had the very same issue with set_FMA3_enable a while ago. I am a software developer for a traffic simulation, and minor differences such as 10^-16 often build up and lead to entirely different simulation results eventually. Naturally, one is supposed to run many simulations with different seeds and average over all of them, making the different behavior irrelevant for the final result. But: Sometimes customers have a problem at a specific simulation second for a specific seed (e.g. an application crash or incorrect behavior of an entity), and not being able to reproduce it on our developer machines due to a different CPU makes it much harder to diagnose and fix the issue. Moreover, if the test system consists of a mixture of older and newer CPUs and test cases are not bound to specific resources, means that sometimes tests can deviate seemingly without reason (flaky tests). This is certainly not desired. Requiring exact reproducibility also makes writing the tests much easier because you do not require heuristic thresholds (e.g. a tolerance or some guessed value for the amount of samples). Moreover, our customers expect the results to remain stable for a specific version of the program since they calibrated (more or less...) their traffic networks to real data. This is somewhat questionable, since (again) one should actually look at averages, but the naive expectation in reality usually wins.

As one of the commenters who was advocating for some tolerance, I appreciate your comments about striving for exactitude. In the end I think it's a tradeoff: there can be costs to inexactitude, such as the non-reproducibility of simulations that you mentioned, but then again, there are costs to tracking down and then finding a way to correct every inexactitude, and those can be high, too! So sometimes you have to choose your poison. (Btw, in the context of traffic simulations, on first reading I completely misinterpreted those words "a problem — e.g. a crash". :-) )
Heh, your right ;-) I amended the text to make it more clear. Also I agree with it being a trade-off. However in practice we did not have much trouble over the years requiring reproducibility when staying on a single platform (OS + tool chain). AFAIK it was mostly the set_FMA3_enable thingy a few years ago and a change by Microsoft to the behavior of printf 1 or 2 years ago.
This is a plausible explanation of the FP difference. It's not clear whether you knew or were just guessing, but Intel64 Family 6 Model 42 Stepping 7 dates from before Intel CPUs supported FMA3, whereas the other two postdate the addition of FMA3 to Intel designs.
As far as the question of testing exact FP results, I can appreciate that doing so might be desirable in conjunction with the particular application you describe, but the constraints on that application are atypical, and indeed, not entirely sensible (as you acknowledge yourself). I would accept that there are cases where you do want to test exact FP results, but this answer seems to suggest that it is always reasonable and appropriate to test exact FP results, and I utterly reject that.
@JohnBollinger I completely agree with you that exact reproducibility is not always (or even rarely) a sensible requirement, and I am sorry that my post sounded like this (although I did mention "for some applications"). I edited it to emphasize that my statement holds only for certain applications. On the other hand, I do reject the notion of the comments suggesting that it can never be a sensible requirement. It really depends on the application, and without knowing the situation of the OP, we cannot make a fair judgment.

Luis Colorado · Accepted Answer · 2022-11-23 21:54:31Z

2

First of all, 40965.8966304650828827e-01 cannot be a result from cos() function, as cos(x) is a function that, for real valued arguments always returns a value in the interval [-1.0, 1.0] so the result shown cannot be the output from it.
Second, you will have probably read somewhere that double values have a precision of roughly 17 digits in the significand, while your are trying to show 21 digit. You cannot get correct data past the ...508, as you are trying to force the result farther from the 17dig limit.

The reason you get different results in different computers is that what is shown after the precise digits are shown is undefined behaviour, so it's normal that you get different values (you could get different values even on different runs on the same machine with the same program)

answered Nov 23, 2022 at 21:54

Luis Colorado

13k1 gold badge19 silver badges34 bronze badges

3 Comments

armoken Over a year ago

1. Yes, it my mistake, but this is real problematic value from test results.

armoken Over a year ago

2. Yes, but accumulation of error over multiple computations can result in large errors. And we need to know what can affect the results, because a large resources required now to make current code generate results fully independent from hardware.

Luis Colorado Over a year ago

1. then you have a mistake in your code. the error can grow, but not so much. 2. Your accumulated error can become huge if you substract two quantities of about the same magnitude (e.g. if you subtract 1.1235645 - 1.1235558) because the relative error can grow up to ~100% and this can be amplified if you multiply that result by a large number. But normally accumuluation errors tend to compensate in pairs and maintain the relative error the same) Look for a book in error theory. You need it.

Clifford · Accepted Answer · 2022-10-15 18:16:47Z

IEEE-745 double precision binary floating point provides no more than 15 decimal significant digits of precision. You are looking at the "noise" of different library implementations and possibly different FPU implementations.

How to make calculations fully reproducible?

That is an X-Y problem. The answer is you can't. But it is the wrong question. You would do better to ask how you can implement valid and robust tests that are sympathetic to this well-known and unavoidable technical issue with floating-point representation. Without providing the test code you are trying to use, it is not possible to answer that directly.

Generally you should avoid comparing floating point values for exact equality, and rather subtract the result from the desired value, and test for some acceptable discrepancy within the supported precision of the FP type used. For example:

#define EXPECTED_RESULT  40965.8966304650
#define RESULT_PRECISION 00000.0000000001

double actual_result = test() ;
bool error = fabs( actual_result-
                   EXPECTED_RESULT ) > 
                   RESULT_PRECISION ;

@armoken you miss my point. Striving for the reproducibility you expect is futile. That s is the solution to the inevitability of FP error.

Collectives™ on Stack Overflow

Standard math functions reproducibility on different CPU's

3 Answers 3

6 Comments

3 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

3 Comments

2 Comments

Linked

Related