Revisions to Solutions for floating point rounding errors

added 2 characters in body

Source Link

edited Feb 22, 2016 at 19:38

8.1k
1
26
26

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problemproblems crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*y*y): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall results. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problem crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall results. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problems crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*y): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall results. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

added 1 character in body

Source Link

edited Feb 21, 2016 at 9:04

cmaster - reinstate monica

8.1k
1
26
26

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problem crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall errorsresults. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problem crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall errors. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problem crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall results. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

Source Link

answered Feb 21, 2016 at 8:59

cmaster - reinstate monica

8.1k
1
26
26

Floating point arithmetic is usually quite precise (15 decimal digits for a double) and quite flexible. The problem crop up when you are doing math that significantly reduces the amount of digits of precision. Here are some examples:

Cancelation on subtraction: 1234567890.12345 - 1234567890.12300, the result 0.0045 has only two decimal digits of precision. This strikes whenever you subtract two numbers of similar magnitude.
Swallowing of precision: 1234567890.12345 + 0.123456789012345 evaluates to 1234567890.24691, the last ten digits of the second operand are lost.
Multiplications: If you multiply two 15 digit numbers, the result has 30 digits that need to be stored. But you can't store them, so the last 15 bits are lost. This is especially irksome when combined with a sqrt() (as in sqrt(x*x + y*): The result will only have 7.5 digits of precision.

These are the main pitfalls that you need to be aware of. And once you are aware of them, you can try to formulate your math in a way that avoids them. For examle, if you need to increment a value over and over again in a loop, avoid doing this:

for(double f = f0; f < f1; f += df) {

After a few iterations, the larger f will swallow part of the precision of df. Worse, the errors will add up, leading to the contraintuitive situation that a smaller df may lead to worse overall errors. Better write this:

for(int i = 0; i < (f1 - f0)/df; i++) {
    double f = f0 + i*df;

Because you are combining the increments in a single multiplication, the resulting f will be precise to 15 decimal digits.

This is only an example, there are other ways to avoid loss of precision due to other reasons. But it helps already a lot to think about the magnitude of the involved values, and to imagine what would happen if you were to do your math with pen and paper, rounding to a fixed number of digits after every step.

Stack Exchange Network

Return to Answer