In the Numbers étude, many students failed to grasp
either that the model code was computing *relative errors*
or the practical significance of the relative errors obtained,
given that the model code uses single-precision floats.

Recall that IEEE single-precision floating point numbers are

S: 1 | E: 8 | M: 23 |

where S is the sign bit, E is the exponent field, and M is the significand aligned so that its most significant 1 bit is just off the edge of what's stored. (This is called a “hidden bit” and means that the significand is effectively 24 bits.)

There are two very special values for the exponent field.

- all bits 0.
- This is the smallest possible exponent. These numbers
represent (-1)
^{S}×M×2^{-149}. When M = 0, this is zero, and yes, IEEE arithmetic distinguishes between +0.0 and -0.0. When M > 0, these numbers are called “subnormal” because they are smaller (in magnitude) than the smallest normalised number.If you try to represent a positive number less than or equal to 2

^{-150}it will be rounded to 0.0. This is called*underflow*. - all bits 1.
- This is the largest possible exponent. These numbers
represent ∞ (S=0) or -∞ (S=1) if M = 0;
if M > 0 they are ``Not-a-Number'' values.
An ∞ result indicates a result that was too big to fit (overflow); NaN results indicate some sort of error, such as sqrt(-1).

- all other values of E are not special.
- These numbers are called “normalised”
numbers because the significand is scaled to put the
most significant 1 bit in a fixed place. They
represent (-1)
^{S}×(2^{23}+M)×2^{E-149}. Zero, subnormal numbers, and normalised numbers are collectively called*finite*numbers.

Suppose we have a true value `T` and a calculated
or otherwise estimated value `E`. Just how wrong is
`E`? One answer is the *absolute error*:

abserr(E,T) = abs(E-T)

This is a good way to measure error if you expect the sizes of errors not to vary much with the size of the true values, or when the range of possible true values is not great. For example, if you can measure temperatures to ±0.2K (which is way better than almost all weather measurements) then since temperatures typically range from 233K to 313K in places people are likely to care about, absolute error is a good quality measure. (See this article for accuracy of clinical thermometers. 0.2K wasn't picked out of thin air.)

Supposing `T` to be non-zero, we can
define the *relative error*:

relerr(E,T) = abs((E-T)/T)

That is, the absolute error scaled by the size of `T`.
Let's return to our temperature example. Real measuring instruments
tend to get worse at the ends of their ranges. Let's use some
actual numbers from
a Texas
Instruments data sheet.

range | abserr | relerr |

20–42°C | 0.13°C | 0.012 |

-20–90°C | 0.20°C | 0.0034 |

90–110°C | 0.23°C | 0.0033 |

-55–150°C | 0.36°C | 0.0042 |

The absolute error isn't even close to constant.

Except in the middle of the range (where it is worse), the relative error is close to 1 part in 300. The typical error in the middle of the range is quoted as ±0.05°C, which is close to the worst case relative error elsewhere, which is reasonable.

The next number in that data sheet is worth noting:
two *adjacent* sensors on the same tape, manufactured
just instants apart, could differ by ±0.1°C.
We see that two “sibling” sensors placed in
different circuits measuring close but different places
could disagree by ±0.2C *and that would not
mean there was any real difference at all*. Again,
the 0.2K figure wasn't pulled out of thin air! Next
time someone tells you we know global temperatures to
better than ±1°C, laugh yourself sick.

Because *physical* measurement errors tend to
grow with the errors, floating-point, which has
*representation* errors that grow with the
represented value, aren't a bad fit. Single precision
floating point numbers, for example, are much better than
we need to record temperatures. (Even “ultra-precise”
measurements taken with extremely expensive laboratory equipment.)

Of course calculations introduce their own errors. This makes
it useful to have a feel for how much error is *unavoidable*,
just because of the way numbers are represented.

The following C program shows the absolute and
relative errors for *adjacent* single precision
floats. That is, we are concerned here with numbers
that are as close to each other as they can possibly
be without being the same number.

#include<float.h> #include<math.h> #include<stdio.h>staticfloatabserr(floatderived,floatcorrect) {returnfabsf(derived - correct); }staticfloatrelerr(floatderived,floatcorrect) {returnfabsf((derived - correct)/correct); }staticvoidshow(charconst*label,floatderived,floatcorrect) { printf("%s %.1e %.1e\n", label, abserr(derived, correct), relerr(derived, correct)); }intmain(void) {union{floatf;unsignedu; } pun;floatconstm = powf(2.0f, -24); printf("absolute relative\naround error error\n"); pun.f = FLT_MIN; pun.u++; show("FLT_MIN+", pun.f, FLT_MIN); pun.f = m; pun.u++; show("6.0e-8+", pun.f, m); pun.f = 1.0f; pun.u--; show("1.0-", pun.f, 1.0f); pun.f = 1.0f; pun.u++; show("1.0+", pun.f, 1.0f); pun.f = 1.0f/m; pun.u--; show("1.7e+7-", pun.f, 1.0f/m); pun.f = FLT_MAX; pun.u--; show("FLT_MAX-", pun.f, FLT_MAX);return0; }

Here's the output of that program.

around | absolute error | relative error |
---|---|---|

FLT_MIN+ | 1.4e-45 | 1.2e-07 |

6.0e-8+ | 7.1e-15 | 1.2e-07 |

1.0- | 6.0e-08 | 6.0e-08 |

1.0+ | 1.2e-07 | 1.2e-07 |

1.7e+7- | 1.0e+00 | 6.0e-08 |

FLT_MAX- | 2.0e+31 | 6.0e-08 |

We see that the absolute errors grow with the numbers, while the relative errors fluctate between 1.2e-7 and half that. In fact the official figure is

FLT_EPSILON = 1.1920928955078125×10^{-7}

If the relative error of a single precision result is
a small multiple of 1.2e-7, that means it is about as good
as you are likely to get. (The C <math.h> library
and java.lang.Math class go to great lengths to get 1 bit
worst case error. *You* aren't likely to write code
that accurate, and neither am I.) It's certainly much smaller than the
accuracy of most physical measurements.

If the relative error of a calculation is 1.0, that typically means that the result underflowed to 0.0.

If the relative error of a calculation is ∞, that typically means that the result overflowed.

If the relative error is NaN, that means that somewhere in the calculation something stupid happened.

Of course, only in carefully constructed test cases
are we likely to know what the true value `T` is.
If we have two different calculations for the same result,
we can compute their absolute and relative errors. It
is common to use

relerr(E,F) = abs(E-F)/max(abs(E),abs(F))

in this case. A large relative error in such a case indicates trouble. A small relative error may simply mean that both are wrong.