x = 1.0
println(nextfloat(x) - x)
println(eps(Float64))2.220446049250313e-16
2.220446049250313e-16
Numerical Analysis
NYU Paris
Chapter structure:
Binary representation of real numbers
Floating point formats
Arithmetic operations
IEEE 754 encoding (Inf, -Inf, NaN)
Integer formats
A real number \(x\) in base \(\beta\) can be written as:
\[x = \pm \sum_{k=-n}^{\infty} a_k \beta^{-k}, \quad a_k \in \{0, \dots, \beta-1\}\]
Base 2 (binary)
Base 16 (hexadecimal)
Algorithm to convert a decimal fraction \(x\) to binary:
| i | \(x\) | Bits so far |
|---|---|---|
| 1 | 1/3 | 0.0 |
| 2 | 2/3 | 0.01 |
| 3 | 1/3 | 0.010 |
| 4 | 2/3 | 0.0101 |
| 5 | 1/3 | 0.01010 |
Pattern repeats → \((1/3)_{10} = (0.\overline{01})_2\)
\[ F(p, E_{min}, E_{max}) = \Big\{ (-1)^s 2^E (b_0.b_1 b_2 \dots b_{p-1})_2 \Big\} \]
Parameters:
For \(x \in F(p, E_{min}, E_{max})\):
| Format | Precision \(p\) | \(E_{min}\) | \(E_{max}\) |
|---|---|---|---|
| Half (Float16) | 11 | -14 | 15 |
| Single (Float32) | 24 | -126 | 127 |
| Double (Float64) | 53 | -1022 | 1023 |
Float16, Float32, Float64 (default)Definition of the machine epsilon
\[ \varepsilon = 2^{-(p-1)} \]
eps(Float16), eps(Float32), eps(Float64)Floating point arithmetic is not exact due to limited precision.
Rounding: result of any operation is rounded to nearest representable number.
Each intermediate operation may introduce a small round-off error.
Errors accumulate in sequences of operations.
Unique Representation
The representation of a representable number as \((-1)^s 2^E (b_0.b_1 … b_{p-1}\) is unique by requiring that
Float16, Float32, Float64 use 16, 32, 64 bits.