= 1.0
x println(nextfloat(x) - x)
println(eps(Float64))
2.220446049250313e-16
2.220446049250313e-16
Numerical Analysis
NYU Paris
Chapter structure:
Binary representation of real numbers
Floating point formats
Arithmetic operations
IEEE 754 encoding (Inf, -Inf, NaN)
Integer formats
A real number \(x\) in base \(\beta\) can be written as:
\[x = \pm \sum_{k=-n}^{\infty} a_k \beta^{-k}, \quad a_k \in \{0, \dots, \beta-1\}\]
Base 2 (binary)
Base 16 (hexadecimal)
Algorithm to convert a decimal fraction \(x\) to binary:
i | \(x\) | Bits so far |
---|---|---|
1 | 1/3 | 0.0 |
2 | 2/3 | 0.01 |
3 | 1/3 | 0.010 |
4 | 2/3 | 0.0101 |
5 | 1/3 | 0.01010 |
Pattern repeats → \((1/3)_{10} = (0.\overline{01})_2\)
\[ F(p, E_{min}, E_{max}) = \Big\{ (-1)^s 2^E (b_0.b_1 b_2 \dots b_{p-1})_2 \Big\} \]
Parameters:
For \(x \in F(p, E_{min}, E_{max})\):
Format | Precision \(p\) | \(E_{min}\) | \(E_{max}\) |
---|---|---|---|
Half (Float16) | 11 | -14 | 15 |
Single (Float32) | 24 | -126 | 127 |
Double (Float64) | 53 | -1022 | 1023 |
Float16
, Float32
, Float64
(default)Definition of the machine epsilon
\[ \varepsilon = 2^{-(p-1)} \]
eps(Float16)
, eps(Float32)
, eps(Float64)
Floating point arithmetic is not exact due to limited precision.
Rounding: result of any operation is rounded to nearest representable number.
Each intermediate operation may introduce a small round-off error.
Errors accumulate in sequences of operations.
Unique Representation
The representation of a representable number as \((-1)^s 2^E (b_0.b_1 … b_{p-1}\) is unique by requiring that
Float16
, Float32
, Float64
use 16, 32, 64 bits.