Floating point arithmetic

Numerical Analysis

Urbain Vaes

NYU Paris

Introduction

Numerical algorithms usually assume exact operations.
On computers, only a subset of real numbers can be stored.
Many operations are approximate → leads to round-off errors.

Chapter structure:

Binary representation of real numbers
Floating point formats
Arithmetic operations
IEEE 754 encoding (Inf, -Inf, NaN)
Integer formats

Binary Representation of Real Numbers

A real number \(x\) in base \(\beta\) can be written as:

\[x = \pm \sum_{k=-n}^{\infty} a_k \beta^{-k}, \quad a_k \in \{0, \dots, \beta-1\}\]

Humans: usually base 10
Computers: usually base 2

Common Bases

Base 2 (binary)

Digits are 0 or 1
Easy to store in circuits
Multiplying/dividing by 2 = bit shifts

Base 16 (hexadecimal)

Digits: 0–9, A–F
Compact representation
Used in colors, IPv6 addresses

Conversion algorithm (Decimal → Binary)

Algorithm to convert a decimal fraction \(x\) to binary:

Initialize i = 1
While x ≠ 0:
    Multiply x by 2
    If x ≥ 1:
        bᵢ = 1
        x = x - 1
    Else:
        bᵢ = 0
    i = i + 1

Bits \(b_1, b_2, …\) form the bits of the output \(x = (0.b_1 b_2 …)_2\)
If \(x\) repeats, binary representation is infinite
Base 2 representation can be infinite even if base 10 representation is finite

Example: \(1/3\) in Binary

i	\(x\)	Bits so far
1	1/3	0.0
2	2/3	0.01
3	1/3	0.010
4	2/3	0.0101
5	1/3	0.01010

Pattern repeats → \((1/3)_{10} = (0.\overline{01})_2\)

Set of Representable Values

Computers store only a subset of real numbers.
IEEE 754 standard defines floating point formats:

\[ F(p, E_{min}, E_{max}) = \Big\{ (-1)^s 2^E (b_0.b_1 b_2 \dots b_{p-1})_2 \Big\} \]

Parameters:

\(p\): number of significant bits (precision)
\(E_{min}, E_{max}\): minimum and maximum exponents
Special values: Inf, -Inf, NaN

Components of a Floating Point Number

For \(x \in F(p, E_{min}, E_{max})\):

\(s\) → sign
\(E\) → exponent
\(b_0.b_1 b_2 \dots b_{p-1}\) → significand
\(b_0\): leading bit

Common Formats

Format	Precision \(p\)	\(E_{min}\)	\(E_{max}\)
Half (Float16)	11	-14	15
Single (Float32)	24	-126	127
Double (Float64)	53	-1022	1023

\(F_{16} \subset F_{32} \subset F_{64}\)
Half-precision introduced in IEEE 754-2008
In Julia, Float16, Float32, Float64 (default)

Relative Error and Machine Epsilon

Definition of the machine epsilon

\[ \varepsilon = 2^{-(p-1)} \]

Depends on the floating point format (through parameter \(p\))
In Julia, eps(Float16), eps(Float32), eps(Float64)
Indicates the maximum relative spacing between representable numbers
Example: The next number from 1 is \(1 + \varepsilon\)

x = 1.0
println(nextfloat(x) - x)
println(eps(Float64))

2.220446049250313e-16
2.220446049250313e-16

Absolute Density of Floating Point Numbers

More numbers near 0 → denser representation.
Spacing grows with \(|x|\).

Relative Density of Floating Point Numbers

Relative spacing oscillates between \(\frac{1}{2} \varepsilon\) and \(\varepsilon\) in normalized region
In denormalized region (very small numbers), the relative spacing can be larger

Arithmetic Operations and Rounding

Floating point arithmetic is not exact due to limited precision.
Rounding: result of any operation is rounded to nearest representable number.
Each intermediate operation may introduce a small round-off error.
Errors accumulate in sequences of operations.

ε = eps()
@show .1 + .2 == .3     
@show sqrt(2)^2 == 2    
@show exp(ε) == 1 + ε   
@show exp(ε/2) == 1 + ε 
@show exp(ε/3) == 1;

0.1 + 0.2 == 0.3 = false
sqrt(2) ^ 2 == 2 = false
exp(ε) == 1 + ε = true
exp(ε / 2) == 1 + ε = true
exp(ε / 3) == 1 = true

Encoding of Floating Point Numbers

Set of representable numbers specified by \((p, E_{min}, E_{max})\)
Encoding = how they are actually stored
Encoding does not affect magnitude or propagation of round-off errors.

Unique Representation Rules

Unique Representation

The representation of a representable number as \((-1)^s 2^E (b_0.b_1 … b_{p-1}\) is unique by requiring that

Either \(E > E_{min}\) and leading bit \(b_0 = 1\)
Or \(E = E_{min}\), then \(b_0\) may be 0

Storange in memory: Float16, Float32, Float64 use 16, 32, 64 bits.
Example layout (32-bit Float32):
- Sign: 1 bit
- Encoded exponent: 8 bits
- Encoded significand: 23 bits