Floating point arithmetic

Numerical Analysis

Urbain Vaes

NYU Paris

Introduction

  • Numerical algorithms usually assume exact operations.
  • On computers, only a subset of real numbers can be stored.
  • Many operations are approximate → leads to round-off errors.

Chapter structure:

  • Binary representation of real numbers

  • Floating point formats

  • Arithmetic operations

  • IEEE 754 encoding (Inf, -Inf, NaN)

  • Integer formats

Binary Representation of Real Numbers

A real number \(x\) in base \(\beta\) can be written as:

\[x = \pm \sum_{k=-n}^{\infty} a_k \beta^{-k}, \quad a_k \in \{0, \dots, \beta-1\}\]

  • Humans: usually base 10
  • Computers: usually base 2

Common Bases

Base 2 (binary)

  • Digits are 0 or 1
  • Easy to store in circuits
  • Multiplying/dividing by 2 = bit shifts


Base 16 (hexadecimal)

  • Digits: 0–9, A–F
  • Compact representation
  • Used in colors, IPv6 addresses

Conversion algorithm (Decimal → Binary)

Algorithm to convert a decimal fraction \(x\) to binary:

Initialize i = 1
While x ≠ 0:
    Multiply x by 2
    If x ≥ 1:
        bᵢ = 1
        x = x - 1
    Else:
        bᵢ = 0
    i = i + 1
  • Bits \(b_1, b_2, …\) form the bits of the output \(x = (0.b_1 b_2 …)_2\)
  • If \(x\) repeats, binary representation is infinite
  • Base 2 representation can be infinite even if base 10 representation is finite

Example: \(1/3\) in Binary


i \(x\) Bits so far
1 1/3 0.0
2 2/3 0.01
3 1/3 0.010
4 2/3 0.0101
5 1/3 0.01010


Pattern repeats → \((1/3)_{10} = (0.\overline{01})_2\)

Set of Representable Values

  • Computers store only a subset of real numbers.
  • IEEE 754 standard defines floating point formats:

\[ F(p, E_{min}, E_{max}) = \Big\{ (-1)^s 2^E (b_0.b_1 b_2 \dots b_{p-1})_2 \Big\} \]

Parameters:

  • \(p\): number of significant bits (precision)
  • \(E_{min}, E_{max}\): minimum and maximum exponents
  • Special values: Inf, -Inf, NaN

Components of a Floating Point Number

For \(x \in F(p, E_{min}, E_{max})\):

  • \(s\)sign
  • \(E\)exponent
  • \(b_0.b_1 b_2 \dots b_{p-1}\)significand
  • \(b_0\): leading bit

Common Formats


Format Precision \(p\) \(E_{min}\) \(E_{max}\)
Half (Float16) 11 -14 15
Single (Float32) 24 -126 127
Double (Float64) 53 -1022 1023


  • \(F_{16} \subset F_{32} \subset F_{64}\)
  • Half-precision introduced in IEEE 754-2008
  • In Julia, Float16, Float32, Float64 (default)

Relative Error and Machine Epsilon

Definition of the machine epsilon

\[ \varepsilon = 2^{-(p-1)} \]

  • Depends on the floating point format (through parameter \(p\))
  • In Julia, eps(Float16), eps(Float32), eps(Float64)
  • Indicates the maximum relative spacing between representable numbers
  • Example: The next number from 1 is \(1 + \varepsilon\)
x = 1.0
println(nextfloat(x) - x)
println(eps(Float64))
2.220446049250313e-16
2.220446049250313e-16

Absolute Density of Floating Point Numbers

  • More numbers near 0 → denser representation.
  • Spacing grows with \(|x|\).

Relative Density of Floating Point Numbers

  • Relative spacing oscillates between \(\frac{1}{2} \varepsilon\) and \(\varepsilon\) in normalized region
  • In denormalized region (very small numbers), the relative spacing can be larger

Arithmetic Operations and Rounding

  • Floating point arithmetic is not exact due to limited precision.

  • Rounding: result of any operation is rounded to nearest representable number.

  • Each intermediate operation may introduce a small round-off error.

  • Errors accumulate in sequences of operations.

ε = eps()
@show .1 + .2 == .3     
@show sqrt(2)^2 == 2    
@show exp(ε) == 1 + ε   
@show exp/2) == 1 + ε 
@show exp/3) == 1;
0.1 + 0.2 == 0.3 = false
sqrt(2) ^ 2 == 2 = false
exp(ε) == 1 + ε = true
exp(ε / 2) == 1 + ε = true
exp(ε / 3) == 1 = true

Encoding of Floating Point Numbers

  • Set of representable numbers specified by \((p, E_{min}, E_{max})\)
  • Encoding = how they are actually stored
  • Encoding does not affect magnitude or propagation of round-off errors.

Unique Representation Rules

Unique Representation

The representation of a representable number as \((-1)^s 2^E (b_0.b_1 … b_{p-1}\) is unique by requiring that

  • Either \(E > E_{min}\) and leading bit \(b_0 = 1\)
  • Or \(E = E_{min}\), then \(b_0\) may be 0


  • Storange in memory: Float16, Float32, Float64 use 16, 32, 64 bits.
  • Example layout (32-bit Float32):
    • Sign: 1 bit
    • Encoded exponent: 8 bits
    • Encoded significand: 23 bits