In this post, we’re going to get our hands dirty writing a toy arbitrary-precision floating point implementation in Rust. I’m therefore going to assume basic knowledge of IEEE754 (binary) floating point numbers. If you feel unprepared, I can recommend the following articles, which contain excellent visualizations:

Having an arbitrary-precision implementation is useful for pedagogical purposes, because a good mental model for IEEE754 floating point computations is to pretend that each computation happens in ideal math land, and is then brought back down to finite precision land as a separate post-processing step. Everything beyond that is “just” optimizations.

Let’s ignore the special values for now, and focus on the values everyone imagines when they think about floating-point values. Well, we basically just want a mantissa, an exponent, and a sign, right?

struct ArbFloat {
    sign: bool,
    exponent: i32,
    mantissa: u32,
}

Is this it? Well, we have 32 bits of precision here, the number of bits we granted the mantissa field. A bit of a far cry from arbitrary-precision, if you ask me. It’s sufficient to hold single-precision values, which require 24 bits of precision. But it will struggle to hold double-precision values, which require 53 bits of precision. And it will definitely struggle to hold the results of arithmetic between two such floats.

Introducing num-bigint. This crate has types for arbitrary-precision integers, so we can use them to hold our mantissa:

struct ArbFloat {
    sign: bool,
    exponent: i32,
    mantissa: BigUint,
}

But wait, shouldn’t we also make the exponent field a BigUint? The exponent field describes, well, exponential growth of our floating-point number. A value of 10000 in the exponent field means that the number has the value $\mathrm{mantissa} \cdot 2^{10000}$. That’s plenty, so i32 will suffice.

Also, rather than having separate sign and BigUint fields, we can use the BigInt type, which already has a sign. This will let us avoid dealing with sign management code, and we won’t need to carefully avoid underflows when subtracting BigUint-s:

struct ArbFloat {
    exponent: i32,
    number: BigInt,
}

Now, let’s handle the three special value categories we neglected earlier: zeros, infinities, and NaNs. The former two have a meaningful sign, but no behavior is ascribed to the sign bit of NaNs. Signed zeros, and the sign has meaningful behavior, wild!

enum ArbFloat {
    Regular { exponent: i32, number: BigInt },
    Zero { sign: bool },
    Infinity { sign: bool },
    NaN,
}

It’s kind of annoying that we’ll have to handle the sign’s behavior non-uniformly across these variants. We can instead hoist the number field out of Regular, and make the sign state of the other variants ride on it. We’ll also be giving NaNs a sign, but it doesn’t matter. We’ll also use this opportunity to slightly shorten the names of the fields.

#[derive(Debug, Copy, Clone)]
enum FloatKind {
    Regular { exp: i32 },
    Zero,
    Infinity,
    NaN,
}

#[derive(Debug, Clone)]
struct ArbFloat {
    kind: FloatKind,
    num: BigInt,
}

impl ArbFloat {
    fn new(kind: FloatKind, num: BigInt) -> Self {
        Self { kind, num }
    }
}

There’s a hidden invariant here we haven’t mentioned yet, which must be maintained for this data structure. The sign maintained by the BigInt type is of the following type: pub enum Sign { Minus, NoSign, Plus }. But we already have a Zero variant, and we need to maintain a sign state for it. So basically num should always be nonzero. For Zero, Infinity, and NaN, we’ll restrict ourselves even further and say that it should always be equal to $\pm 1$.

Alright, say we have a floating point value in one of the many IEEE754 or IEEE754-like binary interchange formats. How do we ingest such a value into our data structure?

We don’t want to write a separate parsing routine for each such format, that would be quite duplicative and bug-prone. So we want a parametric description of an IEEE754-like format. Basically, all we need to know is how many bits are in the mantissa field, and how many bits are in the exponent field (the sign field always has one bit).

For simplicity’s sake, we’ll only support interchange formats up to 64 bits in size. This fits formats like binary16 (half-precision), bfloat16 (“brain floating point”), binary32 (single-precision), and binary64 (double-precision).

type IntStorage = u64;

#[derive(Debug, Copy, Clone)]
struct FormatDesc {
    frac_bits: u8,
    exp_bits: u8,
}

impl FormatDesc {
    const BINARY32: Self = Self {
        frac_bits: 23,
        exp_bits: 8,
    };
    const BINARY64: Self = Self {
        frac_bits: 52,
        exp_bits: 11,
    };
}

fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
    todo!()
}

Let’s add some convenience methods on FormatDesc so that we’ll be able to extract the three different fields of the storage format.

impl FormatDesc {
    fn mask(bits: u32) -> IntStorage {
        IntStorage::MAX >> (IntStorage::BITS - bits)
    }

    fn frac_mask(&self) -> IntStorage {
        Self::mask(self.frac_bits as u32)
    }

    fn frac_shift(&self) -> IntStorage {
        0
    }

    fn biased_exp_mask(&self) -> IntStorage {
        Self::mask(self.exp_bits as u32)
    }

    fn biased_exp_shift(&self) -> IntStorage {
        self.frac_shift() + self.frac_bits as IntStorage
    }

    fn sign_mask(&self) -> IntStorage {
        1
    }

    fn sign_shift(&self) -> IntStorage {
        self.biased_exp_shift() + self.exp_bits as IntStorage
    }
}

Great. Now let’s start parsing.

fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
    let frac = (storage >> desc.frac_shift()) & desc.frac_mask();
    let biased_exp = (storage >> desc.biased_exp_shift()) & desc.biased_exp_mask();
    let sign = ((storage >> desc.sign_shift()) & desc.sign_mask()) != 0;

    ArbFloat::new(todo!(), todo!())
}

Now that we have the raw fields, how do we reason about them to create an ArbFloat? Let’s start with computing the FloatKind. We know that when the biased_exp field is all-ones, then we’re dealing either with a NaN (when frac != 0) or with an Infinity (when frac == 0). We also know that when the biased_exp field is all-zeros, then we’re dealing either with a subnormal number (when frac != 0) or with a zero (when frac == 0). Let’s sketch that out:

fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
    let frac = (storage >> desc.frac_shift()) & desc.frac_mask();
    let biased_exp = (storage >> desc.biased_exp_shift()) & desc.biased_exp_mask();
    let sign = ((storage >> desc.sign_shift()) & desc.sign_mask()) != 0;

    let kind = if biased_exp == desc.biased_exp_mask() {
        if frac == 0 {
            FloatKind::Infinity
        } else {
            FloatKind::NaN
        }
    } else if biased_exp == 0 {
        if frac == 0 {
            FloatKind::Zero
        } else {
            // Subnormals.
            FloatKind::Regular { exp: todo!() }
        }
    } else {
        FloatKind::Regular { exp: todo!() }
    };
    ArbFloat::new(kind, todo!())
}

Almost there. We’re only missing an assignment for exp and num. We’ll need to introduce a couple of helper methods on the FormatDesc structure:

impl FormatDesc {
    fn precision(&self) -> i32 {
        self.frac_bits as i32 + 1
    }

    // Move the unsigned (biased) exponent in the [1, 2^bits - 2] range to
    // the signed [-2^{bits - 1} + 2, 2^{bits - 1} - 1] range.
    fn exp_bias(&self) -> i32 {
        (1 << (self.exp_bits - 1)) - 1
    }

    // The hidden bit -- the implicit bit to the left of the radix point.
    fn integer_bit(&self) -> IntStorage {
        1 << self.frac_bits
    }
}

And now we can put everything together:

fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
    let frac = (storage >> desc.frac_shift()) & desc.frac_mask();
    let biased_exp = (storage >> desc.biased_exp_shift()) & desc.biased_exp_mask();
    let sign = ((storage >> desc.sign_shift()) & desc.sign_mask()) != 0;

    // We add the precision to the bias to be able to interpret the fraction
    // field as an integer rather than as a fixed-point number in [1, 2).
    // We've essentially moved the radix point to the right by multiplying
    // the fraction by `2^(precision - 1)`, so we compensate by subtracting
    // `precision - 1` from the exponent.
    let exp = biased_exp as i32 - (desc.exp_bias() + desc.precision() - 1);
    let mut num = BigInt::from(if sign { -1 } else { 1 });
    let kind = if biased_exp == desc.biased_exp_mask() {
        if frac == 0 {
            FloatKind::Infinity
        } else {
            FloatKind::NaN
        }
    } else if biased_exp == 0 {
        if frac == 0 {
            FloatKind::Zero
        } else {
            // Subnormals.
            num *= frac;
            FloatKind::Regular { exp: exp + 1 }
        }
    } else {
        num *= frac | desc.integer_bit();
        FloatKind::Regular { exp }
    };
    ArbFloat::new(kind, num)
}

This is a bit gnarly, but then so is IEEE754. 😉 Let’s see what this prints out for a couple of examples:

fn print_examples() {
    println!("{:?}", parse(FormatDesc::BINARY32, 0x8000_0000)); // -0f32
    println!("{:?}", parse(FormatDesc::BINARY32, 0x7F80_0000)); // f32::INFINITY
    println!("{:?}", parse(FormatDesc::BINARY32, 0x7FC0_0000)); // f32::NAN
    println!("{:?}", parse(FormatDesc::BINARY32, 0x3F80_0000)); // 1f32
    println!("{:?}", parse(FormatDesc::BINARY64, 0x3FF0_0000_0000_0000)); // 1f64
}

This prints out:

ArbFloat { kind: Zero, num: -1 }
ArbFloat { kind: Infinity, num: 1 }
ArbFloat { kind: NaN, num: -1 }
ArbFloat { kind: Regular { exp: -23 }, num: 8388608 }
ArbFloat { kind: Regular { exp: -52 }, num: 4503599627370496 }

This is almost right, but we’re now faced with an inconvenient reality of floating-point numbers: uniqueness of representation is not a given, it must be earned! In particular, we see here two different representations for the number 1.

This stems from the fact that we can “trade” between the mantissa and the exponent by multiplying the mantissa by 2 and then subtracting 1 from the exponent to compensate, and vice versa. Therefore, every number $m \cdot 2^{e}$ has a many different “aliases”, of the form $(m \cdot 2^{d}) \cdot 2^{e - d}$.

In the IEEE754 binary interchange formats, this problem is neatly defined out of existence by saying that the mantissa always represents a fraction in the interval $[1, 2)$. And throw away the top bit, since it’s always 1, while you’re at it. In other words, the single alias with $1 \le m < 2$ is the only one that can be encoded. Other aliases would need to either set bits above the implicit bit or unset the implicit bit itself.

I did lie in the previous paragraph. Just a bit. The mantissa doesn’t always represent a fraction in the interval $[1, 2)$, due to the existence of subnormals. But they’re defined in a way that does not break uniqueness of representation, and meshes well with gradual underflow towards zero.

For subnormal numbers (biased_exp == 0, mantissa != 0), IEEE754 essentially degrades into a fixed-point representation. The meaning of the number is $0.\mathrm{mantissa} \cdot 2^{E_\mathrm{min}}$. The drop from biased_exp == 1 to biased_exp == 0 does not represent a drop in the numeric value of the exponent in the regular way, but rather, we start digging into the mantissa and losing precision gradually. Dividing a number with biased_exp == 2 by 2 would leave us with biased_exp == 1 and the same mantissa. Dividing a number with biased_exp == 1 by 2 would leave us with biased_exp == 0, but also shift the mantissa right one bit position, essentially shifting the hidden bit into the mantissa itself (which means that the hidden bit becomes zero; we stole it into the mantissa!).

So, how does all this apply to us? What do we do to always have a single representation for every number? Well, we could just “right-justify” the BigInt to the best degree we can. We divide the mantissa by the largest power of two that divides it, and donate it to the exponent. This does depart from the IEEE754 perspective of putting the normalization anchor “to the left” of the number, in the radix point. We’re putting our normalization anchor “to the right”, by saying that we only ever deal with integers in our analogue for the mantissa (and only odd integers, at that).

impl ArbFloat {
    fn new(mut kind: FloatKind, mut num: BigInt) -> Self {
        if let FloatKind::Regular { exp } = &mut kind {
            let adjustment = num.trailing_zeros().unwrap() as i32;
            *exp += adjustment;
            num >>= adjustment;
        }
        Self { kind, num }
    }
}

Re-running, we now get the following output:

ArbFloat { kind: Zero, num: -1 }
ArbFloat { kind: Infinity, num: 1 }
ArbFloat { kind: NaN, num: 1 }
ArbFloat { kind: Regular { exp: 0 }, num: 1 }
ArbFloat { kind: Regular { exp: 0 }, num: 1 }

We can even parse the fabled/cursed binary3 format!

fn print_binary3() {
    const BINARY3: FormatDesc = FormatDesc {
        frac_bits: 1,
        exp_bits: 1,
    };
    for x in 0..8 {
        println!("{:?}", parse(BINARY3, x));
    }
}
ArbFloat { kind: Zero, num: 1 }
ArbFloat { kind: Regular { exp: 0 }, num: 1 }
ArbFloat { kind: Infinity, num: 1 }
ArbFloat { kind: NaN, num: 1 }
ArbFloat { kind: Zero, num: -1 }
ArbFloat { kind: Regular { exp: 0 }, num: -1 }
ArbFloat { kind: Infinity, num: -1 }
ArbFloat { kind: NaN, num: -1 }

Next time, we’ll take a shot at implementing multiplication and rounding.