In this post, we’re going to get our hands dirty writing a toy arbitrary-precision floating point implementation in Rust. I’m therefore going to assume basic knowledge of IEEE754 (binary) floating point numbers. If you feel unprepared, I can recommend the following articles, which contain excellent visualizations:
- Bartosz Ciechanowski’s Exposing Floating Point.
- Michael Matloka’s Floating-point arithmetic – all you need to know, explained interactively.
Having an arbitrary-precision implementation is useful for pedagogical purposes, because a good mental model for IEEE754 floating point computations is to pretend that each computation happens in ideal math land, and is then brought back down to finite precision land as a separate post-processing step. Everything beyond that is “just” optimizations.
Let’s ignore the special values for now, and focus on the values everyone imagines when they think about floating-point values. Well, we basically just want a mantissa, an exponent, and a sign, right?
struct ArbFloat {
sign: bool,
exponent: i32,
mantissa: u32,
}
Is this it? Well, we have 32 bits of precision here, the number of bits we
granted the mantissa field. A bit of a far cry from arbitrary-precision,
if you ask me. It’s sufficient to hold single-precision values, which require
24 bits of precision. But it will struggle to hold double-precision values,
which require 53 bits of precision. And it will definitely struggle to hold
the results of arithmetic between two such floats.
Introducing num-bigint.
This crate has types for arbitrary-precision integers, so we can use them
to hold our mantissa:
struct ArbFloat {
sign: bool,
exponent: i32,
mantissa: BigUint,
}
But wait, shouldn’t we also make the exponent field a BigUint? The exponent
field describes, well, exponential growth of our floating-point number. A value
of 10000 in the exponent field means that the number has the value
$\mathrm{mantissa} \cdot 2^{10000}$. That’s plenty, so i32 will suffice.
Also, rather than having separate sign and BigUint fields, we can use
the BigInt type, which already has a sign. This will let us avoid dealing
with sign management code, and we won’t need to carefully avoid underflows
when subtracting BigUint-s:
struct ArbFloat {
exponent: i32,
number: BigInt,
}
Now, let’s handle the three special value categories we neglected earlier: zeros, infinities, and NaNs. The former two have a meaningful sign, but no behavior is ascribed to the sign bit of NaNs. Signed zeros, and the sign has meaningful behavior, wild!
enum ArbFloat {
Regular { exponent: i32, number: BigInt },
Zero { sign: bool },
Infinity { sign: bool },
NaN,
}
It’s kind of annoying that we’ll have to handle the sign’s behavior
non-uniformly across these variants. We can instead hoist the number field
out of Regular, and make the sign state of the other variants ride on it.
We’ll also be giving NaNs a sign, but it doesn’t matter. We’ll also use this
opportunity to slightly shorten the names of the fields.
#[derive(Debug, Copy, Clone)]
enum FloatKind {
Regular { exp: i32 },
Zero,
Infinity,
NaN,
}
#[derive(Debug, Clone)]
struct ArbFloat {
kind: FloatKind,
num: BigInt,
}
impl ArbFloat {
fn new(kind: FloatKind, num: BigInt) -> Self {
Self { kind, num }
}
}
There’s a hidden invariant here we haven’t mentioned yet, which must be
maintained for this data structure. The sign maintained by the BigInt type
is of the following type: pub enum Sign { Minus, NoSign, Plus }. But we
already have a Zero variant, and we need to maintain a sign state for it.
So basically num should always be nonzero. For Zero, Infinity, and NaN,
we’ll restrict ourselves even further and say that it should always be equal
to $\pm 1$.
Alright, say we have a floating point value in one of the many IEEE754 or IEEE754-like binary interchange formats. How do we ingest such a value into our data structure?
We don’t want to write a separate parsing routine for each such format, that would be quite duplicative and bug-prone. So we want a parametric description of an IEEE754-like format. Basically, all we need to know is how many bits are in the mantissa field, and how many bits are in the exponent field (the sign field always has one bit).
For simplicity’s sake, we’ll only support interchange formats up to 64 bits in
size. This fits formats like binary16 (half-precision),
bfloat16 (“brain floating point”), binary32 (single-precision), and
binary64 (double-precision).
type IntStorage = u64;
#[derive(Debug, Copy, Clone)]
struct FormatDesc {
frac_bits: u8,
exp_bits: u8,
}
impl FormatDesc {
const BINARY32: Self = Self {
frac_bits: 23,
exp_bits: 8,
};
const BINARY64: Self = Self {
frac_bits: 52,
exp_bits: 11,
};
}
fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
todo!()
}
Let’s add some convenience methods on FormatDesc so that we’ll be able
to extract the three different fields of the storage format.
impl FormatDesc {
fn mask(bits: u32) -> IntStorage {
IntStorage::MAX >> (IntStorage::BITS - bits)
}
fn frac_mask(&self) -> IntStorage {
Self::mask(self.frac_bits as u32)
}
fn frac_shift(&self) -> IntStorage {
0
}
fn biased_exp_mask(&self) -> IntStorage {
Self::mask(self.exp_bits as u32)
}
fn biased_exp_shift(&self) -> IntStorage {
self.frac_shift() + self.frac_bits as IntStorage
}
fn sign_mask(&self) -> IntStorage {
1
}
fn sign_shift(&self) -> IntStorage {
self.biased_exp_shift() + self.exp_bits as IntStorage
}
}
Great. Now let’s start parsing.
fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
let frac = (storage >> desc.frac_shift()) & desc.frac_mask();
let biased_exp = (storage >> desc.biased_exp_shift()) & desc.biased_exp_mask();
let sign = ((storage >> desc.sign_shift()) & desc.sign_mask()) != 0;
ArbFloat::new(todo!(), todo!())
}
Now that we have the raw fields, how do we reason about them to create an ArbFloat?
Let’s start with computing the FloatKind. We know that when the biased_exp
field is all-ones, then we’re dealing either with a NaN (when frac != 0)
or with an Infinity (when frac == 0). We also know that when the biased_exp
field is all-zeros, then we’re dealing either with a subnormal number (when frac != 0)
or with a zero (when frac == 0). Let’s sketch that out:
fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
let frac = (storage >> desc.frac_shift()) & desc.frac_mask();
let biased_exp = (storage >> desc.biased_exp_shift()) & desc.biased_exp_mask();
let sign = ((storage >> desc.sign_shift()) & desc.sign_mask()) != 0;
let kind = if biased_exp == desc.biased_exp_mask() {
if frac == 0 {
FloatKind::Infinity
} else {
FloatKind::NaN
}
} else if biased_exp == 0 {
if frac == 0 {
FloatKind::Zero
} else {
// Subnormals.
FloatKind::Regular { exp: todo!() }
}
} else {
FloatKind::Regular { exp: todo!() }
};
ArbFloat::new(kind, todo!())
}
Almost there. We’re only missing an assignment for exp and num. We’ll need
to introduce a couple of helper methods on the FormatDesc structure:
impl FormatDesc {
fn precision(&self) -> i32 {
self.frac_bits as i32 + 1
}
// Move the unsigned (biased) exponent in the [1, 2^bits - 2] range to
// the signed [-2^{bits - 1} + 2, 2^{bits - 1} - 1] range.
fn exp_bias(&self) -> i32 {
(1 << (self.exp_bits - 1)) - 1
}
// The hidden bit -- the implicit bit to the left of the radix point.
fn integer_bit(&self) -> IntStorage {
1 << self.frac_bits
}
}
And now we can put everything together:
fn parse(desc: FormatDesc, storage: IntStorage) -> ArbFloat {
let frac = (storage >> desc.frac_shift()) & desc.frac_mask();
let biased_exp = (storage >> desc.biased_exp_shift()) & desc.biased_exp_mask();
let sign = ((storage >> desc.sign_shift()) & desc.sign_mask()) != 0;
// We add the precision to the bias to be able to interpret the fraction
// field as an integer rather than as a fixed-point number in [1, 2).
// We've essentially moved the radix point to the right by multiplying
// the fraction by `2^(precision - 1)`, so we compensate by subtracting
// `precision - 1` from the exponent.
let exp = biased_exp as i32 - (desc.exp_bias() + desc.precision() - 1);
let mut num = BigInt::from(if sign { -1 } else { 1 });
let kind = if biased_exp == desc.biased_exp_mask() {
if frac == 0 {
FloatKind::Infinity
} else {
FloatKind::NaN
}
} else if biased_exp == 0 {
if frac == 0 {
FloatKind::Zero
} else {
// Subnormals.
num *= frac;
FloatKind::Regular { exp: exp + 1 }
}
} else {
num *= frac | desc.integer_bit();
FloatKind::Regular { exp }
};
ArbFloat::new(kind, num)
}
This is a bit gnarly, but then so is IEEE754. 😉 Let’s see what this prints out for a couple of examples:
fn print_examples() {
println!("{:?}", parse(FormatDesc::BINARY32, 0x8000_0000)); // -0f32
println!("{:?}", parse(FormatDesc::BINARY32, 0x7F80_0000)); // f32::INFINITY
println!("{:?}", parse(FormatDesc::BINARY32, 0x7FC0_0000)); // f32::NAN
println!("{:?}", parse(FormatDesc::BINARY32, 0x3F80_0000)); // 1f32
println!("{:?}", parse(FormatDesc::BINARY64, 0x3FF0_0000_0000_0000)); // 1f64
}
This prints out:
ArbFloat { kind: Zero, num: -1 }
ArbFloat { kind: Infinity, num: 1 }
ArbFloat { kind: NaN, num: -1 }
ArbFloat { kind: Regular { exp: -23 }, num: 8388608 }
ArbFloat { kind: Regular { exp: -52 }, num: 4503599627370496 }
This is almost right, but we’re now faced with an inconvenient reality of floating-point numbers: uniqueness of representation is not a given, it must be earned! In particular, we see here two different representations for the number 1.
This stems from the fact that we can “trade” between the mantissa and the exponent by multiplying the mantissa by 2 and then subtracting 1 from the exponent to compensate, and vice versa. Therefore, every number $m \cdot 2^{e}$ has a many different “aliases”, of the form $(m \cdot 2^{d}) \cdot 2^{e - d}$.
In the IEEE754 binary interchange formats, this problem is neatly defined out of existence by saying that the mantissa always represents a fraction in the interval $[1, 2)$. And throw away the top bit, since it’s always 1, while you’re at it. In other words, the single alias with $1 \le m < 2$ is the only one that can be encoded. Other aliases would need to either set bits above the implicit bit or unset the implicit bit itself.
I did lie in the previous paragraph. Just a bit. The mantissa doesn’t always represent a fraction in the interval $[1, 2)$, due to the existence of subnormals. But they’re defined in a way that does not break uniqueness of representation, and meshes well with gradual underflow towards zero.
For subnormal numbers (biased_exp == 0, mantissa != 0), IEEE754 essentially
degrades into a fixed-point representation. The meaning of the number is
$0.\mathrm{mantissa} \cdot 2^{E_\mathrm{min}}$. The drop from biased_exp == 1
to biased_exp == 0 does not represent a drop in the numeric value of the
exponent in the regular way, but rather, we start digging into the mantissa
and losing precision gradually. Dividing a number with biased_exp == 2 by 2
would leave us with biased_exp == 1 and the same mantissa. Dividing a
number with biased_exp == 1 by 2 would leave us with biased_exp == 0, but
also shift the mantissa right one bit position, essentially shifting the hidden
bit into the mantissa itself (which means that the hidden bit becomes zero; we
stole it into the mantissa!).
So, how does all this apply to us? What do we do to always have a single
representation for every number? Well, we could just “right-justify” the BigInt
to the best degree we can. We divide the mantissa by the largest power of two
that divides it, and donate it to the exponent. This does depart from the IEEE754
perspective of putting the normalization anchor “to the left” of the number,
in the radix point. We’re putting our normalization anchor “to the right”, by
saying that we only ever deal with integers in our analogue for the mantissa
(and only odd integers, at that).
impl ArbFloat {
fn new(mut kind: FloatKind, mut num: BigInt) -> Self {
if let FloatKind::Regular { exp } = &mut kind {
let adjustment = num.trailing_zeros().unwrap() as i32;
*exp += adjustment;
num >>= adjustment;
}
Self { kind, num }
}
}
Re-running, we now get the following output:
ArbFloat { kind: Zero, num: -1 }
ArbFloat { kind: Infinity, num: 1 }
ArbFloat { kind: NaN, num: 1 }
ArbFloat { kind: Regular { exp: 0 }, num: 1 }
ArbFloat { kind: Regular { exp: 0 }, num: 1 }
We can even parse the fabled/cursed binary3 format!
fn print_binary3() {
const BINARY3: FormatDesc = FormatDesc {
frac_bits: 1,
exp_bits: 1,
};
for x in 0..8 {
println!("{:?}", parse(BINARY3, x));
}
}
ArbFloat { kind: Zero, num: 1 }
ArbFloat { kind: Regular { exp: 0 }, num: 1 }
ArbFloat { kind: Infinity, num: 1 }
ArbFloat { kind: NaN, num: 1 }
ArbFloat { kind: Zero, num: -1 }
ArbFloat { kind: Regular { exp: 0 }, num: -1 }
ArbFloat { kind: Infinity, num: -1 }
ArbFloat { kind: NaN, num: -1 }
Next time, we’ll take a shot at implementing multiplication and rounding.