IEEE Standard
The rule of thumb is to "round to nearest, tie to even", by which I mean it's the IEEE recommended rounding for floating point representation, though it's not strictly followed, case in point, tensorflow seems not to, wrapping around mantissa when converting fp32 to bf16. (check floating point - Convert FP32 to Bfloat16 in C++ - Stack Overflow)
- Round to nearest, ties to even – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit.
Relevance
This is mostly relevant in deep learning where a few mismatched precision among critical weights, which, research shows, are not created equal; some may be modified with no obvious change to model output, while others change it drastically when tweaked. Therefore in commercial deployment, the service provider usually wants to make sure computation is exact as model testing done in lab setups.
Usually it goes as researcher compare GPU results with CPU golden, and then NPU and other ASICs playing catchups will compare their results with GPU.
Large corporations capable of providing LLM service will also likely test workload tolerance to precision loss/changes beforehand.
Different lib or py_pak use different rounding, so make sure you check beforehand how they interpret fp322b16, or implement the rounding yourself
https://www.corsix.org/content/converting-fp32-to-fp16
Breakdown
From a numerical point of view, the tie2even rule is simple, 1.5 to 2.0, 0.25 to 0.0 etc. but I'm very much confused when working with binaries so here is a break down:
(use E for exponent encoding, i.e. before subtracting bias, exp for true exponent value)
in general for:
N + ".5", where N is an integer, we have a tie; ".5" means midway,
e.g. for f322bf16 (examples using Float Toy), 7 MSBs aside, the rest being leading 1 then all 0s is a ".5" tie
else
simply round to nearest
where we round the mantissa in it binary form per usual, and generate a carry to exponent when it overflows, e.g.
to bf16:
and if the carry results in dst. dtype exponent overflow, then follow Inf/Nan protocol of the dst. dtype.
Tie Breaking
FP322BF16
by tie2even rule, we break tie in favor of setting the LSB of dst. dtype to 0, hence
breaks to
- the numerical "evenness" of mantissa do not follow the intuition of integers, so write out binary and follow strict definition of "tie2even" when confused.
- note that tie2even will never introduce carry, simplifying logic
FP322FP16
a simple case:
Exponent "Rounding"
There is no actual rounding for exponent, when overflow, we usually either set the value to NaN or Inf.
Exponent convertion is numerically simple integer copy or (value) truncation for overflow (which by IEEE encoding, is either NaN or Inf) and underflow. This exponent absolute value overflow led fp value underflow should not be confused with the innate denormal, aka subnormal, underflow, see https://en.wikipedia.org/wiki/Subnormal_number. To avoid confusion, we will consider exponent to be only capable of being "overflow".
In binary operation, because of the innate -bias representation, the MSB of exponent is a "bias signed bit" and need to be always kept, hence:
keep MSB
if Upgrading:
set non MSB extra bits all (~MSB)s
keep lower bits
elif Downgrading:
if (non MSB extra bits all (~MSB)s):
keep lower bits
else:
if (overflow2NaN):
set ALL bits to 1s, including MSB
set mantissa to non-0 value
if (overflow2Inf):
set ALL bits to MSBs
set mantissa to 0
examples
tie break and MSB=0
to nearest and MSB=1
overflow2NaN (the non-0 default is usually just set M_MSB to 1)
overflow2Zero
overflow2Inf
On Practice For Deep Learning
First of all, the rounding loss, provided no exponent abs. val. overflow happens, mostly has negligible impact on training and inference.
The famous vanish/exploding gradient problem has to do with innate representation subnormal and range, not rounding.
Rounding loss is pertinent to mixed precision training and inference where we frequently upgrade dtype for mmad and downgrade for writing to memory or do the whole regularize, activate, normalize stuff that must be processed with vector unit with much lower throughput than a tensor unit. Even still, without exponent overflow, the loss should be insignificant, adding order led precision difference might be more of a concern.
When exponent does overflow, usually we just set NaN because we donot want to risk carrying on with missing gradient correction in training nor weight input in inference.