IEEE Floating Point Rounding

IEEE Standard

The rule of thumb is to "round to nearest, tie to even", by which I mean it's the IEEE recommended rounding for floating point representation, though it's not strictly followed, case in point, tensorflow seems not to, wrapping around mantissa when converting fp32 to bf16. (check floating point - Convert FP32 to Bfloat16 in C++ - Stack Overflow)

  • Round to nearest, ties to even – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit.

Relevance

This is mostly relevant in deep learning where a few mismatched precision among critical weights, which, research shows, are not created equal; some may be modified with no obvious change to model output, while others change it drastically when tweaked. Therefore in commercial deployment, the service provider usually wants to make sure computation is exact as model testing done in lab setups.

Usually it goes as researcher compare GPU results with CPU golden, and then NPU and other ASICs playing catchups will compare their results with GPU.

Large corporations capable of providing LLM service will also likely test workload tolerance to precision loss/changes beforehand.

Different lib or py_pak use different rounding, so make sure you check beforehand how they interpret fp322b16, or implement the rounding yourself

https://www.corsix.org/content/converting-fp32-to-fp16

Breakdown

From a numerical point of view, the tie2even rule is simple, 1.5 to 2.0, 0.25 to 0.0 etc. but I'm very much confused when working with binaries so here is a break down:

(use E for exponent encoding, i.e. before subtracting bias, exp for true exponent value)

in general for:

N + ".5", where N is an integer, we have a tie; ".5" means midway,

e.g. for f322bf16 (examples using Float Toy), 7 MSBs aside, the rest being leading 1 then all 0s is a ".5" tie

else

simply round to nearest

where we round the mantissa in it binary form per usual, and generate a carry to exponent when it overflows, e.g.

to bf16:

and if the carry results in dst. dtype exponent overflow, then follow Inf/Nan protocol of the dst. dtype.

Tie Breaking

FP322BF16

by tie2even rule, we break tie in favor of setting the LSB of dst. dtype to 0, hence

breaks to

  • the numerical "evenness" of mantissa do not follow the intuition of integers, so write out binary and follow strict definition of "tie2even" when confused.
  • note that tie2even will never introduce carry, simplifying logic

FP322FP16

a simple case:

Exponent "Rounding"

There is no actual rounding for exponent, when overflow, we usually either set the value to NaN or Inf.

Exponent convertion is numerically simple integer copy or (value) truncation for overflow (which by IEEE encoding, is either NaN or Inf) and underflow. This exponent absolute value overflow led fp value underflow should not be confused with the innate denormal, aka subnormal, underflow, see https://en.wikipedia.org/wiki/Subnormal_number. To avoid confusion, we will consider exponent to be only capable of being "overflow".

In binary operation, because of the innate -bias representation, the MSB of exponent is a "bias signed bit" and need to be always kept, hence:

keep MSB
if Upgrading:
    set non MSB extra bits all (~MSB)s
    keep lower bits
elif Downgrading:
    if (non MSB extra bits all (~MSB)s):
        keep lower bits
    else:
        if (overflow2NaN):
            set ALL bits to 1s, including MSB
            set mantissa to non-0 value
        if (overflow2Inf):
            set ALL bits to MSBs
            set mantissa to 0

examples

tie break and MSB=0

to nearest and MSB=1

overflow2NaN (the non-0 default is usually just set M_MSB to 1)

overflow2Zero

overflow2Inf

On Practice For Deep Learning

First of all, the rounding loss, provided no exponent abs. val. overflow happens, mostly has negligible impact on training and inference.

The famous vanish/exploding gradient problem has to do with innate representation subnormal and range, not rounding.

Rounding loss is pertinent to mixed precision training and inference where we frequently upgrade dtype for mmad and downgrade for writing to memory or do the whole regularize, activate, normalize stuff that must be processed with vector unit with much lower throughput than a tensor unit. Even still, without exponent overflow, the loss should be insignificant, adding order led precision difference might be more of a concern.

When exponent does overflow, usually we just set NaN because we donot want to risk carrying on with missing gradient correction in training nor weight input in inference.

  • 7
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Sure, I can help you with that. To convert 1.75 to floating point representation IEEE 1958, we need to follow certain steps: Step 1: Convert the decimal number 1.75 to binary. 1. Convert the integer part 1 to binary -> 1 2. Convert the fraction part 0.75 to binary: 0.75 x 2 = 1.5 (1 goes to the bit after the comma) 0.5 x 2 = 1 (1 goes to the next bit) 0 So 1.75 in binary is represented as 1.11 Step 2: Normalize the binary number The binary number is normalized by moving the decimal point to the left so that there is only one non-zero digit before the decimal point. In this case, the binary number is already normalized. Step 3: Determine the sign, exponent, and mantissa Sign: Since the number is positive, the sign bit is 0. Exponent: The exponent can be calculated using the formula: exponent = (number of bits for exponent - 1) + bias For IEEE 1958 floating point representation, the bias value is 127, and the number of bits for the exponent is 8. In this case, the exponent is 0 + 127 = 127 Mantissa: The mantissa is the fractional part of the normalized binary number. In this case, the mantissa is 1.11, which can be represented as: 1.11 = 1 + 1/2 + 1/4 = 1.5 The mantissa is then represented in binary as: 1.5 = 1.1000... So the sign bit is 0, the exponent in binary is 01111111, and the mantissa in binary is 10000000000000000000000. Putting it all together, 1.75 in floating point representation IEEE 1958 is: 0 11111111 10000000000000000000000 I hope this helps! Is there anything else I can assist you with?

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值