关于IEEE754二进制浮点数算术标准的介绍

IEEE 754

IEEE二进制浮点数算术标准IEEE 754)是最广泛使用的浮点数运算标准,为许多CPU浮点运算器所采用。这个标准定义了表示浮点数的格式(包括负零0 (number))与反常值(denormal number)),一些特殊数值(无穷非数值NaN)),以及这些数值的“浮点数运算子”;它也指明了四种数值修约规则和五种例外状况(包括例外发生的时机与处理方式)。

IEEE 754规定了四种表示浮点数值的方式:单精确度(32位元)、双精确度(64位元)、延伸单精确度(43位元以上,很少使用)与延伸双精确度(79位元以上,通常以80位元实做)。只有32位元模式有强制要求,其他都是选择性的。大部分程序语言都有提供IEEE格式与算术,但有些将其列为非必要的。例如,IEEE 754问世之前就有的C语言,现在有包括IEEE算术,但不算作强制要求(C语言的float通常是指IEEE单精确度,而double是指双精确度)。

该标准的全称为IEEE二进制浮点数算术标准(ANSI/IEEE Std 754-1985,又称IEC 60559:1989,微处理器系统的二进制浮点数算术(本来的编号是IEC 559:1989[1]。后来还有“与基数无关的浮点数”的“IEEE 854-1987标准”,有规定基数为210的状况。

目录

 

浮点数剖析

以下是该标准对浮点数格式的描述。

本文表示位元的约定

我们将电脑上一个长度为W单字word)其中的位元以0W1整数编码,通常将最右边的位元编成0,以让编号最小的位元与最低效位元least significant bitlsb,代表最小位数,改变时对数值影响最小的位元)一致。

整体呈现

二进制浮点数是以符号数值表示法格式储存,将最高效位元指定为符号位元sign bit);“指数部份”,即次高效的e位元,为浮点数中经指数偏差exponent bias)处理过后的指数;“小数部份”,即剩下的f位元,为有效位数significand)减掉有效位数本身的最高效位元。

一些非中文的文字因为尚未翻译而被隐藏,欢迎参与翻译。

指数偏差

指数偏差(表示法中的指数为实际指数减掉某个值)为 2e-1 - 1,参见有符号数处理Excess-N。减掉一个值是因为指数必须是有号数才能表达很大或很小的数值,但是有号数通常的表示法,二的补数two's complement),会使得 Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison.

For example, to represent a number which has exponent of 17, exponent is 17+2e-1 - 1.

范例

The most significant bit of the significand ( not stored) is determined by the value of exponent. If 0 < exponent < 2e 1, the most significant bit of the significand is 1, and the number is said to be normalized. If exponent is 0, the most significant bit of the significand is 0 and the number is said to be de-normalized. Three special cases arise:

1.    if exponent is 0 and fraction is 0, the number is ±0 (depending on the sign bit)

2.    if exponent = 2e 1 and fraction is 0, the number is ±infinity (again depending on the sign bit), and

3.    if exponent = 2e 1 and fraction is not 0, the number being represented is not a number (NaN).

This can be summarized as:

Type

Exponent

Fraction

Zeroes

0

0

Denormalized numbers

0

non zero

Normalized numbers

1 to 2e 2

any

Infinities

2e 1

0

NaNs

2e 1

non zero

Single-precision 32 bit

A single-precision binary floating-point number is stored in 32 bits.

clip_image001

Bit values for the the IEEE 754 32bit float 0.15625

The exponent is biased by 28 1 1 = 127 in this case (Exponents in the range 126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of 127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above.

For normalised numbers, the most common, exponent is the biased exponent and fraction is the significand minus the most significant bit.

The number has value v:

v = s × 2e × m

Where

s = +1 (positive numbers) when the sign bit is 0

s = 1 (negative numbers) when the sign bit is 1

e = Exp 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 m < 2.

In the example shown above, the sign is zero, the exponent is 3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 23, which is +0.15625.

Notes:

1.    Denormalized numbers are the same except that e = 126 and m is 0.fraction. (e is NOT 127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to 126 for the calculation.)

2.    126 is the smallest exponent for a normalized number

3.    There are two Zeroes, +0 (s is 0) and 0 (s is 1)

4.    There are two Infinities + (s is 0) and (s is 1)

5.    NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish signaling NaNs from quiet NaNs

6.    NaNs and Infinities have all 1s in the Exp field.

7.    The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±2149 ≈ ±1.4012985×1045

8.    The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±2126 ≈ ±1.175494351×1038

9.    The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)24)2128) [2] ≈ ±3.4028235×1038


Here is the summary table from the previous section with some example 32-bit single-precision examples:

Type

Exponent

Significand

Value

Zero

0000 0000

000 0000 0000 0000 0000 0000

0.0

One

0111 1111

000 0000 0000 0000 0000 0000

1.0

Denormalized number

0000 0000

100 0000 0000 0000 0000 0000

5.9×10-39

Large normalized number

1111 1110

111 1111 1111 1111 1111 1111

3.4×1038

Small normalized number

0000 0001

000 0000 0000 0000 0000 0000

1.18×10-38

Infinity

1111 1111

000 0000 0000 0000 0000 0000

Infinity

NaN

1111 1111

non zero

NaN

A more complex example

clip_image002

Bit values for the IEEE 754 32bit float -118.625

Let us encode the decimal number 118.625 using the IEEE 754 system.

1.    First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1".

2.    Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101.

3.    Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 26. This is a normalized floating point number. The fraction is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000.

4.    The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101. 

Double-precision 64 bit

clip_image002

The three fields in a 64bit IEEE 754 float

Double precision is essentially the same except that the fields are wider:

The fraction part is much larger, while the exponent is only slightly larger. The standard creators believed precision is more important than range.

NaNs and Infinities are represented with Exp being all 1s (2047).

For Normalized numbers the exponent bias is +1023 (so e is exponent 1023). For Denormalized numbers the exponent is 1022 (the minimum exponent for a normalized numberit is not 1023 because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed.

Notes:

1.    The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±21074 ≈ ±5×10324

2.    The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±21022 ≈ ±2.2250738585072020×10308

3.    The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)53)21024) [2] ≈ ±1.7976931348623157×10308

Comparing floating-point numbers

IEEE floating point numbers use lexicographical ordering. If NaN's are excluded IEEE floating point numbers can be compared as signed magnitude integers.

Rounding floating-point numbers

The IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings.

·         Round to Nearest rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754r this mode is called roundTiesToEven to distinguish it from another round-to-nearest mode)

·         Round toward 0 directed rounding towards zero

·         Round toward + directed rounding towards positive infinity

·         Round toward directed rounding towards negative infinity.

Extending the real numbers

The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.[3][4][5]

Recommended functions and predicates

·         Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). Note that this is one of the few operations which operates on a NaN in a way resembling arithmetic. Note that copysign is a new function under the C99 standard.

·         x returns x with the sign reversed. Note that this is different from 0x in some cases, notably when x is 0. So (0) is 0, but the sign of 00 depends on the rounding mode.

·         scalb (y, N)

·         logb (x)

·         finite (x) a predicate for "x is a finite value", equivalent to Inf < x < Inf

·         isnan (x) a predicate for "x is a nan", equivalent to "x x"

·         x <> y which turns out to have different exception behavior than NOT(x = y).

·         unordered (x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.

·         class (x)

·         nextafter(x,y) returns the next representable value from x in the direction towards y

References

1.    Codes (英文)

2.    ^ 2.0 2.1 Prof. W. Kahan. "Lecture Notes on the Status of IEEE 754" (PDF). October 1, 1997 3:36 am. Elect. Eng. & Computer Science University of California. Retrieved on 2007-04-12.

3.    John R. Hauser (March 1996). "Handling Floating-Point Exceptions in Numeric Programs" (PDF). ACM Transactions on Programming Languages and Systems 18 (2).

4.    David Stevenson (March 1981). "IEEE Task P754: A proposed standard for binary floating-point arithmetic". Computer 14 (3): 5162.

5.    Kahan, W. and Palmer, J. (1979). "On a proposed floating-point standard". SIGNUM Newsletter 14 (Special): 1321.

·         Floating Point Unit by Jidan Al-Eryani

Revision of the standard

Note that the IEEE 754 standard is currently under revision. See: IEEE 754r

See also

·         0 (negative zero)

·         IEEE 754r working group to revise IEEE 754-1985.

·         NaN (Not a Number)

·         minifloat for simple examples of properties of IEEE 754 floating point numbers

·         Intel 8087 (early implementation effort)

·         Q (number format) For constant resolution

 

外部链接

·         IEEE 754 references

·         Let's Get To The (Floating) Point by Chris Hecker

·         What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg - a good introduction and explanation.

·         IEEE 854-1987 History and minutes

·         Converter

·         Another Converter

·         Converter as MS-Windows program

·         Comparing doubles in C++

·         An Interview with the Old Man of Floating-Point Coprocessor.info : x87 FPU pictures, development and manufacturer information

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值