关于IEEE754二进制浮点数算术标准的介绍

最新推荐文章于 2022-08-31 10:44:48 发布

dllbl

最新推荐文章于 2022-08-31 10:44:48 发布

阅读量2.6k

点赞数

分类专栏： C/C++ 文章标签： numbers transactions diagnostics float properties exception

C/C++ 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

IEEE 754

IEEE二进制浮点数算术标准（IEEE 754）是最广泛使用的浮点数运算标准，为许多CPU与浮点运算器所采用。这个标准定义了表示浮点数的格式（包括负零（−0 (number)）与反常值（denormal number）），一些特殊数值（无穷与非数值（NaN）），以及这些数值的“浮点数运算子”；它也指明了四种数值修约规则和五种例外状况（包括例外发生的时机与处理方式）。

IEEE 754规定了四种表示浮点数值的方式：单精确度（32位元）、双精确度（64位元）、延伸单精确度（43位元以上，很少使用）与延伸双精确度（79位元以上，通常以80位元实做）。只有32位元模式有强制要求，其他都是选择性的。大部分程序语言都有提供IEEE格式与算术，但有些将其列为非必要的。例如，IEEE 754问世之前就有的C语言，现在有包括IEEE算术，但不算作强制要求（C语言的float通常是指IEEE单精确度，而double是指双精确度）。

该标准的全称为IEEE二进制浮点数算术标准（ANSI/IEEE Std 754-1985），又称IEC 60559:1989，微处理器系统的二进制浮点数算术（本来的编号是IEC 559:1989）^[1]。后来还有“与基数无关的浮点数”的“IEEE 854-1987标准”，有规定基数为2跟10的状况。

浮点数剖析

以下是该标准对浮点数格式的描述。

本文表示位元的约定

我们将电脑上一个长度为W的单字（word）其中的位元以0到W−1的整数编码，通常将最右边的位元编成0，以让编号最小的位元与最低效位元（least significant bit或lsb，代表最小位数，改变时对数值影响最小的位元）一致。

整体呈现

二进制浮点数是以符号数值表示法格式储存，将最高效位元指定为符号位元（sign bit）；“指数部份”，即次高效的e位元，为浮点数中经指数偏差（exponent bias）处理过后的指数；“小数部份”，即剩下的f位元，为有效位数（significand）减掉有效位数本身的最高效位元。

一些非中文的文字因为尚未翻译而被隐藏，欢迎参与翻译。

指数偏差

指数偏差（表示法中的指数为实际指数减掉某个值）为 2^e^-1 - 1，参见有符号数处理的Excess-N。减掉一个值是因为指数必须是有号数才能表达很大或很小的数值，但是有号数通常的表示法，二的补数（two's complement），会使得 Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison.

For example, to represent a number which has exponent of 17, exponent is 17+2^e^-1 - 1.

范例

The most significant bit of the significand ( not stored) is determined by the value of exponent. If 0 < exponent < 2^e − 1, the most significant bit of the significand is 1, and the number is said to be normalized. If exponent is 0, the most significant bit of the significand is 0 and the number is said to be de-normalized. Three special cases arise:

1. if exponent is 0 and fraction is 0, the number is ±0 (depending on the sign bit)

2. if exponent = 2^e − 1 and fraction is 0, the number is ±infinity (again depending on the sign bit), and

3. if exponent = 2^e − 1 and fraction is not 0, the number being represented is not a number (NaN).

This can be summarized as:

Type	Exponent	Fraction
Zeroes	0	0
Denormalized numbers	0	non zero
Normalized numbers	1 to 2^e − 2	any
Infinities	2^e − 1	0
NaNs	2^e − 1	non zero

Single-precision 32 bit

A single-precision binary floating-point number is stored in 32 bits.

Bit values for the the IEEE 754 32bit float 0.15625

The exponent is biased by 2⁸⁻¹ − 1 = 127 in this case (Exponents in the range −126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above.

For normalised numbers, the most common, exponent is the biased exponent and fraction is the significand minus the most significant bit.

The number has value v:

v = s × 2^e × m

Where

s = +1 (positive numbers) when the sign bit is 0

s = −1 (negative numbers) when the sign bit is 1

e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 ≤ m < 2.

In the example shown above, the sign is zero, the exponent is −3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 2⁻³, which is +0.15625.

Notes:

1. Denormalized numbers are the same except that e = −126 and m is 0.fraction. (e is NOT −127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to −126 for the calculation.)

2. −126 is the smallest exponent for a normalized number

3. There are two Zeroes, +0 (s is 0) and −0 (s is 1)

4. There are two Infinities +∞ (s is 0) and −∞ (s is 1)

5. NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish signaling NaNs from quiet NaNs

6. NaNs and Infinities have all 1s in the Exp field.

7. The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±2⁻¹⁴⁹ ≈ ±1.4012985×10⁻⁴⁵

8. The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±2⁻¹²⁶ ≈ ±1.175494351×10⁻³⁸

9. The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)²⁴)2¹²⁸) ^[2] ≈ ±3.4028235×10³⁸

Here is the summary table from the previous section with some example 32-bit single-precision examples:

Type	Exponent	Significand	Value
Zero	0000 0000	000 0000 0000 0000 0000 0000	0.0
One	0111 1111	000 0000 0000 0000 0000 0000	1.0
Denormalized number	0000 0000	100 0000 0000 0000 0000 0000	5.9×10^-39
Large normalized number	1111 1110	111 1111 1111 1111 1111 1111	3.4×10³⁸
Small normalized number	0000 0001	000 0000 0000 0000 0000 0000	1.18×10^-38
Infinity	1111 1111	000 0000 0000 0000 0000 0000	Infinity
NaN	1111 1111	non zero	NaN

A more complex example

Bit values for the IEEE 754 32bit float -118.625

Let us encode the decimal number −118.625 using the IEEE 754 system.

1. First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1".

2. Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101.

3. Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 2⁶. This is a normalized floating point number. The fraction is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000.

4. The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101.

Double-precision 64 bit

The three fields in a 64bit IEEE 754 float

Double precision is essentially the same except that the fields are wider:

The fraction part is much larger, while the exponent is only slightly larger. The standard creators believed precision is more important than range.

NaNs and Infinities are represented with Exp being all 1s (2047).

For Normalized numbers the exponent bias is +1023 (so e is exponent − 1023). For Denormalized numbers the exponent is −1022 (the minimum exponent for a normalized number—it is not −1023 because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed.

Notes:

1. The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are

±2⁻¹⁰⁷⁴ ≈ ±5×10⁻³²⁴

2. The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are

±2⁻¹⁰²² ≈ ±2.2250738585072020×10⁻³⁰⁸

3. The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are

±((1-(1/2)⁵³)2¹⁰²⁴) ^[2] ≈ ±1.7976931348623157×10³⁰⁸

Comparing floating-point numbers

IEEE floating point numbers use lexicographical ordering. If NaN's are excluded IEEE floating point numbers can be compared as signed magnitude integers.

Rounding floating-point numbers

The IEEE standard has four different rounding modes; the first is the default; the others are called directed roundings.

· Round to Nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754r this mode is called roundTiesToEven to distinguish it from another round-to-nearest mode)

· Round toward 0 – directed rounding towards zero

· Round toward +∞ – directed rounding towards positive infinity

· Round toward −∞ – directed rounding towards negative infinity.

Extending the real numbers

The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.^[3][4][5]

Recommended functions and predicates

· Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). Note that this is one of the few operations which operates on a NaN in a way resembling arithmetic. Note that copysign is a new function under the C99 standard.

· −x returns x with the sign reversed. Note that this is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.

· scalb (y, N)

· logb (x)

· finite (x) a predicate for "x is a finite value", equivalent to −Inf < x < Inf

· isnan (x) a predicate for "x is a nan", equivalent to "x ≠ x"

· x <> y which turns out to have different exception behavior than NOT(x = y).

· unordered (x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.

· class (x)

· nextafter(x,y) returns the next representable value from x in the direction towards y

References

1. ↑ Codes （英文）

2. ^ ^2.0 ^2.1 Prof. W. Kahan. "Lecture Notes on the Status of IEEE 754" (PDF). October 1, 1997 3:36 am. Elect. Eng. & Computer Science University of California. Retrieved on 2007-04-12.

3. ↑ John R. Hauser (March 1996). "Handling Floating-Point Exceptions in Numeric Programs" (PDF). ACM Transactions on Programming Languages and Systems 18 (2).

4. ↑ David Stevenson (March 1981). "IEEE Task P754: A proposed standard for binary floating-point arithmetic". Computer 14 (3): 51–62.

5. ↑ Kahan, W. and Palmer, J. (1979). "On a proposed floating-point standard". SIGNUM Newsletter 14 (Special): 13–21.

· Floating Point Unit by Jidan Al-Eryani

Revision of the standard

Note that the IEEE 754 standard is currently under revision. See: IEEE 754r

See also

· −0 (negative zero)

· IEEE 754r working group to revise IEEE 754-1985.

· NaN (Not a Number)

· minifloat for simple examples of properties of IEEE 754 floating point numbers

· Intel 8087 (early implementation effort)

· Q (number format) For constant resolution

外部链接

· IEEE 754 references

· Let's Get To The (Floating) Point by Chris Hecker

· What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg - a good introduction and explanation.

· IEEE 854-1987 History and minutes

· Converter

· Another Converter

· Converter as MS-Windows program

· Comparing doubles in C++

· An Interview with the Old Man of Floating-Point Coprocessor.info : x87 FPU pictures, development and manufacturer information

dllbl

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
关于IEEE754二进制浮点数算术标准的介绍

IEEE 754 IEEE二进制浮点数算术标准（IEEE 754）是最广泛使用的浮点数运算标准，为许多CPU与浮点运算器所采用。这个标准定义了表示浮点数的格式（包括负零（−0 (number)）与反常值（denormal number）），一些特殊数值（无穷与非数值（NaN）），以及这些数值的“浮点数运算子”；它也指明了四种数值修约规则和五种例外状况（包括例外发生的时机与处理方式）。 IEEE 7
复制链接

扫一扫