数值分析笔记 - L2 - Floating Point Arithmetic(国外资料)

最新推荐文章于 2023-03-20 15:33:01 发布

JoynerWall

最新推荐文章于 2023-03-20 15:33:01 发布

阅读量959

点赞数 2

分类专栏：数值分析文章标签：算法

本文链接：https://blog.csdn.net/apple_52225972/article/details/120199765

版权

数值分析专栏收录该内容

2 篇文章 1 订阅

订阅专栏

1. Finite precision number systems 有限精度数字系统

1.1 Issues 问题

2. Normalised systems 规格化系统

2.1 A general representation 一般表示形式

2.2 Normalisation 标准化/规格化

2.3 Standard 标准

2.3 机器数系的特点

3. Errors and their representations 误差和误差表示

3.1 Bounding the errors 界定误差

4. Machine Precision 机器精度

4.1 Why is this important? 为什么这很重要

4.2 An alternative definition of eps eps的另一种定义

5. “Features” of finite precision 有限精度的“特征”

5.1 Is this all academic? 这都仅仅只是学术性的理论吗?

6. Summary 总结

1. Finite precision number systems 有限精度数字系统

Computers store numbers with finite precision, i.e. using a finite set of bits (binary digits), typically 32 or 64 of them. （计算机以有限的精度存储数字，即使用一组有限的位 / bit（二进制数字），通常为32或64位。）

Many numbers cannot be stored exactly（许多数字无法准确存储）：

Some numbers cannot be represented precisely using any finite set of digits（有些数字不能用任何有限的数字集精确表示）: e.g. √ 2 = 1.14142 . . ., π = 3.14159 . . ., etc.
Some cannot be represented precisely in a given number base（有些无法在给定的基数中精确表示）: e.g. 1 9 = 0.111 . . . (decimal 十进制), 1 5 = 0.00110011 . . . (binary 二进制).
Others can be represented by a finite number of digits but only using more than are available（其他的可以用有限数量的数位表示，但只能使用比可用数位更多的数位）:e.g. 1.526374856437 cannot be stored exactly using 10 decimal digits （不能使用10位的十进制数字精确存储）.

1.1 Issues 问题

The inaccuracies inherent in finite precision arithmetic must be modelled in order to understand（必须对有限精度算法中固有的误差进行建模，以便理解）:

how the numbers are represented (and the nature of associated limitations)（数字的表示方式（以及相关限制的性质））;
the errors in their representation（它们表达中的误差）;
the errors which occur when arithmetic operations are applied to them（对其应用算术运算时发生的误差）.

The examples shown here will be in decimal by the issues apply to any base, e.g. binary.（本文显示的示例将以十进制为单位，适用于任何基数，例如二进制。）

2. Normalised systems 规格化系统

2.1 A general representation 一般表示形式

Any finite precision number can be written using the floating point representation（任何有限精度的数字都可以使用浮点表示法书写）：

The digits $b_t$ are integers satisfying 0 ≤ $b_t$ ≤ β − 1.
The mantissa, $b_1b_2b_3...$ $b_{t-1}$ $b_t$ , contains t digits.
β is the base (always a positive integer).
e is the integer exponent and is bounded (L ≤ e ≤ U).

F （β, t, L, U） fully defines a finite precision number system. （机器数系）

t：字长，正整数；

β：进制，一般为2，8，10和16；

e：阶码，整数，L ≤ e ≤ U；

L和U为固定整数。

2.2 Normalisation 标准化/规格化

Normalised finite precision systems will be considered here for which（这里将考虑标准化有限精度系统）：

2.2.1 Examples：

In the case (β,t, L,U) = (10, 4, −49, 50) (base 10)，10000 = .1000× $10^5$ ，22.64 = .2264× $10^2$ ，0.0000567 = .5670× $10^{-4}$ .
In the case (β,t, L,U) = (2, 6, −7, 8) (binary)，10000 = .100000 × $2^5$ ，1011.11 = .101111 × $2^4$ ，0.000011 = .110000 × $2^{-4}$ .
Zero is always taken to be a special case. e.g.， 0 = ±.00 . . . 0 × $\beta ^0$ .

2.3 Standard 标准

The IEEE single precision standard is (β,t, L,U) = (2, 23, −127, 128). This is available via numpy.single.
The IEEE double precision standard is (β,t, L,U) = (2, 52, −1023, 1024). This is available via numpy.double.

Example 1：

Consider the number system given by (β,t, L,U) = (10, 2, −1, 2) which gives x = ±. $b_1b_2$ × $10^e$ where − 1 ≤ e ≤ 2.

Q1：How many numbers can be represented by this normalised system（这个标准化系统可以表示多少个数字）?

A：9×10×4×2+1 = 721

Q2：What are the two largest positive numbers in this system（这个系统中最大的两个正数是什么）?

A：.99× $10^2$ 、.98× $10^2$

Q3：What are the two smallest positive numbers（两个最小的正数是什么）?

A: .10× $10^{-1}$ 、.11× $10^{-1}$

Q4：What is the smallest possible difference between two numbers in this system（在这个系统中，两个数字之间可能存在的最小差异是什么）?

A：The smallest difference is 0.001.

Example 2：

Consider the number system given by (β,t, L,U) = (10, 3, −3, 3) which gives x = ±. $b_1b_2b_3$ × $10^e$ where − 3 ≤ e ≤ 3.

Q1：How many numbers can be represented by this normalised system（这个标准化系统可以表示多少个数字）?

A：9× $10^2$ ×7×2+1 = 12601

Q2：What are the two largest positive numbers in this system?

A：.999× $10^3$ ，.998× $10^3$

Q3：What are the two smallest positive numbers?

A：.101× $10^{-3}$ ，.102× $10^{-3}$

Q4：What is the smallest possible difference between two numbers in this system?

A：0.000001

Q5：What is the smallest possible difference in this system, x and y, for which x < 100 < y?

A: 100 = .100× $10^3$ ，x = .999× $10^2$ ，y = .101× $10^3$ . So the smallest difference is 1.1

2.3 机器数系的特点：

是有限的离散集；
有绝对值最大非零数(M)和最小非零数(m)；
数绝对值大于M，产生上溢错误，小于m，则产生下溢错误；
上溢时，中断程序；下溢时，用零表示该数继续程序；无论是上溢，还是下溢，都称为溢出错误。
计算机把尾数为0且阶数最小的数表示为数零。

3. Errors and their representations 误差和误差表示

From now on fl(x) will be used to represent the (approximate) stored value of x（从现在起，fl(x)将用于表示x的（近似）存储值）.

The error in this representation can be expressed in two ways（此表示中的误差可以用两种方式表示）：

The number fl(x) is said to approximate x to t significant digits (or figures) if t is the largest non-negative integer for which Relative error < 0.5 × $\beta^{1-t}$ .（如果 t 是使相对误差小于0.5× $\beta^{1-t}$ 的最大的非负整数，则表示数字fl(x)近似于 x 到 t 个有效数位（或数字））。

3.1 Bounding the errors 界定误差

In the number system given by (β,t, L,U), the nearest (larger) representable number to

x = 0. $b_1b_2b_3$ ... $b_{t-1}b_{t}$ × $\beta^e$ is (在由(β,t,L,U)给出的数字系统中，与x=0. $b_1b_2b_3$ ... $b_{t-1}b_{t}$ × $\beta^e$ 最近（较大）的可表示数为)

Any number y ∈ (x, x˜) is stored as either x or x˜ by rounding to the nearest representable number, so（任一数字y∈(x, x˜)通过舍入最近法则到最接近的可表示数作为x或x~被计算机储存，因此）

the largest possible error is $\frac{1}{2}$ $\beta^{e-t}$ ,（理论最大的误差）
which means that |y − fl(y)| ≤ $\frac{1}{2}$ $\beta^{e-t}$ .

补充：

摘自：数值分析（原书第2版）_Timothy Sauer_机械工业出版社

4. Machine Precision 机器精度

It follows from y > x ≥ .100...00 × $\beta^e$ = $\beta^{e-1}$ that

and this provides a bound on the relative error(这提供了相对误差的界限): for any y

The last term is known as machine precision or unit roundoff and is often called $\epsilon$ .(最后一个术语称为机器精度或单位舍入，通常称为 $\epsilon$ ，eps)

Examples：

1. The number system (β,t, L,U) = (10, 2, −1, 2) gives

2. The number system (β,t, L,U) = (10, 3, −3, 3) gives

eps = $\frac{1}{2}$ $\beta^{t-1}$ = $\frac{1}{2}10^{1-3}$ =0.005.

3. The number system (β,t, L,U) = (10, 7, 2, 10) gives

eps = $\frac{1}{2}$ $\beta^{t-1}$ = $\frac{1}{2}10^{1-7}$ = 0.0000005.

4.1 Why is this important? 为什么这很重要

Arithmetic operations are usually carried out as though infinite precision is available, after which the result is rounded to the nearest representable number. （算术运算通常被视为无限精度，然后将结果舍入到最接近的可表示数字。）

This means that arithmetic cannot be completely trusted(这意味着算术运算结果不能被完全信任)

e.g. x + y =?，

and the usual rules don’t necessarily apply（并且通常的规则并不一定适用）

e.g. x + (y + z) = (x + y) + z?

Example 1：

More examples：

4.2 An alternative definition of eps eps的另一种定义

Machine precision is the smallest positive number eps such that 1 + eps > 1, i.e. it is half the difference between 1 and the next largest representable number.（机器精度是最小的正数eps，即1 + eps > 1，即它是1和下一个最大可代表数之间差的一半。）

5. “Features” of finite precision 有限精度的“特征”

Overflow： the number is too large to be represented, e.g. multiply the largest representable number by 10. This gives inf (infinity) with numpy.doubles and is usually “fatal”.（数字太大而无法表示，例如将最大的可表示数乘以10。这给Python中的numpy.doubles输送了一个inf(无穷大)的数，并且这通常是“致命的”。）

Underflow： the number is too small to be reprsented, e.g. divide the smallest representable number by 10. This gives 0 and may not be immediately obvious. （这个数字太小而无法表示，例如，将最小的可表示数除以10。结果是0，可能不是立即变得很明显。）

Divide by zero： gives a result of inf, but $\frac{0}{0}$ gives nan (not a number)

Divide by inf： gives 0.0 with no warning

5.1 Is this all academic? 这都仅仅只是学术性的理论吗?

No! There are many examples of major software errors that have occurred due to programmers not understanding the issues associated with computer arithmetic. . .（不！由于程序员不理解与计算机运算相关的问题，出现了许多主要软件错误的例子。。）

In February 1991, a basic rounding error within software for the US Patriot missile system caused it to fail, contributing to the loss of 28 lives.（1991年2月，美国爱国者导弹系统软件的一个基本舍入误差导致系统故障，造成28人丧生。）
In June 1996, the European Space Agency’s Ariane Rocket exploded shortly after take-off: the error was due to failing to handle overflow correctly.（1996年6月，欧洲航天局(European Space Agency)的阿丽亚娜火箭(Ariane Rocket)在起飞后不久爆炸:错误的原因是未能正确处理溢出。）
In October 2020, a driverless car drove straight into a wall due to faulty handling of a floating point error.（2020年10月，一辆无人驾驶汽车因处理浮点错误而直接撞到墙上。）

6. Summary 总结

1. There is inaccuracy in almost all computer arithmetic.（几乎所有的计算机运算都不准确。）

2. Care must be taken to minimise its effects, for example:（必须小心地将其影响降到最低，例如:）

add the smallest terms in an expression first; （首先在表达式中添加最小的项;）
avoid taking the difference of two very similar terms；（避免使用两个非常相似的项的差值;）
even checking whether a = b is dangerous! （甚至检查a = b都危险!）

3. The usual mathematical rules no longer apply.（通常的数学规则不再适用了。）

4. There is no point in trying to compute a solution to a problem to a greater accuracy than can be stored by the computer.（试图计算出一个问题的解决方案，然而其精度超过了计算机所能存储的精度，这是没有意义的。）

备注：本文部分题目给出的答案均为我自己运算得来，不保证对，如有错误欢迎指正！

JoynerWall

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
数值分析笔记 - L2 - Floating Point Arithmetic(国外资料)

1. Finite precision number systems有限精度数字系统Computers store numbers with finite precision, i.e. using a finite set of bits (binary digits), typically 32 or 64 of them. （计算机以有限的精度存储数字，即使用一组有限的位 / bit（二进制数字），通常为32或64位。）Many numbers cannot be stored exactl.
复制链接

扫一扫