统计计算①-Floating-point Arithmetic and Stability

最新推荐文章于 2022-05-22 15:59:16 发布

除了鸽什么也不会

最新推荐文章于 2022-05-22 15:59:16 发布

阅读量753

点赞数

分类专栏：统计

本文链接：https://blog.csdn.net/qq_23095921/article/details/82808955

版权

统计专栏收录该内容

2 篇文章 0 订阅

订阅专栏

实数在计算机中的表示

小数部分（decimal part）可能无限长（infinitely long）
Fixed-point system
数字串中小数点位置固定。e.g. 总共8位，00012345表示0001.2345
缺点：尺度固定不灵活；容易overflow/underflow；空间浪费
Float：Floating-point number
浮点指的是数字的小数点可以“浮动”。
${\text{significand}}\times {\text{base}}^{\text{exponent}}$
$s i g n - i n t; e x p - i n t; b a s e - 2, 10, 16$

https://en.wikipedia.org/wiki/Floating-point_arithmetic

在计算机科学中，浮点（英语：floating point，缩写为FP）是一种对于实数的近似值数值表现法，由一个有效数字（即尾数）加上幂数来表示，通常是乘以某个基数的整数次指数得到。以这种表示法表示的数值，称为浮点数（floating-point number）。利用浮点进行运算，称为浮点计算，这种运算通常伴随着因为无法精确表示而进行的近似或舍入。

早期的CPU不能进行浮点数的运算，因此一个独立的Floating-point unit（FPU，比如Intel 8087）被单独出售。

https://zh.wikipedia.org/wiki/浮点运算器

浮点运算器（floating point unit，缩写FPU）是运行浮点运算的结构。一般是用电路来实现，应用在计算机芯片中。是整数运算器之后的一大发展，因为在浮点运算器发明之前，计算机中的浮点运算是都是用整数运算来模拟的，效率十分不良。浮点运算器一定会有误差，但科学及工程计算仍大量的依靠浮点运算器——只是在程序设计时就必需考虑精确度问题。

显然浮点数系统只能表征有理数，其中无限不循环小数也不能表示。
e.g. 0.2的二进制数
延伸：十进制数（decimal）与二进制数（binary）的转换（Base conversion）。

整数：
整数部分，把十进制转成二进制一直分解至商数为0。读余数从下读到上，即是二进制的整数部分数字。
For whole numbers, repeatedly divide by the base and record the remainders.

小数：
小数部分，则用其乘2，取其整数部分的结果，再用计算后的小数部分依此重复计算，算到小数部分全为0为止，之后读所有计算后整数部分的数字，从下读到上。
For the fractional part, repeatedly multiply the part after the radix by the base, and record the part before the radix.

将59.25(10) 转成二进制：
整数部分：

59 ÷ 2 = 29 ... 1
29 ÷ 2 = 14 ... 1
14 ÷ 2 =  7 ... 0
 7 ÷ 2 =  3 ... 1
 3 ÷ 2 =  1 ... 1
 1 ÷ 2 =  0 ... 1

小数部分：

0.25×2=0.5
0.50×2=1.0

$59.25_{(10)}=111011.01_{(2)}$

The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.

每秒浮点运算次数（亦称每秒峰值速度）是每秒所执行的浮点运算次数（英语：Floating-point operations per second；缩写：FLOPS）的简称，被用来估算电脑效能，尤其是在使用到大量浮点运算的科学计算领域中。因为FLOPS字尾的那个S代表秒，而不是复数，所以不能够省略。

Single：

https://en.wikipedia.org/wiki/Single-precision_floating-point_format
在这里插入图片描述
${\text{sign}}=+1$
${\text{exponent}}=(-127)+124=-3$
${\text{fraction}}=1+2^{-2}=1.25$
${\text{value}}=(+1)\times 1.25\times 2^{-3}=+0.15625$

第1位表示正负，中间8位表示指数，后23位储存有效数位（有效数位是24位）。
$Value = (-1)^{b_{31}}\times 2^{(b_{30}b_{29}\dots b_{23})_{2}-127}\times (1.b_{22}b_{21}\dots b_{0})_{2}$
即 $Value = (-1)^{sign} \times (1.Mantissa)} \times 2^{exponent-127$
符号位（Sign）：第一位的正负号0代表正，1代表负。
指数位（Exponent）：中间八位共可表示 $2^8=256$ 个数，指数可以是二补码；或0到255，0到126代表-127到-1，127代表零，128-255代表1-128。
有效数位（Significand）/小数位（Fraction）/尾数位（Mantissa）：最左手边的1并不会储存，因为它一定存在（二进制的第一个有效数字必定是1，第一位隐含的1不表示出来）。换言之，有效数位是24位，实际储存23位。

表示范围
最大正数：
https://www.coursera.org/learn/jisuanji-xitong/lecture/TBTLG/w2-4-1-fu-dian-shu-de-biao-shi-fan-wei

Double：

https://en.wikipedia.org/wiki/Double-precision_floating-point_format
在这里插入图片描述

$Value = (-1)^{sign} \times (1.Mantissa)} \times 2^{exponent-1023$

By the IEEE 754 Standard,
https://en.wikipedia.org/wiki/IEEE_754

Type	Sign	Exponent	Significand	Number of decimal digits
Half precision	1	5	10	~3.3
Single precision	1	8	23	~7.2
Double precision	1	11	52	~16.0

Note:
$log_{10}2^{10+1} = 3.31, \log_{10}2^{23+1} = 7.22, \log_{10}2^{10+1} = 15.95$
转换为10进制后，1为规格化（normalization）隐藏位。

浮点数运算（Floating-point arithmetic）

加法（Additon）和减法（Subtraction）
指数向高对齐较大的数可能发生有效位数丢失
$\times 10^5$

乘法（Multiplication）和除法（Division）（安全）
有效数字相乘，指数相加。最后结果近似并标准化。
不存在cancellation或absorption的问题。

Cancellation
常发生在相近数的减法中，是有效位数丢失的主要来源。保留下的数准确率低（四舍五入等）。

恶性消除 Catastrophic Cancellation
当其中一位发生了近似时。
良性消除 Benign Cancellation
再加一位额外精度进行保护（guard digit）可以控制误差在2倍Machine epsilon以内。

Machine epsilon (macheps)
由于浮点运算中的舍入，Machine epsilon给出了相对误差的上限。
详见：machine_epsilon
https://en.wikipedia.org/wiki/Guard_digit
https://en.wikipedia.org/wiki/Machine_epsilon

精度问题（Accuracy problem）

近似实数的精度（precision）是其在十进制下的，在计算中作为有效数字处理的位数.
近似实数的准确度（accuracy）是其在十进制下小数点后边的位数.
参考资料：数值分析

Cancellation是不精确度（inaccuracy）的主要来源。
+
+
+
+
+

除了鸽什么也不会

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
统计计算①-Floating-point Arithmetic and Stability

实数在计算机中的表示小数部分（decimal part）可能无限长（infinitely long）Float：Floating-point number在计算机科学中，浮点（英语：floating point，缩写为FP）是一种对于实数的近似值数值表现法，由一个有效数字（即尾数）加上幂数来表示，通常是乘以某个基数的整数次指数得到。以这种表示法表示的数值，称为浮点数（floating-...
复制链接

扫一扫

专栏目录