float VS double

最新推荐文章于 2023-04-06 08:50:27 发布

skypeGNU

最新推荐文章于 2023-04-06 08:50:27 发布

阅读量1.7k

点赞数

分类专栏： C/C++

本文链接：https://blog.csdn.net/skypeGNU/article/details/9260449

版权

C/C++ 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

浮点型 float double

float 是单精度浮点类型；double 是双精度浮点类型。

存储结构

存储方式是用科学计数法来存储数据的。

科学记数法是一种以记下极大或极小数字的方法。在科学记数法中，一个数被写成一个1与10之间的实数（尾数）与一个10的幂的积，为了得到统一的表达方式，该尾数并不包括10：

782300=7.823×10^5
0.00012=1.2×10^-4
10000=1×10^4

在电脑或计算器中一般用E或e（英语Exponential）来表示10的幂：

7.823E5=782300
1.2e-4=0.00012

采用二进制浮点算法的 IEC 60559:1989 (IEEE 754) 标准。

Type	Sign	Exponent	Mantissa
float	1 bit	8 bit	23 bit
double	1 bit	11 bit	52 bit

Sign(符号位) ：

0代表正，1代表负。

Exponent(指数)：

指数偏移值(exponent bias)，是指浮点数表示法中的指数域的编码值为指数的实际值加上某个固定的值， IEEE 754标准规定该固定值为 2^e-1 - 1，其中的e为存储指数的位元的长度。如:float指数是8 bit，固定偏移值是2^8-1 - 1 = 128−1 = 127。单精度浮点数的指数部分实际取值是从128到-127。例如指数实际值为17，在单精度浮点数中的指数域编码值为144，即144 = 17 + 127。

Mantissa(尾数)：

被存储为 1.XXX... 形式的二进制分数。此分数有一个大于或等于 1 且小于 2 的值。注意实数总是以规范化形式存储；即尾数左移以使尾数的高序位总是 1。因为该位总是 1，所以存储尾数时，只存储 XXX... 。

如，十进制浮点数120.5的二进制形式为1111000.1，转换为科学计数法形式为 (1.1110001)*(2^6)，指数偏移值为6+127，尾数为1110001。尾数则直接填入，如果空间多余则以0补齐，如果空间不够则0舍1入。所以十进制浮点数120.5的float类型存储如下（二进制）：

Sign	Exponent	Mantissa
0	1000 0101	111 0001 0000 0000 0000 0000
0	6+127=133	110 1101

在线演示

IEEE-754 Analysis

取值范围

在Microsoft Visual C++ 中取值范围如下：

Type Name	Bytes	Range of Values
__int32	4	–2,147,483,648 to 2,147,483,647
unsigned __int32	4	0 to 4,294,967,295
__int64	8	–9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
unsigned __int64	8	0 to 18,446,744,073,709,551,615
float	4	3.4E +/- 38 (7 digits)
double	8	1.7E +/- 308 (15 digits) <—这个指15 个十进制位

浮点十进制值通常没有完全相同的二进制表示形式。这是 CPU 所采用的浮点数据表示形式的副作用。为此，可能会经历一些精度丢失，并且一些浮点运算可能会产生意外的结果。导致此行为的原因是下面之一：比较浮点数相等

十进制数的二进制表示形式可能不精确。

使用的数字之间类型不匹配（例如，混合使用浮点型和双精度型）。

为解决此行为，大多数程序员或是确保值比需要的大或者小，或是获取并使用可以维护精度的二进制编码的十进制 (BCD) 库。浮点值的二进制表示形式影响浮点计算的精度和准确性。

忽略精度

建议采用的一种方法是定义两个值之间可接受的差值幅度(例如.000001)，而不是比较其是否相等。如果两个值之间的绝对差值小于或等于该幅度，则差值可能是因精度差异而产生的，因此这两个值可能相等。下面的示例使用此方法比较 .33333 和 1/3。

#define EPSILON         .000001   // Define your own tolerance
#define FLOAT_EQ(x,v)   (((v - EPSILON) < x) && (x <( v + EPSILON)))
double double1 = .3333333;
double double2 = (double) 1/3;
if (FLOAT_EQ(double1, double2))
        printf("double1 and double2 are equal.");
else
        printf("double1 and double2 are unequal.");
if (fabs(double1 - double2) < .000001)
        printf("double1 and double2 are equal.");
else
        printf("double1 and double2 are unequal.");

对于 EPSILON，可以使用常数 FLT_EPSILON（为浮点型定义为 1.192092896e-07F）或者 DBL_EPSILON（为双精度型定义为 2.2204460492503131e-016）。

类型float和double通过==,>,<等比较不会引起编译错误，但是非常可能得到错误的结果。这是因为它们的内存分布不同，不可以直接比较。正确的方法是转换为同一类型后比较两者差值，如果结果小于规定的小值，则视为相等。

Difference between float and double

Huge difference.

Floating point numbers in C use IEEE 754 encoding.

This type of encoding uses a sign, a significand, and an exponent.

Because of this encoding, you can never guarantee that you will not have a change in your value.

Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.

Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.

Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.

As the name implies, a double has 2x the precision of float [1]. In general a double has 15 to 16 decimal digits of precision, while float only has 7.

This precision loss could lead to truncation errors much easier to float up, e.g.

float a = 1.f / 81;    
float b = 0;    
for (int i = 0; i < 729; ++ i)            
b += a;    
printf("%.7g\n", b);   // prints 9.000023
while
double a = 1.0 / 81;    
double b = 0;    
for (int i = 0; i < 729; ++ i)            
b += a;    
printf("%.15g\n", b);   // prints 8.99999999999996

Also, the maximum value of float is only about 3e38, but double is about 1.7e308, so using float can hit Infinity much easier than double for something simple e.g. computing 60!.

Maybe the their test case contains these huge numbers which causes your program to fail.

Of course sometimes even double isn't accurate enough, hence we have long double[1] (the above example gives 9.000000000000000066 on Mac), but all these floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.

BTW, don't use += to sum lots of floating point numbers as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm .

When do you use float and when do you use double

Frequently in my programming experience I need to make a decision whether I should use float or double for my real numbers. Sometimes I go for float, sometimes I go for double, but really this feels more subjective. If I would be confronted to defend my decision, I would probably not give sound reasons.

When do you use float and when do you use double? Do you always use double, only when memory constraints are present you go for float? Or you use always float unless the precision requirement requires you to use double? Are there some substantial differences regarding computational complexity of basic arithmatics between float and double? What are the pros and cons of using float or double? And have you even used long double?

The default choice for a floating-point type should be double. This is also the type that you get with floating-point literals without a suffix or (in C) standard functions that operate on floating point numbers (e.g. exp, sin, etc.).

float should only be used if you need to operate on a lot of floating-point numbers (think in the order of thousands or more) and analysis of the algorithm has shown that the reduced range and accuracy don't pose a problem.

long double can be used if you need more range or accuracy than double, and if it provides this on your target platform.

In summary, float and long double should be reserved for use by the specialists, with double for "every-day" use.

There is rarely cause to use float instead of double in code targeting modern computers. The extra precision reduces (but does not eliminate) the chance of rounding errors or other imprecision causing problems.

The main reasons I can think of to use float are:

1．You are storing large arrays of numbers and need to reduce your program's memory consumption.

2．You are targeting a system that doesn't natively support double-precision floating point. Until recently, many graphics cards only supported single precision floating points. I'm sure there are plenty of low-power and embedded processors that have limited floating point support too.

3．You are targeting hardware where single-precision is faster than double-precision, and your application makes heavy use of floating point arithmetic. On modern Intel CPUs I believe all floating point calculations are done in double precision, so you don't gain anything here.

4．You are doing low-level optimization, for example using special CPU instructions that operate on multiple numbers at a time.

So, basically, double is the way to go unless you have hardware limitations or unless analysis has shown that storing double precision numbers is contributing significantly to memory usage.

Use double for all your calculations and temp variables. Use float when you need to maintain an array of numbers - float[] (if precision is sufficient), and you are dealing with over tens of thousands of float numbers.

Many/most math functions or operators convert/return double, and you don't want to cast the numbers back to float for any intermediate steps.

E.g. If you have an input of 100,000 numbers from a file or a stream and need to sort them, put the numbers in a float[].

http://www.cppexample.com/standard/float.html
http://en.wikipedia.org/wiki/Single_precision_floating-point_format
http://en.wikipedia.org/wiki/Double_precision_floating-point_format

skypeGNU

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
float VS double

浮点型 float doublefloat 是单精度浮点类型；double 是双精度浮点类型。存储结构存储方式是用科学计数法来存储数据的。Tip科学记数法是一种以记下极大或极小数字的方法。在科学记数法中，一个数被写成一个1与10之间的实数（尾数）与一个10的幂的积，为了得到统一的表达方式，该尾数并不包括10：782300=7.823×10^50.0001
复制链接

扫一扫