Floating-Point overflow and underflow

最新推荐文章于 2024-10-15 05:30:00 发布

howlowl

最新推荐文章于 2024-10-15 05:30:00 发布

阅读量1.6k

点赞数

分类专栏：疑问 CandCpp

CandCpp 同时被 2 个专栏收录

5 篇文章

订阅专栏

疑问

4 篇文章

订阅专栏

(一）What will happen if so ?

Floating-Point Overflow and Underflow
Suppose the biggest possible float value on your system is about 3.4E38 and you do this:

float toobig = 3.4E38 * 100.0f;
printf("%e\n", toobig);
What happens? This is an example of overflow—when a calculation leads to a number too large
to be expressed. The behavior for this case used to be undefined, but now C specifies that
toobig gets assigned a special value that stands for infinity and that printf() displays either
inf or infinity (or some variation on that theme) for the value.
What about dividing very small numbers? Here the situation is more involved. Recall that a
float number is stored as an exponent and as a value part, or mantissa. There will be a number
that has the smallest possible exponent and also the smallest value that still uses all the bits
available to represent the mantissa. This will be the smallest number that still is represented to
the full precision available to a float value. Now divide it by 2. Normally, this reduces the
exponent, but the exponent already is as small as it can get. So, instead, the computer moves
the bits in the mantissa over, vacating the first position and losing the last binary digit. An
analogy would be taking a base 10 value with four significant digits, such as 0.1234E-10,
dividing by 10, and getting 0.0123E-10. You get an answer, but you've lost a digit in the
process. This situation is called underflow, and C refers to floating-point values that have lost
the full precision of the type as subnormal. So dividing the smallest positive normal floating-
point value by 2 results in a subnormal value. If you divide by a large enough value, you lose all
the digits and are left with 0. The C library now provides functions that let you check whether
your computations are producing subnormal values.
There's another special floating-point value that can show up: NaN, or not-a-number. For
example, you give the asin() function a value, and it returns the angle that has that value as
its sine. But the value of a sine can't be greater than 1, so the function is undefined for values in
excess of 1. In such cases, the function returns the NaN value, which printf() displays as nan,
NaN, or something similar.

（二）its detection

https://stackoverflow.com/questions/15655070/how-to-detect-double-precision-floating-point-overflow-and-underflow

To be perfectly portable, you have to check before the operation, e.g. (for addition):

if ( (a < 0.0) == (b < 0.0)
    && std::abs( b ) > std::numeric_limits<double>::max() - std::abs( a ) ) {
    //  Addition would overflow...
}

Similar logic can be used for the four basic operators.

If all of the machines you target support IEEE (which is probably the case if you don't have to consider mainframes), you can just do the operations, then use isfinite or isinf on the results.

For underflow, the first question is whether a gradual underflow counts as underflow or not. If not, then simply checking if the results are zero and a != -b would do the trick. If you want to detect gradual underflow (which is probably only present if you have IEEE), then you can use isnormal—this will return false if the results correspond to gradual underflow. (Unlike overflow, you test for underflow after the operation.)

(三）

https://hk.saowen.com/a/d229a08eae12ccedc4ce2dd540eaec49920c861c11679d6bb88a4276318fdf3f

自帶的函數limits，

#include<iostream>
#include<limits>
using namespace std;
int main(void){
    
    cout<<"int "<<"    所佔字節數： "<<sizeof(int);
    cout<<"    最大值: "<<(numeric_limits<int>::max)();
    cout<<"    最小值: "<<(numeric_limits<int>::min)()<<endl; 
    
    cout<<"double "<<"    所佔字節數： "<<sizeof(double);
    cout<<"    最大值: "<<(numeric_limits<double>::max)();
    cout<<"    最小值: "<<(numeric_limits<double>::min)()<<endl; 
    
    return 0;
}

輸出為

int 所佔字節數： 4 最大值: 2147483647 最小值: -2147483648
double 所佔字節數： 8 最大值: 1.79769e+308 最小值: 2.22507e-308

(四）C

integer limits: limits.h

floating point limits: float.h

Inside float.h

FLT_MAX
DBL_MAX
LDBL_MAX 1E+37 or greater
1E+37 or greater
1E+37 or greater MAXimum Maximum finite representable floating-point number.

FLT_MIN
DBL_MIN
LDBL_MIN 1E-37 or smaller
1E-37 or smaller
1E-37 or smaller MINimum Minimum representable positive floating-point number.