比较浮点数的黄金法则 -- The golden rule for floating number comparison

本文链接：https://blog.csdn.net/prototype/article/details/1753670

How to compare floating numbers is an ooold, yet frequently asked, question. Here I am not going to repeat the old answers, since as you know tons of good links are there in the web that explained well how floating number works in computers and how comparisons should in principle be carried out. Instead, I am going to straightly give you a few pieces of code that work universally for IEEE 754 floating numbers, regardless their magnitudes and the comparison rigorosity. They are based on the golden rule as I am going to show you. At the end, I will discuss upon the performance optimization for a few special cases.

OK, here we go. But... wait a second, let me first be super clear of a few of things:
1. I am talking about the IEEE standard 754 floating numbers. This is almost needless to say because this specification is adopted by most, if not all, modern computers. But just to be super clear, I am assuming your computer also works with this specification.
2. I assume that you, the reader, know some basics about floating numbers, such as: floating numbers are not continuously represented in computers.
3. I assume that you, the reader, could read C++ program. This is important since all code and examples here are given in C++.
If any of the three assumptions was broken for your case, then you should not read this article.

Congrats, you passed the "configure" stage. :-) We now set off.

// Compares equality
inline bool is_equal( Real x, Real y )

{    Real x1 = abs_( x );
   Real y1 = abs_( y );
   Real z1 = (x1 > y1 ) ? x1           : y1;
   Real eps = (z1 > 1.0) ? z1 * epsilon : epsilon;

   return abs_( x - y ) <= eps;
}

This piece of code reveals the golden rule, anything else here are derived from this rule. In this code, Real is a floating number, be it float or double or even long double; abs_ is a function to get the absolute value of a given number, be it a standard function or your own function; REAL_EPSILON is a adjustable parameter representing the rigorosity of the comparison, I will elaborate upon it later. What this code does is to first calculate the proper range within which the two real numbers are considered the same. This is the real thing. Actually all lines before the return statement are dedicated to this task. This range is calculated automatically based on the preset rigorosity as represented by, again, REAL_EPSILON. Once this range is obtained, the remaining of the whole comparison task becomes trivial: just calculate the difference of the two numbers, and then compare the absolute value of this difference with the range; if it is smaller than the range, the two numbers are considered the same; otherwise, different. You may ask why the range is calculated as described? Well, to explain that, you need to learn some basics about the floating number, which I have to skip, as I said.

What really needs to explain here is the value for the rigorosity. What value should it take? You might immediately think of the *_EPSILON macros/constants as defined in <cfloat> library. That is, however, not a very good choice. The reason is that for a given value x, probably only 2 or 3 numbers are considered equal by this rigorosity, which is too rigorous and no much difference from the direct comparison. I recommend a value that is 5-10 times the *_EPSILON for the rigorosity for general applications, but your mileages might be different.

// Compares equality with adjustable rigorosity.

// Slight meta-tricks to figure out REAL_EPSILON for different Reals. Yes, the meta thing is kinda verbose if you don't have some a meta library.

template < typename T >

struct what_real;

template < > struct what_real < float > { static const int value = 0; } ;

template < > struct what_real < double > { static const int value = 1; } ;

template < > struct what_real < long double > { static const int value = 2; } ;

template < > struct what_real < my_real > { static const int value = 3; } ;

extern Real REAL_EPSILON; // Its value has to be set outside the header, in a .cc file.

namespace {

const int GAUGE = 7;

}

inline bool is_equal( Real x, Real y, real_epsilon = GAUGE * REAL_EPSILON )

{

...
Real eps = (z1 > 1.0) ? z1 * real_epsilon : real_epsilon;

return abs_( x - y ) <= eps;

}

The meta-thing is just added for fun. The main point is that you now can use the function like: is_equal( x, y, 1E-6 ), where 1E-6 is an online specification of the rigorosity. More convenient, huh!?

With the above code understood, we can easily define greater-than and less-than functions (so they are omitted here).

Now let's deal with performance for some special cases. A common case is to compare a floating number against zero. It is better to dedicate a function for this case. Replacing zero to the argument y, we reduce the code to the following:

inline bool is_zero( Real x, real_epsilon = GAUGE * REAL_EPSILON )

{

return abs_( x ) <= real_epsilon;

}

That's simple and good. How about for 1, 2, 8, or an arbitrary integer? If we cannot define a different function for each of these values, what can we do? Template is the answer. We can write a code like the following:

namespace {

template <bool C, int A, int B>

struct select_value

{

static const int value = A;

};

template <int A, int B>

struct select_value<false, A, B>

{

static const int value = B;

};

template <int X>

struct meta_abs

{

static const int value = select_value<(X>0), X, -X>::value;

};

}

template < int Y >

is_equal_to( x, real_epsilon = GAUGE * REAL_EPSILON )

{

return abs_( x - Y ) <= meta_abs<Y>::value > 1 ? real_epsilon * meta_abs<Y>::value : real_epsilon;

}

Here in the anonymous namespace, we defined a meta-function for calculation of the absolute value of a given constant integer Y. This transfers part of the calculation to the compile time, optimizing the overall performance of the comparing operation.

That is all I'd like to say regarding the floating number comparison. The remaining is yours.

BTW, not all of the code here were tested with a compiler.