Is my floating point number equal to zero?

One often sees code like the following, embodying the feeling programmers have that “you should never compare a float to zero”.

double func(double a, double b) {
   double diff = a - b;
   if (fabs(diff) < EPSILON) 
     throw ...;
   else
     return 1/diff;
}

But what are you trying to avoid here? Is it the divide by zero? Then avoid that…

double func(double a, double b) {
   double diff = a - b;
   if (diff == 0) 
     throw ...;
   else
     return 1/diff;
}

“But”, you say, “what if diff is really small”? You mean like 1e-128? Then this function will return 1e+128. Floating point can deal with it. If you didn't need that dynamic range, you'ld be using float instead of double.

OK, what about 1e-309? Aha, different issue (denormal floats). Now you won't get a divide by zero, most likely you'll return Inf, and… your code might run very slowly. Aha! No-one wants slow code, right? OK, but then why the “if” in the first place? That's slowing down every call to the function. If your workloads mostly don't call the function with nearly equal very small numbers, then you can go a lot faster in the common case by not checking.

“Ah, but I want the additional error checking”. OK, fine. I'm with you. But is this the right place? Why did the caller call the routine with nearly equal very small numbers? Is that physically likely in whatever real-world problem you're trying to solve? If not, one should probably protect at the call site with an assert.

Or maybe it's a precondition of the function call that a not be equal to b, and it's a coding bug if not satisfied.

double func(double a, double b) {
   assert(a - b != 0);
 
   return 1/(a-b);
}

Don't forget to leave your asserts on in all builds unless the profiler tells you otherwise. But hang on, you're on a platform which will give you line number here when you get the divby0 right? So the assert is superfluous – just enable exceptions on divide by zero, underflow, and overflow.

And there's also a very simple case. If I stored 0.3f somewhere, I expect to get it back. It's irrelevant that 0.3f doesn't have an exact binary representation – whatever it translates to, I should get exactly that back

void my_unit_test() {
  my_matrix<2,2,float> a = {3.0f, 1.1f, 2.2f, 4.0f};
  TEST_ASSERT_EQ_APPROX(a(0,0), 3.0f, 1e-8f); // No, no, no.  There is no reason for the stored value to be different.
  TEST_ASSERT_EQ(a(0,0), 3.0f); // I gave you whatever 3.0f is.  I want it back.
}

You may worry about excess precision on x86, but there's no computation going on here, so we can and should ask for exact equality.

But my case is more subtle than that

double sinc(double x) {
   if (fabs(x) < EPSILON)
     return 1;
   else
     return sin(x)/x;
}

Even here, you can think about what a better value to compare to might be. First note that sin(x)/x works fine for all floats except zero, so you can avoid a dependency on fabs and a definition of epsilon by just comparing to zero.

But perhaps the profiler has told you that a big chunk of program time is spent in computing sin() and dividing, and furthermore you know that most times it's called with values near zero. Well then, let's do what we do with any special function that goes slow: chop up the domain and special case. It's true that sin(x)/x is exactly floating point 1.0 when x is small. In fact, compute the taylor series, see that

sin(x)/x = 1 - x^2/6 + …

and realize the threshold is sqrt(eps*6). In fact it can be a bit larger, which may help if those values “near” zero were in the range 3e-8 to 1e-6. Experiment a bit and find this code:

double sinc(double x) {
   if (fabs(x) < 1e-6)
     return 1;
   else
     return sin(x)/x;
}

It's a good idea then to write a blog post about the issue, because Toby Sharp might comment (see below) and point out that there is a much better choice:

“… since we need some kind of test for x=0 anyway, I don't see why we shouldn't take the opportunity to improve accuracy in this vicinity. And yes, it only costs 2 multiplies which is less than using the sin in those cases. So we are improving both performance and accuracy in the range (0, threshold). threshold is chosen to be the largest value such that the quadratic approximation is as good as the direct evaluation.”
float sinc(float x) { 
  // Toby's threshold, determined through binary search as the best value for minimizing absolute error
  if (std::abs(x) < 0.0406015441f)
      return 1.0f - (x * x) * (1.0f / 6.0f);
  else
      return std::sin(x) / x;
} 

And templates?

Yes, the bare constant above should tell you that for templating over floating types you need to care about values of epsilons. It's not just a matter of numeric_limits<T>, but needs to be worked out for each case.

Another reason to compare to zero when you can.

Also, compares to zero may be faster on your platform. And you don't have to define epsilon somewhere.

And more...

double sincminusone(double x) {
   if (fabs(x) < 1e-6)
     return 0; // But x^2/6 gives the right answer, not just truncating to the nearest zero.
   else
     return sin(x)/x-1; // Or try (sin(x)-x)/x and compare to zero instead...
}

See also * https://stackoverflow.com/questions/51134021/floating-point-equality

Discussion

Colin Davidson, 2015/03/11 08:11, 2015/03/12 19:38

Nice post, tricky subject. Every divide is a crash waiting to happen, or worse, a nan. Maybe people are concerned about denormalised numbers, whose reciprocals would overflow iirc?

I like to tell people that half the representable floating point numbers are between -1 and 1. So I think that units are pretty important. Should you really care about e-10 if your units are metres? Or radians? Where the units stop making sense is where you're going to find numerical problems.

Toby Sharp, 2015/03/11 10:16

The way I like to compute sinc: quadratic approximation close to zero. The threshold for when to take this approximation can be pre-determined by minimizing error against a more precisely computed value.

double sinc(double x) {

  // Determined through binary search as the best value for minimizing absolute error
  const double eps = 0.00028606786360068216; 
  if (std::abs(x) < eps)
  {
      // Use a quadratic approximation to the rational polynomial
      const double one_over_three_fact = 1.0 / 6.0;
      return 1.0 - (x * x) * one_over_three_fact;
  }
  return std::sin(x) / x;

}

Andrew Fitzgibbon, 2015/03/11 13:59

Toby, but for what range is 1-x^2/6 different from what sin(x)/x will produce? As far as I can see, it's only a speed question?

Toby Sharp, 2015/03/13 12:06

As an example, let's consider using floats with x = 0.00043f. (std::sin(x)/x) evaluates to 1, whereas the code below evaluates to 0.99999994. So using the direct method has an error of 6e-8. For doubles, the direct method has an error of 2.2e-16 compared to the Taylor series at x=0.000279. OK, these are small errors, but since we need some kind of test for x=0 anyway, I don't see why we shouldn't take the opportunity to improve accuracy in this vicinity. And yes, it only costs 2 multiplies which is less than using the sin in those cases. So we are improving both performance and accuracy in the range (0, threshold). threshold is chosen to be the largest value such that the quadratic approximation is as good as the direct evaluation.

float sinc(float x) {

  // Determined through binary search as the best value for minimizing absolute error
  const float eps = 0.0406015441f; 
  
  if (std::abs(x) < eps)
  {
      // Use a quadratic approximation to the rational polynomial
      const float one_over_three_fact = 1.0f / 6.0f;
      return 1.0f - (x * x) * one_over_three_fact;
  }
  return std::sin(x) / x;

}

Andrew Fitzgibbon, 2015/03/15 18:29

Thanks Toby, great comments. Quite interesting how large the threshold is for these cases: 0.04 is certainly not something I would have guessed, or even obtained from considerations around sqrt(eps).

Andrew Fitzgibbon, 2020/07/24 14:12

From Tom Minka:

1. If you are going to compare the result of any calculation to zero, make sure to cast to double first, like so:

if ((double)x == 0) return 1;

Otherwise x could have temporary excess precision that makes it nonzero at the time of comparison but zero at the time of calculation. In C++ and C#, typecasts explicitly round intermediate computations. I once had code like this that didn't cast, and I was very confused why the debugger kept telling me that x was zero, yet the branch wasn't taken.

2. Toby’s magic number 0.0406015441f can be determined without search in a better way that works for any floating-point type. The general formula is sqrt(sqrt(60*ulp(0.999999))) where ulp(x) is the distance to the next number above x. For single precision, ulp(0.999999) = 5.9604644775390625e-8 and the formula yields 0.043486837f.

The formula comes from the fact that the error of that truncated Taylor series is bounded by its next term, which happens to be x^4/120. (There is also roundoff error in computing the truncated Taylor series, but that is negligible by comparison.) We want the error to be at most half of an ulp, so solving x^4/120 = ulp/2 gives the formula.

Finding the threshold by binary search on your development machine is not a good idea in general because the error of sin(x) depends on its implementation, which (according to the C++ specification) is allowed to vary based on processor and compiler intrinsics. It is better to assume that the error of sin(x) is 1 ulp (as allowed by the specification) and work backward from there.

Andrew Fitzgibbon, 2020/07/24 14:13, 2020/07/24 14:14

AWF> Do you think we can construct an example of point 1? Is it just a denormal? We would have to find a compiler that generated assembly with those properties, and exhibit the assembly.

TM> The example of point 1 that I ran into is at Link

AWF> For point 2, what is special about 0.999999? Its ulp is close to the standard “eps”/2, but it seems like a strange constant.

TM> That constant is chosen to be near the value of sinc. You could also obtain ulp(0.999999) via ulp(1)/2.

Toby Sharp, 2020/08/07 04:12

Another constructive example might be a function qreal(a, b, c) which returns the real roots of the quadratic equation ax^2 + bx + c = 0. When a is zero and b is non-zero we have exactly one root at -c/b. When a is denormal, i.e. std::abs(a) < std::numeric_limits<T>::min() then I suggest we would rather solve the linear equation than deal with the division by 2*a in the quadratic case. Making this function work for all possible values of (a, b, c) is interesting. (And see also Kahan's method for computing the discriminant (b^2 - 4ac) exactly using FMA.)

Enter your comment. Wiki syntax is allowed:
 
floating-point-equality.txt · Last modified: 2018/09/05 15:14 by awf
CC Attribution 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0