Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have done my PhD on measuring floating-point error in large numerical simulations. In my experience, most of the time, if 64-bit floating points are not enough, then trying to increase precision further is unlikely to help much (there are exceptions).

In general, using a well-placed compensated (or exact) sum/dot product algorithm will solve most of your problems at a low cost. Nowadays, most people are familiar with Kahan summation (and related tricks like sorting numbers), but they do not know that it does not help much (often less than 128-bit floating points) and that the state of the art for those algorithms has moved quite a bit since. Nowadays, you can ensure that your sum/BLAS operation produces a result within one ulp of the exact result!

I highly recommend perusing the Handbook of Floating-Point Arithmetic (the Accurate Algorithm online page also seems fine [0]) to get an idea of what can be done. Implementations of these ideas can be found in most languages used for numerical computation (a quick Google search gave me options for C++/Python [1], Rust [2], etc.).

[0]: https://accurate-algorithms.readthedocs.io/en/latest/index.h...

[1]: https://github.com/yafshar/xsum

[2]: https://github.com/bsteinb/accurate



Here's an example where higher precision could be useful. You need to do an expectation of some function over a distribution. You only have the moment generating function for the distribution. The moments are the derivatives of this MGF, but a rough rule of thumb is that if the precision of a function is epsilon, then the precision of it's n'th derivative is epsilon^(1/n). If you have 12 significant digits for the MGF, then you can count on only 1 significant digit for the 12'th derivative. Let's say that this is simply unacceptable for you, you need 4 digits, so you can use the first 3 derivatives only. If you double the precision, you can use 6 derivatives, so you can match 6 moments. This might be enough for your problem. Of course, you can find better algorithms, like you use cumulants instead of moments, find a way to devise a Gaussian quadrature method (a Golub-Welsch algorithm) that uses these cumulants, etc. But that requires doing a lot of hard math (with the inevitable initial bugs), while plugging in a double precision library gives you instantaneous results.


Yes! And no!

You will indeed need higher precision if you need a very large number of digits (more than what double precision gives you).

However, even with high precision, the algorithms you use to compute your moments matter: you might not have as many digits as you hope for. Variance, for example, is famously numerically unstable (sometimes giving a negative variance) when computed naively. A stable summation algorithm plugged into a variance computation (even a carefully written one) can sometimes give you six additional digits of precision in the variance that you might not get when doubling the precision (it depends on how big of a cancellation you had in your operations; too big, and doubling the precision will not be enough to recover, while a compensated summation would do fine).

I have seen this play out once in a clustering algorithm that would only work on some data when using a compensated summation to compute the variance.


Well, my point is that it's not as easy as saying "just pick a smarter algorithm". Not all problems needing high precision can resort to a smart summation. Sometimes that smart algorithm just doesn't exist yet.

You are also right, that if you can solve your problem with high precision, it's unlikely that just quad precision will be enough. For example the inverse Laplace transform. In some cases you need thousands of digits. The cases where double precision is not enough, but quad precision is are probably rare. But it's worth a try.


That’s what I was taught about numerics doing my PhD in the 1990s. I was taught that you shouldn’t even use doubles if you could use singles.


That's also my experience.

It's interesting that 64 bits are often necessary (as opposed to 32) but 128 bits so rarely are.


Yes! I suspect that it boils down to 32 bits not being that much (you can enumerate them to test a function), while 64 bits is a lot. By that point, if you are having problems, then you are likely hitting a wall that will not be crossed by throwing additional bits at the problem.


Thanks, it is very helpful. 2 questions:

1. What’s the implications when parallelization is considered? I.e. is those accurate sum algorithms parallelizable, and how would it compares to traditional parallelized sum? 2. What is the performance penalty I should be expecting?


1. Parallelization: Yes, they do parallelize! Conceptually, it is fairly easy to use accurate algorithms locally, then reduce the local accumulators. A side effect of doing this is cutting (or reducing) the noise in the output introduced by the parallelization (note that this is not to be confused with deterministic summation/BLAS, which enforces the order of operations to achieve the same side effect but do not improve accuracy).

2. It varies wildly depending on the algorithm you are using, the implementation of said algorithm, and the operations you are performing outside of the summation (if you sum the results of a loop performing non-trivial operations, that can easily overlap with the summation overhead making it essentially free). The only rule of thumb I can give is that runtime cost increases with desired accuracy. My recommendation would be to switch from a naive sum to using an exact accumulator and observe the difference in output value (letting you know how impactful numerical errors in that sum were) and runtime (letting you know how expensive the switch is for you), then dial it down if needed.


Thanks! That’s very interesting! Now I need to find my next project to try this…




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: