Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Educational material that misinforms its readers isn't educational, and it's insanely counterproductive.

People have already ragged on you for doing Taylor approximation, and I'm not the best expert on the numerical analysis of implementing transcendental functions, so I won't pursue that further. But there's still several other unaddressed errors in your trigonometric code:

* If your function is going to omit range reduction, say so upfront. Saying "use me to get a 40× speedup because I omit part of the specification" is misleading to users, especially because you should assume that most users are not knowledgeable about floating-point and thus they aren't even going to understand they're missing something without you explicitly telling them!

* You're doing polynomial evaluation via `x * a + (x * x) * b + (x * x * x) * c` which is not the common way of doing so, and also, it's a slow way of doing so. If you're trying to be educational, do it via the `((((x * c) * x + b) * x) + a) * x` technique--that's how it's done, that's how it should be done.

* Also, doing `x / 6.0` is a disaster for performance, because fdiv is one of the slowest operations you can do. Why not do `x * (1.0 / 6.0)` instead?

* Doing really, really dumb floating-point code and then relying on -ffast-math to make the compiler unpick your dumbness is... a bad way of doing stuff. Especially since you're recommending people go for it for the easy speedup and saying absolutely nothing about where it can go catastrophically wrong. As Simon Byrne said, "Friends don't let friends use -ffast-math" (and the title of my explainer on floating-point will invariably be "Floating Point, or How I Learned to Start Worrying and Hate -ffast-math").



-ffast-math has about 15 separate compiler flags that go into it, and on any given piece of code, about 3-5 of them are disastrous for your numerical accuracy (but which 3-5 changes by application). If you do the other 10, you get most of the benefit without the inaccuracy. -ffast-math is especially dumb because it encourages people to go for all of them or nothing.


I'd say `x * a + (x * x) * b + (x * x * x) * c` is likely faster (subject to the compiler being reasonable) than `((((x * c) * x + b) * x) + a) * x` because it has a shorter longest instruction dependency chain. Add/Mul have higher throughput than latency, so the latency chain dominates performance and a few extra instructions will just get hidden away by instruction level parallelism.

Also x/6 vs x(1/6) is not as bad as it used to be, fdiv keeps getting faster. On my zen2 its 10 cycles latency and 0.33/cycle throughput for (vector) div, and 3 latency and 2/cycle throughput for (vector) add. So about 1/3 speed, worse if you have a lot of divs and the pipeline fills up. Going back to Pentium the difference is ~10x and you don't get to hide it with instruction parallelism.

* The first expression has a chain of 4 instructions that cannot be started before the last one finished `(((x * x) * x) * c) + the rest` vs the entire expression being a such a chain in the second version. Using fma instructions changes this a bit, making all the adds in both expressions 'free' but this changes precision and needs -ffast-math or such, which I agree is dangerous and generally ill advised.


Someone is still putting tremendous effort into this project so I reckon it would be worthwhile to submit this, obviously well thought through, criticism as a PR for the repo!


For the range reduction, I've always been a fan of using revolutions rather than radians as the angle measure as you can just extract fractional bits to range reduce. Note that this is at the cost of a more complicated calculus.

I can't for the life of me find the Sony presentation, but the fastest polynomial calculation is somewhere between Horner's method (which has a huge dependency tree in terms of pipelining) and full polynomial evaluation (which has redundancy in calculation).

Totally with you on not relying on fast math! Not that I had much choice when I was working on games because that decision was made higher up!


I don't know who the target audience is supposed to be, but who would be the type of person who tries to implement performance critical numerical codes but doesn't know the implications of Taylor expanding the sine function?


People who found that sin is the performance bottleneck in their code and are trying to find way to speed it up.

One of the big problems with floating-point code in general is that users are largely ignorant of floating-point issues. Even something as basic as "0.1 + 0.2 != 0.3" shouldn't be that surprising to a programmer if you spend about five minutes explaining it, but the evidence is clear that it is a shocking surprise to a very large fraction of programmers. And that's the most basic floating-point issue, the one you're virtually guaranteed to stumble across if you do anything with floating-point; there's so many more issues that you're not going to think about until you uncover them for the first time (e.g., different hardware gives different results).


Thanks to a lot of effort by countless legions of people, we're largely past the years when it was common for different hardware to give different results. It's pretty much just contraction, FTZ/DAZ, funsafe/ffast-math, and NaN propagation. Anyone interested in practical reproducibility really only has to consider the first two among the basic parts of the language, and they're relatively straightforward to manage.


Divergent math library implementations is the other main category, and for many practical cases, you might have to worry about parallelization factor changing things. For completeness' sake, I might as well add in approximate functions, but if you using an approximate inverse square root instruction, well, you should probably expect that to be differ on different hardware.

On the plus side, x87 excess precision is largely a thing of the past, and we've seen some major pushes towards getting rid of FTZ/DAZ (I think we're at the point where even the offload architectures are mandating denormal support?). Assuming Intel figures out how to fully get rid of denormal penalties on its hardware, we're probably a decade or so out from making -ffast-math no longer imply denormal flushing, yay. (Also, we're seeing a lot of progress on high-speed implementations of correctly-rounded libm functions, so I also expect to see standard libraries require correctly-rounded implementations as well).


The definition I use for determinism is "same inputs and same order = same results", down to the compiler level. All modern compilers on all modern platforms that I've tested take steps to ensure that for everything except transcendental and special functions (where it'd be an unreasonable guarantee).

I'm somewhat less interested in correctness of the results, so long as they're consistent. rlibm and related are definitely neat, but I'm not optimistic they'll become mainstream.


There are lots of cases where you can get away with moderate accuracy. Rotating a lot of batched sprites would be one of them; could easily get away with a mediocre Taylor series approximation, even though it's leaving free accuracy on the table compared to minimax.

But not having _range reduction_ is a bigger problem, I can't see many uses for a sin() approximation that's only good for half wave. And as others have said, if you need range reduction for the approximation to work in its intended use case, that needs to be included in the benchmark because you're going to be paying that cost relative to `std::sin()`.


> tries to implement performance critical numerical codes but doesn't know the implications of Taylor expanding the sine function?

That would be me, I’m afraid. I know little about Taylor series, but I’m pretty sure it’s less than ideal for the use case.

Here’s a better way to implement faster trigonometry functions in C++ https://github.com/Const-me/AvxMath/blob/master/AvxMath/AvxM... That particular source file implements that for 4-wide FP64 AVX vectors.


I am genuinely quite surprised that the sine approximation is the eyeball catcher in that entire repo.

It will only take a 5-line PR to add Horner’s method and Chebyshev’s polynomials and probably around 20 lines of explanations, and everyone passionate about the topic is welcome to add them.

There are more than enough examples in the libraries mentioned above ;)


It's eye catching because it's advertised as a 40x speedup without any caveats.


Oh, I've missed that part. It's hard to fit README's bullet points on a single line, and I've probably removed too many relevant words.

I'll update the README statement in a second, and already patched the sources to explicitly focus on the [-π/2, π/2] range.

Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: