There are a bunch of real situations where you can assume the input will be in a small range. And while reducing from [-pi;pi] or [-2*pi;2*pi] or whatever is gonna slow it down somewhat, I'm pretty sure it wouldn't be too significant, compared to the FP arith. (and branching on inputs outside of even that expanded target expected range is a fine strategy realistically)
Most real math libraries will do this with only a quarter of the period, accounting for both sine and cosine in the same numerical approximation. You can then do range reduction into the region [0, pi/2) and run your approximation, flipping the X or Y axis as appropriate for either sine or cosine. This can be done branchlessly and in a SIMD-friendly way, and is far better than using a higher-order approximation to cover a larger region.
That's only if they're unpredictable; sure, perhaps on some workload it'll be unpredictable whether the input to sin/cos is grater than 2*pi, but I'm pretty sure on most it'll be nearly-always a "no". Perhaps not an optimization to take in general, but if you've got a workload where you're fine with 0.5% error, you can also spend a couple seconds thinking about what range of inputs to handle in the fast path. (hence "target expected range" - unexpected inputs getting unexpected branches isn't gonna slow down things if you've calibrated your expectations correctly; edited my comment slightly to make it clearer that that's about being out of the expanded range, not just [-pi/2,pi/2])
I'm of course not suggesting branching in cases where you expect a 30% misprediction rate. You'd do branchless reduction from [-2*pi;2*pi] or whatever you expect to be frequent, and branch on inputs with magnitude greater than 2*pi if you want to be extra sure you don't get wrong results if usage changes.
Again, we're in a situation where we know we can tolerate a 0.5% error, we can spare a bit of time to think about what range needs to be handled fast or supported at all.
Those reductions need to be part of the function being benchmarked, though. Assuming a range limitation of [-pi,pi] even would be reasonable, there's certainly cases where you don't need multiple revolutions around a circle. But this can't even do that, so it's simply not a substitute for sin, and claiming 40x faster is a sham
Right; the range reduction from [-pi;pi] would be like 5 instrs ("x -= copysign(pi/2 & (abs(x)>pi/2), x)" or so), ~2 cycles throughput-wise or so, I think; that's slightly more significant than I was imagining, hmm.
It's indeed not a substitute for sin in general, but it could be in some use-cases, and for those it could really be 40x faster (say, cases where you're already externally doing range reduction because it's necessary for some other reason (in general you don't want your angles infinitely accumulating scale)).