But the optimization there is rewriting the algorithm (which is 'brittle' becaus...

But the optimization there is rewriting the algorithm (which is 'brittle' because it depends on the operation used being commutative in that example).

Just optimizing tail calls as written in the programmer's code (which is what is normally meant with "TCO") don't need such rewriting, and would be trivial to implement via a jump if the back-end supports that, except C also needs to take into consideration structures on the stack, and perhaps other details, hence apparently gcc's handling isn't straight-forward even for plain TCO and fails to be applied in all applicable cases (or at least that's what happened ~4 years ago).