Both the RISC-V and Thumb code here are no longer recursive fib; GCC has quite correctly transformed one of the recursions into a normal nonrecursive loop by exploiting the commutative and associative properties of integer addition. I infer that perhaps the same thing may be true of the MSP430 code from the presence of only a single call instruction, but I don't see the loop; perhaps the br to 0 will be relinked to go to .L3 before execution?
I haven't seen MSP430 code before, due to my somewhat embarrassing and unfair distaste for TI as a company, and I agree that it is a very nice encoding indeed!
If you insist that the code contain at least two recursive calls, two subtractions, an addition, and a conditional for the base case, even Thumb2 and RVC no longer look so competitive with stack machines. If not, there is no particular reason to leave one of the recursions in; you may as well optimize that out too, thus reducing an exponential-time algorithm to a linear-time one (or linear to logarithmic, if the parameter is the number of bits of input.) This also makes it smaller, something like (untested):
mov r1, #1
mov r2, #1
1: subs r0, #1
itt lo
movlo r0, r1
bxlo lr
mov r3, r2
adds r2, r1
mov r1, r3
b 1b
I think that's maybe 22 bytes?
But I don't think that tells us much about the relative densities of stack code, register code, accumulator code, and belt code, because it's a very different mix of operations.
> Both the RISC-V and Thumb code here are no longer recursive fib
If the programmer wrote a doubly-recursive fib() and the compiler transformed one to tail-recursion, then that's a doubly-recursive fib() as far as I'm concerned. If you don't want that to happen then use a benchmark that isn't so easily optimised, such as maybe Ackermann.
Note that this optimisation doesn't change the calculation performed or the execution time from being exponential, it merely saves some stack manipulation.
Having access to good compilers is part of the attraction of using a real ISA instead of a made-up one.
Changing the algorithm to an iterative linear-time form one is not an optimisation I'd expect a compiler to do.
I agree, but what we're trying to look at here is how space-efficient the ISA is at encoding computational operations. If one of the example programs encodes a totally different set of computational operations, it stops being an informative comparison.
But, you may reasonably object, this is handwaving! If a subroutine call is a "computational operation", why isn't duplicating a stack item or storing a variable? The stack code is cheating by not storing all its temporaries into variables! And I think the answer is that there's a bit of a slippery slope there, where more extensive program transformations give us less interesting answers. I would argue that transformations like not allocating registers to temporaries (or, say, subroutine arguments) are more broadly applicable than the transformation you're exhibiting, where the compiler has changed a non-tail call to dumbfib, followed by an addition, into an addition followed by a tail-recursive call implemented as a jump. That depends on being able to reassociate the entire chain of additions through the entire call chain, which is something I wouldn't have expected a compiler to be able to do.
I agree that it's more interesting to see benchmarks that aren't so easily gamed by optimizers, which is why in https://dernocua.github.io/notes/c-stack-bytecode.html I selected a bunch of small C and Pascal subroutines quasi-randomly and hand-compiled them into different candidate stack bytecode representations. I'm not sure that Ackermann or Tak would be as interesting.
The main weakness is -- opposite of dumbfib -- it has no function calls at all!
But I find it correlates well with other CPU core & L1 cache/SRAM benchmarks such as Dhrystone and Coremark. The Embench guys added a cut down version -- smaller arrays, shorter execution time [1] to their benchmark suite a couple of year back.
[1] baseline ~4s on typical microcontrollers ... I'd started at ~5s on Ivy Bridge, 10-30s on the Arm boards I had back then
It does have loops, integer multiplication, and array indexing (mostly reading), and all of those are pretty important.
Subroutine calls are particularly difficult to benchmark because they are so critical to real-world performance and so easy for compilers to remove from a benchmark if they can guess that a particular callsite will be hot. (LuaJIT takes this to the extreme.)
You'd think non-tail recursion would stymie this, but I've seen GCC unroll a recursive loop several layers deep and constant-fold away several layers over the base case. Impressive, useful, and completely invalidated my dumb benchmark.
I would guess that most programs have a subroutine call on the order of once every 100 instructions, spending something like 15% of their time on subroutine call and return. This is a very difficult number to quantify because of the pervasive indirect costs of subroutine calls: less efficient register allocation, forgone opportunities for constant folding and specialization, objects stored in memory rather than in registers, allocation on the heap rather than a stack, etc.
I wasn't able to find good information about how many instructions per subroutine call computers typically run these days. Probably I'm not using the right search terms, but here's what I found.
I just ran Emacs a few times under valgrind --tool=callgrind. Perhaps surprisingly, it was usably fast once it started up. I analyzed callgrind's output file with the following highly sophisticated methodology to determine the total number of subroutine calls (and presumably returns):
print(sum(int(mo.group(1)) for line in sys.stdin
for mo in [re.match(r'^calls=(\d+)', line)] if mo))
This, plus some division by numbers of executed instructions reported, revealed that it had a subroutine call for about every 55–59 instructions, and about a hundred and fifty thousand extra subroutine calls (and 7.7 million extra instructions) per character typed.
This suggests that, at least in Emacs, the cost of subroutine call and return is probably higher than 15%, maybe 25%.
Other programs I tested:
- GNU ls: 79 instructions per call
- PARI/GP: 360 instructions per call
- CPython with Numpy: 88 instructions per call
- Minetest (rather unplayable): 88 instructions per call
Are you able to figure out how many of those are calls of leaf functions?
On modern ISAs and ABIs calling a small leaf function is very cheap -- just the `jal` and `ret`, plus getting the arguments into the right registers first. In many cases memory is not touched at all (unless the code in the function necessarily touches memory), no stack frame is created etc.
Hmm, I think that ought to be pretty doable with the callgrind output actually. I'm using amd64 rather than a modern ISA, so the calls do necessarily touch "memory", but leaf functions are still especially cheap—they can use temporary registers for local variables. In the case of Emacs most of the functions are compiled Lisp, and I doubt the Lisp compiler is smart enough to exploit leafness in its register allocator.
Okay, so, I wrote a program http://canonical.org/~kragen/sw/dev3/leafcalls.py to list the leaf functions with the number of calls to each, and totaled up the results. Assuming I didn't fuck up the analysis, which is never a good bet, here it is:
I did look at the output file listing the leaf functions and it is at least not obviously wrong.
So, if this is right, typically almost half of all calls are calls to leaf functions
(functions that never call another function), although my Emacs and
Numpy runs have only about a third and a fourth of their calls being leaf
calls. All of these numbers are surprisingly low to me; I would have
expected the answer to be more like two thirds to three fourths.
There's an optimization I've applied sometimes the opportunity for which wouldn't show up here: on a leaf call that returns before it calls any other subroutines, you can sometimes avoid doing all the foofaraw for a non-leaf call by delaying it until after you know that you're in a non-leaf case. This isn't always a pure win, though, and I've never seen a compiler do it. The call graph output by callgrind only has subroutine granularity, so, absent any alternative, I'm only counting calls as leaf calls if they are calls to a leaf subroutine, that is, a subroutine that never calls other subroutines.
Note: these could be tail-calls, and are if you use `-O2` or `-Os`, but that's not important. Make it `foo(n) + 13` and you still get the "don't make a stack frame unless needed" optimisation.
Oh, here's an interesting one. Arm proactively pushes everything, x86 and RISC-V only if `err()` will be called.
I also would expect the number of function executions which are dynamically leaf functions [1] to be greater than 50%. Except for the kind of perverse call chains you get in Java and perhaps some other OO code bases where there are huge numbers of layers of abstraction where each method simply calls one other lower level method.
[1] i.e. that don't call another function on that execution, though in other circumstances they could.
I wonder why the second example frustrates the optimization—but only on ARM? Also, why is it pushing r0 and r1? That's stupid. It's also stupid that it's using an extra `add` instruction to un-push them—if it doesn't want to overwrite r0, it could pop the worthlessly pushed values into r1 and r2, or r3 and r4. I tried a bunch of different versions of GCC on your example code, and they all seem to produce code that's bad in obvious ways, but they're different obvious ways.
In further thought, if you have an instruction cache, adding 8 to SP is a faster way to discard the two useless values than popping them into unused registers. So that part is arguably defensible. But why push them? There can't be a requirement for a 16-byte-aligned stack pointer at all times—the add would violate it.
That's 24 bytes because gas is using Thumb2 mov.w instructions to initialize r1 and r2. It took me a long time to figure out that the problem was mov vs. movs; this is the correct 20-byte version:
.cpu cortex-m4
.thumb
.syntax unified
.thumb_func
.globl fib
fib: movs r1, #1
movs r2, #1
1: subs r0, #1
itt lo
movlo r0, r1
bxlo lr
mov r3, r2
adds r2, r1
mov r1, r3
b 1b
I swear, I keep making the same n00b mistakes, year after year!
It might be possible to get it down to 18 bytes, but I don't see how.
I haven't seen MSP430 code before, due to my somewhat embarrassing and unfair distaste for TI as a company, and I agree that it is a very nice encoding indeed!
If you insist that the code contain at least two recursive calls, two subtractions, an addition, and a conditional for the base case, even Thumb2 and RVC no longer look so competitive with stack machines. If not, there is no particular reason to leave one of the recursions in; you may as well optimize that out too, thus reducing an exponential-time algorithm to a linear-time one (or linear to logarithmic, if the parameter is the number of bits of input.) This also makes it smaller, something like (untested):
I think that's maybe 22 bytes?But I don't think that tells us much about the relative densities of stack code, register code, accumulator code, and belt code, because it's a very different mix of operations.