Both the RISC-V and Thumb code here are no longer recursive fib; GCC has quite c...

brucehoult · 2025-09-08T12:33:54 1757334834

> Both the RISC-V and Thumb code here are no longer recursive fib

If the programmer wrote a doubly-recursive fib() and the compiler transformed one to tail-recursion, then that's a doubly-recursive fib() as far as I'm concerned. If you don't want that to happen then use a benchmark that isn't so easily optimised, such as maybe Ackermann.

Note that this optimisation doesn't change the calculation performed or the execution time from being exponential, it merely saves some stack manipulation.

Having access to good compilers is part of the attraction of using a real ISA instead of a made-up one.

Changing the algorithm to an iterative linear-time form one is not an optimisation I'd expect a compiler to do.

kragen · 2025-09-08T13:36:16 1757338576

I agree, but what we're trying to look at here is how space-efficient the ISA is at encoding computational operations. If one of the example programs encodes a totally different set of computational operations, it stops being an informative comparison.

But, you may reasonably object, this is handwaving! If a subroutine call is a "computational operation", why isn't duplicating a stack item or storing a variable? The stack code is cheating by not storing all its temporaries into variables! And I think the answer is that there's a bit of a slippery slope there, where more extensive program transformations give us less interesting answers. I would argue that transformations like not allocating registers to temporaries (or, say, subroutine arguments) are more broadly applicable than the transformation you're exhibiting, where the compiler has changed a non-tail call to dumbfib, followed by an addition, into an addition followed by a tail-recursive call implemented as a jump. That depends on being able to reassociate the entire chain of additions through the entire call chain, which is something I wouldn't have expected a compiler to be able to do.

I agree that it's more interesting to see benchmarks that aren't so easily gamed by optimizers, which is why in https://dernocua.github.io/notes/c-stack-bytecode.html I selected a bunch of small C and Pascal subroutines quasi-randomly and hand-compiled them into different candidate stack bytecode representations. I'm not sure that Ackermann or Tak would be as interesting.

brucehoult · 2025-09-08T15:33:11 1757345591

I have my own toy benchmark too :-)

https://hoult.org/primes.txt

kragen · 2025-09-08T15:49:58 1757346598

That seems like a significantly better microbenchmark than dumbfib!

brucehoult · 2025-09-09T02:32:25 1757385145

The main weakness is -- opposite of dumbfib -- it has no function calls at all!

But I find it correlates well with other CPU core & L1 cache/SRAM benchmarks such as Dhrystone and Coremark. The Embench guys added a cut down version -- smaller arrays, shorter execution time [1] to their benchmark suite a couple of year back.

[1] baseline ~4s on typical microcontrollers ... I'd started at ~5s on Ivy Bridge, 10-30s on the Arm boards I had back then

kragen · 2025-09-09T06:59:57 1757401197

It does have loops, integer multiplication, and array indexing (mostly reading), and all of those are pretty important.

Subroutine calls are particularly difficult to benchmark because they are so critical to real-world performance and so easy for compilers to remove from a benchmark if they can guess that a particular callsite will be hot. (LuaJIT takes this to the extreme.)

You'd think non-tail recursion would stymie this, but I've seen GCC unroll a recursive loop several layers deep and constant-fold away several layers over the base case. Impressive, useful, and completely invalidated my dumb benchmark.

I would guess that most programs have a subroutine call on the order of once every 100 instructions, spending something like 15% of their time on subroutine call and return. This is a very difficult number to quantify because of the pervasive indirect costs of subroutine calls: less efficient register allocation, forgone opportunities for constant folding and specialization, objects stored in memory rather than in registers, allocation on the heap rather than a stack, etc.

kragen · 2025-09-09T16:46:38 1757436398

I wasn't able to find good information about how many instructions per subroutine call computers typically run these days. Probably I'm not using the right search terms, but here's what I found.

I just ran Emacs a few times under valgrind --tool=callgrind. Perhaps surprisingly, it was usably fast once it started up. I analyzed callgrind's output file with the following highly sophisticated methodology to determine the total number of subroutine calls (and presumably returns):

  print(sum(int(mo.group(1)) for line in sys.stdin
            for mo in [re.match(r'^calls=(\d+)', line)] if mo))

This, plus some division by numbers of executed instructions reported, revealed that it had a subroutine call for about every 55–59 instructions, and about a hundred and fifty thousand extra subroutine calls (and 7.7 million extra instructions) per character typed.

This suggests that, at least in Emacs, the cost of subroutine call and return is probably higher than 15%, maybe 25%.

Other programs I tested:

- GNU ls: 79 instructions per call

- PARI/GP: 360 instructions per call

- CPython with Numpy: 88 instructions per call

- Minetest (rather unplayable): 88 instructions per call

brucehoult · 2025-09-09T19:27:54 1757446074

Interesting results.

Are you able to figure out how many of those are calls of leaf functions?

On modern ISAs and ABIs calling a small leaf function is very cheap -- just the `jal` and `ret`, plus getting the arguments into the right registers first. In many cases memory is not touched at all (unless the code in the function necessarily touches memory), no stack frame is created etc.

kragen · 2025-09-09T20:10:20 1757448620

Hmm, I think that ought to be pretty doable with the callgrind output actually. I'm using amd64 rather than a modern ISA, so the calls do necessarily touch "memory", but leaf functions are still especially cheap—they can use temporary registers for local variables. In the case of Emacs most of the functions are compiled Lisp, and I doubt the Lisp compiler is smart enough to exploit leafness in its register allocator.

kragen · 2025-09-10T16:57:39 1757523459

Okay, so, I wrote a program http://canonical.org/~kragen/sw/dev3/leafcalls.py to list the leaf functions with the number of calls to each, and totaled up the results. Assuming I didn't fuck up the analysis, which is never a good bet, here it is:

    <table>
    <tr><th>PID     <th>Program <th>Instructions   <th>Calls       <th>Leaf calls  <th>Functions <th>Leaf functions
    <tr><td>2603407 <th>Emacs   <td>18,278,190,938 <td>328,734,692 <td>100,779,209 <td>16,875    <td>4,523
    <tr><td>2611631 <th>Emacs   <td>10,444,575,778 <td>176,576,759 <td> 59,334,998 <td>16,262    <td>4,470
    <tr><td>2612835 <th>Emacs   <td>10,453,593,569 <td>176,931,778 <td> 59,425,409 <td>16,422    <td>4,483
    <tr><td>2622964 <th>ls -l   <td>    28,770,079 <td>    362,597 <td>    154,386 <td>   520    <td>  218
    <tr><td>2623998 <th>PARI/GP <td>    19,265,853 <td>     52,633 <td>     19,983 <td>   828    <td>  346
    <tr><td>2624712 <th>Numpy   <td>   356,605,939 <td>  4,040,096 <td>    952,628 <td> 4,366    <td>1,259
    <tr><td>2627189 <th>Minetest<td>62,476,314,644 <td>708,735,071 <td>354,140,648 <td>22,318    <td>5,556
    </table>

I did look at the output file listing the leaf functions and it is at least not obviously wrong.

So, if this is right, typically almost half of all calls are calls to leaf functions (functions that never call another function), although my Emacs and Numpy runs have only about a third and a fourth of their calls being leaf calls. All of these numbers are surprisingly low to me; I would have expected the answer to be more like two thirds to three fourths.

There's an optimization I've applied sometimes the opportunity for which wouldn't show up here: on a leaf call that returns before it calls any other subroutines, you can sometimes avoid doing all the foofaraw for a non-leaf call by delaying it until after you know that you're in a non-leaf case. This isn't always a pure win, though, and I've never seen a compiler do it. The call graph output by callgrind only has subroutine granularity, so, absent any alternative, I'm only counting calls as leaf calls if they are calls to a leaf subroutine, that is, a subroutine that never calls other subroutines.

I'm interested to hear if you have other stats!

brucehoult · 2025-09-10T21:22:52 1757539372

> I've never seen a compiler do it.

Oh I have. Often.

https://godbolt.org/z/PM9herj3P

Note: these could be tail-calls, and are if you use `-O2` or `-Os`, but that's not important. Make it `foo(n) + 13` and you still get the "don't make a stack frame unless needed" optimisation.

Oh, here's an interesting one. Arm proactively pushes everything, x86 and RISC-V only if `err()` will be called.

https://godbolt.org/z/93WEbc1q9

I also would expect the number of function executions which are dynamically leaf functions [1] to be greater than 50%. Except for the kind of perverse call chains you get in Java and perhaps some other OO code bases where there are huge numbers of layers of abstraction where each method simply calls one other lower level method.

[1] i.e. that don't call another function on that execution, though in other circumstances they could.

kragen · 2025-09-11T00:46:59 1757551619

Shows what I know! I appreciate the correction.

I wonder why the second example frustrates the optimization—but only on ARM? Also, why is it pushing r0 and r1? That's stupid. It's also stupid that it's using an extra `add` instruction to un-push them—if it doesn't want to overwrite r0, it could pop the worthlessly pushed values into r1 and r2, or r3 and r4. I tried a bunch of different versions of GCC on your example code, and they all seem to produce code that's bad in obvious ways, but they're different obvious ways.

kragen · 2025-09-11T11:23:42 1757589822

In further thought, if you have an instruction cache, adding 8 to SP is a faster way to discard the two useless values than popping them into unused registers. So that part is arguably defensible. But why push them? There can't be a requirement for a 16-byte-aligned stack pointer at all times—the add would violate it.

brucehoult · 2025-09-11T23:57:41 1757635061

It's definitely strange. And arm64 does it like the others.

kragen · 2025-09-08T11:31:57 1757331117

That's 24 bytes because gas is using Thumb2 mov.w instructions to initialize r1 and r2. It took me a long time to figure out that the problem was mov vs. movs; this is the correct 20-byte version:

            .cpu cortex-m4
            .thumb
            .syntax unified
            .thumb_func
            .globl fib
    fib:    movs r1, #1
            movs r2, #1
        1:  subs r0, #1
            itt lo
            movlo r0, r1
            bxlo lr
            mov r3, r2
            adds r2, r1
            mov r1, r3
            b 1b

I swear, I keep making the same n00b mistakes, year after year!

It might be possible to get it down to 18 bytes, but I don't see how.