Thank you! This is indeed very thought-provoking! It certainly does depend on *w...

brucehoult · 2025-09-08T05:08:01 1757308081

With respect to your tiny interpreters page, I'd just like to point out that Thumb code for recursive fib() is 26 bytes, very close to Squeak (this is out of GCC with -Os, I haven't checked if I can do better manually).

    00010448 <fib>:
       10448:       b570            push    {r4, r5, r6, lr}
       1044a:       0004            movs    r4, r0
       1044c:       2500            movs    r5, #0
       1044e:       2c01            cmp     r4, #1
       10450:       dd05            ble.n   1045e <fib+0x16>
       10452:       1e60            subs    r0, r4, #1
       10454:       f7ff fff8       bl      10448 <fib>
       10458:       3c02            subs    r4, #2
       1045a:       182d            adds    r5, r5, r0
       1045c:       e7f7            b.n     1044e <fib+0x6>
       1045e:       1960            adds    r0, r4, r5
       10460:       bd70            pop     {r4, r5, r6, pc}

RISC-V is 30 bytes using newish instructions found on e.g. RP2350

    00010186 <fib>:
       10186:       b872                    cm.push {ra,s0-s2},-16
       10188:       842a                    mv      s0,a0
       1018a:       4481                    li      s1,0
       1018c:       4905                    li      s2,1
       1018e:       00895863                bge     s2,s0,1019e <fib+0x18>
       10192:       fff40513                addi    a0,s0,-1
       10196:       3fc5                    jal     10186 <fib>
       10198:       1479                    addi    s0,s0,-2
       1019a:       94aa                    add     s1,s1,a0
       1019c:       bfcd                    j       1018e <fib+0x8>
       1019e:       00940533                add     a0,s0,s1
       101a2:       be72                    cm.popret       {ra,s0-s2},16

The instructions are identical, the difference comes from RISC-V not being able to encode 3-address adds in 2-byte instructions while Arm can.

Falling back to original 2015 RV32IC (no newfangled extensions) adds 2 bytes to 32:

    0001018a <fib>:
       1018a:       050002ef                jal     t0,101da <__riscv_save_0>
       1018e:       842a                    mv      s0,a0
       10190:       4481                    li      s1,0
       10192:       4905                    li      s2,1
       10194:       00895863                bge     s2,s0,101a4 <fib+0x1a>
       10198:       fff40513                addi    a0,s0,-1
       1019c:       37fd                    jal     1018a <fib>
       1019e:       1479                    addi    s0,s0,-2
       101a0:       94aa                    add     s1,s1,a0
       101a2:       bfcd                    j       10194 <fib+0xa>
       101a4:       00940533                add     a0,s0,s1
       101a8:       a899                    j       101fe <__riscv_restore_0>

MSP430 comes to 38 bytes, primarily because all literals and offsets are a full 16 bits.

    00000000 <fib>:
       0:   1a 15           pushm   #2,     r10     ;16-bit words
       2:   0a 4c           mov     r12,    r10     ;
       4:   49 43           clr.b   r9              ;
    
    00000006 <.L3>:
       6:   5c 43           mov.b   #1,     r12     ;r3 As==01
       8:   0c 9a           cmp     r10,    r12     ;
       a:   00 34           jge     $+2             ;abs 0xc
       c:   0c 4a           mov     r10,    r12     ;
       e:   3c 53           add     #-1,    r12     ;r3 As==11
      10:   b0 12 00 00     call    #0              ;
      14:   3a 50 fe ff     add     #-2,    r10     ;#0xfffe
      18:   09 5c           add     r12,    r9      ;
      1a:   30 40 00 00     br      #0x0000         ;
    
    0000001e <.L2>:
      1e:   0c 4a           mov     r10,    r12     ;
      20:   0c 59           add     r9,     r12     ;
      22:   19 17           popm    #2,     r10     ;16-bit words
      24:   30 41           ret

But as mentioned in another message it's a very simply-encoded ISA which would be easy to write an interpreter for e.g.

    0a 4c           mov     r12,    r10
    
    0: register addressing mode, Word size
    a: dst = r10
    4: mov
    c: src = r12

kragen · 2025-09-08T09:12:05 1757322725

Both the RISC-V and Thumb code here are no longer recursive fib; GCC has quite correctly transformed one of the recursions into a normal nonrecursive loop by exploiting the commutative and associative properties of integer addition. I infer that perhaps the same thing may be true of the MSP430 code from the presence of only a single call instruction, but I don't see the loop; perhaps the br to 0 will be relinked to go to .L3 before execution?

I haven't seen MSP430 code before, due to my somewhat embarrassing and unfair distaste for TI as a company, and I agree that it is a very nice encoding indeed!

If you insist that the code contain at least two recursive calls, two subtractions, an addition, and a conditional for the base case, even Thumb2 and RVC no longer look so competitive with stack machines. If not, there is no particular reason to leave one of the recursions in; you may as well optimize that out too, thus reducing an exponential-time algorithm to a linear-time one (or linear to logarithmic, if the parameter is the number of bits of input.) This also makes it smaller, something like (untested):

      mov r1, #1
      mov r2, #1
  1:  subs r0, #1
      itt lo
      movlo r0, r1
      bxlo lr
      mov r3, r2
      adds r2, r1
      mov r1, r3
      b 1b

I think that's maybe 22 bytes?

But I don't think that tells us much about the relative densities of stack code, register code, accumulator code, and belt code, because it's a very different mix of operations.

brucehoult · 2025-09-08T12:33:54 1757334834

> Both the RISC-V and Thumb code here are no longer recursive fib

If the programmer wrote a doubly-recursive fib() and the compiler transformed one to tail-recursion, then that's a doubly-recursive fib() as far as I'm concerned. If you don't want that to happen then use a benchmark that isn't so easily optimised, such as maybe Ackermann.

Note that this optimisation doesn't change the calculation performed or the execution time from being exponential, it merely saves some stack manipulation.

Having access to good compilers is part of the attraction of using a real ISA instead of a made-up one.

Changing the algorithm to an iterative linear-time form one is not an optimisation I'd expect a compiler to do.

kragen · 2025-09-08T13:36:16 1757338576

I agree, but what we're trying to look at here is how space-efficient the ISA is at encoding computational operations. If one of the example programs encodes a totally different set of computational operations, it stops being an informative comparison.

But, you may reasonably object, this is handwaving! If a subroutine call is a "computational operation", why isn't duplicating a stack item or storing a variable? The stack code is cheating by not storing all its temporaries into variables! And I think the answer is that there's a bit of a slippery slope there, where more extensive program transformations give us less interesting answers. I would argue that transformations like not allocating registers to temporaries (or, say, subroutine arguments) are more broadly applicable than the transformation you're exhibiting, where the compiler has changed a non-tail call to dumbfib, followed by an addition, into an addition followed by a tail-recursive call implemented as a jump. That depends on being able to reassociate the entire chain of additions through the entire call chain, which is something I wouldn't have expected a compiler to be able to do.

I agree that it's more interesting to see benchmarks that aren't so easily gamed by optimizers, which is why in https://dernocua.github.io/notes/c-stack-bytecode.html I selected a bunch of small C and Pascal subroutines quasi-randomly and hand-compiled them into different candidate stack bytecode representations. I'm not sure that Ackermann or Tak would be as interesting.

brucehoult · 2025-09-08T15:33:11 1757345591

I have my own toy benchmark too :-)

https://hoult.org/primes.txt

kragen · 2025-09-08T15:49:58 1757346598

That seems like a significantly better microbenchmark than dumbfib!

brucehoult · 2025-09-09T02:32:25 1757385145

The main weakness is -- opposite of dumbfib -- it has no function calls at all!

But I find it correlates well with other CPU core & L1 cache/SRAM benchmarks such as Dhrystone and Coremark. The Embench guys added a cut down version -- smaller arrays, shorter execution time [1] to their benchmark suite a couple of year back.

[1] baseline ~4s on typical microcontrollers ... I'd started at ~5s on Ivy Bridge, 10-30s on the Arm boards I had back then

kragen · 2025-09-09T06:59:57 1757401197

It does have loops, integer multiplication, and array indexing (mostly reading), and all of those are pretty important.

Subroutine calls are particularly difficult to benchmark because they are so critical to real-world performance and so easy for compilers to remove from a benchmark if they can guess that a particular callsite will be hot. (LuaJIT takes this to the extreme.)

You'd think non-tail recursion would stymie this, but I've seen GCC unroll a recursive loop several layers deep and constant-fold away several layers over the base case. Impressive, useful, and completely invalidated my dumb benchmark.

I would guess that most programs have a subroutine call on the order of once every 100 instructions, spending something like 15% of their time on subroutine call and return. This is a very difficult number to quantify because of the pervasive indirect costs of subroutine calls: less efficient register allocation, forgone opportunities for constant folding and specialization, objects stored in memory rather than in registers, allocation on the heap rather than a stack, etc.

kragen · 2025-09-09T16:46:38 1757436398

I wasn't able to find good information about how many instructions per subroutine call computers typically run these days. Probably I'm not using the right search terms, but here's what I found.

I just ran Emacs a few times under valgrind --tool=callgrind. Perhaps surprisingly, it was usably fast once it started up. I analyzed callgrind's output file with the following highly sophisticated methodology to determine the total number of subroutine calls (and presumably returns):

  print(sum(int(mo.group(1)) for line in sys.stdin
            for mo in [re.match(r'^calls=(\d+)', line)] if mo))

This, plus some division by numbers of executed instructions reported, revealed that it had a subroutine call for about every 55–59 instructions, and about a hundred and fifty thousand extra subroutine calls (and 7.7 million extra instructions) per character typed.

This suggests that, at least in Emacs, the cost of subroutine call and return is probably higher than 15%, maybe 25%.

Other programs I tested:

- GNU ls: 79 instructions per call

- PARI/GP: 360 instructions per call

- CPython with Numpy: 88 instructions per call

- Minetest (rather unplayable): 88 instructions per call

brucehoult · 2025-09-09T19:27:54 1757446074

Interesting results.

Are you able to figure out how many of those are calls of leaf functions?

On modern ISAs and ABIs calling a small leaf function is very cheap -- just the `jal` and `ret`, plus getting the arguments into the right registers first. In many cases memory is not touched at all (unless the code in the function necessarily touches memory), no stack frame is created etc.

kragen · 2025-09-09T20:10:20 1757448620

Hmm, I think that ought to be pretty doable with the callgrind output actually. I'm using amd64 rather than a modern ISA, so the calls do necessarily touch "memory", but leaf functions are still especially cheap—they can use temporary registers for local variables. In the case of Emacs most of the functions are compiled Lisp, and I doubt the Lisp compiler is smart enough to exploit leafness in its register allocator.

kragen · 2025-09-10T16:57:39 1757523459

Okay, so, I wrote a program http://canonical.org/~kragen/sw/dev3/leafcalls.py to list the leaf functions with the number of calls to each, and totaled up the results. Assuming I didn't fuck up the analysis, which is never a good bet, here it is:

    <table>
    <tr><th>PID     <th>Program <th>Instructions   <th>Calls       <th>Leaf calls  <th>Functions <th>Leaf functions
    <tr><td>2603407 <th>Emacs   <td>18,278,190,938 <td>328,734,692 <td>100,779,209 <td>16,875    <td>4,523
    <tr><td>2611631 <th>Emacs   <td>10,444,575,778 <td>176,576,759 <td> 59,334,998 <td>16,262    <td>4,470
    <tr><td>2612835 <th>Emacs   <td>10,453,593,569 <td>176,931,778 <td> 59,425,409 <td>16,422    <td>4,483
    <tr><td>2622964 <th>ls -l   <td>    28,770,079 <td>    362,597 <td>    154,386 <td>   520    <td>  218
    <tr><td>2623998 <th>PARI/GP <td>    19,265,853 <td>     52,633 <td>     19,983 <td>   828    <td>  346
    <tr><td>2624712 <th>Numpy   <td>   356,605,939 <td>  4,040,096 <td>    952,628 <td> 4,366    <td>1,259
    <tr><td>2627189 <th>Minetest<td>62,476,314,644 <td>708,735,071 <td>354,140,648 <td>22,318    <td>5,556
    </table>

I did look at the output file listing the leaf functions and it is at least not obviously wrong.

So, if this is right, typically almost half of all calls are calls to leaf functions (functions that never call another function), although my Emacs and Numpy runs have only about a third and a fourth of their calls being leaf calls. All of these numbers are surprisingly low to me; I would have expected the answer to be more like two thirds to three fourths.

There's an optimization I've applied sometimes the opportunity for which wouldn't show up here: on a leaf call that returns before it calls any other subroutines, you can sometimes avoid doing all the foofaraw for a non-leaf call by delaying it until after you know that you're in a non-leaf case. This isn't always a pure win, though, and I've never seen a compiler do it. The call graph output by callgrind only has subroutine granularity, so, absent any alternative, I'm only counting calls as leaf calls if they are calls to a leaf subroutine, that is, a subroutine that never calls other subroutines.

I'm interested to hear if you have other stats!

brucehoult · 2025-09-10T21:22:52 1757539372

> I've never seen a compiler do it.

Oh I have. Often.

https://godbolt.org/z/PM9herj3P

Note: these could be tail-calls, and are if you use `-O2` or `-Os`, but that's not important. Make it `foo(n) + 13` and you still get the "don't make a stack frame unless needed" optimisation.

Oh, here's an interesting one. Arm proactively pushes everything, x86 and RISC-V only if `err()` will be called.

https://godbolt.org/z/93WEbc1q9

I also would expect the number of function executions which are dynamically leaf functions [1] to be greater than 50%. Except for the kind of perverse call chains you get in Java and perhaps some other OO code bases where there are huge numbers of layers of abstraction where each method simply calls one other lower level method.

[1] i.e. that don't call another function on that execution, though in other circumstances they could.

kragen · 2025-09-11T00:46:59 1757551619

Shows what I know! I appreciate the correction.

I wonder why the second example frustrates the optimization—but only on ARM? Also, why is it pushing r0 and r1? That's stupid. It's also stupid that it's using an extra `add` instruction to un-push them—if it doesn't want to overwrite r0, it could pop the worthlessly pushed values into r1 and r2, or r3 and r4. I tried a bunch of different versions of GCC on your example code, and they all seem to produce code that's bad in obvious ways, but they're different obvious ways.

kragen · 2025-09-11T11:23:42 1757589822

In further thought, if you have an instruction cache, adding 8 to SP is a faster way to discard the two useless values than popping them into unused registers. So that part is arguably defensible. But why push them? There can't be a requirement for a 16-byte-aligned stack pointer at all times—the add would violate it.

brucehoult · 2025-09-11T23:57:41 1757635061

It's definitely strange. And arm64 does it like the others.

kragen · 2025-09-08T11:31:57 1757331117

That's 24 bytes because gas is using Thumb2 mov.w instructions to initialize r1 and r2. It took me a long time to figure out that the problem was mov vs. movs; this is the correct 20-byte version:

            .cpu cortex-m4
            .thumb
            .syntax unified
            .thumb_func
            .globl fib
    fib:    movs r1, #1
            movs r2, #1
        1:  subs r0, #1
            itt lo
            movlo r0, r1
            bxlo lr
            mov r3, r2
            adds r2, r1
            mov r1, r3
            b 1b

I swear, I keep making the same n00b mistakes, year after year!

It might be possible to get it down to 18 bytes, but I don't see how.

brucehoult · 2025-09-08T02:24:18 1757298258

> My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro...

Oh, cool, I'll take a look. Having started with Apple ][ in 1980 I'm already well aware of SWEET16 and admire both it's effectiveness, integration with native code, and small size of the interpreter.

> it's common in FORTH to spend no instructions on accessing arguments or returning a value

Yes, good point. Stack machines might not be good for code inside a function, but they often are good on function call/return, if you can arrange things to not need extra stack shuffling. Again, this matches very well to evaluating complex arithmetic expressions where some of the operations are function calls not inline code.

RISC-V has recent extensions -- so far the RP2350 is the most common chip that implements them -- that optimise function prologue and epilog and marshalling function arguments to S registers and S registers to function argument registers for a call. See the Zcmp extension.

https://docs.openhwgroup.org/projects/cva6-user-manual/01_cv...

Also, Zcmt provides optimised 2-byte jumps and calls to up to 256 well-known arbitrary entry points: 32 jumps and 224 calls.

> RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502.

I wan't thinking of 6502. I don't think either JVM or WASM runs there :-) If you're JITing then decoding complexity really doesn't matter much, providing it doesn't increase the VMs code size too much.

I've a couple of times worked on some notes and the beginnings of an interpreter for MSP430 on 6502. I think it can be relatively small and relatively fast. There are very few instructions, and the encoding is very regular. And there is a good optimising GCC for it.

If you're not familiar with it, think "PDP-11 extended to 16 registers and 4 bit opcode (plus byte/word) instead of 3 by trimming src addressing modes to 4 (Rs, @Rs, @Rs+, 0xNNNN(Rs) plus immediate and absolute (ok PC-rel) by punning on PC) and trimming dst addressing modes to just 2 (Rd, 0xNNNN(Rd) plus the punning on PC).

Decoding is fairly 6502-friendly, with the src and dst registers in the low 4 bits of each byte of the instruction, the opcode in the high 4 bits of one byte, and the two addressing modes and B/W flag in the high bits of the other byte.

If you keep the hi and lo bytes of each register in separate arrays in Zero Page then you don't even need to shift the register numbers -- just mask them off and shove them into X & Y registers. Similarly, opcode can just be masked and then shove it into a location that is the bottom half of a modified JMP or JSR, or a JMP (zp), giving 16 bytes of code to get started. Or shift it once if 8 bytes is enough. Similarly, the two addressing modes and B/W flag don't need to be parsed with shifting and masking (at least if you care about speed more than interpreter size), you can just mask that byte and also use a jump table to "decode" it

--

Also, there is an intermediate stage between full interpretation and full JITing. I don't know if you've looked at the "Spike" RISC-V reference emulator. It doesn't normally decode instructions, it hashes them to look up a pre-decoded form with src and dst registers and any immediate/offset decoded into a struct, along with a pointer to the code to interpret the instruction. The hash table size is 8000 and that gets a good hit rate on most code. Of course if there is a hash table miss then the instruction is actually decoded the hard way and inserted into the hash table.

This wouldn't be a good fit for an 8 bit micro (unless you knocked the hash table size down a lot) but would be fine with even 1 MB of RAM.

--

>> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.

I'll just point out again that the native code technique I showed in my original message up-thread cuts a 16 bit register-to-register add down from 13 bytes of code to 7 bytes (or fewer for sequential operations on the same register) at a cost of increasing execution time from 20 cycles to 42 i.e. not a 5x-20x slowdown but only a 2.1x slowdown.

For a 32 bit add it reduces the code size from 25 bytes to 7, and increases the execution time from 38 cycles to 66, a 1.7x slowdown.

Well, I don't know how "native" the native code compilation for C65 is. I'm assuming all operations are inline for speed.

kragen · 2025-09-08T09:33:30 1757324010

I'm interested to hear what you used SWEET16 for.

cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency? How do they affect page fault handling in practice?

If you are going to add non-RISC features to your ISA to ensmallen prologues and epilogues, even at the expense of interrupt latency, another possibility is to add semantics to the call and return instructions. Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.

The MSP430 sounds very appealing in some ways, and Mecrisp-Stellaris is a usable self-licking development environment for it. Too bad it's 16-bit. How small is the smallest FPGA implementation? OpenMSP430 says it's 1650 6-LUTs.

I had no idea about the Spike hash table approach. It's a totally new approach to emulation to me. It makes a lot of sense especially in the RISC-V context where EEs threw the instruction word layout in a blender to reduce their propagation delays—a decision I endorse despite my initial horror.

brucehoult · 2025-09-08T13:03:02 1757336582

> I'm interested to hear what you used SWEET16 for.

Just generic coding that needed 16 bit calculations but wasn't speed-critical. Pointer-chasing code is particularly tiresome on 6502.

> cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency?

They are designed to be easily interruptible / restartable with a variety of possible implementations (non-interruptible, complete restart, continue where you left off). The tradeoff is up to the designer of the individual CPU. A basic idea is to hold them in the decoder and emit normal instructions to the execution pipeline, while simplifying the original instruction towards a base case. A correct (interrupt-safe) sequence of normal instructions to generate is given in the specification.

Also, that code didn't show other instructions to copy `a0` and `a1` to two arbitrary `sN` registers, or to copy two arbitrary `sN` registers to `a0` and `a1`. This helps a lot both implementing and calling functions with two or more arguments, if you're not using arithmetic to marshal the arguments anyway.

> How do they affect page fault handling in practice?

It it unlikely that anyone will ever make a machine with both Zcmp (and Zcmt) and page faults. Architects of high performance CPUs would have a fit, and besides which they use the same opcode space as RVC instructions for floating point load/store, making them incompatible with standard Linux software.

> Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.

VAX, begone!

As Patterson showed, JSR subroutines outperform CALLS/G subroutines.

kragen · 2025-09-08T15:06:09 1757343969

I see, thanks!

Yes, VAX CALLS/CALLG is pretty much what I'm talking about, but with arguments in registers.

I think the performance landscape has changed somewhat since Patterson; if your instruction decoder is shattering each instruction into a dependency graph of separately executable micro-ops anyway to support OoO, you don't have to slow down your whole pipeline to support elaborate instructions like cm.push or CALLG. As I see it, though, a couple of things haven't changed:

- Having arguments in registers is faster than having arguments in memory.

- Page fault handling is unfriendly to architectural instructions having multiple effects, because the page fault can't be handled at the micro-op level; it has to be handled at the architectural level. But maybe this is a smaller problem with OoO and especially speculative execution, I don't know.

If you can safely get by with a complete restart of the instruction, though, the page fault problem goes away.

I think that solves the problems Patterson identified?