It certainly does depend on which stack machine. I'm surprised that Wasm doesn't have a one-byte encoding for (local.get 0) or (local.set 0)! As you point out, even the JVM has one. My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro... found such an 8-bit or 5-bit instruction in nearly every example except CPython, with a 3-to-5-bit operand field in the local.get and local.set instruction byte, but of course Wasm didn't exist in 02007. The JVM only offering 2 bits is atypically stingy.
RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502. The comparable stack-machine code would be something like Chuck Moore's x18/F18A with its four instructions per 18-bit word or James Bowman's J1A.
It also depends, as you say, on the code. Most code is somewhere in between the extremes of x += y and your example of
dstRect.bottom := Max(dstRect.bottom,dstRect.top + txMinHeight);
thePort^.grafProcs := Nil; { restore to normal }
if (n != 0 && --n != 0) ...
target_command = entry_i + CAM_CONTENT_COUNT * j;
tfd = open(ptr, O_RDWR|O_TRUNC|O_EXCL|O_CREAT, 0600);
for (size_t i = 1; i < n; i++) ...
if (event == EV_enter_state) ... else if (event == EV_tick) ... else if (event == EV_reenter_state) ...
Those are all taken from my examples in https://dernocua.github.io/notes/c-stack-bytecode.html. There are a lot of subroutine calls, some reuse of the same value in multiple computations, many more variable reads than writes, and an occasional infix expression with more than one operator. These are all factors that favor stack instruction sets more than x += y does.
Subroutine calls in particular are typically a single bytecode once all the operands are on the stack, referring to a subroutine name or address stashed elsewhere with a short literal index, which similarly is usually allocated a 3–5-bit field in the opcode byte. Very popular subroutines like car or #at:put: can get their own bytecode.
There's a bit of a difference between the 8 registers in the PDP-11 or the 8086 and the 3-bit local variable field in an Emacs Lisp bytecode byte. The stack pointer soaks up one of those slots; on the PDP, the program counter soaks up another. (On ARM the link register sometimes is a third. Thumb and RVC do better here by excluding these registers from their 3-bit namespaces.) Analogously, one of the 8 local-variable indices in Elisp is "see next byte". Also, though, stack machines don't need to spend local variable indices on temporaries, and they often have instance-variable-access opcodes as well. So in effect register machines have less local variables than it at first appears.
Incidentally, as that note mentions, Aztec C for the 6502 did support a bytecode interpreter to more than halve code size:
> As an alternative, the pseudo-code C compiler, CCI, produces machine language for a theoretical machine with 8, 16 and 32 bit capabilities. This machine language is interpreted by an assembly language program that is about 3000 bytes in size.
> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.
But I don't know whether it was a stack bytecode, an accumulator bytecode, a register bytecode, or something weirder.
My experience programming in FORTH is that often I can profitably keep about one variable on the operand stack, so accessing that variable takes no instructions and no bytes. In theory this is something that a compiler could do, but the only case I know of is when you store to a variable in Smalltalk inside a larger expression. Similarly it's common in FORTH to spend no instructions on accessing arguments or returning a value; you just don't drop it before returning.
With respect to your tiny interpreters page, I'd just like to point out that Thumb code for recursive fib() is 26 bytes, very close to Squeak (this is out of GCC with -Os, I haven't checked if I can do better manually).
Both the RISC-V and Thumb code here are no longer recursive fib; GCC has quite correctly transformed one of the recursions into a normal nonrecursive loop by exploiting the commutative and associative properties of integer addition. I infer that perhaps the same thing may be true of the MSP430 code from the presence of only a single call instruction, but I don't see the loop; perhaps the br to 0 will be relinked to go to .L3 before execution?
I haven't seen MSP430 code before, due to my somewhat embarrassing and unfair distaste for TI as a company, and I agree that it is a very nice encoding indeed!
If you insist that the code contain at least two recursive calls, two subtractions, an addition, and a conditional for the base case, even Thumb2 and RVC no longer look so competitive with stack machines. If not, there is no particular reason to leave one of the recursions in; you may as well optimize that out too, thus reducing an exponential-time algorithm to a linear-time one (or linear to logarithmic, if the parameter is the number of bits of input.) This also makes it smaller, something like (untested):
mov r1, #1
mov r2, #1
1: subs r0, #1
itt lo
movlo r0, r1
bxlo lr
mov r3, r2
adds r2, r1
mov r1, r3
b 1b
I think that's maybe 22 bytes?
But I don't think that tells us much about the relative densities of stack code, register code, accumulator code, and belt code, because it's a very different mix of operations.
> Both the RISC-V and Thumb code here are no longer recursive fib
If the programmer wrote a doubly-recursive fib() and the compiler transformed one to tail-recursion, then that's a doubly-recursive fib() as far as I'm concerned. If you don't want that to happen then use a benchmark that isn't so easily optimised, such as maybe Ackermann.
Note that this optimisation doesn't change the calculation performed or the execution time from being exponential, it merely saves some stack manipulation.
Having access to good compilers is part of the attraction of using a real ISA instead of a made-up one.
Changing the algorithm to an iterative linear-time form one is not an optimisation I'd expect a compiler to do.
I agree, but what we're trying to look at here is how space-efficient the ISA is at encoding computational operations. If one of the example programs encodes a totally different set of computational operations, it stops being an informative comparison.
But, you may reasonably object, this is handwaving! If a subroutine call is a "computational operation", why isn't duplicating a stack item or storing a variable? The stack code is cheating by not storing all its temporaries into variables! And I think the answer is that there's a bit of a slippery slope there, where more extensive program transformations give us less interesting answers. I would argue that transformations like not allocating registers to temporaries (or, say, subroutine arguments) are more broadly applicable than the transformation you're exhibiting, where the compiler has changed a non-tail call to dumbfib, followed by an addition, into an addition followed by a tail-recursive call implemented as a jump. That depends on being able to reassociate the entire chain of additions through the entire call chain, which is something I wouldn't have expected a compiler to be able to do.
I agree that it's more interesting to see benchmarks that aren't so easily gamed by optimizers, which is why in https://dernocua.github.io/notes/c-stack-bytecode.html I selected a bunch of small C and Pascal subroutines quasi-randomly and hand-compiled them into different candidate stack bytecode representations. I'm not sure that Ackermann or Tak would be as interesting.
The main weakness is -- opposite of dumbfib -- it has no function calls at all!
But I find it correlates well with other CPU core & L1 cache/SRAM benchmarks such as Dhrystone and Coremark. The Embench guys added a cut down version -- smaller arrays, shorter execution time [1] to their benchmark suite a couple of year back.
[1] baseline ~4s on typical microcontrollers ... I'd started at ~5s on Ivy Bridge, 10-30s on the Arm boards I had back then
It does have loops, integer multiplication, and array indexing (mostly reading), and all of those are pretty important.
Subroutine calls are particularly difficult to benchmark because they are so critical to real-world performance and so easy for compilers to remove from a benchmark if they can guess that a particular callsite will be hot. (LuaJIT takes this to the extreme.)
You'd think non-tail recursion would stymie this, but I've seen GCC unroll a recursive loop several layers deep and constant-fold away several layers over the base case. Impressive, useful, and completely invalidated my dumb benchmark.
I would guess that most programs have a subroutine call on the order of once every 100 instructions, spending something like 15% of their time on subroutine call and return. This is a very difficult number to quantify because of the pervasive indirect costs of subroutine calls: less efficient register allocation, forgone opportunities for constant folding and specialization, objects stored in memory rather than in registers, allocation on the heap rather than a stack, etc.
I wasn't able to find good information about how many instructions per subroutine call computers typically run these days. Probably I'm not using the right search terms, but here's what I found.
I just ran Emacs a few times under valgrind --tool=callgrind. Perhaps surprisingly, it was usably fast once it started up. I analyzed callgrind's output file with the following highly sophisticated methodology to determine the total number of subroutine calls (and presumably returns):
print(sum(int(mo.group(1)) for line in sys.stdin
for mo in [re.match(r'^calls=(\d+)', line)] if mo))
This, plus some division by numbers of executed instructions reported, revealed that it had a subroutine call for about every 55–59 instructions, and about a hundred and fifty thousand extra subroutine calls (and 7.7 million extra instructions) per character typed.
This suggests that, at least in Emacs, the cost of subroutine call and return is probably higher than 15%, maybe 25%.
Other programs I tested:
- GNU ls: 79 instructions per call
- PARI/GP: 360 instructions per call
- CPython with Numpy: 88 instructions per call
- Minetest (rather unplayable): 88 instructions per call
Are you able to figure out how many of those are calls of leaf functions?
On modern ISAs and ABIs calling a small leaf function is very cheap -- just the `jal` and `ret`, plus getting the arguments into the right registers first. In many cases memory is not touched at all (unless the code in the function necessarily touches memory), no stack frame is created etc.
Hmm, I think that ought to be pretty doable with the callgrind output actually. I'm using amd64 rather than a modern ISA, so the calls do necessarily touch "memory", but leaf functions are still especially cheap—they can use temporary registers for local variables. In the case of Emacs most of the functions are compiled Lisp, and I doubt the Lisp compiler is smart enough to exploit leafness in its register allocator.
Okay, so, I wrote a program http://canonical.org/~kragen/sw/dev3/leafcalls.py to list the leaf functions with the number of calls to each, and totaled up the results. Assuming I didn't fuck up the analysis, which is never a good bet, here it is:
I did look at the output file listing the leaf functions and it is at least not obviously wrong.
So, if this is right, typically almost half of all calls are calls to leaf functions
(functions that never call another function), although my Emacs and
Numpy runs have only about a third and a fourth of their calls being leaf
calls. All of these numbers are surprisingly low to me; I would have
expected the answer to be more like two thirds to three fourths.
There's an optimization I've applied sometimes the opportunity for which wouldn't show up here: on a leaf call that returns before it calls any other subroutines, you can sometimes avoid doing all the foofaraw for a non-leaf call by delaying it until after you know that you're in a non-leaf case. This isn't always a pure win, though, and I've never seen a compiler do it. The call graph output by callgrind only has subroutine granularity, so, absent any alternative, I'm only counting calls as leaf calls if they are calls to a leaf subroutine, that is, a subroutine that never calls other subroutines.
Note: these could be tail-calls, and are if you use `-O2` or `-Os`, but that's not important. Make it `foo(n) + 13` and you still get the "don't make a stack frame unless needed" optimisation.
Oh, here's an interesting one. Arm proactively pushes everything, x86 and RISC-V only if `err()` will be called.
I also would expect the number of function executions which are dynamically leaf functions [1] to be greater than 50%. Except for the kind of perverse call chains you get in Java and perhaps some other OO code bases where there are huge numbers of layers of abstraction where each method simply calls one other lower level method.
[1] i.e. that don't call another function on that execution, though in other circumstances they could.
I wonder why the second example frustrates the optimization—but only on ARM? Also, why is it pushing r0 and r1? That's stupid. It's also stupid that it's using an extra `add` instruction to un-push them—if it doesn't want to overwrite r0, it could pop the worthlessly pushed values into r1 and r2, or r3 and r4. I tried a bunch of different versions of GCC on your example code, and they all seem to produce code that's bad in obvious ways, but they're different obvious ways.
In further thought, if you have an instruction cache, adding 8 to SP is a faster way to discard the two useless values than popping them into unused registers. So that part is arguably defensible. But why push them? There can't be a requirement for a 16-byte-aligned stack pointer at all times—the add would violate it.
That's 24 bytes because gas is using Thumb2 mov.w instructions to initialize r1 and r2. It took me a long time to figure out that the problem was mov vs. movs; this is the correct 20-byte version:
.cpu cortex-m4
.thumb
.syntax unified
.thumb_func
.globl fib
fib: movs r1, #1
movs r2, #1
1: subs r0, #1
itt lo
movlo r0, r1
bxlo lr
mov r3, r2
adds r2, r1
mov r1, r3
b 1b
I swear, I keep making the same n00b mistakes, year after year!
It might be possible to get it down to 18 bytes, but I don't see how.
Oh, cool, I'll take a look. Having started with Apple ][ in 1980 I'm already well aware of SWEET16 and admire both it's effectiveness, integration with native code, and small size of the interpreter.
> it's common in FORTH to spend no instructions on accessing arguments or returning a value
Yes, good point. Stack machines might not be good for code inside a function, but they often are good on function call/return, if you can arrange things to not need extra stack shuffling. Again, this matches very well to evaluating complex arithmetic expressions where some of the operations are function calls not inline code.
RISC-V has recent extensions -- so far the RP2350 is the most common chip that implements them -- that optimise function prologue and epilog and marshalling function arguments to S registers and S registers to function argument registers for a call. See the Zcmp extension.
Also, Zcmt provides optimised 2-byte jumps and calls to up to 256 well-known arbitrary entry points: 32 jumps and 224 calls.
> RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502.
I wan't thinking of 6502. I don't think either JVM or WASM runs there :-) If you're JITing then decoding complexity really doesn't matter much, providing it doesn't increase the VMs code size too much.
I've a couple of times worked on some notes and the beginnings of an interpreter for MSP430 on 6502. I think it can be relatively small and relatively fast. There are very few instructions, and the encoding is very regular. And there is a good optimising GCC for it.
If you're not familiar with it, think "PDP-11 extended to 16 registers and 4 bit opcode (plus byte/word) instead of 3 by trimming src addressing modes to 4 (Rs, @Rs, @Rs+, 0xNNNN(Rs) plus immediate and absolute (ok PC-rel) by punning on PC) and trimming dst addressing modes to just 2 (Rd, 0xNNNN(Rd) plus the punning on PC).
Decoding is fairly 6502-friendly, with the src and dst registers in the low 4 bits of each byte of the instruction, the opcode in the high 4 bits of one byte, and the two addressing modes and B/W flag in the high bits of the other byte.
If you keep the hi and lo bytes of each register in separate arrays in Zero Page then you don't even need to shift the register numbers -- just mask them off and shove them into X & Y registers. Similarly, opcode can just be masked and then shove it into a location that is the bottom half of a modified JMP or JSR, or a JMP (zp), giving 16 bytes of code to get started. Or shift it once if 8 bytes is enough. Similarly, the two addressing modes and B/W flag don't need to be parsed with shifting and masking (at least if you care about speed more than interpreter size), you can just mask that byte and also use a jump table to "decode" it
--
Also, there is an intermediate stage between full interpretation and full JITing. I don't know if you've looked at the "Spike" RISC-V reference emulator. It doesn't normally decode instructions, it hashes them to look up a pre-decoded form with src and dst registers and any immediate/offset decoded into a struct, along with a pointer to the code to interpret the instruction. The hash table size is 8000 and that gets a good hit rate on most code. Of course if there is a hash table miss then the instruction is actually decoded the hard way and inserted into the hash table.
This wouldn't be a good fit for an 8 bit micro (unless you knocked the hash table size down a lot) but would be fine with even 1 MB of RAM.
--
>> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.
I'll just point out again that the native code technique I showed in my original message up-thread cuts a 16 bit register-to-register add down from 13 bytes of code to 7 bytes (or fewer for sequential operations on the same register) at a cost of increasing execution time from 20 cycles to 42 i.e. not a 5x-20x slowdown but only a 2.1x slowdown.
For a 32 bit add it reduces the code size from 25 bytes to 7, and increases the execution time from 38 cycles to 66, a 1.7x slowdown.
Well, I don't know how "native" the native code compilation for C65 is. I'm assuming all operations are inline for speed.
cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency? How do they affect page fault handling in practice?
If you are going to add non-RISC features to your ISA to ensmallen prologues and epilogues, even at the expense of interrupt latency, another possibility is to add semantics to the call and return instructions. Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.
The MSP430 sounds very appealing in some ways, and Mecrisp-Stellaris is a usable self-licking development environment for it. Too bad it's 16-bit. How small is the smallest FPGA implementation? OpenMSP430 says it's 1650 6-LUTs.
I had no idea about the Spike hash table approach. It's a totally new approach to emulation to me. It makes a lot of sense especially in the RISC-V context where EEs threw the instruction word layout in a blender to reduce their propagation delays—a decision I endorse despite my initial horror.
> I'm interested to hear what you used SWEET16 for.
Just generic coding that needed 16 bit calculations but wasn't speed-critical. Pointer-chasing code is particularly tiresome on 6502.
> cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency?
They are designed to be easily interruptible / restartable with a variety of possible implementations (non-interruptible, complete restart, continue where you left off). The tradeoff is up to the designer of the individual CPU. A basic idea is to hold them in the decoder and emit normal instructions to the execution pipeline, while simplifying the original instruction towards a base case. A correct (interrupt-safe) sequence of normal instructions to generate is given in the specification.
Also, that code didn't show other instructions to copy `a0` and `a1` to two arbitrary `sN` registers, or to copy two arbitrary `sN` registers to `a0` and `a1`. This helps a lot both implementing and calling functions with two or more arguments, if you're not using arithmetic to marshal the arguments anyway.
> How do they affect page fault handling in practice?
It it unlikely that anyone will ever make a machine with both Zcmp (and Zcmt) and page faults. Architects of high performance CPUs would have a fit, and besides which they use the same opcode space as RVC instructions for floating point load/store, making them incompatible with standard Linux software.
> Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.
VAX, begone!
As Patterson showed, JSR subroutines outperform CALLS/G subroutines.
Yes, VAX CALLS/CALLG is pretty much what I'm talking about, but with arguments in registers.
I think the performance landscape has changed somewhat since Patterson; if your instruction decoder is shattering each instruction into a dependency graph of separately executable micro-ops anyway to support OoO, you don't have to slow down your whole pipeline to support elaborate instructions like cm.push or CALLG. As I see it, though, a couple of things haven't changed:
- Having arguments in registers is faster than having arguments in memory.
- Page fault handling is unfriendly to architectural instructions having multiple effects, because the page fault can't be handled at the micro-op level; it has to be handled at the architectural level. But maybe this is a smaller problem with OoO and especially speculative execution, I don't know.
If you can safely get by with a complete restart of the instruction, though, the page fault problem goes away.
I think that solves the problems Patterson identified?
It certainly does depend on which stack machine. I'm surprised that Wasm doesn't have a one-byte encoding for (local.get 0) or (local.set 0)! As you point out, even the JVM has one. My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro... found such an 8-bit or 5-bit instruction in nearly every example except CPython, with a 3-to-5-bit operand field in the local.get and local.set instruction byte, but of course Wasm didn't exist in 02007. The JVM only offering 2 bits is atypically stingy.
RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502. The comparable stack-machine code would be something like Chuck Moore's x18/F18A with its four instructions per 18-bit word or James Bowman's J1A.
It also depends, as you say, on the code. Most code is somewhere in between the extremes of x += y and your example of
with statements like these: Those are all taken from my examples in https://dernocua.github.io/notes/c-stack-bytecode.html. There are a lot of subroutine calls, some reuse of the same value in multiple computations, many more variable reads than writes, and an occasional infix expression with more than one operator. These are all factors that favor stack instruction sets more than x += y does.Subroutine calls in particular are typically a single bytecode once all the operands are on the stack, referring to a subroutine name or address stashed elsewhere with a short literal index, which similarly is usually allocated a 3–5-bit field in the opcode byte. Very popular subroutines like car or #at:put: can get their own bytecode.
There's a bit of a difference between the 8 registers in the PDP-11 or the 8086 and the 3-bit local variable field in an Emacs Lisp bytecode byte. The stack pointer soaks up one of those slots; on the PDP, the program counter soaks up another. (On ARM the link register sometimes is a third. Thumb and RVC do better here by excluding these registers from their 3-bit namespaces.) Analogously, one of the 8 local-variable indices in Elisp is "see next byte". Also, though, stack machines don't need to spend local variable indices on temporaries, and they often have instance-variable-access opcodes as well. So in effect register machines have less local variables than it at first appears.
Incidentally, as that note mentions, Aztec C for the 6502 did support a bytecode interpreter to more than halve code size:
> As an alternative, the pseudo-code C compiler, CCI, produces machine language for a theoretical machine with 8, 16 and 32 bit capabilities. This machine language is interpreted by an assembly language program that is about 3000 bytes in size.
> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.
But I don't know whether it was a stack bytecode, an accumulator bytecode, a register bytecode, or something weirder.
My experience programming in FORTH is that often I can profitably keep about one variable on the operand stack, so accessing that variable takes no instructions and no bytes. In theory this is something that a compiler could do, but the only case I know of is when you store to a variable in Smalltalk inside a larger expression. Similarly it's common in FORTH to spend no instructions on accessing arguments or returning a value; you just don't drop it before returning.