Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro...

Oh, cool, I'll take a look. Having started with Apple ][ in 1980 I'm already well aware of SWEET16 and admire both it's effectiveness, integration with native code, and small size of the interpreter.

> it's common in FORTH to spend no instructions on accessing arguments or returning a value

Yes, good point. Stack machines might not be good for code inside a function, but they often are good on function call/return, if you can arrange things to not need extra stack shuffling. Again, this matches very well to evaluating complex arithmetic expressions where some of the operations are function calls not inline code.

RISC-V has recent extensions -- so far the RP2350 is the most common chip that implements them -- that optimise function prologue and epilog and marshalling function arguments to S registers and S registers to function argument registers for a call. See the Zcmp extension.

https://docs.openhwgroup.org/projects/cva6-user-manual/01_cv...

Also, Zcmt provides optimised 2-byte jumps and calls to up to 256 well-known arbitrary entry points: 32 jumps and 224 calls.

> RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502.

I wan't thinking of 6502. I don't think either JVM or WASM runs there :-) If you're JITing then decoding complexity really doesn't matter much, providing it doesn't increase the VMs code size too much.

I've a couple of times worked on some notes and the beginnings of an interpreter for MSP430 on 6502. I think it can be relatively small and relatively fast. There are very few instructions, and the encoding is very regular. And there is a good optimising GCC for it.

If you're not familiar with it, think "PDP-11 extended to 16 registers and 4 bit opcode (plus byte/word) instead of 3 by trimming src addressing modes to 4 (Rs, @Rs, @Rs+, 0xNNNN(Rs) plus immediate and absolute (ok PC-rel) by punning on PC) and trimming dst addressing modes to just 2 (Rd, 0xNNNN(Rd) plus the punning on PC).

Decoding is fairly 6502-friendly, with the src and dst registers in the low 4 bits of each byte of the instruction, the opcode in the high 4 bits of one byte, and the two addressing modes and B/W flag in the high bits of the other byte.

If you keep the hi and lo bytes of each register in separate arrays in Zero Page then you don't even need to shift the register numbers -- just mask them off and shove them into X & Y registers. Similarly, opcode can just be masked and then shove it into a location that is the bottom half of a modified JMP or JSR, or a JMP (zp), giving 16 bytes of code to get started. Or shift it once if 8 bytes is enough. Similarly, the two addressing modes and B/W flag don't need to be parsed with shifting and masking (at least if you care about speed more than interpreter size), you can just mask that byte and also use a jump table to "decode" it

--

Also, there is an intermediate stage between full interpretation and full JITing. I don't know if you've looked at the "Spike" RISC-V reference emulator. It doesn't normally decode instructions, it hashes them to look up a pre-decoded form with src and dst registers and any immediate/offset decoded into a struct, along with a pointer to the code to interpret the instruction. The hash table size is 8000 and that gets a good hit rate on most code. Of course if there is a hash table miss then the instruction is actually decoded the hard way and inserted into the hash table.

This wouldn't be a good fit for an 8 bit micro (unless you knocked the hash table size down a lot) but would be fine with even 1 MB of RAM.

--

>> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.

I'll just point out again that the native code technique I showed in my original message up-thread cuts a 16 bit register-to-register add down from 13 bytes of code to 7 bytes (or fewer for sequential operations on the same register) at a cost of increasing execution time from 20 cycles to 42 i.e. not a 5x-20x slowdown but only a 2.1x slowdown.

For a 32 bit add it reduces the code size from 25 bytes to 7, and increases the execution time from 38 cycles to 66, a 1.7x slowdown.

Well, I don't know how "native" the native code compilation for C65 is. I'm assuming all operations are inline for speed.



I'm interested to hear what you used SWEET16 for.

cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency? How do they affect page fault handling in practice?

If you are going to add non-RISC features to your ISA to ensmallen prologues and epilogues, even at the expense of interrupt latency, another possibility is to add semantics to the call and return instructions. Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.

The MSP430 sounds very appealing in some ways, and Mecrisp-Stellaris is a usable self-licking development environment for it. Too bad it's 16-bit. How small is the smallest FPGA implementation? OpenMSP430 says it's 1650 6-LUTs.

I had no idea about the Spike hash table approach. It's a totally new approach to emulation to me. It makes a lot of sense especially in the RISC-V context where EEs threw the instruction word layout in a blender to reduce their propagation delays—a decision I endorse despite my initial horror.


> I'm interested to hear what you used SWEET16 for.

Just generic coding that needed 16 bit calculations but wasn't speed-critical. Pointer-chasing code is particularly tiresome on 6502.

> cm.push and cm.popret seem like very appealing instructions from this point of view, yes! They look more expressive than ldm/stm. I imagine they would hurt interrupt latency?

They are designed to be easily interruptible / restartable with a variety of possible implementations (non-interruptible, complete restart, continue where you left off). The tradeoff is up to the designer of the individual CPU. A basic idea is to hold them in the decoder and emit normal instructions to the execution pipeline, while simplifying the original instruction towards a base case. A correct (interrupt-safe) sequence of normal instructions to generate is given in the specification.

Also, that code didn't show other instructions to copy `a0` and `a1` to two arbitrary `sN` registers, or to copy two arbitrary `sN` registers to `a0` and `a1`. This helps a lot both implementing and calling functions with two or more arguments, if you're not using arithmetic to marshal the arguments anyway.

> How do they affect page fault handling in practice?

It it unlikely that anyone will ever make a machine with both Zcmp (and Zcmt) and page faults. Architects of high performance CPUs would have a fit, and besides which they use the same opcode space as RVC instructions for floating point load/store, making them incompatible with standard Linux software.

> Perhaps you could describe the required activation record structure in a header word before the subroutine entry point that need not follow the usual ISA encoding at all, and if the return instruction can reliably find the same header word (on the stack, say, or in a callee-saved register like lr) there is no need for it to contain a redundant copy of the same information.

VAX, begone!

As Patterson showed, JSR subroutines outperform CALLS/G subroutines.


I see, thanks!

Yes, VAX CALLS/CALLG is pretty much what I'm talking about, but with arguments in registers.

I think the performance landscape has changed somewhat since Patterson; if your instruction decoder is shattering each instruction into a dependency graph of separately executable micro-ops anyway to support OoO, you don't have to slow down your whole pipeline to support elaborate instructions like cm.push or CALLG. As I see it, though, a couple of things haven't changed:

- Having arguments in registers is faster than having arguments in memory.

- Page fault handling is unfriendly to architectural instructions having multiple effects, because the page fault can't be handled at the micro-op level; it has to be handled at the architectural level. But maybe this is a smaller problem with OoO and especially speculative execution, I don't know.

If you can safely get by with a complete restart of the instruction, though, the page fault problem goes away.

I think that solves the problems Patterson identified?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: