SQLite's design docs were the first time I had seen a database use a virtual machine instead of walking a tree. I later noticed VMs in libraries, embedded DSLs, and other applications outside of large general-purpose programming languages. That really drove home for me that VMs could be anywhere and were often a useful step in handling a user's expressions.
Stack-based VMs, like SQLite's (I think), ARE trees. A stack based VM's bytecode (without DUP and POP) is just the post-order depth-first traversal of the corresponding expression tree. With DUP you have a connected acyclic DAG. With POP you have an acyclic DAG with one or more components. With loops you have a full graph.
When looked at this way, a VM makes the most sense actually because a pointer-heavy tree implementation is just terrible for caching and is super wasteful. Also, most SQL plans are trees (Until you get to WITH RECURSIVE).
I'm working on a CEP in C right now and a stack-based bytecode is simply the most 'obvious' way to represent a tree when you don't want to have to deal with lots of memory allocations. Just pre-allocate a buffer large enough and increment and go.
Either way, try taking something like the following and printing an expression tree like the one below
LITERAL 2
LITERAL 3
LITERAL 4
MUL
PLUS
now, write something to produce on a terminal, the following:
Plus
Mul
3
4
2
You should be able to do this in a simple iteration through the ops. Try it! Then try it with the variations above.
Now, try writing ops to manipulate and rewrite it. It's actually really a fun way to represent trees.
This was the question I was going to ask! I've recently been diving into Lua 5.x internals and have been extremely impressed with Lua's register-based bytecode implementation. Lua has a stable byte code interface between patch releases (5.4.0 -> 5.4.1, etc) but not between major/minor revisions (5.3 -> 5.4). SQLite on the other hand does not consider this a public interface at all!
> "Remember: The VDBE opcodes are not part of the interface definition for SQLite. The number of opcodes and their names and meanings change from one release of SQLite to the next."
https://www.sqlite.org/opcode.html#the_opcodes
For anyone interested in understanding how a register-based VM operates I highly recommend:
Not a public one, because they want to have the freedom to change how it works. But if you type in `EXPLAIN CREATE TABLE x(y);` or `EXPLAIN SELECT 3;` or whatever into `sqlite3`, you can see it. And it's reasonably straightforward to dig into the source code (SQLite is very well-written C) to hook that up yourself if you really wanted to.
> Also, most SQL plans are trees (Until you get to WITH RECURSIVE).
WITH RECURSIVE can also generally be implemented tree-style, you just loop a bunch in one of the iterators.
There are some databases, like Umbra, which can have DAG query plans for the benefit of more efficient groupjoin (GROUP BY pushed down into a join). IIRC some of the unnesting patterns can also benefit from non-tree plans.
VMs really can be. They have a long history in code portability. In the old days it really wasn't uncommon to use an approach centered around some interpretable byte code running in a vm, where the reusable vm was all that needed porting to different architectures and operating systems. This all happened well before Java.
It was really big in gaming, Zork, Sierra games, LucasArts games, and even a few more "action" games like Flashback were all designed around VMs to make porting somewhat sane. Running the byte code is usually not the performance bottleneck in these cases but drawing stuff to the screen is, and the VM had to handle that.
It used an almost memory safe systems language, ESPOL, zero Assembly, all CPU capabilities are exposed via intrisics, one of the first recoded uses of unsafe code blocks, there was tagging and capabilities support, the CPUs were microcoded. All of this in 1961, a decade before C came to be.
ESPOL was quickly replaced by NEWP, although there are very little data when it happened, probly a couple of years later.
Nowadays it is still sold by Unisys under the guise of being a mainframe system for those that value security above all, as ClearPath MCP, and you can get NEWP manual.
the b5000 was one of the first non-virtual stack machines, but its instruction set isn't any more of a virtual machine than the 8088's or the pdp-10's. there were a number of interpreted stack languages in the 50s, though not nearly as many as there would be later
When one does digital archeology is it quite common to see Assembly referred to as bytecode, when the CPUs are actually interpreters written in microcode.
Another example, all the Xerox PARC workstations, which loaded the respective interpreter (Smalltalk, Interlisp, Mesa, Mesa/Cedar) into the CPU as first step during the boot process.
According to 5s on Google, the x86 has always been microcoded. I guess the argument is that “machine language” is the public API of the CPU, and “assembly language” is the higher level script used to, mostly, directly create machine language.
Is Intel microcode able to be changed by anyone? I understand that even CPUs get field updates nowadays.
Microcode updates are encrypted, and only Intel has the key. There are exploits to extract the key, but otherwise you're pretty much locked out.
I don't know a whole lot about microarchitecture, but from my understanding, it's not possible to modify microcode's actual uOps in software, it's the translation from assembly to microcode that's being patched in updates.
i have never seen assembly, or the code generated from it, referred to as 'bytecode' unless there was no physical hardware or microcode that could interpret it — except in the case of smalltalk, whose bytecode was, as you say, interpreted by microcode on the alto and the d-machines. but possibly your archeology has delved into cultures mine has not? i'd be very interested in reading more, if you happen to have any links handy
Wozniak found that 16-bit arithmetic was much easier to perform on a 16-bit machine. So he wrote SWEET16, a VM with its own bytecode, to work with 16-bit pointers in Apple's Integer BASIC.
Similarly, the TI-99/4 series of computers had a cumbersome way of accessing memory, because most of the RAM in the machine was exclusively controlled by the video chip and could only be accessed by poking a few registers. So the designers implemented a bytecode VM in ROM called GPL (Graphics Programming Language) that would make accesses to this video RAM seem like straightforward RAM accesses. TI encouraged the use of GPL to develop games and other applications for the system. Unfortunately, the fact that it was interpreted by the CPU, plus the fact that the CPU could only access video-chip memory during the horizontal and vertical retrace intervals, made GPL execution very slow. And since the machine's in-ROM BASIC interpreter was written in GPL, you now know why BASIC programs on the TI-99/4 and /4A crept along at the most leisurely of paces.
So VMs were definitely a thing used to make certain kinds of tradeoffs, even way back in the 8-bit days when systems had mere kilobytes of memory. Who knows how many might be lurking in a modern system!
VMs are a persistently underrated tool by people who think the conversation begins and ends at VirtualBox. They are a phenomenal way to constrain the surface area of what one's own code has to target, and that's always a plus.
Doesn't MS Word internally run a Forth-like VM? I remember reading an article by someone who decompiled an early MS-DOS version of Word only to discover that there was a VM inside.
"Originally code-named “EP” (for “Electronic Paper”), Multiplan was written to use a very clever p-code compiler system created by Richard Brodie. This p-code system, which was also used later by Microsoft Word, allowed Microsoft to target Multiplan to a huge number of different 8-bit and 16-bit computer systems. (Charles Simonyi once estimated that Multiplan was available for 100 platforms.)"
"Microsoft has introduced a code compression technology in its C/C++ Development System for Windows version 7.0 (C/C++ 7.0) called p-code (short for packed code) that provides programmers with a flexible and easy-to-implement solution for minimizing an application's memory requirements. In most cases, p-code can reduce the size of an executable file by about 40 percent. For example, the Windows Project Manager version 1.0 (resource files not included) shrinks from 932K (C/C ++ 7.0 with size optimizations turned on) to 556K when p-code is employed."
"Until now, p-code has been a proprietary technology developed by the applications group at Microsoft and used on a variety of internal projects. The retail releases of Microsoft Excel, Word, PowerPoint�, and other applications employ this technology to provide extensive breadth of functionality without consuming inordinate amounts of memory."
Or indeed any time complexity need to be layered. A VM design doesn't even have to have an explicit bytecode compilation stage -- you can write an interpreter that runs as the instructions are issued.
MonetDB is another database that uses a VM [1], using an instruction set called the MonetDB Assembly Language (MAL). I believe it pioneered the technique [2]. The VM is basically designed to express the query as vectorized functions on columns.
> Tree-Of-Objects → The input SQL is translated in a tree of objects that represent the processing to be done. The SQL is executed by walking this tree. This is the technique used by MySQL and PostgreSQL.
Note that this only holds true for fairly recent MySQL (MySQL 8.x, not including the oldest 8.0.x releases). 5.7 and older, and by extension MariaDB, instead models pretty much everything as a large fixed-function recursive function that calls itself for each new table in the join, plus some function pointers for handling of GROUP BY and such.
TBH, when it comes to query execution (A JOIN B JOIN C GROUP BY …), I think the difference between SQLite's bytecode and a tree of iterators (the classical Volcano executor) is fairly small in practice; they are quite interchangeable in terms of what they can do, and similar when it comes to performance. The difference between bytecode and tree structure is much larger when it comes to evaluation of individual expressions (a + b * cos(c)), especially since that involves much more of a type system; that is more obviously in the “bytecode is the way to go” camp to me.
Yes, postgres for example. It maps pretty close to what you see from `explain` where the ops are pretty high level, reducing interpreter overhead. JIT's big gain is speeding up reading values out of row data