Hacker News new | past | comments | ask | show | jobs | submit login

SQLite's design docs were the first time I had seen a database use a virtual machine instead of walking a tree. I later noticed VMs in libraries, embedded DSLs, and other applications outside of large general-purpose programming languages. That really drove home for me that VMs could be anywhere and were often a useful step in handling a user's expressions.



Stack-based VMs, like SQLite's (I think), ARE trees. A stack based VM's bytecode (without DUP and POP) is just the post-order depth-first traversal of the corresponding expression tree. With DUP you have a connected acyclic DAG. With POP you have an acyclic DAG with one or more components. With loops you have a full graph.

When looked at this way, a VM makes the most sense actually because a pointer-heavy tree implementation is just terrible for caching and is super wasteful. Also, most SQL plans are trees (Until you get to WITH RECURSIVE).

I'm working on a CEP in C right now and a stack-based bytecode is simply the most 'obvious' way to represent a tree when you don't want to have to deal with lots of memory allocations. Just pre-allocate a buffer large enough and increment and go.

Either way, try taking something like the following and printing an expression tree like the one below

  LITERAL 2
  LITERAL 3
  LITERAL 4
  MUL
  PLUS
now, write something to produce on a terminal, the following:

  Plus
    Mul
      3
      4
    2
You should be able to do this in a simple iteration through the ops. Try it! Then try it with the variations above.

Now, try writing ops to manipulate and rewrite it. It's actually really a fun way to represent trees.


SQLite's VM is register-based, not stack-based.


It’s been both - was stack, converted to register[0][1].

[0] https://www.sqlite.org/vdbe.html

[1] https://www.sqlite.org/src/info/051ec01f2799e095


This was the question I was going to ask! I've recently been diving into Lua 5.x internals and have been extremely impressed with Lua's register-based bytecode implementation. Lua has a stable byte code interface between patch releases (5.4.0 -> 5.4.1, etc) but not between major/minor revisions (5.3 -> 5.4). SQLite on the other hand does not consider this a public interface at all!

> "Remember: The VDBE opcodes are not part of the interface definition for SQLite. The number of opcodes and their names and meanings change from one release of SQLite to the next." https://www.sqlite.org/opcode.html#the_opcodes

For anyone interested in understanding how a register-based VM operates I highly recommend:

A No-Frills Introduction to Lua 5.1 VM Instructions by Kein-Hong Man. https://www.mcours.net/cours/pdf/hasclic3/hasssclic818.pdf


Does SQLite's VM have an API? I mean, one where I can build and issue opcodes into directly.


Not a public one, because they want to have the freedom to change how it works. But if you type in `EXPLAIN CREATE TABLE x(y);` or `EXPLAIN SELECT 3;` or whatever into `sqlite3`, you can see it. And it's reasonably straightforward to dig into the source code (SQLite is very well-written C) to hook that up yourself if you really wanted to.


> Also, most SQL plans are trees (Until you get to WITH RECURSIVE).

WITH RECURSIVE can also generally be implemented tree-style, you just loop a bunch in one of the iterators.

There are some databases, like Umbra, which can have DAG query plans for the benefit of more efficient groupjoin (GROUP BY pushed down into a join). IIRC some of the unnesting patterns can also benefit from non-tree plans.


VMs really can be. They have a long history in code portability. In the old days it really wasn't uncommon to use an approach centered around some interpretable byte code running in a vm, where the reusable vm was all that needed porting to different architectures and operating systems. This all happened well before Java.

It was really big in gaming, Zork, Sierra games, LucasArts games, and even a few more "action" games like Flashback were all designed around VMs to make porting somewhat sane. Running the byte code is usually not the performance bottleneck in these cases but drawing stuff to the screen is, and the VM had to handle that.


And Pascal p-code! Not the first, I’ve heard, but I believe it’s close to being the first.


One of the first was Burroughs B5000,

https://en.wikipedia.org/wiki/Burroughs_Large_Systems

It used an almost memory safe systems language, ESPOL, zero Assembly, all CPU capabilities are exposed via intrisics, one of the first recoded uses of unsafe code blocks, there was tagging and capabilities support, the CPUs were microcoded. All of this in 1961, a decade before C came to be.

ESPOL was quickly replaced by NEWP, although there are very little data when it happened, probly a couple of years later.

Nowadays it is still sold by Unisys under the guise of being a mainframe system for those that value security above all, as ClearPath MCP, and you can get NEWP manual.

https://www.unisys.com/solutions/enterprise-computing/clearp...

https://public.support.unisys.com/aseries/docs/ClearPath-MCP...


the b5000 was one of the first non-virtual stack machines, but its instruction set isn't any more of a virtual machine than the 8088's or the pdp-10's. there were a number of interpreted stack languages in the 50s, though not nearly as many as there would be later


When one does digital archeology is it quite common to see Assembly referred to as bytecode, when the CPUs are actually interpreters written in microcode.

Another example, all the Xerox PARC workstations, which loaded the respective interpreter (Smalltalk, Interlisp, Mesa, Mesa/Cedar) into the CPU as first step during the boot process.


According to 5s on Google, the x86 has always been microcoded. I guess the argument is that “machine language” is the public API of the CPU, and “assembly language” is the higher level script used to, mostly, directly create machine language.

Is Intel microcode able to be changed by anyone? I understand that even CPUs get field updates nowadays.


Microcode updates are encrypted, and only Intel has the key. There are exploits to extract the key, but otherwise you're pretty much locked out.

I don't know a whole lot about microarchitecture, but from my understanding, it's not possible to modify microcode's actual uOps in software, it's the translation from assembly to microcode that's being patched in updates.


i have never seen assembly, or the code generated from it, referred to as 'bytecode' unless there was no physical hardware or microcode that could interpret it — except in the case of smalltalk, whose bytecode was, as you say, interpreted by microcode on the alto and the d-machines. but possibly your archeology has delved into cultures mine has not? i'd be very interested in reading more, if you happen to have any links handy


Don't know about Flashback, but Another World famously was VM based.


I might be wrong, but I've read on other project pages that it is. https://github.com/sylefeb/a5k


Wozniak found that 16-bit arithmetic was much easier to perform on a 16-bit machine. So he wrote SWEET16, a VM with its own bytecode, to work with 16-bit pointers in Apple's Integer BASIC.

Similarly, the TI-99/4 series of computers had a cumbersome way of accessing memory, because most of the RAM in the machine was exclusively controlled by the video chip and could only be accessed by poking a few registers. So the designers implemented a bytecode VM in ROM called GPL (Graphics Programming Language) that would make accesses to this video RAM seem like straightforward RAM accesses. TI encouraged the use of GPL to develop games and other applications for the system. Unfortunately, the fact that it was interpreted by the CPU, plus the fact that the CPU could only access video-chip memory during the horizontal and vertical retrace intervals, made GPL execution very slow. And since the machine's in-ROM BASIC interpreter was written in GPL, you now know why BASIC programs on the TI-99/4 and /4A crept along at the most leisurely of paces.

So VMs were definitely a thing used to make certain kinds of tradeoffs, even way back in the 8-bit days when systems had mere kilobytes of memory. Who knows how many might be lurking in a modern system!


The unfortunate history of SWEET16 is that it was never used for its intended purpose - https://retrocomputing.stackexchange.com/a/14513/11579


VMs are a persistently underrated tool by people who think the conversation begins and ends at VirtualBox. They are a phenomenal way to constrain the surface area of what one's own code has to target, and that's always a plus.


Everything is either a compiler or interpreter


That's why SICP is often brought up in contexts that are not explicitly about interpreters :)


Doesn't MS Word internally run a Forth-like VM? I remember reading an article by someone who decompiled an early MS-DOS version of Word only to discover that there was a VM inside.


They called it p-code at the time. The purpose was (purported) to simplify the porting between architectures.

https://casadevall.pro/articles/2020/11/compiling-word-for-w...


from http://www.trs-80.org/multiplan/

    "Originally code-named “EP” (for “Electronic Paper”), Multiplan was written to use a very clever p-code compiler system created by Richard Brodie. This p-code system, which was also used later by Microsoft Word, allowed Microsoft to target Multiplan to a huge number of different 8-bit and 16-bit computer systems. (Charles Simonyi once estimated that Multiplan was available for 100 platforms.)"
Many people say that the 'p' in 'p-code' stands for 'pseudo', i.e. pseudo code. But the article archived at https://techshelps.github.io/MSDN/BACKGRND/html/msdn_c7pcode... says the 'p' is short for 'packed'.

    "Microsoft has introduced a code compression technology in its C/C++ Development System for Windows version 7.0 (C/C++ 7.0) called p-code (short for packed code) that provides programmers with a flexible and easy-to-implement solution for minimizing an application's memory requirements. In most cases, p-code can reduce the size of an executable file by about 40 percent. For example, the Windows Project Manager version 1.0 (resource files not included) shrinks from 932K (C/C ++ 7.0 with size optimizations turned on) to 556K when p-code is employed."

    "Until now, p-code has been a proprietary technology developed by the applications group at Microsoft and used on a variety of internal projects. The retail releases of Microsoft Excel, Word, PowerPoint�, and other applications employ this technology to provide extensive breadth of functionality without consuming inordinate amounts of memory."


Or indeed any time complexity need to be layered. A VM design doesn't even have to have an explicit bytecode compilation stage -- you can write an interpreter that runs as the instructions are issued.

David Parnas talks a lot about this as a way to reliable modularisation: https://two-wrongs.com/deliberate-abstraction.html


MonetDB is another database that uses a VM [1], using an instruction set called the MonetDB Assembly Language (MAL). I believe it pioneered the technique [2]. The VM is basically designed to express the query as vectorized functions on columns.

[1] https://stratos.seas.harvard.edu/files/MonetDebull2012.pdf

[2] https://www.researchgate.net/publication/220538804_Database_...


> That really drove home for me that VMs could be anywhere and were often a useful step in handling a user's expressions.

I think it's a special case of the fact that finite state machines are everywhere you look.


SQLite is the first one I’ve looked at the internals of. Do others walk an AST of the query instead?


FTA:

> Tree-Of-Objects → The input SQL is translated in a tree of objects that represent the processing to be done. The SQL is executed by walking this tree. This is the technique used by MySQL and PostgreSQL.


Note that this only holds true for fairly recent MySQL (MySQL 8.x, not including the oldest 8.0.x releases). 5.7 and older, and by extension MariaDB, instead models pretty much everything as a large fixed-function recursive function that calls itself for each new table in the join, plus some function pointers for handling of GROUP BY and such.

TBH, when it comes to query execution (A JOIN B JOIN C GROUP BY …), I think the difference between SQLite's bytecode and a tree of iterators (the classical Volcano executor) is fairly small in practice; they are quite interchangeable in terms of what they can do, and similar when it comes to performance. The difference between bytecode and tree structure is much larger when it comes to evaluation of individual expressions (a + b * cos(c)), especially since that involves much more of a type system; that is more obviously in the “bytecode is the way to go” camp to me.


Yes, postgres for example. It maps pretty close to what you see from `explain` where the ops are pretty high level, reducing interpreter overhead. JIT's big gain is speeding up reading values out of row data


Sorry, I’d read the article a couple years ago and forgot he goes into depth on the other approaches. My bad.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: