The presentation was interesting; but I would like to write an idea that is tang...

Taniwha · on March 21, 2022

These days there are few caches that need to be flushed at context switch time - RISCV's ASIDs mean that you don't need to flush the TLBs (mostly) when you contect switch.

VRoom! largely has physically tagged caches so they don't need to be flushed, the BTC is virtually tagged, but split into kernel and user caches, you need to flush the user one on on a context switch (or both on a VM switch) - also the trace cache (L0 icache) will also be virtually tagged. VRoom! also doesn't do speculative accesses past the TLBs.

Honestly saving and restoring kernel context is small compared to the time spent in the kernel (and I've spent much of the past year looking at how this works in depth).

Practically you have to design stuff to an architecture (like RISCV) so that one can leverage off of the work of others (compilers, libraries, kernels) adding specialised stuff that would (in this case) get in to a critical timing path is something that one has to consider very carefully - b ut that's a lot of what RISCV is about - you can go and knock up that chip yourself on an FPGA and start trialing it on your microkernel

mastax · on March 22, 2022

Since this was a bit hard to google:

ASID = Address Space Identifier. It's a tag that uniquely identifies each processes' entries in the TLB. This ensures that your TLB lookups can be limited to the valid entries for the process, so you don't need to flush the TLB on context switch.

Taniwha · on March 22, 2022

I think the way to think of ASIDs is as each being a separate address space - in effect if you have 15 bits of ASID you have 2^15 - 32k address spaces.

One thing I've done in VRoom! which is an extension on to the RISCV spec is that if we have an N hart SMP CPU (for example a 2 cpu SMT system) we use log2(N) bits of the ASID to select which hart/cpu a TLB entry belongs to - from a programmer's point of view the ASID just looks smaller.

However there's a VRoom! specific config bit (by default off) that you can set if you know that the ASIDs you are going to use for all your CPU's effectively see the same address space - if you set that bit then the per-cpu portion of the ASID tags (in the TLB) become available (ie to the programmer the ASID looks bigger) - it's a great hack because it doesn't get into any critical paths anywhere

zozbot234 · on March 22, 2022

https://github.com/riscv/riscv-isa-manual/issues/348 suggests that the RISC-V folks are going to address this in the spec at some point.

feffe · on March 22, 2022

RISC-V really is the heir of MIPS as it had those TLB tags as well.

kragen · on March 21, 2022

Thanks, this is really informative.

wrs · on March 21, 2022

Long ago, we in the Newton project at Apple had that idea. We (in conjunction with ARM) were defining the first ARM MMU, so we took the opportunity to implement “domains” of memory protection mappings that could be quickly swapped at a context switch. So you get multiple threads in the same address space, but with independent R/W permission mappings.

I think a few other ARM customers were intrigued by the security possibilities, but the vast majority were more like “what is this bizarre thing, I just want to run Unix”, so the feature disappeared eventually.

Here’s some ARM documentation if you want to pull this thread: https://developer.arm.com/documentation/dui0056'/latest/'cac...

wrs · on March 21, 2022

Too late to edit, but here's a documentation link that works better: https://developer.arm.com/documentation/dui0056/d/caches-and...

StillBored · on March 21, 2022

Its similar to the original macOS, which used handles to track/access/etc memory requested from the OS and swap them to disk as needed. First you request the space, then you request access, which pinned it into ram.

PalmOS was another one that worked similarly. https://www.fuw.edu.pl/~michalj/palmos/Memory.html

kinghajj · on March 21, 2022

You should look into the Mill CPU architecture.[0] Its design should make microkernels much more viable.

* Single 64-bit address space. Caches use virtual addresses.

* Because of that, the TLB is moved after the last level cache, so it's not on the critical path.

* There's instead a PLB (protection lookaside buffer), which can be searched in parallel with cache lookup. (Technically, there's three: two instruction PLBs and one data PLB.)

[0]: https://millcomputing.com/

foobiekr · on March 21, 2022

I was also going to mention the Mill, but it's become a bit of a Flying Dutchman that people tell tales of but which probably doesn't exist.

timschmidt · on March 22, 2022

Fundamental rethinks take time. The ideas expressed by the mill folks have value independent of any specific implementation or absence thereof. Yosys is incredible and the dropping cost and increasing availability of capable FPGA dev boards equally so. I wouldn't put it past a sharp CS major to whip up a toy mill cpu in FPGA these days just based on what's been shared publicly. It's a bit strange to me that I can still see echos of the Datapoint 2200 in a modern machine.

I'd also like to see further work related to this: https://core.ac.uk/reader/161119546

There's been a lot of recapitulation and growth in the language space recently as well, showing up in languages like Zig and Rust, paving the way for better utilization across heterogeneous and many core architectures. I feel like Rust's memory semantics don't hurt the mill either, and may help a lot.

imtringued · on March 23, 2022

They won't commercialize their design. Your best bet would be to reverse engineer their designs and build your own.

timschmidt · on March 24, 2022

I went looking and it seems that they're making some progress. I wasn't previously aware of their wiki, which contains ISA documentation and more: http://millcomputing.com/wiki/Main_Page

kragen · on March 21, 2022

SASOSes are interesting, sometimes extending a 64-bit address space to cover a whole cluster, but they aren't compatible with anything that calls fork().

The various variants of L4 have pretty good context-switch latency even on traditional CPUs, and seL4 in particular is formally proven correct on a few platforms. Spectre+Meltdown mitigation was painful for them, but they're still pretty good.

Lots of microcontrollers have no MMUs but do have MPUs to keep a user task from cabbaging the memory of the kernel or other tasks. Not sure if any of them use the PDP-11-style base+offset segment scheme you're describing to define the memory regions.

Protected-memory multitasking on a multicore system doesn't need to involve context switches, especially with per-core memory.

Even on Linux, context switches are cheap when your memory map is small. httpdito normally has five pages mapped and takes about 100 microseconds (on a 2.8GHz amd64 laptop) to fork, serve a request, and exit. I think I've measured context switches a lot faster than that between two existing processes.

Multiple register banks for context switching go back to the CDC 6600's peripheral processor (FEP) or maybe the TX-0 on which Sutherland wrote SKETCHPAD; it has a lot of advantages beyond potentially cheaper IPC. Register bank switching for interrupt handling was one of the major features the Z80 had over the 8080 (you cn think of the interrupt handler as being the kernel). The Tera MTA in the 01990s was at least widely talked about if not widely imitated. Switching register sets is how "SMT" works and also sort of how GPUs work. And today Padauk's "FPPA" microcontrollers (starting around 12 cents IIRC) use register bank switching to get much lower I/O latency than competing microcontrollers that must take an interrupt and halt background processing until I/O is complete.

Another alternative approach to memory protection is to do it in software, like Java, Oberon, and Smalltalk do, and Liedtke's EUMEL did; then an IPC can be just an ordinary function call. Side-channel leaks like Spectre seem harder to plug in that scenario. GC may make fault isolation difficult in such an environment, particularly with regard to performance bugs that make real-time tasks miss deadlines, and possibly Rust-style memory ownership could help there.

codedokode · on March 21, 2022

What I would like to have is a context switch latency comparable to a function call. For example, if in a microkernel system bus driver, network card driver, firewall, TCP stack, socket service are all separate userspace processes, then every time a packet arrives there would be a context-switching festival.

As I understand, in microkernel OSes most system calls are simply IPCs - for example, network card driver passes incoming packet to the firewall. So there is almost no kernel work except for context switch. That's why it has to be as fast as possible and resemble a normal function call, maybe even without invoking the kernel at all. Maybe something like Intel's call gate, but fast.

> they aren't compatible with anything that calls fork().

I wouldn't miss it; for example, Windows works fine without it.

Taniwha · on March 23, 2022

At the core of any protection boundary crossing is likely going to be a pipe flush (throwing away of tens or maybe 100+ instructions) - post spectre/meltdown we all understand that speculating past such a point into a differently privileged environment is very fraught.

I think this means we won't be seeing 'call gate' equivalents that perform close to subroutine calls on high end systems any time soon if at all

kragen · on March 23, 2022

Though you certainly know more than I do about the subject, my understanding is that differently privileged environments can enqueue messages to each other without pipeline flushes, and general forms of that mechanism have performed better than subroutine calls on high-end systems since the early 01990s: Thinking Machines, MasPar, Tera, even RCU on modern amd64.

And specialized versions of this principle predate computers: a walkie-talkie has the privilege to listen to sounds in its environment, a privilege it only exercises when its talk button is pressed and which it does not delegate to other walkie-talkies, and the communication latency between two such walkie-talkies may be tens of nanoseconds, though audio communication doesn't really benefit from such short latencies. The latency across a SATA link is subnanosecond, which is useful, and neither end trusts the other.

Taniwha · on March 23, 2022

oh totally, but then you aren't "making a procedure call" you're doing something different.

In this case your data is likely traversing the memory hierarchy far enough so that the message data gets shared (more likely the sending data goes into the sending CPU's data cache and the receiving one will use the cache coherency protocol to pull it from there) - that's likely to take of the order of a pipe flush to happen.

You could also have bespoke pipe-like hardware - that's going to be a fixed resource that will require management/flow control/etc if it's going to be a general facility

kragen · on March 23, 2022

Agreed, but even in the cache-line-stealing case, those are latency costs, while a pipeline flush is also a throughput cost, no? Unless one of the CPUs has to wait for the cache line ownership to be transferred.

Taniwha · on March 23, 2022

well if you're making a synchronous call you have to wait for the response which is likely as bad as a pipe flush (or worse, because you likely flood the pipe with a tight loop waiting for the response, or a context switch to do something else while you wait)

Also note that stealing a cache line can be very expensive, if the CPUs are both SMT with each other it's in the same L1, almost 0 cost, if they are on the same die it will be a few (4-5?) clocks across the L2/cache coherency fabric but if they are on separate chiplets connected via a memory controller with L3/L4 in it then it's 4 chip boundary crossings - an order or 2 in magnitude in cost

kragen · on March 23, 2022

All that makes sense to me. So for high performance collaboration across security boundaries needs to be either very rare or nonblocking?

Multithreading within a security boundary is one way to "synchronously wait" without incurring a giant context-switch cost (SMT or Tera-style or Padauk FPPA-style; do GPUs do this too, at larger-than-warp granularity?). Event loops are a variant on this, and io_uring seems to think that's the future. But the GreenArrays approach is to decide that the limiting resource is nanojoules dissipated, not transistors, so just idle some transistors in a synchronous wait. Not sure if that'll ever go mainstream, but it'd fit well with the trend to greater heterogeneity.

kragen · on March 22, 2022

Yes, of course.

You can have IPC that's faster than a function call if it's between cores.

snvzz · on March 22, 2022

>But microkernels tend to have poor performance.

Before we learned how to make them fast, perhaps. They do now tend to be very fast[0][1].

>One of the reasons for this could be high context switch latency.

As multiserver systems pass a lot of messages around, the important metric is IPC cost. Liedtke demonstrated microkernels do not have to be slow, with L3 and later L4. Liedtke's findings have endured fairly well[2] through time. It helps to know that seL4[3] has an order of magnitude faster IPC relative to Linux.

You'd need it to do a lot (think thousands of times) more IPC for the aggregated IPC to be slower than Linux.

>So I was thinking what could be done here.

I don't have a link at hand, but there's some involvement and synergy between seL4 team and RISC-V. I am hopeful it is enough to prevent the bad scenario where RISC-V is overoptimized for the now obsolete UNIX design, and a bad fit to contemporary OS architectures.

0. https://blog.darknedgy.net/technology/2016/01/01/0/

1. https://news.ycombinator.com/item?id=10824382

2. https://sigops.org/s/conferences/sosp/2013/papers/p133-elphi...

3. https://sel4.systems/About/seL4-whitepaper.pdf

HanyouHottie · on March 21, 2022

Another approach is to never context switch by running all programs in kernel mode and vetting them with an interpreter/JIT compiler: https://www.destroyallsoftware.com/talks/the-birth-and-death... (only half joking)

throwaway81523 · on March 22, 2022

That is how the Singularity research OS works, except it's done by static verification in a compiler.

thechao · on March 21, 2022

Segment registers are precisely how NT does context switching. I think it may be restricted to just switching from user- to kernel- threads. I can't remember if there's thread-to-thread switching using segment registers — I feel like this was a thing, or it was just a thing we did when we tried to boot NT on Larrabee. (Blech.)

db65edfc7996 · on March 21, 2022

>But microkernels tend to have poor performance.

Citation needed. What kind of hit are we talking about? 5%? 90%? We have supercomputers from the future that have capacity to spare. I would be willing to take an enormous performance hit for better security guarantees on essential infrastructure (routers, firewalls, file servers, electrical grid, etc).