Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The presentation was interesting; but I would like to write an idea that is tangentially related to this CPU.

I noticed that modern CPUs are optimized for legacy monolith OS kernels like Linux or Windows. But having a large, multimegabyte kernel is a bad idea from a security standpoint. A single mistake or intentional error in some rarely used component (like a temperature sensor driver) can get attacker full access to the system. Again, an error in any part of the monolith kernel can cause system failure. And Linux kernel doesn't even use static analysis to find bugs! It is obvious that using microkernels could solve many of the issues above.

But microkernels tend to have poor performance. One of the reasons for this could be high context switch latency. CPUs with high context switch latency are only good for legacy OSes and not ready for better future kernels. Therefore, either we will find a way to make context switches fast or we will have to stay with large, insecure kernels full of vulnerabilities.

So I was thinking what could be done here. For example, one thing that could be improved is to get rid of address space switch. It causes flushes of various caches and it hurts performance. Instead, we could always use the single mapping from virtual to physical addresses, but allocate each process different virtual address range. To implement this, we could add two registers, which would hold minumum and maximum accessible virtual addresses. It should be easy to check the address against them to prevent speculative out of bounds memory accesses.

By the way, 32-bit x86 architecture had segments, that could be used to divide single address space between processes.

Another thing that can take time is saving/restoring registers on context switch. One way to solve the problem could be to use multiple banks (say, 64 banks) of registers that can be quickly switched, another way would be to zero out registers on return from kernel and let processes save them if they need it.

Or am I wrong somewhere and fast context switches cannot be implemented this way?



These days there are few caches that need to be flushed at context switch time - RISCV's ASIDs mean that you don't need to flush the TLBs (mostly) when you contect switch.

VRoom! largely has physically tagged caches so they don't need to be flushed, the BTC is virtually tagged, but split into kernel and user caches, you need to flush the user one on on a context switch (or both on a VM switch) - also the trace cache (L0 icache) will also be virtually tagged. VRoom! also doesn't do speculative accesses past the TLBs.

Honestly saving and restoring kernel context is small compared to the time spent in the kernel (and I've spent much of the past year looking at how this works in depth).

Practically you have to design stuff to an architecture (like RISCV) so that one can leverage off of the work of others (compilers, libraries, kernels) adding specialised stuff that would (in this case) get in to a critical timing path is something that one has to consider very carefully - b ut that's a lot of what RISCV is about - you can go and knock up that chip yourself on an FPGA and start trialing it on your microkernel


Since this was a bit hard to google:

ASID = Address Space Identifier. It's a tag that uniquely identifies each processes' entries in the TLB. This ensures that your TLB lookups can be limited to the valid entries for the process, so you don't need to flush the TLB on context switch.


I think the way to think of ASIDs is as each being a separate address space - in effect if you have 15 bits of ASID you have 2^15 - 32k address spaces.

One thing I've done in VRoom! which is an extension on to the RISCV spec is that if we have an N hart SMP CPU (for example a 2 cpu SMT system) we use log2(N) bits of the ASID to select which hart/cpu a TLB entry belongs to - from a programmer's point of view the ASID just looks smaller.

However there's a VRoom! specific config bit (by default off) that you can set if you know that the ASIDs you are going to use for all your CPU's effectively see the same address space - if you set that bit then the per-cpu portion of the ASID tags (in the TLB) become available (ie to the programmer the ASID looks bigger) - it's a great hack because it doesn't get into any critical paths anywhere


https://github.com/riscv/riscv-isa-manual/issues/348 suggests that the RISC-V folks are going to address this in the spec at some point.


RISC-V really is the heir of MIPS as it had those TLB tags as well.


Thanks, this is really informative.


Long ago, we in the Newton project at Apple had that idea. We (in conjunction with ARM) were defining the first ARM MMU, so we took the opportunity to implement “domains” of memory protection mappings that could be quickly swapped at a context switch. So you get multiple threads in the same address space, but with independent R/W permission mappings.

I think a few other ARM customers were intrigued by the security possibilities, but the vast majority were more like “what is this bizarre thing, I just want to run Unix”, so the feature disappeared eventually.

Here’s some ARM documentation if you want to pull this thread: https://developer.arm.com/documentation/dui0056'/latest/'cac...


Too late to edit, but here's a documentation link that works better: https://developer.arm.com/documentation/dui0056/d/caches-and...


Its similar to the original macOS, which used handles to track/access/etc memory requested from the OS and swap them to disk as needed. First you request the space, then you request access, which pinned it into ram.

PalmOS was another one that worked similarly. https://www.fuw.edu.pl/~michalj/palmos/Memory.html


You should look into the Mill CPU architecture.[0] Its design should make microkernels much more viable.

* Single 64-bit address space. Caches use virtual addresses.

* Because of that, the TLB is moved after the last level cache, so it's not on the critical path.

* There's instead a PLB (protection lookaside buffer), which can be searched in parallel with cache lookup. (Technically, there's three: two instruction PLBs and one data PLB.)

[0]: https://millcomputing.com/


I was also going to mention the Mill, but it's become a bit of a Flying Dutchman that people tell tales of but which probably doesn't exist.


Fundamental rethinks take time. The ideas expressed by the mill folks have value independent of any specific implementation or absence thereof. Yosys is incredible and the dropping cost and increasing availability of capable FPGA dev boards equally so. I wouldn't put it past a sharp CS major to whip up a toy mill cpu in FPGA these days just based on what's been shared publicly. It's a bit strange to me that I can still see echos of the Datapoint 2200 in a modern machine.

I'd also like to see further work related to this: https://core.ac.uk/reader/161119546

There's been a lot of recapitulation and growth in the language space recently as well, showing up in languages like Zig and Rust, paving the way for better utilization across heterogeneous and many core architectures. I feel like Rust's memory semantics don't hurt the mill either, and may help a lot.


They won't commercialize their design. Your best bet would be to reverse engineer their designs and build your own.


I went looking and it seems that they're making some progress. I wasn't previously aware of their wiki, which contains ISA documentation and more: http://millcomputing.com/wiki/Main_Page


SASOSes are interesting, sometimes extending a 64-bit address space to cover a whole cluster, but they aren't compatible with anything that calls fork().

The various variants of L4 have pretty good context-switch latency even on traditional CPUs, and seL4 in particular is formally proven correct on a few platforms. Spectre+Meltdown mitigation was painful for them, but they're still pretty good.

Lots of microcontrollers have no MMUs but do have MPUs to keep a user task from cabbaging the memory of the kernel or other tasks. Not sure if any of them use the PDP-11-style base+offset segment scheme you're describing to define the memory regions.

Protected-memory multitasking on a multicore system doesn't need to involve context switches, especially with per-core memory.

Even on Linux, context switches are cheap when your memory map is small. httpdito normally has five pages mapped and takes about 100 microseconds (on a 2.8GHz amd64 laptop) to fork, serve a request, and exit. I think I've measured context switches a lot faster than that between two existing processes.

Multiple register banks for context switching go back to the CDC 6600's peripheral processor (FEP) or maybe the TX-0 on which Sutherland wrote SKETCHPAD; it has a lot of advantages beyond potentially cheaper IPC. Register bank switching for interrupt handling was one of the major features the Z80 had over the 8080 (you cn think of the interrupt handler as being the kernel). The Tera MTA in the 01990s was at least widely talked about if not widely imitated. Switching register sets is how "SMT" works and also sort of how GPUs work. And today Padauk's "FPPA" microcontrollers (starting around 12 cents IIRC) use register bank switching to get much lower I/O latency than competing microcontrollers that must take an interrupt and halt background processing until I/O is complete.

Another alternative approach to memory protection is to do it in software, like Java, Oberon, and Smalltalk do, and Liedtke's EUMEL did; then an IPC can be just an ordinary function call. Side-channel leaks like Spectre seem harder to plug in that scenario. GC may make fault isolation difficult in such an environment, particularly with regard to performance bugs that make real-time tasks miss deadlines, and possibly Rust-style memory ownership could help there.


What I would like to have is a context switch latency comparable to a function call. For example, if in a microkernel system bus driver, network card driver, firewall, TCP stack, socket service are all separate userspace processes, then every time a packet arrives there would be a context-switching festival.

As I understand, in microkernel OSes most system calls are simply IPCs - for example, network card driver passes incoming packet to the firewall. So there is almost no kernel work except for context switch. That's why it has to be as fast as possible and resemble a normal function call, maybe even without invoking the kernel at all. Maybe something like Intel's call gate, but fast.

> they aren't compatible with anything that calls fork().

I wouldn't miss it; for example, Windows works fine without it.


At the core of any protection boundary crossing is likely going to be a pipe flush (throwing away of tens or maybe 100+ instructions) - post spectre/meltdown we all understand that speculating past such a point into a differently privileged environment is very fraught.

I think this means we won't be seeing 'call gate' equivalents that perform close to subroutine calls on high end systems any time soon if at all


Though you certainly know more than I do about the subject, my understanding is that differently privileged environments can enqueue messages to each other without pipeline flushes, and general forms of that mechanism have performed better than subroutine calls on high-end systems since the early 01990s: Thinking Machines, MasPar, Tera, even RCU on modern amd64.

And specialized versions of this principle predate computers: a walkie-talkie has the privilege to listen to sounds in its environment, a privilege it only exercises when its talk button is pressed and which it does not delegate to other walkie-talkies, and the communication latency between two such walkie-talkies may be tens of nanoseconds, though audio communication doesn't really benefit from such short latencies. The latency across a SATA link is subnanosecond, which is useful, and neither end trusts the other.


oh totally, but then you aren't "making a procedure call" you're doing something different.

In this case your data is likely traversing the memory hierarchy far enough so that the message data gets shared (more likely the sending data goes into the sending CPU's data cache and the receiving one will use the cache coherency protocol to pull it from there) - that's likely to take of the order of a pipe flush to happen.

You could also have bespoke pipe-like hardware - that's going to be a fixed resource that will require management/flow control/etc if it's going to be a general facility


Agreed, but even in the cache-line-stealing case, those are latency costs, while a pipeline flush is also a throughput cost, no? Unless one of the CPUs has to wait for the cache line ownership to be transferred.


well if you're making a synchronous call you have to wait for the response which is likely as bad as a pipe flush (or worse, because you likely flood the pipe with a tight loop waiting for the response, or a context switch to do something else while you wait)

Also note that stealing a cache line can be very expensive, if the CPUs are both SMT with each other it's in the same L1, almost 0 cost, if they are on the same die it will be a few (4-5?) clocks across the L2/cache coherency fabric but if they are on separate chiplets connected via a memory controller with L3/L4 in it then it's 4 chip boundary crossings - an order or 2 in magnitude in cost


All that makes sense to me. So for high performance collaboration across security boundaries needs to be either very rare or nonblocking?

Multithreading within a security boundary is one way to "synchronously wait" without incurring a giant context-switch cost (SMT or Tera-style or Padauk FPPA-style; do GPUs do this too, at larger-than-warp granularity?). Event loops are a variant on this, and io_uring seems to think that's the future. But the GreenArrays approach is to decide that the limiting resource is nanojoules dissipated, not transistors, so just idle some transistors in a synchronous wait. Not sure if that'll ever go mainstream, but it'd fit well with the trend to greater heterogeneity.


Yes, of course.

You can have IPC that's faster than a function call if it's between cores.


>But microkernels tend to have poor performance.

Before we learned how to make them fast, perhaps. They do now tend to be very fast[0][1].

>One of the reasons for this could be high context switch latency.

As multiserver systems pass a lot of messages around, the important metric is IPC cost. Liedtke demonstrated microkernels do not have to be slow, with L3 and later L4. Liedtke's findings have endured fairly well[2] through time. It helps to know that seL4[3] has an order of magnitude faster IPC relative to Linux.

You'd need it to do a lot (think thousands of times) more IPC for the aggregated IPC to be slower than Linux.

>So I was thinking what could be done here.

I don't have a link at hand, but there's some involvement and synergy between seL4 team and RISC-V. I am hopeful it is enough to prevent the bad scenario where RISC-V is overoptimized for the now obsolete UNIX design, and a bad fit to contemporary OS architectures.

0. https://blog.darknedgy.net/technology/2016/01/01/0/

1. https://news.ycombinator.com/item?id=10824382

2. https://sigops.org/s/conferences/sosp/2013/papers/p133-elphi...

3. https://sel4.systems/About/seL4-whitepaper.pdf


Another approach is to never context switch by running all programs in kernel mode and vetting them with an interpreter/JIT compiler: https://www.destroyallsoftware.com/talks/the-birth-and-death... (only half joking)


That is how the Singularity research OS works, except it's done by static verification in a compiler.


Segment registers are precisely how NT does context switching. I think it may be restricted to just switching from user- to kernel- threads. I can't remember if there's thread-to-thread switching using segment registers — I feel like this was a thing, or it was just a thing we did when we tried to boot NT on Larrabee. (Blech.)


>But microkernels tend to have poor performance.

Citation needed. What kind of hit are we talking about? 5%? 90%? We have supercomputers from the future that have capacity to spare. I would be willing to take an enormous performance hit for better security guarantees on essential infrastructure (routers, firewalls, file servers, electrical grid, etc).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: