Linux used to deliver relatively low syscall overhead esp. on modern aggressively speculating CPUs.
But after spectre+meltdown mitigations landed it felt like the 1990s all over again where syscall overhead was a huge cost relative to the MIPS available.
The article quotes the Intel docs: "Instruction ordering: Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible)."
More detail here would be great, especially using the terms "issue" and "commit" rather than execute.
A barrier makes sense to me, but preventing instructions from issuing seems like too hard of a requirement, how could anyone tell?
> preventing instructions from issuing seems like too hard of a requirement
If this were the case, you could perform SYSCALL in the shadow of a mispredicted branch, and then try to use it to leak data from privileged code.
When the machine encounters an instruction that changes privilege level, you need to validate that you're on a correct path before you start scheduling and executing instructions from another context. Otherwise, you might be creating a situation where instructions in userspace can speculatively influence instructions in the kernel (among probably many other things).
That's why you typically make things like this drain the pipeline - once all younger instructions have retired, you know that you're on a correct [not-predicted] path through the program.
edit: Also, here's a recent example[^1] of how tricky these things can be (where SYSCALL isn't even serializing enough to prevent effects in one privilege level from propagating to another)
it might have more to do with the difficult in separating out the contexts of the two execution streams across the rings. someone may have looked at the cost and complexity of all that accounting and said 'hell no'
Yeah, I would probably say the same. It is a bit strange to document this as part of the architecture (rather than leaving it open as a potential future microarchitectural optimization). Is there some advantage an OS has knowing that the CPU flushes the pipeline on each system call?
There are so many extra steps, obviously the CPU is designed for legacy monolithic OS like Windows which uses syscalls rarely and would work slowly with much safer and better, than Windows, microkernels.
For example, why bother saving userspace registers? Just zero them out to prevent leaks. Ideally with a single instruction.
https://fosspost.org/disable-cpu-mitigations-on-linux