Do you have a citation for kernel mode having more efficient context switches? What kind of direct hardware access are you referring to that would be better than pushing the register context onto the stack?
In my experience, the exact opposite is true, particularly in the era of CPU mitigations that require TLB flushes upon every kernel-mode context switch.
You're right, kernel-level context switching is much slower than user-level context switching.
User-level can also have the advantage of having more actual context about the task that is running, meaning that it's often able to avoid saving/restoring as much data as a kernel-level switch would. See Go's green threads for a great example of this kind of cooperation between runtime and language.
> Do you have a citation for kernel mode having more efficient context switches? What kind of direct hardware access are you referring to that would be better than pushing the register context onto the stack?
The closest thing to this that I can think of is on 32-bit x86 which did have hardware assisted context switching via TSRs.
As it happens, everybody stopped using it because it was too slow, and a bit painful unless you fully bought into x86's awful segmentation model. Early Linux kernels use it if you want to see it in action.
In my experience, the exact opposite is true, particularly in the era of CPU mitigations that require TLB flushes upon every kernel-mode context switch.