The slow part of locking is invalidation of cache lines, and this has to happen with spinlocks anyway. Modern mutex implementations also first try to acquire the lock optimistically, so in the uncontended case they are as fast as userspace spinlocks (modulo inlining).
And if you have a contended lock, then userspace spinlocks are a PITA. You need to take care of fairness, ideally deal with the scheduler (yield to a thread that is not spinning on the same spinlock), and so on.
You can do all of that properly, but even then, you're looking at maaaaybe 10-20% performance increase in real-world applications.
Pure spinlocks can win only in contrived cases, like only having exactly two threads contending for the lock, with short locked sections.
>userspace spinlocks are a PITA ...
>ideally deal with the scheduler
Don't you anyway need to drop the scheduler a hint so that the thread holding the spinlock doesn't get scheduled off the CPU, making the contenders wait longer than they ideally should?
(Or is this what you meant by your "fairness" reference?)
In my limited understanding, this was the no. 1 reason why userspace spinlocks were discouraged -- because pretty much no scheduler accepted a hint from userspace to not kick a thread off the CPU -- modulo jumping through hoops with priority, et cetera.
If I'm missing something (and I likely am), I would be glad to be educated.
> Don't you anyway need to drop the scheduler a hint so that the thread holding the spinlock doesn't get scheduled off the CPU, making the contenders wait longer than they ideally should?
How would you do it? You can change the thread's priority to realtime to prevent the scheduler from pre-empting it while holding the lock, but this requires a kernel roundtrip and several scheduler locks anyway.
You can have a worker thread pool, with individual threads hard-pinned to specific CPUs. Then you can dispatch your work into these threads. This in practice will guarantee that they are not pre-emptied except for occasional kernel housekeeping needs.
But this will make it impossible to use the kernel-level mutexes because they can block your worker threads. So you'll have to reimplement waiting mutexes in userspace, along with a scheduler to intelligently switch to a work item that is not blocked on waiting for something else to complete.
Long story short, you're eventually going to reimplement the kernel in userspace. This can be done, and you can get some performance improvements out of it because you can avoid relatively slow kernel-userspace transitions. DPDK is a good example of this, but at that point you're not just using spinlocks, you're writing software for essentially a custom operating system with its own IO, locking, memory management, etc.
You're correct -- the problem with userspace spinlocks is that the holding thread can be scheduled off. You can prevent this to some degree (probabilistically) with isolcpus and thread pinning, but that usually doesn't prevent hardware interrupts from running on those cores (which kernel spinlocks can avoid!). This isn't really solvable without running in kernel context (to have the elevated permissions necessary to mask interrupts).
The slow part of locking is invalidation of cache lines, and this has to happen with spinlocks anyway. Modern mutex implementations also first try to acquire the lock optimistically, so in the uncontended case they are as fast as userspace spinlocks (modulo inlining).
And if you have a contended lock, then userspace spinlocks are a PITA. You need to take care of fairness, ideally deal with the scheduler (yield to a thread that is not spinning on the same spinlock), and so on.
You can do all of that properly, but even then, you're looking at maaaaybe 10-20% performance increase in real-world applications.
Pure spinlocks can win only in contrived cases, like only having exactly two threads contending for the lock, with short locked sections.