It's more like UNIX and Linux being way behind on threading technology. User space mutexes were in QNX over two decades ago. And on the UNIVAC 1108 half a century ago.[1] Here's a implementation from 1972, by John Walker.
. DIJKSTRA P FUNCTION
.
.
. LA,U A0,<QUEUE>
. LMJ X11,P
. <RETURN> X5 DESTROYED
.
P* TS QHEAD,A0 LOCK THE QUEUE
LX X5,QN,A0 LOAD QUEUE COUNT
ANX,U X5,1 BACK UP THE COUNT
SX X5,QN,A0 REPLACE THE COUNT IN THE QUEUE
TN X5 DO WE NEED TO DEACTIVATE HIM ?
J PDONE NO. SKIP DEACTIVATION
ON TSQ=0
LX X5,QHL,A0 LOAD BACK LINK OF QUEUE
SX X5,QHL,X4 PUT INTO BACK LINK OF ACTIVITY
SX X4,QFL,X5 CHAIN ACTIVITY TO LAST ACTIVITY
SA A0,QFL,X4 CHAIN HEAD TO NEW ACTIVITY
SX X4,QHL,A0 MAKE THE NEW ACTIVITY LAST ON QUEUE
CTS QHEAD,A0 RELEASE PROTECTION ON QUEUE HEAD
SCHDACT* DACT$ . DEACTIVATE PROCESS
OFF
ON TSQ
C$TSQ QHEAD,A0 WAIT FOR C$TSA
OFF
J 0,X11 RETURN AFTER ACTIVATION
.
PDONE CTS QHEAD,A0 UNLOCK THE QUEUE
J 0,X11 RETURN
And that followed Djykstra's paper, published in Dutch in the late 1960s.
You can fast path an userspace mutex (or any blocking primitive) on top of pretty much any blocking system call (semaphores, pipes, poll, sigwait, whatever). The nifty thing with futexes and hashed waitqueues in general is that they consume kernel resources only when actively being waited on. It is not obvious (but certainly possible) that's the case with the UNIVAC implementation above.
Yup. The fast path trick was well known, BeOS made a big deal of its Benaphore, which is exactly this trick with its OS semaphore.
Exactly as you said, the problem with a Benaphore is that if your program wants 100 Benaphores, you're creating 100 OS-wide semaphores, even if your program is literally never experiencing contended locking, that's 100x scarce OS resource used up just in case. If you ask for one million Benaphores, the program won't run, the entire OS only allows (IIRC) 65536 total.
Putting together this fast path + the wait queue idea = a futex, very fast yet very cheap.
[1] https://www.fourmilab.ch/documents/univac/fang/