Same problem occurs with c++ std::shared_ptr. I guess all reference counting has this inherent scaling issue due to contention ruining cache lines. Makes me wonder how/if you get linear parallelism in Swift.
Not sure what is currently in swift, but this paper described biased reference counting approach - e.g. in way two counters - one non-atomic to be used only by specific thread (supposed owner?), and another (atomic) by all other threads - so the sum of these two shows the real reference count (somewhat). Paper here - https://dl.acm.org/doi/pdf/10.1145/3243176.3243195
(Before reading the paper I was expecting that the additional bytes were put for the split counter, plus thread id - but it actually packs them using lower bits for reference counting).
I wonder what abseil/folly/tbb do - need to check (we are heavy std::shared_ptr users, but I don't think 14 bits as described in the paper above would be enough for our use case)
Can't speak for abseil and tbb, but in folly there are a few solutions for the common problem of sharing state between a writer that updates it very infrequently and concurrent readers that read it very frequently (typical use case is configs).
If you absolutely need a std::shared_ptr (which can be the case if you're working with pre-existing interfaces) there is CoreCachedSharedPtr (https://github.com/facebook/folly/blob/main/folly/concurrenc...), which uses an aliasing trick to transparently maintain per-core reference counts, and scales linearly, but it works only when acquiring the shared_ptr, any subsequent copies of that would still cause contention if passed around in threads.
[1] Google has a proposal to make a smart pointer based on RCU/hazptr, but I'm not a fan of it because generally RCU/hazptr guards need to be released in the same thread that acquired them, and hiding them in a freely movable object looks like a recipe for disaster to me, especially if paired with coroutines https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p05...
Yes, though note that RW locks need to maintain the count of readers, and most implementations just use use a single counter in the lock state word, which makes them as subject to contention as a reference count.
folly::SharedMutex, the main RW lock in folly, tries to shard them by core when it detects contention (in fact it is the OG core-sharded primitive in folly) and when that works it is virtually linearly scalable, but the detection is a heuristic (which also has to minimize memory and writer cost) so there are still access patterns that can be pathological.
> the main RW lock in folly, tries to shard them by core when it detects contention
That's very interesting. I dealt with the scenario where I had to scale the hash-map to support 1k-5k concurrent readers and a few sporadic writers. I ended up sharding the hash-map with each one using the RW lock to guard the access to the underlying data. This essentially declined the contention within the RW lock itself, or at least I wasn't able to measure it.
RW locks have a pretty limited space where they're useful (because readers are still contending on a cache line to update the reader count). They're similar to a mutex + ref count. They pretty much only work well in situations where there aren't many readers and they hold the lock for a relatively long amount of time. (Ok, there are sharded reader count variants that reduce the cost of the reader count, but they're still only useful for relatively long reader lock sections.)
Instead, if a mutex showed up as hot in a profile, I'd look at things like RCU/hazard pointers for read-biased data structures in most situations, or trying to shard or otherwise split data between cores such that there isn't much contention on the boring, vanilla mutex.
> (Ok, there are sharded reader count variants that reduce the cost of the reader count, but they're still only useful for relatively long reader lock sections.)
RW locks are a code/design smell, even with a cheap reader count.
Was that a later edit? I didn't see it when I read the comment. Also what's the point of saying that something is not possible and then "actually it's possible, but I don't like it anyway" in parentheses?
What you're saying is wrong: if the reader section is long, you have no problem amortizing the cache invalidation. Sharded counts are useful when the reader section is small, and the cache miss becomes dominant.
Also I don't get this "RW locks are code smell" dogma, not having RW mutexes forces you to design the shared state in a way that readers can assume immutability of the portion of the state they acquire, which usually means heavily pointer-based data structures with terrible cache locality for readers. That is, in order to solve a non-problem, you sacrifice the thing that really matters, that is read performance.
I've heard this thing from Googlers, who didn't have a default RW mutex for a while, then figured out that they could add support for shared sections for free and suddenly RW mutexes are great.
Coalescing reference counting [0] avoids almost all synchronisation. n.b. The abstract says they do "not require any synchronized operation in its write barrier" but they rely on a micro-architectural hack; in practice I'd expect one atomic test-and-set per modified object per collection.
There are no miracles here because it is not a language "feature". It is a property of algorithms. When you divide your large task into parts and schedule execution of those on multiple threads make absolutely sure that there is no locking (atomics are locking) happening inside each individual task.