Same problem occurs with c++ std::shared_ptr. I guess all reference counting has...

malkia · on June 18, 2023

Not sure what is currently in swift, but this paper described biased reference counting approach - e.g. in way two counters - one non-atomic to be used only by specific thread (supposed owner?), and another (atomic) by all other threads - so the sum of these two shows the real reference count (somewhat). Paper here - https://dl.acm.org/doi/pdf/10.1145/3243176.3243195

(Before reading the paper I was expecting that the additional bytes were put for the split counter, plus thread id - but it actually packs them using lower bits for reference counting).

I wonder what abseil/folly/tbb do - need to check (we are heavy std::shared_ptr users, but I don't think 14 bits as described in the paper above would be enough for our use case)

ot · on June 18, 2023

Can't speak for abseil and tbb, but in folly there are a few solutions for the common problem of sharing state between a writer that updates it very infrequently and concurrent readers that read it very frequently (typical use case is configs).

The most performant solutions are RCU (https://github.com/facebook/folly/blob/main/folly/synchroniz...) and hazard pointers (https://github.com/facebook/folly/blob/main/folly/synchroniz...), but they're not quite as easy to use as a shared_ptr [1].

Then there is simil-shared_ptr implemented with thread-local counters (https://github.com/facebook/folly/blob/main/folly/experiment...).

If you absolutely need a std::shared_ptr (which can be the case if you're working with pre-existing interfaces) there is CoreCachedSharedPtr (https://github.com/facebook/folly/blob/main/folly/concurrenc...), which uses an aliasing trick to transparently maintain per-core reference counts, and scales linearly, but it works only when acquiring the shared_ptr, any subsequent copies of that would still cause contention if passed around in threads.

[1] Google has a proposal to make a smart pointer based on RCU/hazptr, but I'm not a fan of it because generally RCU/hazptr guards need to be released in the same thread that acquired them, and hiding them in a freely movable object looks like a recipe for disaster to me, especially if paired with coroutines https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p05...

menaerus · on June 18, 2023

I think RW locks are also worth a mention in this problem space, and folly has them. Actually several implementations.

ot · on June 18, 2023

Yes, though note that RW locks need to maintain the count of readers, and most implementations just use use a single counter in the lock state word, which makes them as subject to contention as a reference count.

folly::SharedMutex, the main RW lock in folly, tries to shard them by core when it detects contention (in fact it is the OG core-sharded primitive in folly) and when that works it is virtually linearly scalable, but the detection is a heuristic (which also has to minimize memory and writer cost) so there are still access patterns that can be pathological.

menaerus · on June 23, 2023

> the main RW lock in folly, tries to shard them by core when it detects contention

That's very interesting. I dealt with the scenario where I had to scale the hash-map to support 1k-5k concurrent readers and a few sporadic writers. I ended up sharding the hash-map with each one using the RW lock to guard the access to the underlying data. This essentially declined the contention within the RW lock itself, or at least I wasn't able to measure it.

loeg · on June 18, 2023

RW locks have a pretty limited space where they're useful (because readers are still contending on a cache line to update the reader count). They're similar to a mutex + ref count. They pretty much only work well in situations where there aren't many readers and they hold the lock for a relatively long amount of time. (Ok, there are sharded reader count variants that reduce the cost of the reader count, but they're still only useful for relatively long reader lock sections.)

Instead, if a mutex showed up as hot in a profile, I'd look at things like RCU/hazard pointers for read-biased data structures in most situations, or trying to shard or otherwise split data between cores such that there isn't much contention on the boring, vanilla mutex.

ot · on June 18, 2023

> because readers are still contending on a cache line to update the reader count

They don't have to, see my sibling comment about folly::SharedMutex.

loeg · on June 18, 2023

I preemptively attempted to respond to this:

> (Ok, there are sharded reader count variants that reduce the cost of the reader count, but they're still only useful for relatively long reader lock sections.)

RW locks are a code/design smell, even with a cheap reader count.

ot · on June 19, 2023

Was that a later edit? I didn't see it when I read the comment. Also what's the point of saying that something is not possible and then "actually it's possible, but I don't like it anyway" in parentheses?

What you're saying is wrong: if the reader section is long, you have no problem amortizing the cache invalidation. Sharded counts are useful when the reader section is small, and the cache miss becomes dominant.

Also I don't get this "RW locks are code smell" dogma, not having RW mutexes forces you to design the shared state in a way that readers can assume immutability of the portion of the state they acquire, which usually means heavily pointer-based data structures with terrible cache locality for readers. That is, in order to solve a non-problem, you sacrifice the thing that really matters, that is read performance.

I've heard this thing from Googlers, who didn't have a default RW mutex for a while, then figured out that they could add support for shared sections for free and suddenly RW mutexes are great.

hayley-patton · on June 18, 2023

Coalescing reference counting [0] avoids almost all synchronisation. n.b. The abstract says they do "not require any synchronized operation in its write barrier" but they rely on a micro-architectural hack; in practice I'd expect one atomic test-and-set per modified object per collection.

[0] https://sites.cs.ucsb.edu/~ckrintz/racelab/gc/papers/levanon...

FpUser · on June 18, 2023

There are no miracles here because it is not a language "feature". It is a property of algorithms. When you divide your large task into parts and schedule execution of those on multiple threads make absolutely sure that there is no locking (atomics are locking) happening inside each individual task.

eldenring · on June 18, 2023

This is one of the most common arguments for non reference counted GCs