Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

People recommend tracing garbage collection because it performs better than atomic reference counting, even though reference counting is much simpler.

Now, suppose I were designing a language where data is always owned by at most one thread. I can skip making the ref counter an atomic variable, and with some help of the language do reference counting at compile time: would I still benefit from the complexity of a tracing GC if my goal is performance and predictable latency?

I want to have automatic memory management for both kernel and apps. Creating a concurrent tracing GC that walks the entire user and kernel space sounds incredibly more complicated than just inc/dec'ing a counter and freeing if it reaches zero (again, assuming only one thread can access a single object)

I know that I would have to add language support for weak pointers or forbid cyclic references altogether, but I wonder if the dislike for reference counting only exists because it doesn't work well for data that's shared across threads.

EDIT: thanks all, you have brought excellent arguments and you have convinced me. In fact, in this actor-based OS I am working on, GC+arena based allocation per actor might be the most performing approach. GC collection can happen in the scheduler between messages (so becomes virtually pauseless), and when an actor terminates, just reclaim the entire arena.



Yes you do still benefit, otherwise V8 would use refcounting for everything.

Problems with refcounting:

1. You leak cycles. Given the point of a memory management scheme is to reliably not leak, this is a major weakness. Cycles are very common and natural things to find in heaps, and even if you avoid them they can easily be introduced years later by people you will never meet who weren't aware of the entire heap topology. You say you'd just "forbid cyclic references": how will you do that? Require everything to be written in Rust?

2. Every object gets bigger, as it must store a reference. This increases memory usage and reduces locality. With a GC it's possible to squeeze the overhead down to a vtable pointer (required anyway in most cases) and then a few bits on top!

3. All the incs/decs bloat your code, creating a form of pervasive slowness that doesn't show up in profilers except in hot loops. This tempts you to micro-optimize add/decrefs, but that increases the chance of mistakes.

4. Requires a very consistent protocol for how and when references are taken and released. The details here can get complicated. For example do containers add/decref for you, or do they expect you to do that?

5. You are exposed to use-after-free bugs, which are a major source of security holes.

There are probably more. And obviously the restriction to not use threads is catastrophic for any system that cares about performance.

The problem with having the ability to hold managed pointers from kernel<->userspace (whether using rc or gc) is that a process might be malicious or arbitrarily buggy, in the standard OS model. It means you have to be able to kill the process and release all its resources, which means the kernel must know where all the references to kernel-side objects are coming from so they can be removed on kill.

To allow object graphs to span the user/kernel space boundary therefore requires the kernel to mandate the use of a specific compiler that's a part of the TCB (i.e. a part of the kernel itself). That radically changes the design of the whole OS. In particular you cannot implement a UNIX that way.


The decrement also needs to be in a "finally" exception handler.

There's also a problem when taking a reference to the contents of an object - that reference can survive the release of the object.


As sibling comment mentioned, reference counting is very much non-predictable in latency.

If you want predictable, you want your GC to run either in constant time slice, or an upper-bounded timeslice.

Deferred refcounts can become quite complex if you want to avoid "I have just released last reference to something that contained a lot of other referenced" (there are some funny threads where Rust users discover that they sometimes need to spin an extra thread to not have effective "stop the world" XD).

So instead common approach is tracing GC with some form timesliced running collector, guaranteeing that time taken from mutator will be upper-bounded. For example, instead of stopping the world, concurrent GC will usually only require a synchronization point. Sometimes it also means that the collector will be given explicit timeslice to ensure that there's certain amount of work being done - it's all tradeoffs based on how much throughput and latency you want.

You can speed up tracing collectors through things like escape analysis and 1bit reference counting (for example, you set a bit if a new reference is passed outside of current scope, every object without bit set can be assumed to have all referenced contained in given scope, etc).


> People recommend tracing garbage collection because it performs better than atomic reference counting, even though reference counting is much simpler.

Reference counting is a lot more complicated! It turns every read into a write.

> [...] if my goal is performance and predictable latency?

Reference counters don't have predictable latency: a single reference decrement can lead to an arbitrary amount of follow up work, like more decrements and deallocations.

Performance of reference counting isn't generally better than that of garbage collectors. (Especially since you still need a backup solution to deal with cycles.)

Figuring out a lot of your memory management at compile time is definitely a win. But doesn't really have much to do with reference counting.


> Reference counting is a lot more complicated! It turns every read into a write.

Why? Maybe it's crucial to the story, but I am exploring Rust-like single ownership semantics.

You would have to "write" only when you create a new reference to something, or the reference is destroyed. But it wouldn't turn any read into a write, no?


Rust ownership works because it runs entirely at compile time. Rust has infrastructure for runtime inference counting, but you need to actively use it, and it's complicated.

Also, every time you read a pointer you are creating a new reference. That's what turns "every read into a write". You are correct that it's not actually every read, it's just that every small block that reads the value will also write over it.



You don’t do this in Rust. Up u can take a normal &T into something held by an Rc<T>. The count will not be touched.

It’s only when you need an additional owning reference that you call clone() and increment the count, or when one of the owners goes out of scope, and Drop decrements the count.

But for non-owning cases, which are many of them in most code, the count stays the same.


"Up u" should be "You", I don't know why my phone autocorrected this way. Too late to edit now!


That has a cost both to the compiler and to the developer.

It's not a free win.


It’s part of the standard library, not the language, so it’s not really a cost to the compiler.

And this just uses all the same rules as normal Rust code, so like, sure, to use features you have to know them, but I find it a bit silly to frame that as a “cost.” If that’s a cost, so is literally everything.

Anyway my main point is just that you have less refount traffic than in other implementations.


If were going down the calling silly route, I'd rather categorize handwaving of Rusts additional complexities by their community as silly.

If "everything has a cost" anyway we might as well code is assembly for optimal performance. But no, because nuances and differences of cognitive load during development can be significant. And so does compilation speed.

It's arguably much easier to code with a GC and language adoption reflects that.


> I'd rather categorize handwaving of Rusts additional complexities by their community as silly.

I believe what steveklabnik meant, was that all this extra complexity (the borrow checker) already exists in the language for other reasons. It's not an additional cost to the compiler or the developer when using the reference-counted types (Rc or Arc).


That's correct. I thought we were talking specifically about refcount traffic, not about broader issues.


> It turns every read into a write.

Not exactly; it turns every passing of a reference, even when it's going to be used only for reading (or not used at all, or just passed on further) into a write of the reference count. But if you can read without passing the reference to somewhere else (or if the compiler can prove that the reference the caller has outlives the reference it would have passed), it can omit the write.

And tracing garbage collection is not completely immune to this; some kinds also turn every read into a write (into a card marking table), copying gc periodically rewrites unmodified objects (when moving them during garbage collection), and the tracing itself might need to read cold objects which might even be located on disk (swapped out).

> Reference counters don't have predictable latency: a single reference decrement can lead to an arbitrary amount of follow up work, like more decrements and deallocations.

However, they have a predictable place for that latency: it can only happen when there's a reference decrement (and the amount of extra work at that point is bounded by the size of the graph which can be reached from that reference). It also will not affect other unrelated threads. With a stop-the-world GC, a thread which allocates too much can cause pauses in threads which are not even allocating or releasing memory at that moment.

> Performance of reference counting isn't generally better than that of garbage collectors. (Especially since you still need a backup solution to deal with cycles.)

However, the memory use of reference counting is generally better than that of tracing garbage collectors, since it releases memory as soon as it can be released. That is, tracing garbage collectors trade extra memory use for their speed; giving them more memory (up to a point) means they can do less work, and giving them too little extra memory can make them much slower.

> Figuring out a lot of your memory management at compile time is definitely a win. But doesn't really have much to do with reference counting.

It does, because it can help reduce the cost of reference counting (by eliding most of the reference count manipulation when it's not necessary).


> With a stop-the-world GC, [...]

Yes, if you compare an implementation of reference counting that doesn't stop the world with an implementation of GC that does explicitly stop the world, you will find that GC stops the world more.


> Reference counters don't have predictable latency: a single reference decrement can lead to an arbitrary amount of follow up work, like more decrements and deallocations.

You can push the cleanup work on a queue to do it asynchronously, to bound the amount of work per deref / to even the work out.


Yes, definitely, you don't have to use the simplest version of reference counting, you can soup it up.

But you can play the same games with GC. So this is not specific to reference counting. There are incremental and real time versions of GC.


The dislike started earlier than the rise of pervasive threading. As another poster said, every read turns into a write, meaning, everytime you pass a reference into a function you have to increment the counter, and decrement when the function exits. It turns out, the vast majority of of inc/dec ops are pointless ones like this, which is considerable overhead for literally no benefit.

You can elide some of them using a sort of borrow analysis. But borrowed refs can only be used in certain ways, and so this analysis isn't perfect and can't eliminate all of the unnecessary overhead. This was all hashed out in the 80s and 90s IIRC, before multithreading even became a big issue.

The complex analyses needed to make RC perform well end up destroying the simplicity argument, and even after they still aren't competitive with tracing.


Rust has this. Data can be placed in atomic or non-atomic refcounted containers, and the type system disallows sharing non-thread-safe types across threads.

Additionally, data can be borrowed from inside the refcounted wrapper, and used freely within a scope without touching the reference count.


Lack of multi-threading simplifies a lot, but is at odds with performance.


You can still have multithreading, you just have to forbid shared data. You can move data between threads at thread creation and termination, and for everything else the language can provide channels/queues to move data between threads. Erlang operates under that model and is great at multithreading


It might be great, just not performant.

CSP often locks you into a single consumer style of pattern, which is highly problematic and scales poorly for anything that isn't statically partitionable. Once you go to MPMC, you start sharing data which immediately forces you to have proper ARC with the high synchronization cost it entails (even without using atomics, RC forces cacheline eviction between cores meaning it acts as a brutal bottleneck on many-core systems per Amadhal's law).


That is indeed the goal I am working towards, but a bit more relaxed than Erlang: mutability is possible only within a single actor/thread, but anything sent in a message to another actor is deep copied. So no memory is ever shared, but the language allows single-threaded mutation for performance.


"just have to forbid shared data"

That's a too severe constraint to be qualified as "just".




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: