Rust data structures with circular references (2021)

smasher164 · on March 26, 2024

Another approach is to use a Gc<T> smart pointer: https://docs.rs/gc/latest/gc/

I've used this in an interpreter and it's quite convenient.

zozbot234 · on March 26, 2024

Yes, garbage collection seems to be the only viable solution for dealing with spaghetti reference graphs in their full generality, including possible cycles. In that context it's worth trying a high-performance concurrent GC implementation such as https://github.com/chc4/samsara Samsara.

celeritascelery · on March 26, 2024

That is just using a RefCell under the hood[1] so it is effectively the same trade-offs as the RefCell example from the article.

[1]https://docs.rs/gc/0.5.0/src/gc/lib.rs.html#495-498

rendaw · on March 27, 2024

It doesn't require you to differentiate weak and strong references, so there's no risk of memory leaks due incorrect choice in Gc, unlike Rc IIUC.

paholg · on March 27, 2024

Note: the two of you are linking to different crates. Samsara provides a Gc type, but it is not the gc crate.

celeritascelery · on March 28, 2024

It looks like you intended to say that on the sibling comment.

paholg · on March 28, 2024

Yep! Sorry.

foldr · on March 26, 2024

Nice! This seems way more appealing than every other approach I've seen in this domain.

googh · on March 27, 2024

There is hardly any difference between indices into vectors and isoheaps (ie. an allocator that reuses the same pool for the given type). Except that the latter is faster and much more ergonomic. Both prevent remote code execution.

I wonder what a modern language would look like if it had manual memory management exclusively through isoheaps. Fil-C[1] attempts to retrofit isoheaps to C.

Now one can argue that isoheaps still allow access to wrong value and that this can potentially cause security issues. But the same also happens with indices or object pools in GCd languages. And if we go by this logic then Rust is less safe than Java (even without unsafe) because Rust requires indices to express what can be expressed with just references in Java.

Increasingly, I find it hard to justify systems like the borrow checker that force the programmers to contort their programs in a certain way. Unfortunately, a lot of newer languages (eg., Hylo, Austral, Circle) are experimenting with something similar to the borrow checker. Take Hylo for instance. The language has no reference type at all. So one has to use indices everywhere. Why not just embrace isoheaps instead?

[1] - https://github.com/pizlonator/llvm-project-deluge/blob/delug...

cozzyd · on March 27, 2024

This seems easier to retrofit onto C++ due to stronger type guarantees, no? (As described in that implementation, you basically can't use malloc but instead need a typed version of malloc). I would say that you w ant more than 32 bit pointers though because you can easily require a working set bigger than 4GB. I'd think something like 40-bit (1 TB of memory / type )pointers would be fine in a 64-bit address space.

You can almost fake something that gets you some of the way there in normal C++ by making the allocator giving you a randomly-placed arena for each type. Yes, you can jump from one address to another, but in a 64-bit address space you're much more likely to segfault than to end up in a different isoheap's arena.

googh · on March 27, 2024

> I would say that you want more than 32 bit pointers

Fil-C uses fat pointers to retain bounds and type information to make existing C code safe (typical C is filled with pointer arithmetic and casts).

Languages like C++/Zig/D have bounds checked slices and containers (or at least have facilities to create them). So fat pointers are not necessary. Slice types can also fit into 128-bit (start+end pointers), so they can be atomic.

Rusky · on March 27, 2024

> Rust requires indices to express what can be expressed with just references in Java.

This is a big oversimplification. Rust can express a lot more than indices, and a lot more than what's in this post.

The actual requirement is that you capture your choice in the type system. But this is possible for anything from indices/handles to reference counting to arenas to GC.

pkulak · on March 26, 2024

This is always brought up as the Achilles heel of Rust but, like... it's not that bad. You just need a weak reference. In fact, it's no different than a language like Swift, where everything is always reference counted. However, Swift will happily let you do it the wrong way and leave finding the resulting memory leak up to you. I prefer the way Rust handles it.

zamalek · on March 27, 2024

Honestly it has taught me that we need circular refs far less often than we thought. I consider it one of the best features of Rust. The lack of circular refs makes code far easier to reason about, and makes it more difficult to produce spaghetti code.

alilleybrinker · on March 26, 2024

Always worth plugging “Learning Rust with Entirely Too Many Linked Lists,” the canonical writing on the type of iterative approach for implementing data types with circular references in Rust: https://rust-unofficial.github.io/too-many-lists/

keybored · on March 26, 2024

The writing style is also fun IMO.

alilleybrinker · on March 26, 2024

Oh yeah Aria’s writing is excellent. https://faultlore.com/blah/

binary132 · on March 26, 2024

Interesting article. It got me thinking about what invariants could be proven for these data structures with a proof assistant having knowledge of things like lifetime or ownership. One thing that bothers me about Rust is that these properties are hardwired into the syntax in a very specific way. Sometimes I might want these properties to be exactly the way they are in Rust. Sometimes less, sometimes more, sometimes other semantics. This article makes me think that the tools Rust offers aren’t very good for this task. They’re not bad! But it seems like something better could be devised.

aidenn0 · on March 26, 2024

Isn't raw pointers only needed for the parent reference, not all references?

josephcsible · on March 26, 2024

  &'a Node<'a>

This (from the non-working "obvious" way) seems to be a common mistake, where the lifetime of a reference is the same one in the type parameter of what it refers to. Does anyone know if this has a name, or if there's some useful resource that talks about it in particular, outside of the context of circular references?

tempaccount420 · on March 26, 2024

Adding more lifetimes is not convenient in Rust so developers tend to reuse them. I wish it wasn't such a (refactoring) pain to add more lifetimes to a struct.

kldx · on March 26, 2024

https://plv.mpi-sws.org/rustbelt/ghostcell/ uses higher rank lifetimes to implement circular linked lists. The downside is that a linked list is stuck inside a closure.

zozbot234 · on March 26, 2024

Nice, there are also many other variants of this pattern, such as are implemented in the qcell crate.

kldx · on March 27, 2024

Right! LCell (qcell) is similar to GhostCell but I like the flexibility the other cell types in qcell provdie

jupp0r · on March 26, 2024

I don't see the problem in having a separate container that manages ownership and representing links between graph nodes in terms of that. The resulting overhead can be quite minimal and it's probably good practice to do this even in C.

hinkley · on March 26, 2024

To a moderately informed spectator, it does seem like Rust should be able to reasonably handle complex (cyclic) data structures - but only when the design can be expressed as composition. With aggregation you have a gnarly potentially undecidable state tree.

I'm pretty sure that at least some instances of aggregation could be expressed as composition if you took a crowbar to the architecture and squinted a little bit. For instance, a child object can survive the death of a parent by being adopted by another.

zozbot234 · on March 26, 2024

> For instance, a child object can survive the death of a parent by being adopted by another.

This is loosely speaking what the qcell crate provides already, via TCell. But the "adoption" protocol involves the use of complex patterns where compile-time "token" types (which one could also see as capabilities for the underlying object) are passed to the relevant code.

dang · on March 26, 2024

LAC-Tech · on March 26, 2024

Raw pointers and unsafe all the way. Sure a few of the more... radical elements of the rust community might try some safety shaming, but it really is the most pragmatic choice here.

theLiminator · on March 26, 2024

Curious if you could implement the tree with qcell or ghostcell for zero runtime cost and having compile time safety?

nurettin · on March 27, 2024

Storing references in integer arrays is how people used to program data structures in BASIC. I still use this method with vectors in order to make data structures easy to serialize and inspect.

gumby · on March 26, 2024

Unfortunately, despite all that effort, deletion leaks memory.

rjeli · on March 26, 2024

  First, we take some of the safety into our own hands. While this approach won't result in corrupted memory, double frees or accessing freed pointers, it can lead to run-time panics and other problems because we deal in "raw" indices to a vector.

Yeah, I’ve often seen people in discussions about c/zig/rust memory saying “manual memory management is fine, just use an arena and handles!” It’s a strange statement to me because you’re just laundering the unsafe pointer arithmetic behind array indexing

burntsushi · on March 26, 2024

> you’re just laundering the unsafe pointer arithmetic behind array indexing

Perhaps true in a very narrow sense, but you could say the same thing about all of Rust. "It's just laundering unsafe stuff behind a safe interface." And indeed, the ability to encapsulate unsafe internals inside a safe interface is one of the primary selling points of the language. It is also one of the key characteristics that differentiate it from languages that do not currently have this ability, such as C and C++. Whether you think this is an actual advantage or not is I suppose up to you, but I certainly think it is. And I think your use of the word "just" is papering over a lot of stuff.

For a more concrete code-level comparison with C, I did the leg work to translate a C program to a number of different Rust programs by varying some constraints. One of those Rust programs does indeed use indices instead of pointers. The README talks about the trade offs. See: https://github.com/BurntSushi/rsc-regexp/

Chabsff · on March 26, 2024

> languages that do not currently have this ability

That seems to be a very common position, and one that's super weird to me. C and particularly C++ absolutely have that ability with library support if you know what you are doing.

The only material difference, from my point of view, is that the default behavior of the language is different.

I will fully grant that the path of least resistance being dangerous is a huge issue in C/C++, and one that Rust addresses, but extending that all the way to saying that the language lacks the ability is really excessive.

tialaramex · on March 26, 2024

There's a big cultural problem and there are several big technical problems.

Rust has a safety culture, and C++ does not. In Rust's safety culture it was obvious that std::mem::unintialized (an unsafe function) should be deprecated because it's more dangerous than it appears, it's actually hard to use it correctly. That's why today we have the MaybeUninit type. In C++ it was apparently equally obvious that std::span, a brand new type in C++ 20, should not have a safe index operation.

Technically the safe/ unsafe distinction being at the language level makes it hard to fake. You can say your C++ only uses your safe abstractions, but the language itself doesn't care, so without inspecting every part of it to check you're never more than one slip away from catastrophe.

Most importantly in this context, at the language level Rust is committed to this safety distinction. If you write code where Rust's compiler can't see why it's OK, the compiler rejects your program. C++ requires that a conforming compiler must instead accept programs unless it can show why they're wrong. These are two possible ways to cut the Gordion knot of Rice's Theorem, but they have very different consequences.

burntsushi · on March 26, 2024

You can't encapsulate safety in C or C++. There's no `unsafe` keyword like Rust (or like Modula 3). If you have to say "if you know what you're doing," then you haven't encapsulated anything. There really is a categorical difference here. It's not excessive at all. It's the entire point.

Now if I were to say something like, "Rust's safety means that you can never have UB anywhere ever and CVEs will never happen for anything if you use Rust." Then yes, that's excessive. But to say that Rust can encapsulate `unsafe` and C and C++ cannot? I don't see how that's excessive. It's describing one of the most obvious differences between the programming languages.

You can restrict yourself to particular subsets of C (I'm thinking about MISRA) or C++, but these usually come with even more significant trade offs than Rust. And I'm not aware of any such subset that provides the ability to encapsulate safety in a way that lets folks not using that subset benefit from it in a way that is impossible to misuse (as a matter of an API guarantee).

pjmlp · on March 27, 2024

Only to add some historical context, ESPOL/NEWP for Burroughs B5000 in 1961 were one of the first systems programming languages with unsafe, many others followed upon that.

Burroughs B5000 had an additional feature for executables using unsafe code that we only have nowadays on managed runtimes like Java and CLR, binaries with unsafe code were tainted and required someone with admin access to enable them for execution.

Regarding C and C++, Visual Studio, Clion and clang tidy are the best we have in terms of tooling for the general public supporting the Core Guidelines (including lifetime checks), and they are still relatively basic in what they can actually validate.

burntsushi · on March 27, 2024

AIUI, Modula-3 provided an ability to actually encapsulate unsafety, in that the concept was elevated to the level of interfaces. Did any language prior to Modula-3 have that capability?

I think that's fundamentally different---although related---to just having an `unsafe` keyword. To take something I know well, Go has an `unsafe` package that acts as a sort of unsafe keyword. If we ignore data races, you can say that a Go program can't violate memory safety if there is no use of the `unsafe` package.

The problem though is that you can't really build new `unsafe` abstractions. You can't write a function that might cause UB on some inputs in a way that requires the caller to write `unsafe`. (You can do this by convention of course, e.g., by putting `Unsafe` in the name of the function.)

In Rust, `unsafe` doesn't just give you the ability to, e.g., dereference raw pointers. It also is required in order to call other `unsafe` functions. You get the benefit of composition so that you can build arbitrary abstractions around `unsafe` with the compiler's support.

My understanding is that Modula-3 supported this style of encapsulation (which is what I was talking about in this thread). What languages prior to Modula-3 supported it, or was Modula-3 the first?

pjmlp · on March 27, 2024

Starting with that 1961 example, ESPOL/NEWP.

Since Unisys still sells Burroughs, nowadays ClearPath MCP, you can get the latest NEWP manual here, section 8.

https://public.support.unisys.com/framework/publicterms.aspx...

Followed by Mesa/Cedar (CHECKED, TRUSTED, UNCHECKED), Modula-2 (IMPORT SYSTEM), the languages of Oberon linage (which follow up on the IMPORT SYSTEM approach), Ada (using Unchecked),....

In the languages that use the IMPORT SYSTEM approach, the compiler can mark the module as unsafe, and anything that might depend on it.

Some the Modula-3 folks worked previously on Cedar at Xerox, by the way.

Mesa - http://www.bitsavers.org/pdf/xerox/mesa/5.0_1979/documentati...

Cedar - http://www.bitsavers.org/pdf/xerox/parc/cedar/Cedar_7.0/09_C...

burntsushi · on March 27, 2024

Very interesting. Thank you.

kelnos · on March 27, 2024

> C and particularly C++ absolutely have that ability with library support if you know what you are doing.

I won't speak to C++, as it's a very different language now since the last time I used it. I've been writing C for more than 20 years, and I still make mistakes. And there's nothing keeping me from accidentally doing something unsafe outside my unsafe abstraction, aside from my own perfection at never making mistakes (yeah, right).

Rust requires you to be explicit about the unsafe things you do. And, realistically, even when I'm building a safe interface on top of necessarily-unsafe code, the unsafe portions aren't even that large compared to the entirety of the abstraction. That makes things much easier to audit, and the compiler tells me which sections of code I need to pay more attention to.

To me, this is lacking the ability. "If you know what you are doing" is a laughable constraint. Even people who theoretically do (and I suspect programming ability is a lot like people's self-reported skill at driving a car) still make mistakes sometimes.

cmeacham98 · on March 26, 2024

I think you're missing the point - abusing a vector in this way is not a safe interface and is instead a good way to introduce several memory safety bugs (for example, UAFs) that the normal rust memory model explicitly prevents.

adwn · on March 26, 2024

> abusing a vector in this way is not a safe interface and is instead a good way to introduce several memory safety bugs (for example, UAFs)

No, as long as you don't use the unsafe keyword, an out-of-bounds vector access won't lead to use-after-free or other memory safety bugs.

Arnavion · on March 26, 2024

Out-of-bounds access is not required for the pseudo-UAF we're talking about here. Deleting a node in the middle of a linked list will leave a "hole" in the backing Vec. You cannot shift the next elements down to fill the hole because that will invalidate all handles to them. If the backing Vec holds the nodes directly, as TFA's implementation does, then there is no way to mark the hole as a hole. So any bug where a different node's handle accidentally ends up accessing this hole instead will lead to that code observing a "freed" node.

One workaround is to make the backing Vec hold Option of Node instead so that deleting a node can set the hole to None, in which case the bug I described above has the opportunity to unwrap() and panic instead of silent UAF. Though you'll also need additional tracking for those holes so that you can fill them with new nodes later, at which point you'd be better off using a proper freelist / slab instead of a Vec anyway (as TFA also mentions).

kelnos · on March 27, 2024

Well, we're not talking about "pseudo-UAF", we're talking about actual-UAF and actual-memory-safety.

You use scare quotes around "freed" for a reason: the data has not actually been freed.

The bug you're talking about is a logic error. It could be a bad bug, depending on circumstances, but there's no memory safety issue here.

Arnavion · on March 27, 2024

>You use scare quotes around "freed" for a reason: the data has not actually been freed.

Who said it hasn't? I would assume such a node to have been given to `std::ptr::drop_in_place`. Not doing that would be a leak until the list as a whole was dropped.

cmeacham98 · on March 26, 2024

I don't mean a literal UAF, but a "use array index after free" because you're using indexes (which only have bounds checking) as heap pointers.

Rust's borrow checker doesn't account for when you re-implement parts of memory management as array indexes.

im3w1l · on March 26, 2024

Safe rust is a turing complete language. That implies that it is possible to build an emulator for any other programming language in safe rust. That emulated language and programs running on it can have memory-safety issues, even if the emulator itself cannot be corrupted.

When writing such an emulator the natural way to set up the memory is to use an array, and the natural way to implement pointers is indices into that array. In other words, this pattern is part-ways there to creating an emulator with internal safety issues.

However guaranteeing that any corruption will be contained to the array is certainly a lot better than nothing.

Chabsff · on March 26, 2024

Rust's safety mechanisms don't exist just to prevent bugs from escalating into security issues, they exist to prevent the whole class of bugs related to reference handling from being present in the first place. That's supposed to mean programs that work more consistently. Handwaving techniques that lead to programs panicking as "Well at least it doesn't become a security concern" is missing the forest for the trees.

potatochup · on March 26, 2024

This is generally true, but afaik there are still some (fairly complex) ways to write memory unsafe code in safe rust https://faultlore.com/blah/everyone-poops/

mandarax8 · on March 26, 2024

How is reusing a freed index not a UAF? If I roll my own allocator I can still get UAFs even though the memory accessed is not yet free'd.

kelnos · on March 27, 2024

Because that's not what "UAF" means. Also not what "freed" means.

To have a UAF, there has to be memory that is actually freed, and you have to attempt to access that memory. No memory is freed here (in the OP's implementation). Even if it was, at worst you'd get a panic for trying to access past the end of the Vec.

None of that is a UAF or a memory safety issue. It's just a logic bug.

amichal · on March 26, 2024

Nope but "handle confusion" safety perhpas?

kelnos · on March 27, 2024

No, that's incorrect. While yes, it's true that you can point two different nodes at the same array index, or mismanage your array indices in a variety of ways, that is a logic error. It will indeed make your program behave incorrectly (and depending on what it's doing, that may have security implications), but there is no memory safety issue. No one is using anything after freeing it; if you try to access an index that is past the end of the Vec, it will panic. Panicking, while undesirable, is memory-safe.

You may think that's a difference without distinction, but "memory safety" and "use after free" have specific definitions, and this ain't them.

burntsushi · on March 26, 2024

You later clarified that by "memory safety bugs" you don't actually mean "memory safety bugs," but rather "use array index after free." But that isn't a memory safety bug. (It might be a denial of service bug or a logic bug, but because of bounds checks, it isn't a memory safety bug.) So no, I'm afraid I haven't missed the point at all.

Could you please read the link I shared? There's all sorts of nuance in the README. And there is absolutely no pretending in my comment or in the link I shared that using indices instead of pointers has zero downsides.

cmeacham98 · on March 26, 2024

It is a memory safety bug - by using this "indexes as pointers" methodology it is possible to write code where two different owners simultaneously believe they are the sole owner of an object.

Writing that using normal pointers is impossible in safe rust (barring a compiler bug, which do exist but are rare).

Rusky · on March 26, 2024

You don't get two owners that way. You get something slightly less powerful than owners, which still has all its preconditions satisfied: you can check that it is in bounds, and if so you will find an element of the expected type. You cannot corrupt the underlying allocator's data structures, you cannot resize the array when there are outstanding pointers to its elements, and you cannot violate the type system.

Rust's goal was never to exclude all shared mutability. (Otherwise why support things like locks?) Rather, it excludes only the kinds of shared mutability that open the door to undefined behavior. The point of all these sorts of "workarounds" is that, because there is no longer a single unrestricted kind of shared mutability, you now get to pick which version of restricted mutability you want: reference counting vs bounds checking vs ghostcell's higher-rank lifetimes vs whatever else.

kelnos · on March 27, 2024

> two different owners simultaneously believe they are the sole owner of an object.

Not in the sense of ownership that matters in Rust. The Vec owns the data. A node with an index in it does not own that data. It merely refers to it.

The key here is when you answer the question, "what happens when I screw it up the indices?" And the answer is logic errors or a panic. Neither of those are memory safety issues.

burntsushi · on March 26, 2024

Can you share a program where this results in UB without using `unsafe`?

Please also consider what I was responding to:

> you’re just laundering the unsafe pointer arithmetic behind array indexing

cmeacham98 · on March 26, 2024

The definition of memory safety is not "code that does not result in UB".

burntsushi · on March 27, 2024

So just to be clear here, the progression is:

"memory safety bugs" -> "for example, UAFs" -> "I don't mean a literal UAF" -> "use array index after free" -> 'memory safety is not "code that does not result in UB"'

I mean, you can define "memory safety" to be whatever you want it to be, but the definition everyone else uses (including Rust) is absolutely connected with undefined behavior. More than that, the entire context of this thread assumes that definition. Rust certainly does. And if you are going to use a different definition than everyone else, at least have the courtesy to provide it.

If people used your definition, then it would be wrong to, for example, say that "Java is a memory safe programming language." But that is, as far as I know, widely regarded to be a true statement.

This sort of disagreement is profoundly irritating, because I made it exceptionally clear what I meant from the get-go. All you had to do was respond and say, "oh, it sounds like we are just using different definitions of the term 'memory safety.' if we use your definition, I agree with what you said."

Rusky · on March 26, 2024

It is the definition of memory safety that Rust uses.

It would be easier to discuss whatever non-UB failure modes you have in mind, in the context of Rust, if you used a different term.

kelnos · on March 27, 2024

Maybe not, but it's also not whatever unconventional definition you've come up with here.

akavel · on March 26, 2024

There's a huge difference in the outcomes those can lead to, that's why it's not just a simple "laundering", but a serious gain. On top of that, C vs. Zig vs. Rust give you a different level of that gain still.

The particular difference is noted in the fragment you quoted. What may not be obvious, is why corrupted memory etc. are a much bigger problem than panics. In fact it is usually the _lack_ of panics that is problem in them: they will lead to silent memory corruption. This will lead to unexpected behaviors of your code. Ones you just never even assumed possible - due to basic contracts being broken. Such problems with then silently propagate through your system, poisoning it without you knowing. Whereas a panic is a clear cut alarm/canary behavior: something's wrong, let's crash loudly and make it immediately visible. This will protect against the corruption spreading out. And in a typical system, will restart the panicked binary, allowing the system to resume operation with minimal disruption.

10000truths · on March 26, 2024

The laundering is the point. It's much easier for the user to handle out-of-bounds array accesses and uninitialized array elements, compared to tracking the lifetime and validity of data at arbitrary memory addresses.

agalunar · on March 26, 2024

Using indices is a (slight) improvement over pointers, since in Rust you will benefit from bounds checking (restricting you to the memory allocated for that tree).

(If you use u32 for the indices, you also save some space.)

taeric · on March 26, 2024

It is hilarious how much space you can save by using smaller indexes than a pointer. Especially on modern architectures.

burntsushi · on March 26, 2024

Yes! The regex crate does this and it saves quite a bit of memory.

taeric · on March 26, 2024

It is the kind of low hanging optimization that I used to think was greatly oversold. And, to be fair, for many programs I would still wager it probably is. If you are chasing a ton of pointers, though, it is definitely worth considering how many more you can hold in memory with smaller sizes.

I think I remember a discussion a while back lamenting that we just tacitly accepted wide pointers. If anyone has a link going over that, I'd be delighted to read it again. I'm 90% sure I did not understand it when I first saw it. :D

eclark · on March 26, 2024

I am currently writing a CFR Tree that has nodes that can be self-referential. https://github.com/elliottneilclark/rs-poker/pull/92

The idea is that every game of poker is a walk through a tree. Keeping track of probabilities and outcomes allows for exploration of the correct way to play. Using indices made building the tree easier (though I am still one bug away from it all working).

bbatha · on March 26, 2024

The difference is that the array bounds are checked and the behavior is defined whereas bad unsafe pointer athematic if you’re lucky gives you a segfault and at worst an RCE.

alexchamberlain · on March 26, 2024

Array bounds are essentially unchecked in C, unless you go so far over as to hit a guard page.

empath-nirvana · on March 26, 2024

You can have bugs with that style of coding, but they're not going to be bugs of the "remote code execution" type.

googh · on March 27, 2024

If preventing remote code execution is the goal, then C/C++/Zig can also achieve this using isoheaps (see Fil-C[1] for example). Even without isoheaps, CFI mitigations are able to prevent jumps into code not known at compile time. Why bother with Rust or any other language with borrow checkers then? Just a honest question.

[1] - https://github.com/pizlonator/llvm-project-deluge/blob/delug...

favorited · on March 26, 2024

The arena + handles approach doesn't solve the memory management issue, but it can be beneficial for storing nodes in a list, graph, etc. from a data locality perspective.

kelnos · on March 27, 2024

There's a big difference, though: the array indexing code is still safe from memory errors. I would much prefer a panic over a segfault.

worik · on March 26, 2024

> Now suppose we want to add a parent link to every node.

Don't.

You do not need to.

Rust does stop you creating self referential structures (easily)

In the case of a binary tree that is not a problem

There might exist some specialised case where there is no more memory to allocate at run time, but in the common case it is more efficient in every respect to use algorithms with your data structures and dispense with back links

I have been working on a C program built by a (mostly brilliant) programmer that put back links in every structure (parent links in trees, doubly linked lists) and the waste of resources, in a constrained environment, is frustrating

Just don't

pizlonator · on March 26, 2024

Some algorithms need the parent or bidirectional pointers.

If the language doesn’t let you then the language sucks.

empath-nirvana · on March 26, 2024

This post shows 3 ways of doing it, including essentially the same way you'd do it in C. And it's exactly as dangerous as doing it in C is, but at least you can isolate that dangerous code to one part of your code base.

alexchamberlain · on March 26, 2024

> you can isolate that dangerous code to one part of your code base.

This is the whole point for me. There are loads of reasons fundamental data structures want to use behaviour the compile can't prove is sound, but humans can burn time discussing, proving the algorithms and testing the implementation. Then, you ask the compiler to prove all the simple glue stuff on top.

hinkley · on March 26, 2024

I suspect there's a type theory here where many cases of parent links essentially represent the exact same problem, and if you solve one, you can restate most/all of the others in this form and then it's proven too.

You know that popcount bit counting device that treats a word of memory as a short SIMD instruction using the 32 or 64 bit ALU? I learned recently that compilers can detect that and emit popcount instructions if those are faster on the target architecture.

I suspect if you can detect that, you could detect a safe parental reference device and let it pass the checker.

worik · on March 26, 2024

> Some algorithms need the parent or bidirectional pointers.

For trees some, not very good, algorithms require parent pointers

fhood · on March 26, 2024

I think it's best to think of this as a good toy example, and not get caught up in the specific utility of a "parent" node.

worik · on March 26, 2024

It is just a bad example.

One of the main points is that there are very few algorithms that require back links that create cycles.

Cyclic data structures, rare as they are, are also hard to reason about. Even for garbage collectors!

If you have to implement one in Rust (I have) it is a bother, a big hassle.

It is a good thing that in the general case when a programmer is implementing back links they are doing the Wrong Thing.

pizlonator · on March 26, 2024

The frustrating thing is that GCs have come far enough that for most of what folks use Rust for, they’d have been happier using a GC’d language.

Weird to justify this kind of language limitation when a great technology exists that obviates the need for it.

_se · on March 26, 2024

The frustrating thing is that people say this without realizing that GCs haven't, and by necessity cannot provide the type of deterministic and predictable performance that would lead someone to use Rust/C++ and friends.

There are few people using Rust who don't have experience in other languages. It's wild that you think you know about their use cases better than they do.

Your comment says "I have never worked on a problem where GCs are problematic". That is great for you, but it has little to do with the value of improving the situation for native, no runtime, performant languages.

tombert · on March 26, 2024

Most people aren't writing a kernel. I've seen plenty of Rust projects writing things like web servers, where a low-latency GC (like Go's) would work just fine.

Obviously there's a lot of problems where a GC would be a real performance issue, but I think that people will reach to "rewrite in Rust" too quickly sometimes. Manual memory management is hard, and most software engineers, sort of by definition, are "average".

forty · on March 26, 2024

One important thing is that Rust is an expressive language that's pleasant to use, unlike Go :) that would explain why many side projects that could be in Go are actually in Rust even if they could have been done efficiently in Go too.

tombert · on March 26, 2024

Fair! I actually kind of like Go but it's not a language I reach for very often.

Still, the point I was making is that it's possible to have garbage collectors that get very low pause times (sub millisecond in Go's case [1]). It might still be an issue for a kernel or microcontroller, I'm not suggesting we always use a GC for every project, I don't think most people would, but I am suggesting that there are a lot of projects (most projects?) that would be better off using a low-latency GC instead of trying to handle memory directly.

Even the JVM is getting better about this; there's already experimental support for a low-latency GC in some distributions of OpenJDK [2], which reportedly gets sub-millisecond pauses as well. If you have JVM support then you have access to good languages like Clojure.

Personally, I actually think I'm reasonably-ok with manually managing memory, I've cut my teeth on enough C to write code that doesn't generally crash, but I still reach for a garbage collector whenever possible. Just because I think I'm capable of doing manual memory management doesn't mean that I think it's worth it for most things; nearly everything I do touches a network at some point, which means that avoiding latency is not really possible a lot of the time.

[1] https://groups.google.com/g/golang-dev/c/Ab1sFeoZg_8 [2] https://wiki.openjdk.org/display/shenandoah/Main

pebal · on March 26, 2024

You can have completely concurrent GC, without pausing threads.

https://github.com/pebal/sgcl

tombert · on March 26, 2024

I have not heard of this, but this looks really cool!

Forgive some ignorance on this, but does this mean you have literally zero millisecond pause times? What kind of additional overhead/throughput does this incur? This just seems like the holy grail!

pebal · on March 26, 2024

This GC operates with zero pause time. The GC runs on a separate thread, so can reduce memory bandwidth if the GC experiences heavy load.

pjmlp · on March 27, 2024

Java has had real time GC for decades, it isn't available as free beer, rather in commercial JDKs for real time OSes and bare metal deployments.

See PTC and Aicas, the two main vendors still in business.

And even if I am not a big Go fan, TinyGo and TamaGo show what is possible regarding using Go for bare metal deployments, OS like.

tombert · on March 27, 2024

I should have thought about the bare metal JVM deployments. I guess we’re all victims of our careers; 99% of what I do is server-ey and as a result I sometimes forget that there’s a whole bunch of other types of software.

bsaul · on March 26, 2024

This may not be the best article to state that rust is more pleasant to use than Go… Op shows cryptic code nesting generic container with various static and dynamic behavior related to memory… This is definitely something i don’t miss having in my go codebase.

forty · on March 28, 2024

Indeed :) but for many use cases, I find it fun and elegant

pjmlp · on March 27, 2024

They could have used OCaml, Haskel, F#, Scala (Scala/Native), Kotlin (Kotli/Native), Common Lisp, C#, D...

It isn't for lack of alternatives with nice type systems, automatic memory management and AOT tooling.

forty · on March 28, 2024

Yes, agreed. Having standalone native binary could also be one of the goal (yes the /native options works too)

Personally I like Scala a lot, and I used it quite a lot as a hobby for 10+ years, but my impression is that it's dying (or dead) so I'm not really motivated to play with it anymore. Kotlin always felt to me like a subpar Scala, so I never enjoyed it.

There is something about Rust that makes me feel it's a smart language and make me enjoy writing it (most of the time at least, it can absolutely be frustrating at times too). But it's certainly a matter of taste before all.

gpderetta · on March 27, 2024

The nice thing is that in rust threads do not share memory by default, so you could in principle have thread-local GCs that doesn't require stop the world for collection. You also could in principle use GC for non-latency critical threads and vice versa.

IncreasePosts · on March 26, 2024

I realize this is a biased sample, but about 99% of rust projects I see on hn or lobsters wouldn't have been encumbered by a GC

pjmlp · on March 26, 2024

Only for those that put the word GC on the same bag, without understanding there are tons of approaches, and many languages with automatic resource management also provide deterministic features.

hgs3 · on March 26, 2024

You can make garbage collection deterministic. Tracing collectors can be designed to accept a timeout so you can "collect" during downtime and resume execution on a predictable schedule. And with reference counting the cost of RC bookkeeping is smeared predictable across the run of the program. The real problem is many GC implementations lack customization and you're stuck with whatever GC algorithm the language implementers pick.

murderfs · on March 26, 2024

> And with reference counting the cost of RC bookkeeping is smeared predictable across the run of the program.

The biggest downside of naive reference counting usually isn't the cost of reference counting itself, it's being left stuck holding the bag when you're unlucky enough to be the last deref on a giant object graph. You can move the destruction onto a separate finalizer thread like what most runtimes that use tracing GCs do, but then you end up running into CPU scheduling issues if you're not overprovisioned.

pizlonator · on March 26, 2024

There’s much more to it than you say.

The real problem is most GCs are built with a “throughput first” mindset and determinism is bolted on.

The details of how to build a deterministic GC are beyond the scope of a HN post lol

alexchamberlain · on March 26, 2024

You can do RC memory management without the overhead of a GC runtime though.

jayd16 · on March 26, 2024

As described (ie. only run when I say so), the GC is less overhead than ref counting because you're not even paying for the cost to free things.

pizlonator · on March 26, 2024

I am saying that GCs can do it as a GC engineer

vlmutolo · on March 26, 2024

The thing is that most code has a straightforward, tree-like ownership model. It’s overkill to use a GC for everything. It also sometimes leads to memory bloat.

What we really need is an ergonomic way to opt into different kinds of memory allocation and management. The borrow checker is great for when you need no extraneous allocation. GCs are great for many linked lists. Bump allocators are great for web template rendering with lots of temporary strings.

logicprog · on March 26, 2024

This is my position as well.

Call me a crank, but while a lot of people say that you don't need to go without a garbage collector for most problems, I believe the converse: that there aren't many problems where you actually need a garbage collector (if you have some other way of freeing yourself from having to manually manage your memory and manually avoid use after free and so on). I think once your mind gets adjusted to working without one and within the constraints of the borrow checker, it's really not that much of a drag, and if you really need to do something outside those bounds (or just want to ignore it and hack something together quickly) that's what Gc and Arc are for.

And honestly, in general I like the unique intersection of features that Rust offers as a language, totally independent from its performance or the presence or absence of a garbage collector at all. Even if it was garbage collected and about 2x slower, I think I would still prefer it to most of the 40 or so languages that I've tried over the years. It feels like a version of OCaml with a much better standard library and ecosystem, a much better build system, a vastly better and cleaner way of doing parametric polymorphism, and much better metaprogramming facilities in comparison.

Not only that, but I genuinely find the constraints of the borrow checker — most especially only allowing either one mutable reference or multiple read-only references at a time — to be a great help in making my code clearer and more straightforward: it essentially helps me reach a lot of the same benefits as pure functional programming (where the data flow of my application is carefully threaded and there is no spooky action at a distance) while still allowing me to have imperative programming and mutation where I need it, by just forcing me to only have one place in my code be able to mutate anything at a given time, to explicitly mark it whenever I'm giving any piece of my code mutable access to anything, and essentially preventing me from having any persistent mutable access to a data structure be held in some other data structure or portion of code that can then continue to mutate it without my explicit permission and knowledge. So why give up the uniquely productive constraints of the borrow checker, which again to reiterate give me 90% of the benefits of pure functional programming with only 50% of the annoyance, and add in all of the unnecessary bloat and non-determinism of a garbage collector, when I have never really found myself to need it?

Furthermore, Rust is significantly more performant than even very fast and systems oriented garbage collected languages with their own runtime such as Go or OCaml or Java, which has the added benefit of allowing me to not really have to worry about performance at all beyond picking reasonable algorithms, and sometimes not even then since the language is often fast enough to make brute force feasible, which paradoxically means that when I am programming in rust, this high performance systems language, it often frees me from the burden of having to worry about performance like I would in something like python. For instance, I've actually implemented programs in Python that end up being so hilariously and frustratingly slow that I eventually gave up on working on them.

Moreover, if I do have to concern myself with performance in rust, I feel like working in a language that uses RAII and doesn't wrap everything in pointers for GC by default provides a significantly clearer mental model for me of what my program is actually doing with its memory, and how my data is actually laid out, and I also typically have a clearer mental model of what CPU operations my program is actually doing as well. And in fact just having this clear mental model of what my program is actually doing helps me Implement more efficient algorithms and make cleaner and less wasteful programs in general, which I quite like.

So yes, in essence I would turn the question back on the questioner and ask why they think they need a garbage collector and a language runtime and all of this extra stuff. Yes it does make the language marginally easier and quicker to write things in, but if I am sitting down to write a large scale program or long running project then being slowed down slightly in return for increased productivity.

keybored · on March 26, 2024

Honestly I feel like the Zeitgeist is that anything that smells of “optimization” is supposed to be preemptively justified. Like do you need a borrow checker or a static language or an AOT language implementation or...? Like anything that looks like they might by-default give you better performance (perish the thought) is supposed to be justified by some line about, oh well it turns out in our case that we need this thing actually, please forgive us for not using a dynamic language.

logicprog · on March 26, 2024

That's a really good elaboration of what I'm getting at! Now that you mention it, it does seem like that's the case — as if people have taken the mantra about "premature optimization" way too far, to rule out not just micro optimizations, but also using good algorithms, and even starting with good foundations. As if by default we should be using the slowest and least efficient language and runtime possible, and any little concession to by-default performance (which in my opinion doesn't really count as optimization even) has to be justified by some particularity of your field or task. And honestly, I have to wonder if that's sort of the mentality that has led us to things like Electron. It makes sense to be worried about optimization for performance if it makes it makes your developer velocity clearly worse, but in my opinion, and according to what little research I've seen, the difference in development velocity between an ahead of time compiled statically typed language and a dynamic language quickly disappears beyond a certain program size, which makes sense to me, since it would be presumptively a constant factor.

pjmlp · on March 26, 2024

They have been great for decades, it has been mostly a religion thing.

GC being any kind of automatic memory management, tracing GC, reference counting, mixed scheme with reference counting and cycle collectors.

Take any systems language that uses that approach to automatic resource management, coupled with ability to use stack allocation, registers and what have you, and there are very few use cases where something else is required.

Like kernels, drivers on hard real time constraints.

Even for GPUs I think shading languages are much more ergonomic than trying to fit classical languages into them.

hinkley · on March 26, 2024

One of the hills I will die on is the somewhat bold assertion that memcached primarily exists because Java got stuck at 1-2GB max heap sizes when obtainable servers started having 4-8 and then 16GB of memory. They got stuck there for long enough that a cottage industry of off-process lookup tables became tenable (also at the time Java was still UCS-16 encoded so you could save 25% of your memory by shoving text out of process)

I'm not ready to assume again that every popular GC-based language will be able to keep up with the available memory on cloud instances, let alone be able to vertically scale to a full bare metal machine. The value of Rust's tradeoffs gets bigger the higher you vertically scale, does it not?

swsieber · on March 26, 2024

I came for the speed, I stayed for the lack of `ConcurrentModificationException`s.

In all seriousness though, I'd say I only have 2 real projects that needed Rust's speed, and the rest I've done is because I bring the other type safety benefits Rust brings.

keybored · on March 26, 2024

Frustrating for who?[1] 99% of Rust users have likely used languages that use GC.

[1] Trying to wean myself off the archaic “whom”.

pvg · on March 26, 2024

off the archaic “whom”.

One easy fix is using the modern 'whomst' everywhere. It has the all the faux-sophistication of 'whom' (with some extra borrowed from 'amongst' and 'whilst') while being equitably wrong to everyone rather than special groups like whom-pedant toffs and who-everywhere anarchists.

keybored · on March 26, 2024

Perfect :)

pjmlp · on March 26, 2024

Which kind of GC though.

jandrewrogers · on March 26, 2024

GCs have not come far enough if you care about performance.

A modern "fast" GC has latencies similar to disk I/O on modern storage. We don't make a habit of randomly dropping a blocking disk I/O in the middle of a performance critical path, and disk I/O is less likely to pollute your CPU cache as a side effect than a GC. Given that performance-optimized code is routinely designed to eliminate the overhead of even context switches (e.g. thread-per-core architectures) to great benefit and GC pauses are much more expensive than a context switch even when "fast", it is difficult to develop a theory of how a GC does not have an adverse impact on performance.

You could try to make the argument that most people writing Rust don't care about performance so it doesn't matter, but that is likely to be controversial at the least.

pebal · on March 26, 2024

You can have completely concurrent GC, without pausing threads.

https://github.com/pebal/sgcl

jandrewrogers · on March 27, 2024

How does this work with standard thread-per-core architectures used in high-performance systems? Sharing state with a GC thread defeats the entire purpose even if you had a dedicated GC thread for each working core to reduce contention, since the entire point is to all but eliminate atomics and cache coherency overhead.

This doesn't address the issue with GCs in high-performance systems, it just turns it into a different kind of problem.

pebal · on March 27, 2024

You either pause the worker threads or share the state, there's no other way. However, the GC thread does not modify the data state.

_se · on March 27, 2024

So, your pauseless GC pauses or requires some form of resource contention (probably mutex-based) to share the state. It's not possible to have a pauseless GC in thread per core, which immediately rules it out for HPC. These are the types of things that the "GC for everything" crowd admits that they don't understand when making such statements.

pebal · on March 27, 2024

I misunderstood you. State is shared by mutator threads using atomic::store(memory_order_release). SGCL never stops threads and does not use locks to share state.

sunshowers · on March 26, 2024

Which popular GC languages have mutexes owning the state they guard, such that it is structurally impossible to access that inner state without holding the lock? To do that you must have single ownership.

Rust achieves correctness at scale, over time, more than most GC languages.

pornel · on March 26, 2024

I don’t mind GC per se, but languages based around a GC tend to lack other useful properties that Rust has, like deterministic destruction for resource management, and exclusive ownership that prevents surprising mutation of objects without making them immutable.

hgs3 · on March 26, 2024

This is what bothers me about using Rust: it seems like Rust is trying to be both a systems language and a higher-level glue language in one, but (for me) it is cumbersome at both. If I need to write higher-level code then I'd rather use Go, Python, or whatever else and interface with C as needed. If I'm writing low-level code then I can use C and an embeddable scripting language for any high-level bits. This multi-language approach is not discussed enough. Recently, I completed a project that used cgo to interop C and Go code and the experience was very pleasant. No need for language workarounds or other code contortions. Just two languages doing what they do best.

keybored · on March 26, 2024

It’s a systems language (whatever that means). It can also be a glue language if you want it to be since it has high-level capabilities. Since being an austere systems programming language is kind of a defunct 90’s thing.

You argue that you would use a higher-level language for that part. Fair. But I don’t see the objective argument for using C for the low-level part. You just say that you want to use C. That’s preference and has nothing to do with Rust being a language that can do multiple things.

tupshin · on March 26, 2024

Your comment is about rust, but not at all related to the article.

hgs3 · on March 26, 2024

The "workarounds or other code contortions" I'm talking about is the focus of the article. The point of the article is that safe Rust cannot directly represent circular data structures and therefore you need workarounds to represent them. My point is rather than fighting the borrow checker you could use two complimentary languages to achieve the same goal.

To add more context: I tried the articles suggestion and used integer handles some years back and shared my feedback on HN [1]. It wasn't a good experience as integer handles are just pointers the borrow checker doesn't know about. It made the implementation far more cumbersome than it needed to be.

[1] https://news.ycombinator.com/item?id=32907482

pvg · on March 26, 2024

focus of the article

That's a key difference - the article has a focus and your comment is generic and repetitive programming-language-war adjacent. There's even a guideline exactly about that - "Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."

hgs3 · on March 26, 2024

That's my bad. I should have opened with my context first (see my remark about "I tried the articles suggestion") instead of sharing my conclusion.

dmitrygr · on March 26, 2024

[flagged]

anon291 · on March 26, 2024

Realistically, other than cool factor I'm having a real hard justifying the use of rust. It either devolves into using reference counting (fine but we have lots of languages that do that without borrow checking), or dynamics (i.e refcell).

The type system is nice with the algebraic data types, but lacking the higher orderedness of languages like Haskell, it just seems sad.

So basically, to me it seems like you're getting the worst of both worlds. You don't get the freedom of c/c++ and you don't get the actual type safety of a language like Haskell (refcell kills the notion of compile time checking).

Realistically what needs to happen is some system for circular references like Haskell has (let bindings).

empath-nirvana · on March 26, 2024

There's so much you can do in rust without messing with any of that, and the only time you have to get deep in the weeds arguing with the compiler is _exactly_ the sort of code that you _should_ be spending a lot of time carefully thinking about. That C allows you to write that code without thinking about it much is _not_ a benefit to using C.

anon291 · on March 26, 2024

Coming from a mostly Haskell background, I assure you I think about my code a lot.

sunshowers · on March 26, 2024

Data structures with circular references are so incredibly rare in practice. I've occasionally needed to use a graph (sometimes even a cyclic one), and petgraph with integer indexes works fine for that. Other than that, I've been writing Rust full-time for over 7 years and really not needed them.

The benefit is that for almost all code you get mutability xor aliasing, which I think is incredibly important to achieving correctness at scale.

dmitrygr · on March 26, 2024

> Data structures with circular references are so incredibly rare in practice

A doubly-linked list enters the chat...

sunshowers · on March 26, 2024

Doubly-linked lists are incredibly rare, few people should ever use one and even fewer people should ever write one. In almost all cases vectors are better [1].

Rust's stdlib comes with one: https://doc.rust-lang.org/std/collections/struct.LinkedList....

[1] There are some cases where they're useful, e.g. when amortized O(1) append is not enough and you want actual O(1) append. But this matters to almost no one.

dmitrygr · on March 26, 2024

> Doubly-linked lists are incredibly rare, few people should ever use one

Umm... how do you propose implementing a queue or a stack cpu-efficiently and memory-efficiently using vectors?

sunshowers · on March 26, 2024

In almost all cases, it is better to use a vector for both those data structures. Cache locality is king, memcpy is blazing fast, and pointer chasing is generally pretty bad for performance.

For a stack, just use a regular vector. Pushes are amortized O(1), and pops are O(1).

For a queue, a ring buffer backed by a vector. (This is how VecDeque in Rust's stdlib works.) Again, enqueue is amortized O(1) and dequeue is O(1).

Even if you do need to use a linked list, it's generally better to depend on someone else's implementation than write your own.

dmitrygr · on March 26, 2024

> it's generally better to depend on someone else

I am the someone else who writes that, often enough…

All y'all forget that behind your cute libraries and abstractions stands someone who wrote them in C or assembly…

sunshowers · on March 27, 2024

In that case, you'll find that Rust is an excellent language for writing doubly-linked lists.

A lot of the point of Rust is to make a few people do the hard work of writing some complex data structure in unsafe code, and let everyone else reap the benefits.

keybored · on March 27, 2024

Where do I send the thank you card?

keybored · on March 26, 2024

> incredibly rare

> doubly-linked lists

Synonyms for application developers.

zozbot234 · on March 26, 2024

> It either devolves into using reference counting (fine but we have lots of languages that do that without borrow checking), or dynamics (i.e refcell).

The point is that you don't have to do that for all of your data objects, unlike other languages that use refcounts and dynamic checks under the hood. Most alloc/dealloc in Rust can be dealt with via simple RAII.

jeremyjh · on March 26, 2024

I don't see what let bindings have to do with solving the circular reference problem, those are just local bindings. A Haskell doubly-linked list would use IORef, but no one would ever actually do that.

anon291 · on March 26, 2024

Haskell doubly linked list:

    data DLList a = DLList a (DLList a) (DLList a) | End

    make_dllist :: [a] -> DLList a
    make_dllist = go (\_ -> End)
      where go mkPrev [] = mkPrev End
            go mkPrev (x:xs) = go (\next -> let me = DLList x (mkPrev me) next in me) xs

No need for IORef. Too many people misunderstand the meaning of 'let' in Haskell. It doesn't mean what you think it does. It defines, not assigns. In rust, let 'declares' and assigns in one go. It does not define.

bronson · on March 26, 2024

C is not a good language for people who are short on time.

keybored · on March 26, 2024

But you have enough time to mess around with Valgrind?

Certhas · on March 26, 2024

If you have a job to do, use standard data structures implemented in a high quality library.

dmitrygr · on March 26, 2024

Which high quality libraries do you suggest for an environment with no OS, heap, stack, globals, or initialized data?