Hi author here, definitely not trying to diss Rust, I love Rust! I'm pointing out some interesting overheads that aren't well known by the average Rust programmer, which Mojo was able to improve upon with the power of hindsight and being a newer thing. For your points below, let me clarify:
> There's no implicit string copying in Rust. Even when passing String ownership, it will usually be passed in registers.
The String metadata can be passed in registers if LLVM does that optimization, but it's not guaranteed and doesn't always happen. Rust move is just a memcpy, there are situations where LLVM doesn't optimize them away, resulting in Rust programs doing a lot more memcpy than people realize.
> It's idiomatic to use `&str` by default.
True if you want it to be immutable, but this actually adds to my point. That is the default behavior in Mojo without having to understand things like deref coercion and the difference between `&str` and `&String`. In Rust it's an unintuitive best practice, which everyone has to learn pretty early in their journey. In Mojo they get the best behavior by default, which gives them a more gentle learning curve, important for our Python audience. Default behavior > idiomatic things to learn.
> The borrow checker doesn't influence destructors.
I didn't claim that, my point was that Rust does do runtime checks using drop flags, to check if a value should be dropped. This can be done statically during compilation, but won't happen if the initialization state of an object is unknown at compile time: https://doc.rust-lang.org/nomicon/drop-flags.html
> &String and usize can't have a destructor, and can be forgotten at any time
In the example, the call stack is growing with a new reference and usize in each frame for each call. This is why tail recursion in Rust has so many issues, those values need to be available to the end of scope to satisfy Rust's guarantees, they can't be "forgotten at any time". It also overflows the stack a lot faster.
> Their benchmark intended to demonstrate tail call cost compiles to a constant, with no `factorial()` calls at run time. They're only benchmarking calls to black_box(1307674368000). Criterion put read_volatile in there. Mojo probably uses a zero-cost LLVM intrinsic instead.
If the Rust benchmark isn't calling `factorial()` it should be instant and faster than Mojo, the Rust version is must slower. `benchmark.keep` in Mojo is a "clobber" directive, indicating that the value could be read or written to at any time, so LLVM doesn't optimize away the function calls to get the result.
Thanks for taking the time to read the post, and write out your thoughts. Really enjoying the discussion around these topics.
Rust optimizes factorial to be iterative, not using recursion (tail or otherwise) at all, and it turns `factorial(15, 1)` into `1307674368000`: https://rust.godbolt.org/z/bGrWfYKrP. As has been pointed out a few times, you're benchmarking `criterion::black_box` vs `benchmark.keep` (try the newer `std::hint::black_box`, which is built into the compiler and should have lower overhead)
I updated the blog with full benchmark reproduction instructions, I also removed criterion::black_box altogether, and it resulted in no performance difference. Removing benchmark.keep from Mojo causes it to optimize away everything and run in less than a picosecond.
If you could show me a benchmark that supports what you're saying that'd be great, thanks.
Rust realizes the vector is never used, and so never does any allocation, or recursion, it just turns into a loop to count up to 999_999_999.
And some back of the napkin math says there's no way either benchmark is actually allocating anything. Even if malloc took 1 nanosecond (it _doesn't_), 999_999_999 nanoseconds is 0.999999999 seconds.
It _is_ somewhat surprising that rust doesn't realize the loop can be completely optimized away, like it does without the unused Vec, but this benchmark still isn't showing what you're trying to show.
> The String metadata can be passed in registers if LLVM does that optimization, but it's not guaranteed and doesn't always happen. Rust move is just a memcpy, there are situations where LLVM doesn't optimize them away, resulting in Rust programs doing a lot more memcpy than people realize.
I believe even this is not very correct, because memcpy here is an implementation detail for moves. Rust can relatively easily amend its (not yet standardized) ABI to not physically move arguments larger than some threshold like many C++ ABIs if needed. I don't know about the current status, but AFAIK it was considered multiple times in the past.
Passing arguments in registers is not an optimization, but an ABI. It always happens up to a certain number of arguments, and Rust in particular uses an ABI that flattens more structs into registers than C++.
Other moves could be memcpy, but there's a distinction between Rust saying moves behave like memcpy, and moves actually being memcpys. String's 3×size_t (or 2×size_t for Box<str>/Arc<str>) is below LLVM's threshold of actual memcpy call. Rust has optimization passes for eliminating redundant moves.
You're giving an impression that memcpy happens all over the place, where in reality it's quite rare, and certainly doesn't happen in the simple cases you describe.
In Rust, knowledge of ownership and the zoo of strings is a requirement (e.g. use of &String is a novice error). It's nice that Mojo can hide it, and you could celebrate that without making dubious performance claims.
> True if you want it to be immutable, but this actually adds to my point.
Sigh, it adds to inaccuracies. Mutable strings are &mut String, passed as a single pointer, so a mutable string is an even better case of a thin reference that doesn't need memcpy.
> those values need to be available to the end of scope to satisfy Rust's guarantees
No they don't. You're conflating Rust's guaranteed Drop order (which does interfere with TCO) with borrow checking and stack usage, which don't. For references and Copy types, Rust has an "eager drop" behavior. Their existence on the stack is not guaranteed nor necessary.
Borrow checking scopes are hypothetical for the sake of the check, and don't influence code generation in any way. You can literally remove borrow checker and lifetimes from Rust entirely, and the code will compile to the same instructions — mrustc implementation is a real example of that.
Your example function where you try to demonstrate how the arguments prevent TCO compiles to a single `ret`.
> it should be instant and faster than Mojo
The `factorial()` is instant, but the `black_box` isn't because Rust/Criterion implements it differently than Mojo. So Mojo has a faster `benchmark.keep` function, and you failed to benchmark the relevant function, and presented a misleading benchmark with a wrong conclusion.
You should validate your claims on what Rust does by actually checking the output. Try https://rust.godbolt.org/ (don't forget to add -O to flags!) or using cargo-show-asm.
Hi thanks for the discussion, I tried out a bunch of different benchmarks, and Rust TCO is actually working as you say it does. I removed that part from the blog. Thanks very much for the discussion, I definitely need to upskill on assembly.
I removed criterion::black_black from Rust and it had no performance difference, so updated the blog. If I removed benchmark.keep from Mojo it ran in less than a picosecond, so I left it in to be fair.
Can you show me a benchmark to dispute my claim about recursion? I'm not sure what the generated assembly of a function that's not being run is meant to prove.
> There's no implicit string copying in Rust. Even when passing String ownership, it will usually be passed in registers.
The String metadata can be passed in registers if LLVM does that optimization, but it's not guaranteed and doesn't always happen. Rust move is just a memcpy, there are situations where LLVM doesn't optimize them away, resulting in Rust programs doing a lot more memcpy than people realize.
> It's idiomatic to use `&str` by default.
True if you want it to be immutable, but this actually adds to my point. That is the default behavior in Mojo without having to understand things like deref coercion and the difference between `&str` and `&String`. In Rust it's an unintuitive best practice, which everyone has to learn pretty early in their journey. In Mojo they get the best behavior by default, which gives them a more gentle learning curve, important for our Python audience. Default behavior > idiomatic things to learn.
> The borrow checker doesn't influence destructors.
I didn't claim that, my point was that Rust does do runtime checks using drop flags, to check if a value should be dropped. This can be done statically during compilation, but won't happen if the initialization state of an object is unknown at compile time: https://doc.rust-lang.org/nomicon/drop-flags.html
> &String and usize can't have a destructor, and can be forgotten at any time
In the example, the call stack is growing with a new reference and usize in each frame for each call. This is why tail recursion in Rust has so many issues, those values need to be available to the end of scope to satisfy Rust's guarantees, they can't be "forgotten at any time". It also overflows the stack a lot faster.
> Their benchmark intended to demonstrate tail call cost compiles to a constant, with no `factorial()` calls at run time. They're only benchmarking calls to black_box(1307674368000). Criterion put read_volatile in there. Mojo probably uses a zero-cost LLVM intrinsic instead.
If the Rust benchmark isn't calling `factorial()` it should be instant and faster than Mojo, the Rust version is must slower. `benchmark.keep` in Mojo is a "clobber" directive, indicating that the value could be read or written to at any time, so LLVM doesn't optimize away the function calls to get the result.
Thanks for taking the time to read the post, and write out your thoughts. Really enjoying the discussion around these topics.