Hi author here, definitely not trying to diss Rust, I love Rust! I'm pointing ou...

Dr_Emann · on Feb 13, 2024

Rust optimizes factorial to be iterative, not using recursion (tail or otherwise) at all, and it turns `factorial(15, 1)` into `1307674368000`: https://rust.godbolt.org/z/bGrWfYKrP. As has been pointed out a few times, you're benchmarking `criterion::black_box` vs `benchmark.keep` (try the newer `std::hint::black_box`, which is built into the compiler and should have lower overhead)

And no: in the example with `&String` and `usize`, the stack isn't growing: https://rust.godbolt.org/z/6zW6WfGE7

jack_clayto · on Feb 14, 2024

I updated the blog with full benchmark reproduction instructions, I also removed criterion::black_box altogether, and it resulted in no performance difference. Removing benchmark.keep from Mojo causes it to optimize away everything and run in less than a picosecond.

If you could show me a benchmark that supports what you're saying that'd be great, thanks.

jack_clayto · on Feb 15, 2024

I did a lot more benchmarks and Rust TCO is happening in a lot of scenarios. Thanks for pointing this out, I updated this section in the blog.

Dr_Emann · on Feb 15, 2024

Hi, I think even the remaining benchmark isn't showing what you're trying to show:

https://rust.godbolt.org/z/r9rP6xohb

Rust realizes the vector is never used, and so never does any allocation, or recursion, it just turns into a loop to count up to 999_999_999.

And some back of the napkin math says there's no way either benchmark is actually allocating anything. Even if malloc took 1 nanosecond (it _doesn't_), 999_999_999 nanoseconds is 0.999999999 seconds.

It _is_ somewhat surprising that rust doesn't realize the loop can be completely optimized away, like it does without the unused Vec, but this benchmark still isn't showing what you're trying to show.

jack_clayto · on Feb 17, 2024

True thanks! I updated the example again, profiled this time to make sure each program is actually allocating.

lifthrasiir · on Feb 13, 2024

> The String metadata can be passed in registers if LLVM does that optimization, but it's not guaranteed and doesn't always happen. Rust move is just a memcpy, there are situations where LLVM doesn't optimize them away, resulting in Rust programs doing a lot more memcpy than people realize.

I believe even this is not very correct, because memcpy here is an implementation detail for moves. Rust can relatively easily amend its (not yet standardized) ABI to not physically move arguments larger than some threshold like many C++ ABIs if needed. I don't know about the current status, but AFAIK it was considered multiple times in the past.

pornel · on Feb 13, 2024

Passing arguments in registers is not an optimization, but an ABI. It always happens up to a certain number of arguments, and Rust in particular uses an ABI that flattens more structs into registers than C++.

Other moves could be memcpy, but there's a distinction between Rust saying moves behave like memcpy, and moves actually being memcpys. String's 3×size_t (or 2×size_t for Box<str>/Arc<str>) is below LLVM's threshold of actual memcpy call. Rust has optimization passes for eliminating redundant moves.

You're giving an impression that memcpy happens all over the place, where in reality it's quite rare, and certainly doesn't happen in the simple cases you describe.

In Rust, knowledge of ownership and the zoo of strings is a requirement (e.g. use of &String is a novice error). It's nice that Mojo can hide it, and you could celebrate that without making dubious performance claims.

> True if you want it to be immutable, but this actually adds to my point.

Sigh, it adds to inaccuracies. Mutable strings are &mut String, passed as a single pointer, so a mutable string is an even better case of a thin reference that doesn't need memcpy.

> those values need to be available to the end of scope to satisfy Rust's guarantees

No they don't. You're conflating Rust's guaranteed Drop order (which does interfere with TCO) with borrow checking and stack usage, which don't. For references and Copy types, Rust has an "eager drop" behavior. Their existence on the stack is not guaranteed nor necessary.

Borrow checking scopes are hypothetical for the sake of the check, and don't influence code generation in any way. You can literally remove borrow checker and lifetimes from Rust entirely, and the code will compile to the same instructions — mrustc implementation is a real example of that.

Your example function where you try to demonstrate how the arguments prevent TCO compiles to a single `ret`.

> it should be instant and faster than Mojo

The `factorial()` is instant, but the `black_box` isn't because Rust/Criterion implements it differently than Mojo. So Mojo has a faster `benchmark.keep` function, and you failed to benchmark the relevant function, and presented a misleading benchmark with a wrong conclusion.

You should validate your claims on what Rust does by actually checking the output. Try https://rust.godbolt.org/ (don't forget to add -O to flags!) or using cargo-show-asm.

jack_clayto · on Feb 15, 2024

Hi thanks for the discussion, I tried out a bunch of different benchmarks, and Rust TCO is actually working as you say it does. I removed that part from the blog. Thanks very much for the discussion, I definitely need to upskill on assembly.

jack_clayto · on Feb 14, 2024

Here's at least three Rust veterans in this thread explaining that move is just a memcpy which can be optimized away: https://users.rust-lang.org/t/move-semantics-rust-vs-c/61274...

I removed criterion::black_black from Rust and it had no performance difference, so updated the blog. If I removed benchmark.keep from Mojo it ran in less than a picosecond, so I left it in to be fair.

Can you show me a benchmark to dispute my claim about recursion? I'm not sure what the generated assembly of a function that's not being run is meant to prove.