> LTO On *some* workloads (think calls not possible to inline within a hot loop)...

Const-me · 2025-05-14T19:37:23 1747251443

> on some workloads, I found LTO to be a requirement for C code to match C# performance

Yeah, I observed that too. As far as I remember, that code did many small memory allocations, and .NET GC was faster than malloc.

However, last time I tested (used .NET 6 back then), for code which churches numbers with AVX, my C++ with SIMD intrinsics was faster than C# with SIMD intrinsics. Not by much but noticeable, like 20%. The code generator was just better in C++. I suspect the main reason is .NET JIT compiler doesn’t have time for expensive optimisations.

neonsunset · 2025-05-14T20:23:31 1747254211

> The code generator was just better in C++. I suspect the main reason is .NET JIT compiler doesn’t have time for expensive optimisations.

Yeah, there are heavy constraints on how many phases there are and how much work each phase can do. Besides inlining budget, there are many hidden "limits" within the compiler which reduce the risk of throughput loss.

For example - JIT will only be able to track so many assertions about local variables at the same time, and if the method has too many blocks, it may not perfectly track them across the full span of them.

GCC and LLVM are able to leisurely repeat optimization phases where-as RyuJIT avoids it (even if some phases replicate some optimizations happened earlier). This will change once "Opt Repeat" feature gets productized[0], we will most likely see it in NativeAOT first, as you'd expect.

On matching codegen quality produced by GCC for vectorized code - I'm usually able to replicate it by iteratively refactoring the implementation and quickly testing its disasm with Disasmo extension. The main catch with this type of code is that GCC, LLVM and ILC/RyuJIT each have their own quirks around SIMD (e.g. does the compiler mistakenly rematerialize vector constant construction inside the loop body, undoing you hosting its load?). Previously, I thought it was a weakness unique to .NET but then I learned that GCC and LLVM tend to also be vulnerable to that, and even regress across updates as it sometimes happens in SIMD edge cases in .NET. But it is certainly not as common. What GCC/LLVM are better at is if you start abstracting away your SIMD code in which case it may need more help as once you start exhausting available registers due to sometimes less than optimal register allocation you start getting spills or you may be running in a technically correct behavior around vector shuffles where JIT needs to replicate portable behavior but fails to see your constant does not need it so you need to reach out for platform-specific intrinsics to work around it.

[0]: https://github.com/dotnet/runtime/issues/108902