Have you tried using Enzyme (https://enzyme.mit.edu/)? It operates on the LLVM IR, so it's available in any language that breaks down into LLVM (e.g., Julia, where I've used it for surface gradients) and it produces highly optimized AD code. Pretty cool stuff.
Yeah I've used it (cool project indeed!), albeit mostly just in a project I and others in the autodiff community maintain which benchmarks many different autodiff tools against each other: https://github.com/gradbench/gradbench