It's not an implementation of CUDA, it's an implementation of the CUDA runtime API. The API is used to configure the card, allocate and copy memory, and run kernels. Importantly you cannot use this to write the actual kernels which run on the GPU!
I know AMD has a whole bunch of (related?) projects for GPU compute, but man - if they could just provide an interop layer that Just Works they'd get immediate access to so much more market share.
“Just works” in this context means executing the compiled CUDA or the PTX bytecode without recompiling. Nobody is ever going to utilize ROCm if it requires distributing as source and recompiling.
To make it even more insulting, even simply installing ROCm itself is a massive burden, even on an ostensibly-supported (as geohot discovered) and even just “it works out of the box if you distribute and compile it locally” is ignoring that whole massive “draw the rest of the owl” stage of getting ROCm installed and building properly in your environment.
> “Just works” in this context means executing the compiled CUDA or the PTX bytecode without recompiling. Nobody is ever going to utilize ROCm if it requires distributing as source and recompiling.
Even a source-compatible layer that let you just recompile CUDA code for an AMD GPU would be a huge improvement. That alone would eliminate the CUDA lock-in.
Don't forget AMD doesn't seem to even care about ROCm themselves. Six months in and RDNA3 cards still don't support it. Can you imagine if Nvidia launched RTX40- cards with no DLSS even though 30- cards already had it, and six months started boasting about how DLSS support was "coming this fall"?
The hardware that is officially supported is a subset of the hardware that works. You are correct that the RX 7900 XT is not officially supported, but I must point out that you are linking to a fork of the documentation from 2019. This is the official ROCm documentation: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...
TLDR; If you provide even more functions through the overloaded headers, incl. "hidden ones", e.g., `__cudaPushCallConfiguration`, you can use LLVM/Clang as a CUDA compiler and target AMD GPUs, the host, and soon GPUs of two other manufacturers.
Yes, though with caveats. The driver and parts of the extended API we used to lower CUDA calls are in upstream LLVM. The wrapper headers are not.
We will continue the process of getting it all to work in upstream/vanilla LLVM soon though. Help is always appreciated.
FWIW, we have some alternative ideas on how to get out of the vendor trap, as well as some existing prototypes to deal with things like CUBLAS and Thrust.
Feel free to reach out, or just keep an eye out.
1. This implements the clunky C-ish API; there's also the Modern-C++ API wrappers, with automatic error checking, RAII resource control etc.; see: https://github.com/eyalroz/cuda-api-wrappers (due disclosure: I'm the author)
2. Implementing the _runtime_ API is not the right choice; it's important to implement the _driver_ API, otherwise you can't isolate contexts, dynamically add newly-compiled JIT kernels via modules etc.
3. This is less than 3000 lines of code. Wrapping all of the core CUDA APIs (driver, runtime, NVTX, JIT compilation of CUDA-C++ and of PTX) took me > 14,000 LoC.
What I _really_ like to receive, though, is feedback from using the wrappers, ideas for changes/improvements, and of course messages volunteering to QA new versions before their release :-P
How does this relate to the goals outlined by George Hotz to bring ML to AMD chips and break the Nvidia dominance?
I'm not an expert here but this approach seems powerful and important. But this system seems complex enough to doubt the ability of an individual to build. It seems like this would need a corporate sponsor to get off the ground. Perhaps AMD itself would be interested in paying engineers to iterate on this?
Holtz is talking about drivers too, not just user space libraries.
> The software is terrible! There’s kernel panics in the driver. You have to run a newer kernel than the Ubuntu default to make it remotely stable. I’m still not sure if the driver supports putting two cards in one machine, or if there’s some poorly written global state. When I put the second card in and run an OpenCL program, half the time it kernel panics and you have to reboot.
He also talks about user space stuff but clearly he thinks the whole stack, above and below this kind of library also needs a lot of work.
This is still so mind-boggling to me. AMD should be in a good financial position now that Zen was such a success, and that their GPUs are catching up too. Why are their drivers still a Clusterfuck across the board after all these years? Why not throw more manpower at the problem?
I'm sure even if their GPUs were twice as fast as Nvidia's, everybody would still buy team green because it's better to have a card that works than a broken piece of garbage. We tried to get an MI50 to work reliably at work with KVM, but that thing was a complete dumpster fire. A colleague just bought a 7900XTX for gaming and spent days getting it to work. This included three Windows reinstalls. And that use case is gaming on Windows, which supposedly is the best supported case. It only gets worse from there. Compute on Linux? Lol.
Now last time this topic came up, someone claimed that AMD is pretty much at their limits production wise, and there are a few unnamed large companies buying loads of their cards for compute and cloud gaming, and AMD basically has engineers dedicated to making sure things work exactly for their use case, so they don't have to really care about the rest. Sounds pretty wild, but not completely unrealistic...
My old nvidia 570's drivers went into severe bitrot. Basic stuff like screensavers and desktops broke badly, and games were flaky. The card is still more than powerful enough for what I used it for.
I switched to AMD, with open source drivers. I get windows-level performance on AAA (and indie) games in steam, and zero compatibility issues with the rest of the Linux ecosystem.
AMD GPU support in linux is a bit of a flip flop. Some generations seem to get a lot of love and work really damn well (sometimes with better and broader support than the windows drivers) like RDNA2 and most Polaris cards. Others such as Vega and specially now RDNA3 are a shitshow with a lot of things just broken.
I used a Vega64 for years and just bought a 6000 series. Both work great on my machines. I'm typically running the bleeding edge kernels though, so that might explain it a bit. I would think Ubuntu is probably the most supported if you opt for the OEM kernel.
I ran vega 64 in linux for 3-4 years, it was really nice. It also worked without bugs with proton, the 3060 I have now gives me a lot of artifacts like incorrect lightning and even X crashes once in a while.
I'm considering switching back to an AMD card due to this.
It really depends on the card. I have an old RDNA workstation card in my server and the driver was a real crapshoot. I eventually started delaying updates to newer kernel releases (which would fix other bugs!) because there would be regressions of various kinds. Graphics under Linux is still a bit painful after all these years.
At least Nvidia finally open sourced their driver too, I guess. And Intel is still open source. But it still sucks a bit I think unless you do research.
Why are you talking about screensavers in a CUDA thread though? You can't compute on a fancy screensaver animation, you need a working CUDA driver that nVidia provides for that.
> And that use case is gaming on Windows, which supposedly is the best supported case.
I’m being a little tongue-in-cheek here, but the best supported case for AMD is gaming via console: AMD provides CPU/GPU for the current generation of both the XBox and PlayStation consoles.
Which suggests to me that they shouldn’t have too much problem supporting their hardware on Windows or Linux. But that’s outside of my area of expertise. Maybe they need to spend too much engineer effort and time supporting the consoles at what’s probably a pretty thin profit margin?
No, really. We've worked with Intel, Nvidia and AMD... Well for the latter, at least tried. We're not a big fish, but response time and quality of responses were stellar with Intel and Nvidia. AMD took weeks and even when asking very precise questions with lots of technical background, there was a lot of "hmm dunno have to find someone who'd know" kind of answers, and it would often take one to two weeks for a single reply. And that's not even dev work, it's just tech support for your own damn stuff you're trying to sell.
You can't seriously tell me that's not something they could fix.
Par for the course with new kernel things: it's unusual for something new in the kernel to be stable in the distro kernels unless they've devoted a great deal of effort to backport things.
The way conservative distros define "stable" is part of the problem. For things less than 3 years old, going for stale versions often runs counter to "stable".
Just in case other people who have AMD GPU and run Windows have the same needs as I have, that is, to train or run machine learning models, please checkout torch-directml and tensorflow-directml.
I'm not sure this really makes any more sense than AMD chasing CUDA compatibility with ROCm/MiOpen/HIP. CUDA and DirectX seem too low level to be used as a compatibility API over widely divergent hardware (AMD vs NVidia) without giving up a lot of performance.
cuDNN being higher level offers more opportunity for compatibility without losing performance (i.e different implementations of kernels fine-tuned for optimal performance on AMD vs NVidia hardware), but the trouble is that so much of what frameworks like PyTorch do is based on custom kernels, not just cuDNN.
It seems the best bet for AMD would be a rock solid low level API (not a moving target) and support of high level optimizing ML compilers to reduce the level of
effort for the framework (PyTorch, TensorFlow, JAX ...) vendors to provide framework-level support on top of that. Ultimately they'd need to work very closely with the framework vendors to provide this support, since they are the ones who would be benefiting from it.
It's odd how much of an afterthought ML support has seemed to be for AMD over the years... maybe the relative size of the consumer ML market vs graphics/gaming market didn't seem to make it worth their effort, but as NVidia has shown this is a path to gaining much more lucrative data center wins.
How does it work? Last time I tried DirectML it wasn't well supposed and there was little software which supported it. Also the performance seemed to be not too great. I am currently using a Linux install because with ROCm I can use popular tools like Automatic111 webui and oobabooga.
I trained a WGAN on torch-directml with no issues so the software seems quite supported. But I can’t speak of performance because I have nothing to compare against.
Does that work? I might be in market for a new GPU if AMD had something that beats NVidia for ML (for sane price)... I can't really justify buying NVidia GPU, anything decent is too expensive.
>>> hipify-clang is a clang-based tool for translating CUDA sources into HIP sources. It translates CUDA source into an abstract syntax tree, which is traversed by transformation matchers. After applying all the matchers, the output HIP source is produced. [...]
I'm not even certain optimisation matters. I can crash my machine (AMD graphics) with a stock Debian install by letting something attempt BLAS on the GPU.
The situation is starting to improve though. Installed a bunch of libraries from https://repo.radeon.com/rocm/apt/5.4 jammy main and the crashes got less frequent. I don't have a lot of faith in AMD to deliver reliable BLAS libraries at this point, but it could happen. The hardware is there, I just don't think they're prioritising supporting the right places in the distribution chain or supporting consumer-level graphics.
I do find it strange that AMD is not allocating more resources to ROCM, given that that seems to be where the money is, at least from my viewpoint. I guess they have been able to sell more cards than they could manufacture, but that seems to be changing.
APIs weren't copyrightable before Oracle v Google. There was plenty of precedent saying that. For example, before they were called Oracle, they built a clone of IBM SEQUEL.
The main concern with Oracle v Google was that the court would ignore or misinterpret the existing precedent.
A secondary concern was that a Google employee formerly worked on Java at Sun (and/or Oracle), and copy-pasted some implementation source code from oracle to google's code bases. There was a real possibility the "APIs aren't copyrightable" precedent would stand, but the courts would rule that Google couldn't continue distributing Dalvik.
> > VUDA only takes kernels in SPIR-V format. VUDA does not provide any support for compiling CUDA C kernels directly to SPIR-V (yet). However, it does not know or care how the SPIR-V source was created - may it be GLSL, HLSL, OpenCL.
So the answer is no, it can't be used with kernels that use cublas or cudnn, which excludes almost all ML use-cases.
Memory management in Vulkan is _very_ restricted. Nothing remotely like UVM. CUDA on Vulkan for these reasons will always stay a pet project at best, with no shot at usable quality whatsoever.