I do love AMD because its drivers are open source as opposed to nVidia. However, is "less issues" really the case? I sure hope not, for nVidia's sake.
My AMD graphics experience, on APUs and dedicated GPUs, has been plagued with basic issues and random crashes. AMD cards and their driver/firmware issues are without a question the IT product that affects my daily life most.
I tried AMD exactly once for Linux graphics, it was so unstable that I bought a NVIDIA card within a month and have never looked back. Potentially NVidias approach of staying as far away as possible from the Linux kernel abstractions is much saner than to play ball with them.
The immense majority of Linux users use these "abstractions". Even on the steam survey which is going to be extremely biased to favor nvidia users Intel/AMD GPUs are the majority, and if you include the Steam Deck then it's a no contest.
Compare the number of issues here: https://wiki.archlinux.org/title/Vulkan. The only issue with the Nvidia driver is that there might be another (open-source) driver installed. AMD has several drivers which fail in different situations. The same has been true for OpenGL implementations, the only truly good implementation is the one by Nvidia, this is even well known in the PC game development industry.
What is this trying to prove? It's just one random page of a publicly editable wiki. AMD/Intel having more entries may just be because it is more popular (which it is indeed), or because (being opensource) it is actually easier to debug+solve problems, and/or a million other reasons.
Frankly, I think all the argument you need is that despite the huge advantage nvidia has on other platforms, it is almost completely reversed (even on steam) for Linux. It conveys very heavily that at least nvidia has a very poor reputation on Linux desktops.
In my experience, AMD still has some massive stability issues with the VBIOS of their GPUs. But it's not a component that the users can officially update on their own (whichever version the graphic card comes with when you receive it, is what you are stuck with), and most users are not even aware of its existence (even on Windows, the official driver doesn't provide any way to update it, or even check its version). Resulting in two seemingly similar cards, with the same drivers and firmware, behaving wildly differently in some case depending on the production batches the card was part of. Ask me how I know.
From what I remember from the ATI days (aah, sweet memories), it was often the case of either you card was working great and the driver was much more up to date with the latest kernel tech, or the card was a mess to make work under linux.
With NVidia, you were more sure that your car was going to work, but they were often not implementing certain kernel tech, or they had their own weird way to do things, which often mean that you had to configure Xorg and some app with some weird workaround for NVidia.
> I do love AMD because its drivers are open source as opposed to nVidia.
AMD's drivers are not really more open that Nvidia's. Similar to Nvidia's Open GPU Kernel Module's[0], AMD's opensource drivers are mostly a shim that wrap firmware blobs[1] in which the functionality you really care about is contained.
That's still a huge difference though. You get in kernel drivers that properly support wayland and don't require recompiling modules all the time. Plus all the hacks one has to go through to run wlroots based compositors with nvidia.
Normally, firmware does not make the driver itself less open source.
I get it and I'd also like for all components of my devices to be open source. Still, the parts I really care about and have any idea of how to fix stuff in, are open source.
For nvidia devices there is (or was? It's been a while) a whole other set of horrible user-space closed-source stuff.
> AMD's opensource drivers are mostly a shim that wrap firmware blobs
I went and checked the size of AMDs vs NVs firmware blobs and the 2 GSP firmware that are used for new NV cards in the linux-firmware package (https://archlinux.org/packages/core/any/linux-firmware as references for what I checked) result in it being the single largest folder in there (40MiB). Compare that to the largest amdgpu firmware file which is at 392KiB with the entire folder of 562 items being <20MiB.
Not to say that AMDs firmware is open source or so, it certainly isn't, but even comparing the amount that is possibly done is somewhat laughable.
AMD _has_ open drivers which include not only the kernel parts but also DRM, video acceleration, and mesa/llvm backends. NVIDIA doesn't release any of this.
Are you going to claim that because both have proprietary firmware, it doesn't matter if you are forced to run proprietary software to use it?
Using Mesa is already an order of improvement of openness, and frankly, calling it "a shim which wraps firmware" is a ridiculous thing to say.
As far as I am aware, ROCm itself has been open since its release in 2016. Whatever they were trying to convey will have little impact on ROCm being a valid competitor. Frankly, AMD should just trash ROCm and go all in on Intel's oneAPI. AMD seems helpless on software and reports are Google and Qualcomm have interest in pushing oneAPI.
Is AMD's situation even fixable? Since 2022, if you were doing AI, was there any reason at all to have an AMD GPU over an NVIDIA? If you're just someone who wants to try out some of the cool projects shared on HN alone, you'd have been left out of the conversation entirely for the last 2 years.
The worse part is that they had an internal project that could've enabled CUDA applications to run with ROCm, but they stopped funding it [1]
This is all to say - AMD GPUs make sense if you're a gamer and if they're cheaper than an NVIDIA where-ever you buy your GPUs. But otherwise, it feels like buying a SymbianOS phone while Android and iOS were into their second or third generations.
Even if AMD could allow CUDA to run on Radeon GPUs from a technical perspective, how would that work out legally?
It was my understanding that a big part of what makes CUDA so good is all the math libraries that Nvidia wrote and optimized for their GPUs. Are those libraries licensed to run on non-Nvidia GPUs?
To me, it seems like what AMD needs to do is invest heavily in the ROCm ecosystem to bring it up to the level of CUDA. But yeah, it probably would take a long time to catch up.
You have to hand it to Nvidia here. They invested in the GPGPU vision long before it was economically viable. Their long term focus has allowed them to reap incredible profits from AI now and crypto before it.
Contrast this with the Intel approach of halfheartedly trying a new market every couple years and abandoning it after it’s not immediately successful. See Larrabee for the relevant example. But also Optane and their 5G cell modem as other notables.
Running CUDA applications on AMD was always going to be slower. They have slightly different architectures, and CUDA is completely closed source. They would be committing to always being a step behind.
The fraction of people who are doing AI at the CUDA level is far smaller than the fraction of people who are doing it at the TensorFlow or Ollama levels.
I don't know what AMD is planning here; but tweets and a bug tracker is not useful progress. I am not optimistic that this will be meaningful.
That being said, if they start open sourcing the firmware that would potentially be huge. And if they do actually fix some of geohot's bugs that would potentially be significant. There seems to be a small and finite set of bugs on any given AMD card that are blockers. If, by hook or crook, they fix that list of bugs then the card will probably be a lot more useful for machine learning. If they stop with the stupid crashes, everything else seems to be ok. They only need to do BLAS and data transfers reliably.
But knee-jerk bugfixing in response to public pressure is a really bad sign. This still wouldn't be the thoughtful root cause analysis that leads to high quality software. Plan->Do->Check->Act; not Panic->Tweet->Code->Crash.
I wonder if this is what George Hotz was alluding to when he announced tiny corp would be reversing-reversing course and offering AMD boxes as an option.
I'm all for demanding as much open source as possible, but I'm not aware of any big company releasing the source for their firmware. Intel who has often been hailed as the paragon of open source is not releasing the firmware of their ( less capable) cards, nor are they releasing the microcode to their CPUs (neither is AMD).
Open source drivers with closed source Firmware is still much better than the closed source mess we have with Nvidia
Absolutely. Jeez, it’s amazing they did this much already. My first reaction was more along the lines of “he can’t keep getting away with it!!!” lol, the connection to the demands is undeniable. And yeah they probably can’t open source firmware - there’s tons of licensed proprietary shit in there too I’m sure. If that’s the framing… way to snatch rhetorical defeat from the jaws of victory. I’m serious. Take a massive W here and just accept some things cannot be changed (Dolby, fraunhofer, hdmi, Broadcom, and arm are among them). Firmware is hard.
Geohot needs to scope his ambitions for his wacky parallel stack appropriately in that light. He might be able to get access to firmware/microcode if he NDA’d (which again, is insane for something to happen this rapidly already) but I’m guessing he wouldn’t like that either and would see it as the same “leash” or whatever. I get that his job is to be the rhetorical pitbull here and demand they all release it as fast as possible, but honestly I think AMD is relatively on the up-and-up here (although I’m sure there will be some disappointments where AMD won’t open some chunks). He’s winning overall here, he just needs to accept the W (and keep pushing etc).
The reason this whole insane child sctick is getting graced with a response is he’s right. AMD’s gpgpu software stack has never not been turboshit, their opencl runtime (and entire software ecosystem) was turboshit too. AMD isn’t being “culturally open source” on this project, they are working major versions ahead, behind closed doors, and as a result don’t even have a mechanism to accept PRs etc. It’s “libre source available” more like - and this is not how the gpu driver team operates at all. And they’re behind and have a ton of people who want to help, but some parts of it (binary slices…) are pretty much just rotten root-and-branch. They are barely walking forward to some kind of family-level support now, I think, finally. And their software is so broken he can trivally fuzz a bunch of kernel crashes etc out of their demo code, on supported hardware (and fuzzing is another initiative they announced here). Including enterprise hardware, apparently. Like they have zero official hardware listed and this terrible binary slice thing, and they still can’t run stable. And of course there’s no vfio (there was a callout on Reddit recently too, this was announced alongside the other intiatives) and the control processor hangs so you need to hard reset it sometimes to get the gpu back.
People are trying. This is obviously someone who’s trying to run them in a cluster with some uptime level etc. While they debug the ROCm software that deadlocks the kernel. Cant even just reset it and try something else, it’s a whole-ass circus just to get back to the console.
fixing these problems is the motivation of asking for the firmware too, I’m sure. And he also seems to want to bypass the whole ROCm and AMD framework entirely and do his own thing. I think targeting sycl or spir-v for bytecode (PTX replacement) that is finally-optimized at runtime sounds like a way more practical mechanism, but hey, let him cook, sometimes you see an idea. it’s probably something he could do if he NDA’d up etc.
notionally he could target CUDA as well with the same approach, either emit PTX or bytecode directly. Like you can’t use CUDA to build a library for something else, but emitting PTX is something that isn’t blocked, and probably can’t realistically be. It’s just another compiler with a custom software stack in some little DSL or sycl or whatever.
I think he’ll bounce off it tbh, there is a lot of irreducible complexity here (firmware is perpetually patching hardware bugs among other things). I’m not denying there’s a lot of cruft and legacy bugs that probably could be cleaned out etc. it’s just going to be a “much larger than he thinks” problem. If he really wants to actually succeed, figuring out the scope of his idea is important. And again, he probably at least evaluate some of these options if he NDAs. Acting like a dork until they “read you in”, didn’t realize that worked… But if he does succeed in simplifying a bunch of the SIMT binary pipeline/intercompatiblity mess, it’d be a huge win.
Or he could focus on the stuff that everyone can collaborate on, but, that’s not going to be microcode. And just like the RPi foundation - you can notionally have a core team who’s NDA’d and manages basically as thin a shim layer as you can get. But I really don’t think AMD will open microcode/firmware. But I’ve been wrong so far! He just keeps getting away with it.
Those are different paths though!
AMD, for their part, I think finally realizes how badly they’ve fucked up forever. JonChesterfield has always listened patiently to people bitch and I’m sure taken notes, but, they can only do what they can with what management gives them, like anyone else. I think now amd is finally ready to hear it. Employee count in financials indicates they might finally have started staffing up software engineers. They’re saying the right things about open sourcing, and the incentives align for them to deliver, and ideally quickly, because these people are building AMD’s stack for them. I’m sure some pieces can’t be opened etc but they will largely deliver here, I think. And there’s too many eyes watching at this point, and many large customers that will be very very disappointed if AMD doesn’t deliver and then goes back to wallowing in their mess with a stack that completely falls over and constantly gets in the way. That would burn them really badly in the long term - “hey, you didn’t let the community help and you messed it up internally? Gosh maybe the nvidia money pays for something” is a super easy sell and nvidia goes back to owning the market.
That’s why it’s surprising - it’s a pretty direct commitment with financial consequences attached tbh. But he’s right about pretty much all of it. He’s an asshole, but he’s right. The gpgpu stuff has always been a tremendous, poorly-supported mess. Under resourced due to financial problems, yes, but context and understanding doesn’t fix kernel panics.
Looks like geohot is now advertising Team Green and Team Red options with the latter being $10k cheaper.
tinybox green - 6x NVIDIA RTX 4090 ($25k)
You came here to do ML and you want it to just work. You like checking out the latest open source ML stuff on GitHub. gpt-fast. You want
your Attention to be CUDA quality FlashAttention. You want FP8. A solid box for the ML researcher. A solid box for the tinygrad developer. A solid
box for everyone. It's just a little expensive.
tinybox red - 6x AMD 7900XTX ($15k)
You came here for the hardware.
You want to be a part of the journey. Make AMD great, with or without their help. Sure, it works for ML, mostly. Just maybe you have to reboot it sometimes. tinygrad
does work well enough to get AMD on MLPerf at least. You buy the red box because you want to mess with kernel drivers and shader
engines. You know what a warp is. You like that you can see the MMIO with umr.
RDNA3 is documented! You want to get every bit of perf out of the metal.
The specs are similar to green, except it's $10k cheaper.
Hardware programming manual incomplete? We would not have mesa neither the linux kernel modules without a complete with erratas hardware programming manual.
AMD employs people to write the open source drivers. They have access to internal resources. I think it is possible that the public docs are incomplete.
Code does not replace the hardware programming manual with up-to-date erratas. If this is the case, hopefully AMD is not hypocrit on that matter, namely releasing on purpose outdated or incomplete hardware programming manual/erratas to force people to use their code (which could be very very bad, look at that c++ abomination which is libaddr, which should be plain and simple C99).
"Open sourcing additional portions of our software stack and more hardware documentation."
So, not the whole driver stack, but still a positive step forward in the right direction.
[1] https://x.com/amdradeon/status/1775261152987271614