Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.
Depending on how many individual tweaks are necessary for hardware variants of course... but at this level of code & complexity it actually seems pretty reasonable to write 3 or 4 versions of things for different vendors. More work yes, but not pointless.
Metalibm[1,2] is a different idea, but kind of related: if you need a special (trigonometric, exponential, ...) function only with limited accuracy or only on a specific domain, you can have an approximation generated for your specific needs.
Apache TVM does something similar for auto-optimization and last time I checked it wasn't always a win against OpenVINO (depending on the network and batch-size) and it came with lots of limitations (which may have been lifted since) - stuff like dynamic batch size.
To me it makes sense to have an interface that can be implemented individually for AMD, Metal, etc. Then, leave it up to the individual manufacturers to implement those interfaces.
I'm sitting in an office with a massive number of Macbook Pro Max laptops usually sitting idle and I wish Apple would realize the final coup they could achieve if I could also run the typically-NVIDIA workloads on these hefty, yet underutilized, Mx machines.
Apple could unlock so much compute if they give customers a sort of “Apple@Home” deal. Allow Apple to run distributed AI workloads on your mostly idle extremely overpowered Word/Excel/VSCode machine, and you get compensation dropped straight into your Apple account’s linked creditcard.
BTW, at our day-job, we've been running a "cluster" of M1 Pro Max machines running Ollama and LLMs. Corporate rules prevent remote access onto machines, so we created a quick and dirty pull system where individual developers can start pulling from a central queue, running LLM workloads via the Ollama local service, and contributing things back centrally.
Sounds kludge, but introduce enough constraints and you end up with this as the best solution.
>> Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?
Good question, the account is muddy --
1. Electricity is a parent company responsibility, so while that is a factor in OpEx price, it isnt a factor for us. I dont think it even gets submetered. Obviously, one wouldnt want to abuse this, but maxing out Macbooks dont seem close to abuse territory
2. The M1/M2/M3 machines are already purchased, so while that is major CapEx, it is a sunk cost and also an underutilized resource most of the day. We assume no wear and tear from maxing out the cores, not sure if that is a perfect assumption but good enough.
3. Local servers are out of the question at a big company outside of infra groups, it would take years to provision them and I dont think there is even a means to anymore.
The real question is cloud. Cloud with RTX/A100 would be far more expensive, though I'm sure performant. (TPM calculation left to the reader :-) I'd leave those for fine tuning, not for inference workloads. Non-production Inference is particularly bad because you cant easily justify reserved capacity without some constant throughput. If we could mix environments, it might make sense to go all cloud on NVIDIA but having separate environments with separate compliance requirements makes that hard.
Jokes aside, I think a TPM calculation would be worthwhile and perhaps I can do a quick writeup on this and submit to HN.
If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap (I realize that doesn't fit their brand) and then get the rights perpetually to run compute on them. Kind of like advertising but it might be doing something actually helpful for someone else.
I think in future, it is possible for homes to have "compute wall" similar to powerwall of tesla. Each home has a wifi router why not a compute wall for their needs.
But isn't vulkan made to run cross platform? And why can't they write it in dx12 as well? Aren't those made to be more portable while offering more low level access than previous apis?
What is stopping you from implementing fast math using compute shaders or just hacking with those interfaces? Or are they just too slow when they go through the api layer? Or is that just a myth that can be worked around if you know you are writing high performance code? Pardon my ignorance!
From my understanding, using triangle / shaders to do HPC has given way to a specific, more general purpose GPU programming paradigm which is CUDA.
Of course this knowledge is superficial and probably outdated, but if I'm not too far off base, it's probably more work to translate a general CUDA-like layer or CUDA libs to OpenCL.
>In practice, OpenCL became a giant mess. Some vendors put speed bumps by not supporting the transition from 2 to 3, or having shitty drivers for it.
Well, Nvidia's stock price in the age of AI indicates how bad of a screw up that was, they're locked out of the market growth until they play catch up. By then, Nvidia might have an insurmountable foothold.