Yes, but none of these have performance portability across GPU vendors, so it's ...

radarsat1 · on April 1, 2024

Depending on how many individual tweaks are necessary for hardware variants of course... but at this level of code & complexity it actually seems pretty reasonable to write 3 or 4 versions of things for different vendors. More work yes, but not pointless.

treffer · on April 1, 2024

A nice example of this is fftw which has hundreds (if not thousands) of generated methods to do the fft math. The whole project is a code generator.

It can then after compilation benchmark these, generate a wisdom file for the hardware and pick the right implementation.

Compared with that "a few" implementations of the core math kernel seem like an easy thing to do.

mananaysiempre · on April 1, 2024

Metalibm[1,2] is a different idea, but kind of related: if you need a special (trigonometric, exponential, ...) function only with limited accuracy or only on a specific domain, you can have an approximation generated for your specific needs.

[1] https://github.com/metalibm/metalibm

[2] https://indico.cern.ch/event/166141/sessions/125685/attachme...

bee_rider · on April 1, 2024

ATLAS was an automatically tuned BLAS, but it’s been mostly supplanted by ones using the hand-tuned kernel strategy.

touisteur · on April 1, 2024

Apache TVM does something similar for auto-optimization and last time I checked it wasn't always a win against OpenVINO (depending on the network and batch-size) and it came with lots of limitations (which may have been lifted since) - stuff like dynamic batch size.

I wish we had superoptom

naasking · on April 1, 2024

Not exactly comparable, as you said, the FFTW implementations are auto-generated but it doesn't sound like these few implementations will be.

TuringNYC · on April 1, 2024

To me it makes sense to have an interface that can be implemented individually for AMD, Metal, etc. Then, leave it up to the individual manufacturers to implement those interfaces.

I'm sitting in an office with a massive number of Macbook Pro Max laptops usually sitting idle and I wish Apple would realize the final coup they could achieve if I could also run the typically-NVIDIA workloads on these hefty, yet underutilized, Mx machines.

jorvi · on April 1, 2024

Apple could unlock so much compute if they give customers a sort of “Apple@Home” deal. Allow Apple to run distributed AI workloads on your mostly idle extremely overpowered Word/Excel/VSCode machine, and you get compensation dropped straight into your Apple account’s linked creditcard.

TuringNYC · on April 1, 2024

BTW, at our day-job, we've been running a "cluster" of M1 Pro Max machines running Ollama and LLMs. Corporate rules prevent remote access onto machines, so we created a quick and dirty pull system where individual developers can start pulling from a central queue, running LLM workloads via the Ollama local service, and contributing things back centrally.

Sounds kludge, but introduce enough constraints and you end up with this as the best solution.

nickpsecurity · on April 1, 2024

Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?

TuringNYC · on April 1, 2024

>> Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?

Good question, the account is muddy --

1. Electricity is a parent company responsibility, so while that is a factor in OpEx price, it isnt a factor for us. I dont think it even gets submetered. Obviously, one wouldnt want to abuse this, but maxing out Macbooks dont seem close to abuse territory

2. The M1/M2/M3 machines are already purchased, so while that is major CapEx, it is a sunk cost and also an underutilized resource most of the day. We assume no wear and tear from maxing out the cores, not sure if that is a perfect assumption but good enough.

3. Local servers are out of the question at a big company outside of infra groups, it would take years to provision them and I dont think there is even a means to anymore.

The real question is cloud. Cloud with RTX/A100 would be far more expensive, though I'm sure performant. (TPM calculation left to the reader :-) I'd leave those for fine tuning, not for inference workloads. Non-production Inference is particularly bad because you cant easily justify reserved capacity without some constant throughput. If we could mix environments, it might make sense to go all cloud on NVIDIA but having separate environments with separate compliance requirements makes that hard.

Jokes aside, I think a TPM calculation would be worthwhile and perhaps I can do a quick writeup on this and submit to HN.

newswasboring · on April 1, 2024

If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap (I realize that doesn't fit their brand) and then get the rights perpetually to run compute on them. Kind of like advertising but it might be doing something actually helpful for someone else.

TuringNYC · on April 1, 2024

>> If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap

In such a case, my guess is that the machines being free would be trumped by the increased cost of electricity.

passion__desire · on April 3, 2024

I think in future, it is possible for homes to have "compute wall" similar to powerwall of tesla. Each home has a wifi router why not a compute wall for their needs.

FieryTransition · on April 2, 2024

But isn't vulkan made to run cross platform? And why can't they write it in dx12 as well? Aren't those made to be more portable while offering more low level access than previous apis?

What is stopping you from implementing fast math using compute shaders or just hacking with those interfaces? Or are they just too slow when they go through the api layer? Or is that just a myth that can be worked around if you know you are writing high performance code? Pardon my ignorance!

remram · on April 2, 2024

They would work and would be fast but not the fastest the algorithm can be implemented on each platform.

surge · on April 1, 2024

Maybe its a dumb question, but isn't something like OpenCL meant to solve this problem?

jvanderbot · on April 1, 2024

From my understanding, using triangle / shaders to do HPC has given way to a specific, more general purpose GPU programming paradigm which is CUDA.

Of course this knowledge is superficial and probably outdated, but if I'm not too far off base, it's probably more work to translate a general CUDA-like layer or CUDA libs to OpenCL.

slackito · on April 2, 2024

The fact that you're comparing CUDA to using triangles and shaders makes me think you might be confusing OpenCL with OpenGL.

OpenCL is meant for general computation (the C is for "computing") rather than graphics, like CUDA.

VHRanger · on April 1, 2024

In theory, yes.

In practice, OpenCL became a giant mess. Some vendors put speed bumps by not supporting the transition from 2 to 3, or having shitty drivers for it.

It also sat at the wrong level of abstraction for high performance compute, which is why CUDA ended up being used.

Vulkan would have been reasonable to write compute shaders in, if there wasn't a ton of alternatives out there already now

surge · on April 5, 2024

>In practice, OpenCL became a giant mess. Some vendors put speed bumps by not supporting the transition from 2 to 3, or having shitty drivers for it.

Well, Nvidia's stock price in the age of AI indicates how bad of a screw up that was, they're locked out of the market growth until they play catch up. By then, Nvidia might have an insurmountable foothold.