LLM Training and Inference with Intel Gaudi 2 AI Accelerators

sailplease · on Jan 6, 2024

"Based on these public on-demand quoted prices from AWS and IDC, we found that the IntelR GaudiR 2 has the best training performance-per-dollar, with an average advantage of 4.8x vs the NVIDIA A100-80GB, 4.2x vs. the NVIDIA A100-40GB, and 5.19x vs. the NVIDIA H100"

ShamelessC · on Jan 6, 2024

Seems there's some friction in porting software as you have to use their build of pytorch. They claim you just have to change your specified device in `.to(device:str)` statements but, if someone could verify that it would be appreciated. My experience with porting software to Google's TPU's or AMD GPU's has been not great.

kkielhofner · on Jan 6, 2024

I fully support Nvidia competition. Their monopoly is bad news for a variety of reasons, obviously.

However, as you note many of these implementations (Intel, AMD, Google TPU, etc) are more or less at the “get PyTorch to kind of work” stage.

I don’t know of many/any real world applications that are “vanilla” PyTorch at this point.

Stuff like Flash Attention (2), HF accelerate/optimum, distributed training implementations, Deepspeed, custom CUDA kernels all over the place, TensorRT, PyTorch 2 compile, SPDA, serving frameworks, etc. The software stacks and resulting functionality, usability, and performance CUDA “owns” are truly endless.

Any real project or implementation I’ve touched in the last year is so intertwined and dependent on CUDA it’s mind blowing and the CUDA lead is only increasing.

With AMD/ROCm as one example when you finally kind of get things to sort of work even though the hardware is potentially competitive on paper the software ecosystem is so far behind you’re happy to pay the “Nvidia tax” because not only is CUDA significantly smoother overall the endless stacks of optimized software implementations for CUDA make any allegedly comparable implementations run at a fraction of the speed while also burning dev time left and right.

Love or hate Nvidia the 15 year investment and dominance of CUDA is very apparent to anyone who’s actually working with this stuff and just trying get something done.

Again, as you note it’s interesting to watch observers/casual users claim these implementations are competitive because in my experience you get even one level deeper and it’s a complete nightmare. I try ROCm every couple of months and end up laughing and/or shaking my head at just how far behind it is (after six years).

I’m really rooting for them but the reality is these CUDA “competitors” have a very very long way to go.

emmender2 · on Jan 6, 2024

pytorch/tensorflow etc are becoming the "OS" for inference and training.

users interact with pytorch - not with hardware libraries. so, if pytorch can abstract the hardware, users wont care.

all users will care about is dollar cost of doing their work. so expect increasing commoditization of the hardware.

further, almost everyone in the ecosystem has an incentive to commoditize the hardware (users, cloud vendors, etc). over time i see the moat eroding - as the moat does not attach directly to the user.

samsartor · on Jan 6, 2024

This is still pretty idealized. In my experience the pytorch abstraction leaks _constantly_. In particular, any interesting ML project probably pulls in at least a few dependencies with custom pytorch extensions somewhere.

kkielhofner · on Jan 7, 2024

> users interact with pytorch - not with hardware libraries. so, if pytorch can abstract the hardware, users wont care.

At the most basic level, yes (pretty much "hello world"). This is what I meant by "it’s interesting to watch observers/casual users claim these implementations are competitive". Take a look at a project (nearly any project) and you will see plenty of specific commits for ROCm:

https://github.com/search?q=repo%3Ahuggingface%2Ftransformer...

https://github.com/search?q=repo%3AAUTOMATIC1111%2Fstable-di...

https://github.com/search?q=repo%3Avllm-project%2Fvllm+rocm&...

https://github.com/search?q=repo%3Aoobabooga%2Ftext-generati...

https://github.com/search?q=repo%3Amicrosoft%2FDeepSpeed+roc...

Check the dates - ROCm is six years old and all of these commits are /very/ recent.

Only the most simple projects are purely PyTorch to the point where other than random curiosities I'm not sure I've seen one in years.

Check the docs and pay attention to caveats everywhere for ROCm, with tables showing feature support for ROCm with asterisks all over the place. Repeat for nearly any project (check issues and pull requests while you're at it). Do the same for CUDA and you will see just how much specific hardware and underlying software work is required.

> all users will care about is dollar cost of doing their work.

Exactly. Check PyTorch issues.

ROCm:

https://github.com/pytorch/pytorch/issues?q=is%3Aissue+rocm

8,548 total issues.

CUDA:

19,692 total issues.

With Nvidia having 90% market share in AI and 80% market share on desktop and being supported in torch since day one those ratios are way off. For now and the foreseeable future if you're a business (time isn't free) the total cost of an actual solution from getting running, to training, to actually doing inference (especially at high production scale) very heavily favors Nvidia/CUDA. I've worked in this space for years and at least once a month since the initial releases of ROCm on Vega in 2017 I check in on AMD/ROCm and can't believe how bad it is. I've spent many thousands of dollars on AMD hardware so that I can continually evaluate it - if ROCm were anywhere close to CUDA in terms of total cost I'd be deploying it. My AMD hardware just sits there, waiting over half a decade for ROCM to be practical.

I don't have some blind fielty to Nvidia, own any stock, or care what logo is stamped on the box. I'm just trying to get stuff done.

> further, almost everyone in the ecosystem has an incentive to commoditize the hardware (users, cloud vendors, etc). over time i see the moat eroding - as the moat does not attach directly to the user.

We're very much in agreement. Your key statement is "over time" and this is what I was referring to with 'I’m really rooting for them but the reality is these CUDA “competitors” have a very very long way to go.'. It's going to be a while...

ilaksh · on Jan 6, 2024

I looked in their Intel Developer Cloud and saw the $10.42/hr 8x but there is no individual 1x Gaudi 2 there that I could see. The $1.30/hr could be okay for some inference use case though if it were available. Although for what I was thinking, llama.cpp is not going to work anyway.

remexre · on Jan 6, 2024

Kinda funny that instead of NVLink, they're just using (presumably standard) 100GbE as their connector/protocol; wonder if this also lets you wire up larger and more complex topologies of these cards across servers using normal 100GbE switches