> If you’re OK with used equipment I'm building a business, not a home lab. (bef...

zxexz · 2024-07-13T17:17:45 1720891065

So am I, but that makes total sense in your case, providing cloud compute. Doing that without contracts and warranties sounds like a nightmare (I haven’t downvoted you at all, I’ve seen your comments in a few threads and an really interested in what you’re doing). Best of luck, especially on the AMD side. I see a lot of people being skeptical about that, but I think we are very close to them being competitive with Nvidia in this space. I’m pretty much entirely Nvidia at the moment, but I’d love to hack on some MI300X whenever I can get access.

latchkey · 2024-07-14T00:34:08 1720917248

Thank you. Reach out to me when you are ready, or try it out on something like runpod, by the hour. We will eventually have by the hour as well, we just are not there yet.

zxexz · 2024-07-14T03:05:49 1720926349

Thanks, I sure will when I have the time. I’ll posts something and/or shoot you a message if I figure anything out too :) I have an MI100 still sitting around somewhere that I’ve meant to repair, but that hardware pales in comparison too the new AMD cards…

adastra22 · 2024-07-13T01:47:26 1720835246

So? The difference in cost can be as much as 10x. If you're building a startup, that matters.

latchkey · 2024-07-13T01:54:13 1720835653

What matters is support contracts and uptime.

If you have a dozen customers on a server that cannot access things because of an issue, then as a startup, without a whole customer support department, you're literally screwed.

I've been on HN long enough to have seen plenty of companies get complaints after growing too quickly and not being able to handle the issues they run into.

I'm building this business in a way to de-risk things as much as possible. From getting the best equipment I can buy today, to support contracts, to the best data center to just scaling with revenue growth. This isn't a cost issue, it is a long term viability issue.

Home lab... certainly cut as many corners as you want. Cloud service provider building top super computers for rent... not so much. There is a reason why not a lot of people start to do this... it is extremely capital intensive. That is a huge moat and getting the relationships and funding to do what I'm doing isn't easy and took me over 5 years to get to this point of just getting started. I'm not going to blow it all on cutting corners on some used equipment.

p1esk · 2024-07-13T05:22:25 1720848145

I'm building this business in a way to de-risk things as much as possible

Then why did you go with AMD and not Nvidia? Are you not interested in AI/ML customers?

latchkey · 2024-07-13T06:21:48 1720851708

In my eyes, it is less risky with AMD. When you're rooting for the underdog, they have every incentive to help you. This isn't a battle to "win" all of AI or have one beat the other, I just need to create a nice profitable business that solves customer needs.

If I go with Nvidia, then I'm just another one of the 500 other companies doing exactly the same thing.

I'm a firm believer that there should not be a single company that controls all of the compute for AI. It would be like having Cisco be the only company that provides routers for the internet.

Additionally, we are not just AMD. We will run any compute that our customers want us to deploy for them. We are the capex/opex for businesses that don't want to put up the millions, or figure out and deal with all the domain specific details of deploying this level of compute. The only criteria we have is that it is the best-in-class available today for each accelerator. For example, I wouldn't deploy H100's because they are essentially old tech now.

> Are you not interested in AI/ML customers?

Read these blog posts and tell me why you'd ask that question...

https://chipsandcheese.com/2024/06/25/testing-amds-giant-mi3...

https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpu...

p1esk · 2024-07-13T14:31:17 1720881077

OK, I just looked at the first blog post: “ROCm is nowhere near where it needs to be to truly compete with CUDA.”

That’s all I need to know as an AI/ML customer.

latchkey · 2024-07-14T00:31:36 1720917096

That is fine. Nobody is pretending that the software side is perfect. What we and AMD are looking for is the early adopters willing to bet on a new (just available in April) class of hardware that is better than previous generations. Which, given the general status of AI itself today, should be pretty easy to find.

wmf · 2024-07-13T02:33:27 1720838007

The latest generation generally isn't available used in much quantity. Used 100G equipment is cheap because it's almost ten years old.

adastra22 · 2024-07-13T05:18:04 1720847884

Depends on your use case. For AI it’s not much good. Other applications don’t require as much internode bandwidth.

zxexz · 2024-07-13T17:36:11 1720892171

What do you mean not much good for AI? I guess it depends, 10 year old equipment is not great but plenty of 7 year old stuff is excellent. SN2700 switches for example are EOL, but that doesn’t matter because you can run mainline Linux flawlessly, if you enable switchdev (I’ve managed to run FreeBSD too). CX5 and CX6 cards are used everywhere still. I don’t have much experience with Broadcom gear, but I hear there are good options there too, but they tend to require more to get the stack setup on Linux.

adastra22 · 2024-07-13T18:37:52 1720895872

What I mean is that AI training can be bottlenecked by internode communication speed, as GPUs sit idle while weights are swapped around. Other applications (e.g. graphics rendering) are “embarrassingly parallel” and don’t even use the internode communication. Most applications lie somewhere in between.

zxexz · 2024-07-14T02:55:12 1720925712

I think as a rule of thumb, latest generation hardware makes the most sense for cloud compute providers supporting AI. But while yes, training can be bottlenecked easily by internode training speed, it really depends on the model, model size, how you’re doing the training (DDP? FSDP? Custom sharding?). I’ve seen bottlenecks there, but usually it’s on the latency side of things, and you won’t really see an improvement there moving from Infiniband HDR to NDR. Preparing hardware for a generic cluster with changing or unknown workloads, yeah max it out and build the most flexible fabric you can. But if you know your model, you can optimize your hardware to it.

latchkey · 2024-07-14T00:37:00 1720917420

This is one of the many reasons why systems used for mining crypto did not transition into an AI role. The hardware requirements were totally different.

I had a hard time convincing my rather non-technical bosses of this in my previous company.

zxexz · 2024-07-14T02:58:14 1720925894

Yeah those crypto cards were nerfed. Truly e-waste, I’d love for someone to figure out some innovative ways to use them, maybe with some unhinged hardware mods.

latchkey · 2024-07-14T03:07:38 1720926458

It isn't just the cards, it is the whole chassis. We actually had the cards specifically manufactured from older chips that were just sitting in a warehouse somewhere. They didn't have fans or display ports.

In reality it was everything about the system though... CPU, PSU, mobo, ram, disk, switches, cables. It was all focused on ROI, not quality or performance.

I spent years looking for alternative uses for them and came up empty handed. At one point, we had 20,000 PS5 APU chip blades in production (and another ~30k sitting in boxes). I found a professor scientist who could use them to needle in haystack searching for quasars. We did some small testing and if we had been able to find funding to power them for a couple months, it was Nobel worthy research.

Sadly, the company shut down, I was laid off, and I have no idea what happened to it all.

zxexz · 2024-07-14T04:51:37 1720932697

Wow, thanks for sharing. Truly tragic. I wonder where all the parts are sitting now, hoping it’s not just e-waste, but knowing that likely is the case. A shame about the quasar research, that would have been a wonderful project to read about.

latchkey · 2024-07-14T05:02:52 1720933372

100% e-waste, I'm sure. I also talked to a number of firms that will buy whatever you have, for pennies on the dollar, and then deal with it for you. Apparently things like memory chips can be unsoldered recycled pretty well.

This is another reason why I'm going with Dell these days, they have a program for recycling. Even if it isn't perfect, at least it is something...

https://www.dell.com/en-us/lp/dt/recovery-recycling-services