Hacker News new | past | comments | ask | show | jobs | submit login

Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?



If you do, in fact, need H100s, they can be very hard to get. Even the smaller flavors of A100 you sometimes request, wait days for, and then 1 node might show up during a weekend. And for the reasons described in the article and the fact that large training jobs can be network-limited, nicer networks can be a big deal.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: