Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scaling Kubernetes to 7,500 Nodes (openai.com)
240 points by sillysaurusx on Jan 25, 2021 | hide | past | favorite | 53 comments


A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node ... So for many of our workloads, a single pod occupies the entire node

Hm, why not just use the underlying nodes then, without Kubernetes?

Is the underlying cloud that bad at scheduling, and are they keeping the VMs warm all the time?

What are they gaining for this indirection? Is it to get a common interface across GCP and other clouds?

Bin-packing or fragmentation is not a common problem

there’s relatively low strain on the scheduler.

That said, strain on the kube-scheduler is spiky


Hi! Co-author here. We do keep the nodes running 24/7, so Kubernetes still provides the scheduling to decide which nodes are free or not at any given time. Generally starting a container on a pre-warmed node is still much much faster than booting a VM. Also, some of our servers are bare-metal.

EDIT: Also don't discount the rest of the Kubernetes ecosystem. It's more than just a scheduler. It provides configuration, secrets management, healthchecks, self-healing, service discovery, ACLs... there are absolutely other ways to solve each of these things. But when starting from scratch there's a wide field of additional questions to answer, problems to solve.


Isn't Kubernetes a pretty lousy scheduler when it doesn't take this into consideration? There are a number of schedulers used in high performance computing that should be able to do a better job.


Yeah exactly... This seems closer to an HPC problem, not a "cloud" problem.

Related comment from 6 months ago about Kubernetes use cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_with_kub...

Summary: scale has at least 2 different meanings. Scaling in resources doesn't really mean you need Kubernetes. Scaling in terms of workload diversity is a better use case for it.

Kubernetes is basically a knockoff of Borg, but Borg is designed (or evolved) to run diverse services (search, maps, gmail, etc.; batch and low latency). Ironically most people who run their own Kube clusters don't seem to have much workload diversity.

On the other hand, HPC is usually about scaling in terms of resources: running a few huge jobs on many nodes. A single job will occupy an entire node (and thousands of nodes), which is what's happening here.

I've never used these HPC systems but it looks like they are starting to run on the cloud. Kubernetes may still have been a defensible choice for other reasons, but as someone who used Borg for a long time, it's weird what it's turned into. Sort of like protobufs now have a weird "reflection service". Huh?

https://aws.amazon.com/blogs/publicsector/tag/htcondor/

https://aws.amazon.com/marketplace/pp/Center-for-High-Throug...


Exactly, we migrated to k8s not because we needed better scaling (ec2 auto scaling groups were working reasonably well for us) but because we kept inventing our own way to do rolling deploys or run scheduled jobs, and had a variety of ways to store secrets. On top of that developers were increasingly running their own containers with docker-compose to test services talking to each to each other.

We migrated to k8s to A) have a way to standardize how to run containerized builds and get the benefits for "it works on my laptop" matching how it works in production (at least functionally) and B) a common set of patterns for managing deployed software.

Resource scheduling only became of interest after we migrated when we realized the aggregation of our payloads allowed us to use things like spot instances without jeopardizing availability.


Condor and the like are for independent jobs "throughput computing" but the authors here are using MPI for tightly-coupled jobs. SLURM and Flux are actively-developed schedulers for these kind of jobs.

https://slurm.schedmd.com/

https://flux-framework.readthedocs.io/en/latest/


SLURM hits a nice sweet spot when you have a very traditional cluster: very homogeneous nodes (both hardware and software), standard logins (eg some kind of LDAP/AD), shared NFS files, trusted code. It's an absolute pain when:

- Lots of different kinds of nodes

- anything more complex dependency wise than a handful of shared Conda envs

- anything involving docker

- anything vaguely untrusted

- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

- anything more complex than 3-5 priority levels of scheduling

It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

It's also just a bit dated feeling. Even though kube is complex, it's a joy to work with compared to SLURM. Hashicorp is even better imho.


hmm, I'd like to digress

>- Lots of different kinds of nodes

well, that's not a problem of slurm (which will happily start your process on all nodes), but of typical MPI programming. And once you are running something computationally intensive over multiple nodes today, you are still using MPI.

>- anything more complex dependency wise than a handful of shared Conda envs

you can put whatever dependencies you want on your NFS (or copy them to your node). If you're running on a single node it behaves 100% like running with a special login shell on os XYZ, so I don't know what problems happen with dependencies. The main problem would be that it doesn't include any "service discovery" beyond OpenMPI.

>- anything involving docker

have not used it, but there's enroot/singularity. The first of which is apparently dogfooded at Nvidia. Probably might need some adjustements for bases images (because MPI)... As I don't know about the policy within these 5k+ cloud companies: can employees just execute any random image from dockerhub there? This seems a little dangerous...

> anything vaguely untrusted

linked to the docker case? Does kubernetes reboot nodes then? Slurm can do this. And while classical Slurm use cases definitely require a shared account (because of the shared fs), slurm should afaik merrily execute your programs even without any shared account than slurm. You can attack this obviously, but so you can attack kubernetes and while it gets more scrutiny it's also a byzantine collection of FANG-style requirements.

EDIT: What you can't work around is Slurm needing a comms-channel back to the controller, which you though could just firewall off (jobs don't use Slurm to communicate...). As each job can execute a Prolog-script, you can even only selectively allow traffic to flow between allocated nodes quite simply.

>- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

that's indeed the case

>- anything more complex than 3-5 priority levels of scheduling

what kind of scheduling does kubernetes implement? I guess you could write a plugin for slurm doing that

> It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

except that your points didn't pertain to this (except maybe for the dependencies, if you think about actual service-dependencies), I fully agree


All very good points!

> you can put whatever dependencies you want on your NFS (or copy them to your node).

This is exactly what we do currently. For non controlled data, this works. However this gets really thorny when you involve CUI (confidential unclassified information), precisely because of mentioned shared fs.

Both SLURM and Kube let you write schedulers but just getting SLURM to talk to the DB was a tough affair, some very poorly documented bugs were at play.

I haven't been on this project in a bit so I don't recall the exact details. And maybe it's lack of familiar with SLURM. But I definitely felt hobbled by it. We are probably going to something based off of Hashicorp stuff.


yes, I guess you are still using NFSv3? We (really tiny vs. everyone else here) settled on that as well, because it requires less integration overall. Though if you're going the all-AD-route, there's the auks-plugin for running with NFSv4 (not sure, how long ticket renewal works though). And you can always just sbcast a zip of your tree and completely forego the NFS (if you store your data somewhere else. Normally you should also be able to write GRES-plugins to "share" this ressources.


The problem with slurm is how it's typically used: ssh into a shared login node with a shared file system, authorization is tightly coupled to linux users on that node, submit jobs with sbatch. Kubernetes deployment feels much more modern and safe.

I have worked with containers + slurm, where the vendor libmpi is injected in the container runtime [1] by a hook, which gives you close to bare metal performance with some container goodness in terms of isolation and deployment.

[1] https://github.com/eth-cscs/sarus


Slurm should be the answer but it isn't. In our ML environment, it required ML researchers to understand what is going on (more systems knowledge) and no one liked it. The situation devolved to sshing into machines and running jobs. You are right that slurm is a good fit for HPC ... I just don't think DL workloads are exactly that.

P.S. I also think the K8s scheduler isn't great.


One FAANGUAMLetc engineer told me they SSH, Slurm, and track experiments by telling their manager which parameters were best the day before. This was very strange given that this company has a machine learning platform, so either this engineer did not use it, or they did not use it that much.

We were talking about our machine learning platform and taking it for a spin. We do have long-running notebook scheduling[0] but we wanted to be able to watch the notebook's output from multiple devices as it was running, and for it to survive closed tabs or network disconnections, not just get the results once it's done. We also wanted to be able to do that right from the notebook's interface, instead of SSH'ing and all that, as this was tedious and some of our users aren't that comfortable doing that.

- [0]: https://iko.ai/docs/notebook/#long-running-notebooks


It maybe an HPC problem but I'm not sure the available solutions come close to k8s in terms of functionality and I'm not talking about scheduling.

I used to work in HPC/Grid but it's been a while but I do remember Condor being clunky even though it had its uses.

And the commercial grid offerings couldn't scale to almost 10k nodes back then (am not sure about now, or if they even exist anymore)


Condor is clunky, but still in use in high energy physics, for example (LHC CMS detector data processing).

For greenfield deployments, I would recommend Hashicorp's Nomad before Kubernetes or Condor if your per server container intent is ~1 (bare metal with a light hypervisor for orchestration), but still steer you to Kubernetes for microservices and web-based cookie cutter apps (I know many finance shops using Nomad, but Cloudflare uses it with Consul, so no hard and fast rules).

Disclosure: Worked in HPC space managing a cluster for high energy physics. I also use (free version) Nomad for personal cluster workload scheduling.


I admit that Nomad is a fair middle ground due to its clean DSL and also because of the homogeneity of their workloads.

The team at OpenAI used the k8s api to make extensions around multi-tenancy (across teams) to saturate available allocations, task specific scheduling modifications which were not supported by the k8s scheduler.

I don't know if Nomad has this extensibility. Their plugins were around device plugins and tasks when I last looked at it.


Another pro for Kubernetes is that it has a lot of inertia at the moment with a large contributing community and a large pool of engineers with experience in using it. It's a guess, but would assume the talent pool for hpc stuff isn't as big.

And yea, I like the ability to easily support a diverse set of workloads on the same cluster. It's a simple and easier to understand architecture compared to my previous experience with hadoop.


Not sure that's a pro if your use case is just a platform for long running compute intensive jobs. The platform's goals may diverge even more from yours in the future, if a cloud provider's use case is the cause for a big rewrite for example.

A small part of said inertia is perhaps the CADT model of software development in action up close, where functionality can be redeveloped multiple times because someone is not satisfied with the outcome.


> Ironically most people who run their own Kube clusters don't seem to have much workload diversity.

This has not been my experience at all, but most of my clients are big corporations/enterprises. It's not uncommon to have a cluster with hundreds or thousands of different services running, from front-end static file servers to CRUD apps to machine learning. Even the startups I've worked with had at least a handful of different services they ran on K8s.


Can you go a bit more into detail what you mean by protobuf "reflection service"


A better term to search for is "gRPC reflection service". I can't find the link, but I thought I saw people saying that this was idiomatic to use in many cases for Google Cloud, rather than compiling the schemas into the binary.

That feels weird to me because compilation was always the traditional / intended use of protobufs, whereas dynamic reflection was for a few non-critical debugging tools. I guess my point is that Google has a very specific computing environment and set of problems, and when they get exported to the outside world, they start to get warped because of the different set of problems. That feels like what happened with Borg/Kubernetes as well. I seem to see a lot of regrets about Kubernetes lately, from primary authors and operators.


If all you care about is node in use or not in use I think it’s fine. You don’t need anything complex from the scheduler.


Not to me mention it’s a well known skillset that can more easily be hired for, as opposed to “come work on our crazy sauce job scheduler, you’ll love it!”


Are you starting from scratch? This architecture seems like a pretty standard HPC deployment with unnecessary containerization involved.


I feel like we solved this problem over a decade ago (if you’re keeping machines warm anyway) with job brokers. Am I somehow mistaken?


> self-healing, service discovery

For a second I read that as self-discovery

Damn kubernetes is some good shit


Well, considering how looking at kubernetes config makes me question the choices I have made in my life that led me into this moment, "self-discovery" is not too far off, I think.


In the HPC space, this is called grid-computing. I've used Slurm (https://slurm.schedmd.com/) in the past with good success.


Id say that slurm is slightly more challenging to configure than kubernetes, but on the other hand there's pretty much one way to do it, and you're not searching among every other use case to find it.

It is however designed for secure networks where you have total control over the physical layer-2. If that gets compromised you can be in a world of hurt.

In a past life i configured and set up slurm in vpcs in amazon, was kind of a fun exercise.


True, I had not considered the security aspect. Security is usually considered a non-issue in HPC circles.


It's not a non-issue, it's just a different security model. You need to audit all of the code going into the data center, but once it's there it's (ideally) a hermetically sealed safe space.


I have had the privilege of running a mixed machine learning cluster at almost this scale (5000 nodes). I would never dream of doing it on K8s. Its just not really designed to run at the this scale usefully, simply or cheaply. Especially if you have complex jobs(ie you have a bunch of dependencies, run this task before that).

On AWS we used batch. It runs a docker image on a machine. it takes care of all the machine prep, scheduling and error handling.It scales simply, secrets are built in (using parameter store) and you can have EFS/lustre for file access.

You will however need to make a job description language, but that's trivial (mine is a 500 line lambda)

Its not as nice to use as tractor/alfred/deadline, but it works, and doesn't need a team of devops to keep it running optimally.

K8s has some appeal (widespread adoption) but its still not really a HPC workload system. It's networking scheme is just insane (and chatty as fuck, just look at the graphs, almost 10gigs in control traffic, on AWS if you're running in multiple az, thats about $150 an hour just in transfer alone)


fully agree with the comment. Currently running an >500 node bare metal cluster with GPU nodes in K8s using Mellanox NICs with 80Gbit/s connectivity.

Customer is scheduling work load with Airflow on the K8s farm. Coming from a HPC background, I can confirm your comment. For batch-like workload, K8s and Airflow is a bad choice regarding farm utilization and workload placement. Classic HPC schedulers like Slurm are way more efficient on using farm resources for this kind of workload.

Also networking wise is Calico for high speed Infiniband-based networks might be a bad decision. Control plane is too chatty, and slows down spinning up workload. Also do not ever try IP-on-IP encapsulation - it will make your NIC-speed slow, due to CPU-bound encapsulation.

With Mellanox-class NICS, strive for SDN technology or use network transport modes with hardware acceleration on the NIC, like VLAN-tagging.


I was reading about 'Nomad' from Hashicorp, which would seem to do this too


Disclosure: I work on Google Cloud (and have worked with Ben and Eric in their role at OpenAI).

For folks that find this topic interesting, I think that Maciek’s (GKE PM) talk at NEXT with Twitter on 15,000 node clusters has a ton of great info [1].

Just having lots of nodes is pretty hard, but having lots of nodes and lots of churn is harder. The diagram Maciek puts into the talk and a related blog post is really the key: scale has many dimensions.

For OpenAI, they’re making a reasonable choice as operators of a multi tenant cluster (shared amongst researchers). You want to own the other software on the box, do rolling upgrades, and then get out of the way.

Many other folks though, really need a much more dynamic system. Watch Maciek’s talk!

[1] https://youtu.be/3AAgWBvM5L0


> That said, strain on the kube-scheduler is spiky. A new job may consist of many hundreds of pods all being created at once, then return to a relatively low rate of churn.

Last I checked, the default scheduler places Pods one at a time. It might be advantageous to use a gang/batch scheduler like kube-batch[0], Poseidon[1] or DCM[2].

Edit: looks like they're already investigating that approach --

> We tried a few things needing a custom scheduler, but ran into edge cases that caused conflicts with how normal pods were scheduled. Kubernetes 1.18 introduced a plugin architecture for the core Kubernetes scheduler, making it much easier to add features like this natively. We recently landed on the Coscheduling plugin as a good way to solve this problem.

[0] https://github.com/kubernetes-sigs/kube-batch

[1] https://github.com/kubernetes-sigs/poseidon

[2] https://github.com/vmware/declarative-cluster-management


>Pods communicate directly with one another on their pod IP addresses with MPI via SSH, not service endpoints.

this is super interesting. They are using mpi on kubernetes for their AI training i suppose.

So they are not using anything like kubeflow. Any idea which framework this is ?

The state of AI training on kubernetes is not so hot. And this would be a good learning. There is Ray Distributed today that claims a better performance (as well as a better developer experience) than OpenMPI - https://www.usenix.org/system/files/osdi18-moritz.pdf

wonder why the choices were made as such


Hi, co-author here!

We use a pretty standard tech stack of PyTorch + NCCL + MPI. We've used both OpenMPI and MPICH to varying degrees.

Kubeflow is interesting, but it solves a slightly different problem of scheduling/coordinating ML workflows on top of Kube. It doesn't get involved with how an ML job communicates within itself cross-node.


Probably OP was referring to the MPIOperator, TFOperator, PytorchOperator, ... they are under the Kuberflow org, but can be deployed independently of Kubeflow itself. Several other projects are using those operators to provide similar abstractions you mentioned in your blog post, e.g. Gang scheduling, cross-nodes communication, ...

One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.


@benchess - yes this is what i meant. Using the operator framework.

But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?

Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?


Hi, kube-scheduler maintainer here, currently looking into enabling MPI use cases in k8s.

We started a discussion in https://github.com/kubeflow/mpi-operator/issues/315


Mpi over ssh sounds like a bad idea. Typically hpc clusters have a highly optimized vendor libmpi. E.g. cray has an abi compatible libmpi.so to mpich, so you are likely better of replacing the libmpi.so in your container with the vendor version by mounting a bunch of paths from the host. That can be a hook like nvidia has for their gpu support. One container runtime that does this is Sarus: https://github.com/eth-cscs/sarus


The prometheus scaling problem is a real one. It stems from the "scrape everything" mantra that everyone in prometheus community follows, we should be rather opting-in on which metrics are useful and drop everything by default. We need to go the distributed-tracing way, have smart agents near your metric source when gets everything locally and then decides what metrics are useful and finally ship them off for storage. We are kinda halfway there with OpenTelemetry but the adoption hasn't started yet. Very curious to see how you guys managed it.


>Pod network traffic shaping

Have you considered EDT-based rate limiting for Pods? This should scale well compared to TBF or HTB. Cilium developers have integrated this natively: https://cilium.io/blog/2020/11/10/cilium-19#bwmanager


Hi, co-author here. Yes we are excited about the potential of this!


Hi Ben, could I ask the size of your team managing these clusters and development efforts?


>Our biggest jobs run MPI, and all pods within the job are participating in a single MPI communicator. If any of the participating pods die, the entire job halts and needs to be restarted.

Sounds like a strange approach for ML jobs - i mean you'd expect that all those parallel subtasks aren't each individually midway interconnected with each other, and that the failed subtask can be easily restarted on its own. With thousands of subtasks running in parallel some are bound to fail for whatever reason. Their choice of MPI over HTTP suggests though that they pay a premium for latency and that suggests the subtasks actively cross-communicating, a typical case for MPI.


Thats what makes ML training challenging, the nodes are constantly all-reducing the weights to each other and the step time is in milliseconds.

Thats why you see cloud vendors coming out with really beefy individual hosts with lots of GPU's strapped to them, as datacenter networking sucks for these applications.


This is probably data-parallel training with collective all-reduce (Horovod probably, as they are using MPI). Membership of the ring in Horovod is static - you can't recover from a failed worker. You would need to build a consistent hashing ring (like a DHT), so that workers could identify and agree on failing workers (heartbeats) and evict them. None of those goodies in Horovod yet.

The workaround is to have a chief-node do periodic checkpointing of the model weights and epoc/iteration, so that you can recover from the checkpoint if a worker fails.


When we ran HPC CFD models at university we would just use a head node and SLURM to schedule work on machines. It worked really well. I'm not really sure what benefit Kubernetes adds in this case? If anything I feel like it adds more complexity.

Would love to hear from the OpenAI folks as to why they chose to use Kubernetes for this problem.


Modern HPC schedulers like Slurm are able to run containerized workloads making the problem of portability less painful.

For batch workload K8s is not sufficient. You need a workflow engine on top like Kubeflow or Airflow, so run you jobs. Thus. consider K8s more as a cluster operating software, but not as a fully-enabled workload management like slurm.

Furthermore, K8s has so many features you do not need in HPC environments, where high throughput and load occupation is paramount.


From the POV of an HPC cluster user, when using SLURM's `srun` or similar to schedule a job, this now allows you to use `srun --container=<your container>`, and it will start each node where you app run using the container, and make sure MPI, GPUs, etc, all work.

If you don't know anything about containers, it probably will be a bit hard to imagine what this buys you, but don't worry, as more clusters start moving towards this model, you'll have to learn about containers at some point.

From the POV of the HPC cluster, it means that the `module` system can be replaced with containers, and that can significantly lower the maintenance overhead of the cluster. In a sense, it turns HPC cluster users into HPC cluster maintainers (that have to build their own images, preparing their own environment, etc).


Godda hand it to OpenAI. My opinion about them is slowly reversing. I was nervous when they went full API, but CLIP is a fantastic model that they released for free.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: