Does anyone know about literature or courses regarding building machine learning cluster infrastructure? I am mainly interested in building and scaling up the storage infrastructure, networking and scheduling approaches.
Nothing fancy. The core principles are the same, you'll need to adapt them depending upon the kind workload changes that ML introduces. For most ML systems
1. Storage infra: assuming storage for models or even data, use any blob storage like S3. Or a shared networked file system like EFS, Lustre etc.
2. Networking: if you're talking about networking using large GPUs, I am not aware of any definitive resource on this.
3. Scheduling: This is honestly a solved problem at this point, anything works - write your own coordinator that periodically runs docker image base jobs (you can hook one up quite quickly using some sort of system for metadata and triggers powered by message queuing), use Airflow, use AWS Batch for large scale jobs.
You missed model serving (I think ?). Tough and latency sensitive esp for recommender systems. Prone to latency spikes, traffic spikes. Even with a well-written Python code you can run into limitations quite quickly.
Well, right now I am seeing lots of low-level innovation for networking/storage along with RoCE, Infiniband, Tesla's ttpoe, the recent addition of devmem-tcp to the linux kernel (https://docs.kernel.org/networking/devmem.html) and wondered if there are approaches on how to plug something like that together on a higher level and what the considerations are. I surely assume EFS or S3 might be too expensive for a (large) training infrastructure, but I can be wrong?
> You missed model serving (I think ?).
I think I have a better grasp on the engineering challenges there and could imagine an architecture to scale that out (I believe!).