kubeflow is pretty horrendously bad unfortunately. Most of the installation docs...

mmq · on Oct 17, 2020

There are several other platforms/tools taking different approaches to productionizing ML workload. At Polyaxon[1], we used to create a container for each task and also log the git commit for reference, and provide ways to push the container to an in-cluster registry or a cloud registry. In the new version of our platform, we basically improved the process by injecting specific git commit inside pre-built docker images to not only reduce the build time, but also to allow easier integration with external tool such as Github actions.

[1]: https://github.com/polyaxon/polyaxon

sandGorgon · on Oct 18, 2020

Hey thanks for the comment. Could you talk about what you see as an alternative...but assuming that k8s (most likely some cloud managed flavor like eks, etc) is what the native devops is based on ?

What I'm seeing is that ML/data engineering is diverging from devops reality - and going and building its own orchestration layers. Which is all but impractical except at the largest orgs.

I'm yet to find something that fits in with kubernetes. Which is why it seems everyone is using fully managed solutions here like Sagemaker

mlthoughts2018 · on Oct 18, 2020

I do think managed solutions like Fargate & Sagemaker are good choices. Some providers, notably GCP, have no offerings that seriously match these (Cloud Run has too many limitations).

Kubernetes is very poor for workload orchestration for machine learning. It’s ok for simple RPC-like services in which each isolated pod just makes stateless calls to a prediction function and reports a result and a score.

But it’s very poor for stateful combinations of ML systems, like task queues or robustness in multi-container pod designs for cooperating services. And it is especially bad for task execution. Operating Airflow / Luigi on k8s is horrendous, which is why nearly every ML org I’ve seen ends up writing their own wrappers around native k8s Job and CronJob.

Kubeflow can be thought of like an attempt to do this in a single, standard manner, but the problem is that there are too many variables in play. No org can live with the different limitations or menu choices that kubeflow enforces, because they have to fit the system into their company’s unique observability framework or unique networking framework or unique RBAC / IAM policies, etc. etc.

I recommend leveraging a managed cloud solution that takes all that stuff out of the internal datacenter model of operations, move it off of k8s, and only use systems you have end to end control over (eg, do your own logging, do your own alerting, etc. through vendors & cloud - don’t rely on SRE teams to give you a solution because it almost surely will not work for machine learning workloads).

If you cannot do that because of organizational policy, then create your own operators and custom resources in k8s and write wrappers around your main workload patterns, and do not try to wedge your workloads into something like kubeflow or TFX / TF Serving, MLflow, etc. You may have occasional workloads that use some of these, but you need to ensure you have wrapped a custom “service boundary” around them at a higher level of abstraction, otherwise you are hamstrung by their (deep seated) limitations.

sandGorgon · on Oct 19, 2020

this was super-brilliant. thank you so much ! I wish you would write a blog on this.

joana035 · on Oct 17, 2020

Airflow is a good replacement and works well, easy to deploy and to add datasources/steps.

mlthoughts2018 · on Oct 17, 2020

Airflow is only a DAG task executor, which hardly scratches the surface of what is needed for managing experiment tracking, telemetry for model training, and ad hoc vs scheduled ML workloads.

Airflow is useful as a component of an ML platform, but it is only in principle capable of addressing a really really tiny part of the requirements.

You also need to ensure Airflow can easily provision the required execution environment (eg. distributed training, multi-gpu training, heavily custom runtime environments).

Overall Airflow isn’t a big part of ML workflows, just a small side tool for a small subset of cases.