Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

kubeflow is pretty horrendously bad unfortunately. Most of the installation docs are incomplete and inaccurate, and since the workflow requires building a separate container for each submitted task (instead of separately specifying version control commit) you cannot actually get reproducible results. You’d have to scrape the state of the code out of the identified container tied to a job, since the circumstances under which the container was created for the job can be any arbitrary, out of band changes a developer was working on, such as from a branch they never pushed.

This workflow also doesn’t work well in hybrid on-prem + cloud environments because, for example, your model training might run in a cloud Spark task, but your CI pipeline (responsible for building and publishing a container to an on-prem container repo) might run on-prem. kubeflow, for example, has a hard requirement to put containers into cloud container registries, and makes assumptions about what the networking situation is allowing connection between on-prem and cloud container resources.

I think industry shifting focus to kubeflow is actually a giant mistake.



There are several other platforms/tools taking different approaches to productionizing ML workload. At Polyaxon[1], we used to create a container for each task and also log the git commit for reference, and provide ways to push the container to an in-cluster registry or a cloud registry. In the new version of our platform, we basically improved the process by injecting specific git commit inside pre-built docker images to not only reduce the build time, but also to allow easier integration with external tool such as Github actions.

[1]: https://github.com/polyaxon/polyaxon


Hey thanks for the comment. Could you talk about what you see as an alternative...but assuming that k8s (most likely some cloud managed flavor like eks, etc) is what the native devops is based on ?

What I'm seeing is that ML/data engineering is diverging from devops reality - and going and building its own orchestration layers. Which is all but impractical except at the largest orgs.

I'm yet to find something that fits in with kubernetes. Which is why it seems everyone is using fully managed solutions here like Sagemaker


I do think managed solutions like Fargate & Sagemaker are good choices. Some providers, notably GCP, have no offerings that seriously match these (Cloud Run has too many limitations).

Kubernetes is very poor for workload orchestration for machine learning. It’s ok for simple RPC-like services in which each isolated pod just makes stateless calls to a prediction function and reports a result and a score.

But it’s very poor for stateful combinations of ML systems, like task queues or robustness in multi-container pod designs for cooperating services. And it is especially bad for task execution. Operating Airflow / Luigi on k8s is horrendous, which is why nearly every ML org I’ve seen ends up writing their own wrappers around native k8s Job and CronJob.

Kubeflow can be thought of like an attempt to do this in a single, standard manner, but the problem is that there are too many variables in play. No org can live with the different limitations or menu choices that kubeflow enforces, because they have to fit the system into their company’s unique observability framework or unique networking framework or unique RBAC / IAM policies, etc. etc.

I recommend leveraging a managed cloud solution that takes all that stuff out of the internal datacenter model of operations, move it off of k8s, and only use systems you have end to end control over (eg, do your own logging, do your own alerting, etc. through vendors & cloud - don’t rely on SRE teams to give you a solution because it almost surely will not work for machine learning workloads).

If you cannot do that because of organizational policy, then create your own operators and custom resources in k8s and write wrappers around your main workload patterns, and do not try to wedge your workloads into something like kubeflow or TFX / TF Serving, MLflow, etc. You may have occasional workloads that use some of these, but you need to ensure you have wrapped a custom “service boundary” around them at a higher level of abstraction, otherwise you are hamstrung by their (deep seated) limitations.


this was super-brilliant. thank you so much ! I wish you would write a blog on this.


Airflow is a good replacement and works well, easy to deploy and to add datasources/steps.


Airflow is only a DAG task executor, which hardly scratches the surface of what is needed for managing experiment tracking, telemetry for model training, and ad hoc vs scheduled ML workloads.

Airflow is useful as a component of an ML platform, but it is only in principle capable of addressing a really really tiny part of the requirements.

You also need to ensure Airflow can easily provision the required execution environment (eg. distributed training, multi-gpu training, heavily custom runtime environments).

Overall Airflow isn’t a big part of ML workflows, just a small side tool for a small subset of cases.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: