We operate a (small?) Airflow instance with ~20 DAGs but, one of those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL backing it.
We package all the code in 1-2 different Docker images and then create the DAG. We've faced many issues (logs out of order, missing, random race conditions, random task failures, etc.)
But what annoys me the most is that for that 1 big DAG, the UI is completely useless, tree view has insane dupplication, graph view is super slow and hard to navigate through and answering basic questions like, what exactly failed and what nodes are around it are not easy.
At Airbnb, we were using SubDAGs to try to manage large number of tasks in a single DAG. This allowed organizing tasks and drilling down into failures more easily but came with its own challenges.
We had a similar DAG that was the result of migration a single daily Luigi pipeline to Airflow. I started identifying isolated branches and breaking them off with external task sensors back to the main DAG. This worked but it's a pain in the ass. My coworker ended up exporting the graph to graphviz and started identifying clusters of related tasks that way.
I've not had the best luck with ExternalTaskSensors. There have been some odd errors like execution failing at 22:00:00 every day (despite the external task running fine).
Also, the @task annotation provides no facilities to name tasks. So if you like to build reusable tasks (as I do), you end up with my_generic_task__1, my_generic_task__2, my_generic_task__n. I've tried a few hacks to dynamically rename these, but I just ended up bringing down my entire staging cluster.
This still has the problem that, when you call my_func multiple times in the same dag, the resulting tasks will be labelled, my_func, my_func__1, my_func__2, ...
> Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
The problem with dogmatic OOP is that it mixes data structures and code into one. This leads to hard to modify code, unexpected side effects, poor parallelism. Imperative or FP code does the minimum to define what the data should be shaped like and then has a bunch of functions that either modify it or do some IO with it.
That's a lot easier to trace through and reason about. That in turn leads to more malleable code, which lets you evolve it faster as your understanding of the problem domain evolves.
Maybe I can offer an answer to your question, I have worked at a couple of companies where we ran "small" scale k8s clusters (1-100 nodes as you say).
We have chosen k8s and I would again, because its nice to use. Its not necessarily easier, as you point out, the complexity of managing the cluster is considerable. But if you use a managed cluster like EKS or DO's k8s offering, you don't have to worry too much about the nodes and the unit of worry is the k8s config and then for deployment you can use Docker.
I like Docker, because its nice. Its nice to have the same setup locally as you have remotely.
In my experience the tooling around k8s is nice to manage declaratively, I never liked working with machines directly because even tools like Chef or Ansible feel very flimsy.
The other thing you can do is run on ECS or similar, but there the flexibility is a lot lower. So k8s for me offers the sweet spot of being able to do a lot quickly with a nice declarative interface.
I'd be interested to hear your take on how to best run a small cluster though.
Thanks, that's really interesting. Everyone has different challenges and requirements, and of course different experiences.
For smaller setups (say 1-10 services) I'm quite happy with cloud config and one VM per process behind one load balancer per service. It's simple to set up, scale and reproduce. This setup doesn't autoscale, but I've never really felt the need. We use Go and deploy one static binary per service at work with minimal dependencies so docker has never been very interesting. We could redeploy almost all the services we run within minutes if required with no data loss, so that bit feels similar to K8s I imagine.
For even smaller companies (many services at many companies) a single reliable server per service is often fine - it depends of course on things like uptime requirements for that service but not everything is of critical importance and sometimes uptime can be higher with a single untouched service.
I think what I'd worry about with a k8s config which affects live deployments is that I could make a tweak which seemed reasonable in isolation but broke things in inscrutable ways - many outages at big companies seem to be related to config changes nowadays.
With a simpler setup there is less chance of bringing everything down with a config change, because things are relatively static after deploy.
Sorry that should have said one binary per node really, not per service (though it is one binary per service, just on a few nodes for redundancy and load).
Services behind a load balancer so one node at a time replaced then restarted behind that, and/or you can do graceful restarts. There are a few ways.
They're run as systemd units and of course could restart for other reasons (OS Update, crash, OOM, hardware swapped out by host) - haven't noticed any problems related to that or deploys and I imagine the story is the same for other methods of running services (e.g. docker). As there is a load balancer individual nodes going down for a short time doesn't matter much.
> how do you deploy your static binary to the server? (without much downtime ?)
Ask yourself how would you solve this problem if you deployed by hand and automate that.
1. Create a brain-dead registry that gets information about what runs where (service name, ip address:port number, id, git commit, service state, last healthy_at). If you want to go crazy, do it 3x.
2. Have haproxy or nginx use the registry to build a communication map between services.
You are done.
For extra credit ( which is nearly cost free ) with 1. you now can build a brain-dead simple control plane by sticking an interface to 1 that lets someone/something toggle services automatically. For example, if you add percentage gauge to services, you can do hitless rolling deploys or cannery deploys.
You’re making a great point there about procedural mindset being applied to array programming.
But the thing is, I feel like array based programming should lend itself naturally to functional approaches. And Pandas does do this to an extent.
My problem is that this is super inconsistent. Some things are done as a method call on an object, others by passing the object to a pandas function and others yet by passing a function to a method on an object. This is the major source of frustration for me.
Maybe there is some logic to that, but I haven’t found it yet and I think that is a sign of bad API design. Its like PHP to me. All nice and documented but useless without Googling everything
AIUI it's pretty expensive compared to, say, RDS or their managed Redis service? Which makes perfect sense relative to how much of a pain running your own Kafka cluster is.
100% worth it IMO, but it's a lot of upfront cost and you only start to see the benefits when a given flow is Kafka end-to-end and you learn how to use it, so I absolutely get why people are skeptical.
> I got here by taking no risk at all.
> This is the most European thing I've ever heard. Most people would not like to take risks and would not even think about stepping outside of their comfort zone.
In some sense this is true, but lets not get carried away. The American Startupism is all about the flash, big growth, big impact, big IPO.
In EU you have companies slowly grinding a hard problem for 20 years that also have huge impact but no flashy news. Look at things like vertical farming, ASLR, fintech. Thats all thanks to low risk and stability.
Keyboard design aside (which was fixed a year ago, and wasn't a poor production standards/cheap materials, etc. issue, it was a BS make-it-thinner-design issue), you mean the best built laptops in the industry, or the CPUs, battery, etc. with universal praise (M1)?
The worst possible condemnation of a powerful national government: to compare its normal everyday functioning to a corporate product launch. Orwell was too optimistic.
Well that is the new / recent Apple. Steve rarely use these technical benchmarks with specific numbers and only talk about the experience or it is being faster, better.
Now it is just a more polished event Keynote which felt the same as Google event in that they were prepared by tech people for tech people.
Accurate, with an asterisk (i.e. technically true, but intentionally misleading), which results in headlines like "No, the new Arm MacBook Air is not faster than 98% of PC laptops" when they are "audited".
We package all the code in 1-2 different Docker images and then create the DAG. We've faced many issues (logs out of order, missing, random race conditions, random task failures, etc.)
But what annoys me the most is that for that 1 big DAG, the UI is completely useless, tree view has insane dupplication, graph view is super slow and hard to navigate through and answering basic questions like, what exactly failed and what nodes are around it are not easy.