Hacker Newsnew | past | comments | ask | show | jobs | submit | more trumpeta's commentslogin

We operate a (small?) Airflow instance with ~20 DAGs but, one of those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL backing it.

We package all the code in 1-2 different Docker images and then create the DAG. We've faced many issues (logs out of order, missing, random race conditions, random task failures, etc.)

But what annoys me the most is that for that 1 big DAG, the UI is completely useless, tree view has insane dupplication, graph view is super slow and hard to navigate through and answering basic questions like, what exactly failed and what nodes are around it are not easy.


At Airbnb, we were using SubDAGs to try to manage large number of tasks in a single DAG. This allowed organizing tasks and drilling down into failures more easily but came with its own challenges.

In more recent versions of Airflow, TaskGroups (https://airflow.apache.org/docs/apache-airflow/stable/concep..., https://www.astronomer.io/guides/task-groups/ ) were made to help this a little bit. Hopefully that helps a bit.

At ~1k nodes in the graph introspection becomes hard anyway, as others have suggested, breaking it down if possible might be a good idea.


We had a similar DAG that was the result of migration a single daily Luigi pipeline to Airflow. I started identifying isolated branches and breaking them off with external task sensors back to the main DAG. This worked but it's a pain in the ass. My coworker ended up exporting the graph to graphviz and started identifying clusters of related tasks that way.


I've not had the best luck with ExternalTaskSensors. There have been some odd errors like execution failing at 22:00:00 every day (despite the external task running fine).


Also, the @task annotation provides no facilities to name tasks. So if you like to build reusable tasks (as I do), you end up with my_generic_task__1, my_generic_task__2, my_generic_task__n. I've tried a few hacks to dynamically rename these, but I just ended up bringing down my entire staging cluster.


`your_task.override(task_id="your_generated_name")` not working for you?


I got pretty excited when I read this response, but no, it doesn't work. I'm not sure how this would work since annotated tasks return an xcom object.

Can you point me to the documentation on this function? It's possible I'm not using it correctly.

I can do something like this, which works locally, but breaks when deployed:

    res = annotated_task_function(...)
    res.operator.task_id = 'manually assigned task id'


@task.python(task_id="this_is_my_task_name")

def my_func():

...


This still has the problem that, when you call my_func multiple times in the same dag, the resulting tasks will be labelled, my_func, my_func__1, my_func__2, ...


How about the dynamic task mapping that is now available in 2.3?


Does this imply file metadata content can effect the access performance of those files even for operations that do not directly concern the metadata?


Torvalds (I think):

> Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

The problem with dogmatic OOP is that it mixes data structures and code into one. This leads to hard to modify code, unexpected side effects, poor parallelism. Imperative or FP code does the minimum to define what the data should be shaped like and then has a bunch of functions that either modify it or do some IO with it.

That's a lot easier to trace through and reason about. That in turn leads to more malleable code, which lets you evolve it faster as your understanding of the problem domain evolves.


ddg has the same issue though


Maybe I can offer an answer to your question, I have worked at a couple of companies where we ran "small" scale k8s clusters (1-100 nodes as you say).

We have chosen k8s and I would again, because its nice to use. Its not necessarily easier, as you point out, the complexity of managing the cluster is considerable. But if you use a managed cluster like EKS or DO's k8s offering, you don't have to worry too much about the nodes and the unit of worry is the k8s config and then for deployment you can use Docker.

I like Docker, because its nice. Its nice to have the same setup locally as you have remotely.

In my experience the tooling around k8s is nice to manage declaratively, I never liked working with machines directly because even tools like Chef or Ansible feel very flimsy.

The other thing you can do is run on ECS or similar, but there the flexibility is a lot lower. So k8s for me offers the sweet spot of being able to do a lot quickly with a nice declarative interface.

I'd be interested to hear your take on how to best run a small cluster though.


Thanks, that's really interesting. Everyone has different challenges and requirements, and of course different experiences.

For smaller setups (say 1-10 services) I'm quite happy with cloud config and one VM per process behind one load balancer per service. It's simple to set up, scale and reproduce. This setup doesn't autoscale, but I've never really felt the need. We use Go and deploy one static binary per service at work with minimal dependencies so docker has never been very interesting. We could redeploy almost all the services we run within minutes if required with no data loss, so that bit feels similar to K8s I imagine.

For even smaller companies (many services at many companies) a single reliable server per service is often fine - it depends of course on things like uptime requirements for that service but not everything is of critical importance and sometimes uptime can be higher with a single untouched service.

I think what I'd worry about with a k8s config which affects live deployments is that I could make a tweak which seemed reasonable in isolation but broke things in inscrutable ways - many outages at big companies seem to be related to config changes nowadays.

With a simpler setup there is less chance of bringing everything down with a config change, because things are relatively static after deploy.


>We use Go and deploy one static binary per service at work with minimal dependencies so docker has never been very interesting.

how do you deploy your static binary to the server? (without much downtime ?)


Sorry that should have said one binary per node really, not per service (though it is one binary per service, just on a few nodes for redundancy and load).

Services behind a load balancer so one node at a time replaced then restarted behind that, and/or you can do graceful restarts. There are a few ways.

They're run as systemd units and of course could restart for other reasons (OS Update, crash, OOM, hardware swapped out by host) - haven't noticed any problems related to that or deploys and I imagine the story is the same for other methods of running services (e.g. docker). As there is a load balancer individual nodes going down for a short time doesn't matter much.


> how do you deploy your static binary to the server? (without much downtime ?)

Ask yourself how would you solve this problem if you deployed by hand and automate that.

1. Create a brain-dead registry that gets information about what runs where (service name, ip address:port number, id, git commit, service state, last healthy_at). If you want to go crazy, do it 3x.

2. Have haproxy or nginx use the registry to build a communication map between services.

You are done.

For extra credit ( which is nearly cost free ) with 1. you now can build a brain-dead simple control plane by sticking an interface to 1 that lets someone/something toggle services automatically. For example, if you add percentage gauge to services, you can do hitless rolling deploys or cannery deploys.


You’re making a great point there about procedural mindset being applied to array programming. But the thing is, I feel like array based programming should lend itself naturally to functional approaches. And Pandas does do this to an extent.

My problem is that this is super inconsistent. Some things are done as a method call on an object, others by passing the object to a pandas function and others yet by passing a function to a method on an object. This is the major source of frustration for me.

Maybe there is some logic to that, but I haven’t found it yet and I think that is a sign of bad API design. Its like PHP to me. All nice and documented but useless without Googling everything


Do you have an opinion on MSK?


AIUI it's pretty expensive compared to, say, RDS or their managed Redis service? Which makes perfect sense relative to how much of a pain running your own Kafka cluster is.

100% worth it IMO, but it's a lot of upfront cost and you only start to see the benefits when a given flow is Kafka end-to-end and you learn how to use it, so I absolutely get why people are skeptical.


> I got here by taking no risk at all. > This is the most European thing I've ever heard. Most people would not like to take risks and would not even think about stepping outside of their comfort zone.

In some sense this is true, but lets not get carried away. The American Startupism is all about the flash, big growth, big impact, big IPO. In EU you have companies slowly grinding a hard problem for 20 years that also have huge impact but no flashy news. Look at things like vertical farming, ASLR, fintech. Thats all thanks to low risk and stability.


correct, but the tuition is peanuts compared to US or in some cases places like India.


I mean, I get the same feeling from Apple events, everything is constantly the best ever.


Compared to other companies that present new products with something like "this year we've made a product worse than what we had before"?


Haven't been following Macbook quality much, huh?


I'd reverse the question.

Keyboard design aside (which was fixed a year ago, and wasn't a poor production standards/cheap materials, etc. issue, it was a BS make-it-thinner-design issue), you mean the best built laptops in the industry, or the CPUs, battery, etc. with universal praise (M1)?


The worst possible condemnation of a powerful national government: to compare its normal everyday functioning to a corporate product launch. Orwell was too optimistic.


Eh, it was just a more relatable example. If anything, it shows you that our idea of oppression is Apple overstating their product's capabilities.


The Orwellian part is people talking as though it’s a fact that Apple is overstating something.

As far as I can see that’s bullshit.

They say what they are measuring, and people do check it.

They’d be open to straight up legal action from both customers and shareholders if they were lying.


The Orwell's of the future will be writing about China and western social media (and a bit of Apple thrown in there)


> everything is constantly the best ever.

Not just the best ever, always 3x or 300% better according to fake benchmarks that nobody can audit. It's beyond Marketing, basically propaganda.


>Not just the best ever, always 3x or 300% better

Well that is the new / recent Apple. Steve rarely use these technical benchmarks with specific numbers and only talk about the experience or it is being faster, better.

Now it is just a more polished event Keynote which felt the same as Google event in that they were prepared by tech people for tech people.


"It's Magic!"


They indicate what they are comparing in the smallprint and they do get audited by people like AnandTech or Ars.

So far, every single time, they have been shown to be accurate.


> They indicate what they are comparing in the smallprint

(smallprint) * tested in very specific condition X that does not reflect at all everyday usage of 99.9999% of users.

(The tech media) Apple did it again with 300% perf increase! Mad engineering!

That's just legal department finding a way to make propaganda harder to challenge. It's still propaganda in effect.


The fact that you are using a made up exaggeration doesn’t help your point.

They don’t in fact make exaggerations like that, otherwise you’d be able to quote a real one.


Accurate, with an asterisk (i.e. technically true, but intentionally misleading), which results in headlines like "No, the new Arm MacBook Air is not faster than 98% of PC laptops" when they are "audited".


"No, the new Arm MacBook Air is not faster than 98% of PC laptops"

The piece with that headline is about the only thing that has been audited and found to be false.

Search for commentary anywhere on that piece and you’ll find it has been completely debunked.


There are plenty of cases where you can write a complex program in Rust without ever needing those things.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: