We had to figure this out the hard way, and ended up with this approach (approxi...

mrj · 2025-03-11T01:49:17 1741657757

A lot of this seems like the fault of the ALB, is it? I had the same problem and eventually moved off of it to cloudflare tunnels pointed at service load balancers directly, which changed immediately when pods went bad. With a grace period for normal shutdowns, I haven't seen any downtime for deploys or errors.

The issue with the above setup is (maybe I'm doing it wrong?) but if a pod is removed suddenly, say if it crashes, then some portion of traffic gets errors until the ALB updates. And that can be an agonizingly long time, which seemed because it's pointed at IP addresses in the cluster and not the service. It seemed like a shortcoming of the ALB. GKE doesn't have the same behavior.

I'm not the expert but found something that worked.

paranoidrobot · 2025-03-11T04:24:27 1741667067

> A lot of this seems like the fault of the ALB, is it?

I definitely think the ALB Controller should be taking a more active hand in termination of pods that are targets of an ALB.

But the ALB Controller is exhibiting the same symptom I keep running into throughout Kubernetes.

The amount of "X is a problem because the pod dies too quickly before Y has a chance to clean up/whatever, so we add a preStop sleep of 30 seconds" in the Kubernetes world is truly frustrating.

lucasyvas · 2025-03-11T04:50:40 1741668640

If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.

Many people don’t run the main process in the container as PID 1, so this “problem” remains.

If it’s not feasible to remove something like a shell process from being the first thing that runs, exec will allow replacing the shell process with the application process.

paranoidrobot · 2025-03-11T09:52:17 1741686737

> If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.

Maybe I am holding it wrong. I'd love not to have to do this work.

But I don't see how being PID 1 or not helps (and yes, for most workloads it is PID 1)

The ALB controller is the one that would need to deregister a target from the target group, and it won't until the pod is gone. So we have to force it by having the app do the functional equivalent with the readiness check.

relistan · 2025-03-11T08:13:28 1741680808

Yeah, exactly. We just catch the TERM, clean up, and then shut down. But the rest of the top post in the thread is right on.

remram · 2025-03-11T14:05:49 1741701949

If I understand correctly, because ALB does its own health checks, you need to catch TERM, wait 30s while returning non-ready for ALB to have time to notice, then clean up and shut down.

kunley · 2025-03-11T08:57:12 1741683432

Kubernetes was written by people who have developer, not ops, background and is full of things like this. The fact that it became a standard is a disaster

gchamonlive · 2025-03-11T12:28:51 1741696131

Maybe, or maybe orchestration and load balancing is hard. I think it's too simplistic to dismiss k8s development because the devs weren't ops.

I don't know of a tool that does a significantly better job at this without having other drawbacks and gotchas, and even if it did it doesn't void the value k8s brings.

I have my own set of gripes with software production engineering in general and specially with k8s, having seen first hand how much effort big corps have to put just to manage a cluster, but it's disrespectful to qualify this whole endeavour as disastrous.

kunley · 2025-03-12T15:38:16 1741793896

Guys who wrote it are ok, they put a lot of effort and that's fine. If I understand things correctly, they were also compensated well. But the effort based on some wrong assumptions makes a product that is flawed. Lot of people then are forced to use it because there is no alternative, or the alternatives are easily dismissed - behavior based in turn on a certain propaganda and marketing. And that part is a disaster. This is not personal, btw.

jfuawdfaw · 2025-03-11T04:49:45 1741668585

> A lot of this seems like the fault of the ALB, is it?

People forget to enable pod readiness gates.

paranoidrobot · 2025-03-11T09:56:45 1741687005

Pod Readiness Gates, unless I'm missing something, only help on startup.

Unless something has changed since I last went digging into this. You will still have the ALB sending traffic to a pod that's in terminating state, unless you do the preStop bits I talked about in the top of the thread.

https://kubernetes-sigs.github.io/aws-load-balancer-controll...

jfuawdfaw · 2025-03-11T16:56:59 1741712219

> Pod Readiness Gates, unless I'm missing something, only help on startup.

Also allows graceful rollout of workload.

> You will still have the ALB sending traffic to a pod that's in terminating state

The controller watches endpoints and will remove your pod from target group on pod deletion.

You don't need the preStop scam as long as your workload respects SIGTERM and does lame-duck.

paranoidrobot · 2025-03-11T21:48:22 1741729702

> You don't need the preStop scam as long as your workload respects SIGTERM and does lame-duck.

Calling it a scam is a bit much.

I think having to put the logic of how the load balancer works into the application is a crossing of concerns. This kind of orchestration does not belong in the app, it belongs in the supporting infrastructure.

The app should not need to know how the load balancer works with regards to scheduling.

The ALB Controller should be doing this. It does not, and so we use preStop until/unless the ALB controller figures it out.

Yes, the app needs to listen for SIGTERM and wait until it's outstanding requests are completed before exiting - but not more than that.

pojzon · 2025-03-11T22:54:11 1741733651

Just curious:

- so if pod goes to terminating state

- with gates enabled, alb controller should remove it from targets instantly coz it listens to k8s api pod changes stream ?

In my experience there was ALWAYS some delay even a small one in High Frequency systems which caused 500s.

Which we solved with internal apigateway, aws+iptables+cni was always causing issues in every setup without it.

whalesalad · 2025-03-10T23:14:45 1741648485

Racing against an ASG/ALB combo is always a horrifying adrenaline rush.

jfuawdfaw · 2025-03-11T04:51:29 1741668689

Nobody should be using ASG's anymore. EKS Auto Mode or Karpenter.

NightMKoder · 2025-03-11T03:27:16 1741663636

Why the additional SUGUSR1 vs just doing those (failing health, sleeping) on SIGTERM?

jchw · 2025-03-11T03:33:08 1741663988

Presumably, because it'd be annoying waiting for lame duck mode when you actually do want the application to terminate quickly. SIGKILL usually needs special privileges/root and doesn't give the application any time to clean-up/flush/etc. The other workaround I've seen is having the application clean-up immediately upon a second signal, which I reckon could also work, but either solution seems reasonable.

paranoidrobot · 2025-03-11T04:20:06 1741666806

Yeah, there were a bunch of reasons.

Using SIGTERM is a problem because it conflicts with other behavior.

For instance, if you use SIGTERM for this then you have a potential for the app quitting during the preStop, which will be detected as a crash by Kube and so restart your app.

chippiewill · 2025-03-11T12:54:31 1741697671

> which will be detected as a crash by Kube and so restart your app.

I don't think kubernetes restarts pods that have been marked for termination

paranoidrobot · 2025-03-11T04:14:04 1741666444

We have a number of concurrent issues.

We don't want to kill in-flight requests - terminating while a request is outstanding will result in clients connected to the ALB getting some HTTP 5xx response.

The AWS ALB Controller inside Kubernetes doesn't give us a nice way to specifically say "deregister this target"

The ALB will continue to send us traffic while we return 'healthy' to it's health checks.

So we need some way to signal the application to stop serving 'healthy' responses to the ALB Health Checks, which will force the ALB to mark us as unhealthy in the target group and stop sending us traffic.

SIGUSR1 was an otherwise unused signal that we can send to the application without impacting how other signals might be handled.

NightMKoder · 2025-03-11T05:00:40 1741669240

So I might be putting words in your mouth, so please correct me if this is wrong. It seems like you don’t actually control the SIGTERM handler code. Otherwise you could just write something like:

  sigterm_handler() {
    make_healthcheck_fail();
    sleep(20);
    stop_web_server();
    exit(0);
  }

Technically the server shutdown at the end doesn’t even need to be graceful in this case.

jfuawdfaw · 2025-03-11T04:38:45 1741667925

Curious, which framework are you using? I've had no issues with NodeJS, Go, and Rust apps directly behind ALB with IP-Target.

paranoidrobot · 2025-03-11T10:02:34 1741687354

I don't think it matters the framework, it's an issue with the ALB controller itself, not the application.

The ALB controller doesn't handle gracefully stopping traffic (by ensuring target group de-registration is complete) before allowing the pod to terminate.

Without a preStop, Kube immediately sends SIGTERM to your application.

nijave · 2025-03-11T09:32:05 1741685525

Istio automates this (at the risk of adding more complexity)

lambdasquirrel · 2025-03-11T10:29:15 1741688955

Or nginx. In both cases it’s probably more expensive than an ALB but you have better integration with the app side, plus traffic mesh benefits if you’re using istio. The caveat is that you are managing your own public-facing nodes.

nijave · 2025-03-18T20:48:48 1742330928

You can have ALB terminate HTTPS with a publically trusted CA cert then relay to Istio.

We do ALB Ingress to Istio Gateways

If you have a CDN, DDoS, or WAF you might be able to terminate there as well

Extra hop tho

chippiewill · 2025-03-11T12:52:27 1741697547

> App handler for SIGUSR1 tells readiness hook to start failing.

Doesn't the kubernetes pod shutdown already mark the pod as not-ready before it calls the pre-stop hook?