We had to figure this out the hard way, and ended up with this approach (approximately).
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
A lot of this seems like the fault of the ALB, is it? I had the same problem and eventually moved off of it to cloudflare tunnels pointed at service load balancers directly, which changed immediately when pods went bad. With a grace period for normal shutdowns, I haven't seen any downtime for deploys or errors.
The issue with the above setup is (maybe I'm doing it wrong?) but if a pod is removed suddenly, say if it crashes, then some portion of traffic gets errors until the ALB updates. And that can be an agonizingly long time, which seemed because it's pointed at IP addresses in the cluster and not the service. It seemed like a shortcoming of the ALB. GKE doesn't have the same behavior.
I'm not the expert but found something that worked.
> A lot of this seems like the fault of the ALB, is it?
I definitely think the ALB Controller should be taking a more active hand in termination of pods that are targets of an ALB.
But the ALB Controller is exhibiting the same symptom I keep running into throughout Kubernetes.
The amount of "X is a problem because the pod dies too quickly before Y has a chance to clean up/whatever, so we add a preStop sleep of 30 seconds" in the Kubernetes world is truly frustrating.
If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.
Many people don’t run the main process in the container as PID 1, so this “problem” remains.
If it’s not feasible to remove something like a shell process from being the first thing that runs, exec will allow replacing the shell process with the application process.
> If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.
Maybe I am holding it wrong. I'd love not to have to do this work.
But I don't see how being PID 1 or not helps (and yes, for most workloads it is PID 1)
The ALB controller is the one that would need to deregister a target from the target group, and it won't until the pod is gone. So we have to force it by having the app do the functional equivalent with the readiness check.
If I understand correctly, because ALB does its own health checks, you need to catch TERM, wait 30s while returning non-ready for ALB to have time to notice, then clean up and shut down.
Kubernetes was written by people who have developer, not ops, background and is full of things like this. The fact that it became a standard is a disaster
Maybe, or maybe orchestration and load balancing is hard. I think it's too simplistic to dismiss k8s development because the devs weren't ops.
I don't know of a tool that does a significantly better job at this without having other drawbacks and gotchas, and even if it did it doesn't void the value k8s brings.
I have my own set of gripes with software production engineering in general and specially with k8s, having seen first hand how much effort big corps have to put just to manage a cluster, but it's disrespectful to qualify this whole endeavour as disastrous.
Guys who wrote it are ok, they put a lot of effort and that's fine. If I understand things correctly, they were also compensated well. But the effort based on some wrong assumptions makes a product that is flawed. Lot of people then are forced to use it because there is no alternative, or the alternatives are easily dismissed - behavior based in turn on a certain propaganda and marketing. And that part is a disaster. This is not personal, btw.
Pod Readiness Gates, unless I'm missing something, only help on startup.
Unless something has changed since I last went digging into this. You will still have the ALB sending traffic to a pod that's in terminating state, unless you do the preStop bits I talked about in the top of the thread.
> You don't need the preStop scam as long as your workload respects SIGTERM and does lame-duck.
Calling it a scam is a bit much.
I think having to put the logic of how the load balancer works into the application is a crossing of concerns.
This kind of orchestration does not belong in the app, it belongs in the supporting infrastructure.
The app should not need to know how the load balancer works with regards to scheduling.
The ALB Controller should be doing this. It does not, and so we use preStop until/unless the ALB controller figures it out.
Yes, the app needs to listen for SIGTERM and wait until it's outstanding requests are completed before exiting - but not more than that.
Presumably, because it'd be annoying waiting for lame duck mode when you actually do want the application to terminate quickly. SIGKILL usually needs special privileges/root and doesn't give the application any time to clean-up/flush/etc. The other workaround I've seen is having the application clean-up immediately upon a second signal, which I reckon could also work, but either solution seems reasonable.
Using SIGTERM is a problem because it conflicts with other behavior.
For instance, if you use SIGTERM for this then you have a potential for the app quitting during the preStop, which will be detected as a crash by Kube and so restart your app.
We don't want to kill in-flight requests - terminating while a request is outstanding will result in clients connected to the ALB getting some HTTP 5xx response.
The AWS ALB Controller inside Kubernetes doesn't give us a nice way to specifically say "deregister this target"
The ALB will continue to send us traffic while we return 'healthy' to it's health checks.
So we need some way to signal the application to stop serving 'healthy' responses to the ALB Health Checks, which will force the ALB to mark us as unhealthy in the target group and stop sending us traffic.
SIGUSR1 was an otherwise unused signal that we can send to the application without impacting how other signals might be handled.
So I might be putting words in your mouth, so please correct me if this is wrong. It seems like you don’t actually control the SIGTERM handler code. Otherwise you could just write something like:
I don't think it matters the framework, it's an issue with the ALB controller itself, not the application.
The ALB controller doesn't handle gracefully stopping traffic (by ensuring target group de-registration is complete) before allowing the pod to terminate.
Without a preStop, Kube immediately sends SIGTERM to your application.
Or nginx. In both cases it’s probably more expensive than an ALB but you have better integration with the app side, plus traffic mesh benefits if you’re using istio. The caveat is that you are managing your own public-facing nodes.
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
e: Formatting was slightly broken.