Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.

We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.

I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.



In fairness to Kubernetes, this partially due to AWS and how their ALB/NLB interact with Kubernetes. So, when Kubernetes starts to replace Pods, the Amazon ALB/NLB Controller starts reacting, however, it must make calls to Amazon API and wait for ALB/NLB to catch up with changing state of the cluster. Kubernetes is not aware of this and continues on blindly. If Ingress Controller was more integrated into the cluster, you wouldn't have this problem. We run Ingress-Nginx at work instead of ALB for this reason.

Thus, this entire system of "Mark me not ready, wait for ALB/NLB to realize I'm not ready and stop sending traffic, wait for that to finish, terminate and Kubernetes continues with rollout."

You would have same problem if you just started up new images in autoscaling group and randomly SSH into old ones and running "shutdown -h now". ALB would be shocked by sudden departure of VMs and you would probably get traffic going to old VMs until health checks caught up.

EDIT: Azure/GCP have same issue if you use their provided ALBs.


Nginx ingress has the same problem, it's just much faster at switching over when a pod is marked as unready because it's continuously watching the endpoints.

Kubernetes is missing a mechanism for load balancing services (like ingress, gateways) to ack pods being marked as not ready before the pod itself is terminated.


They are a few warts like this with core/apps controllers. Nothing unfixable within general k8s design imho but unfortunately most of the community have moved on to newer shinier things


It shouldn't. I've not had the braincells yet to fully internalize the entire article, but it seems like we go wrong about here:

> The AWS Load Balancer keeps sending new requests to the target for several seconds after the application is sent the termination signal!

And then concluded a wait is required…? Yes, traffic might not cease immediately, but you drain the connections to the load balancer, and then exit. A decent HTTP framework should be doing this by default on SIGTERM.

> I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system.

Yeah, I wouldn't agree with that either. A terminating pod is inherently "not ready", that not-ready state should cause the load balancer to remove it from rotation. Similarly, the pod itself can drain its connections to the load balancer. That could take time; there's always going to be some point at which you'd have to give up on a slowloris request.


The fundamental gap in my opinion, is that k8s has no mechanism (that I am aware of) to notify the load balancing mechanism (whether that's a service, ingress or gateway) that it intends to remove a node - and for the load balancer to confirm this is complete.

This is how all pre-k8s rolling deployment systems I've used have worked.

So instead we move the logic to the application, and put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.


K8s made simple things complicated, yet it doesn't have obvious safety (or sanity) mechanisms, making everyday life a PITA. I wonder why it was adopted so quickly despite its flaws, and the only thing coming to my mind is, like Java in 90s: massive marketing and propaganda that it's "inevitable"..


> put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.

Again, I don't see why the sleep is required. You're removed from the load balancer when the last connection from the LB closes.


That’s how you’d expect it to work, but that’s not how pod deletion works.

The pod delete event is sent out, and the load balancer and the pod itself both receive and react to it at the same time.

So unless the LB switchover is very quick, or the pod shutdown is slow - you get dropped requests - usually 502s.

Try googling for graceful k8s deploys and every article will say you have to put a preStop sleep in


Most http frameworks don't do this right. They typically wait until all known in-flight requests complete and then exit. That's usually too fast for a load balancer that's still sending new requests. Instead you should just wait 30 seconds or so while still accepting new requests and replying not ready to load balancer health checks, and then if you want to wait additional time for long running requests, you can. You can also send clients "connection: close" to convince them to reopen connections against different backends.


> That's usually too fast for a load balancer that's still sending new requests.

How?

A load balancer can't send a new request on a connection that doesn't exist. (Existing connections being gracefully torn down as requests conclude on them & as the underlying protocol permits.) If it cannot open a connection to the backend (the backend should not allow new connections when the drain starts) then by definition new requests cannot end up at the backend.


The server in http is limited in its ability to initiate connection closures. Remember that when you close a connection in TCP, that sends a FIN packet, and the other end of the connection doesn't know that that's happened yet and might still be sending data packets. In http, the server can request that the client stop using a connection and close it with the "connection: close" header. If the server closes the connection abruptly, there could be requests in flight on the network. With http pipelining, the server may even receive requests on the same connection after sending "connection: close" since they could have been sent by the client before that header was received. With pipelining, the client needs to close the TCP connection to achieve a graceful shutdown.


K8S is overrated, it's actually pretty terrible but everyone has been convinced it's the solution to all of their problems because it's slightly better than what we had 15 years ago (Ansible/Puppet/Bash/immutable deployments) at 10x the complexity. There are so many weird edge cases just waiting to completely ruin your day. Like subPath mounts. If you use subPath then changes to a ConfigMap don't get reflected into the container. The container doesn't get restarted either of course, so you have config drift built in, unless you install one of those weird hacky controllers that restarts pods for you.


It's not slightly better it's way better than Ansible/Puppet/Bash/immutable deployments, because everything follow the same paterm and is standard.

You get observability pretty much for free, solution from 15 years ago were crap, remember Nagios and the like?

Old solutions would put trash all over the disk in /etc/. How many time did we have to ssh to fix / repair stuff?

All the health check / load balancer is also much better handled on Kubernetes.


I wouldn't throw away k8s just for subPath weirdness, but I hear your general point about complexity. But if you are throwing away Ansible and Puppet, what is your solution? Also I'm not entirely sure what you are getting at with bash (what does shell scripting have to do with it?) and immutable deployments.


That's only one example of K8s weirdness that can wake you up at 3am. How: change is rolled out during business hours that changes service config inside ConfigMap. Pod doesn't get notified or reload this change. Pod crashes at night, loads the new (bad/invalid) config, takes down production. To add insult to injury, the engineers spend hours debugging the issue because it's completely unintuitive that CM changes are not reflected ONLY when using subPath.


That's totally valid. I understand the desire of k8s maintainers to prevent "cascading changes" from happening, but this one is a very reasonable feature they seem to not support. There's a pretty common hack to make things restart on a config change by adding a pod annotation with the configmap hash:

      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}

But I agree that it shouldn't be needed. There should be builtin and sensible ways to notify of changes and react.


This is argument for 12Factor and Env Vars for Config.

Also, Kustomize can help with some of this since it will rotate the name of ConfigMaps so when any change happens, new ConfigMap, new Deployment.


That's how I do it, with kustomize. Definitely confused me before I learned that, but hasn't been an issue for years. And if you don't use kustomize, you just do... What was it kubectl rollout? Add that to the end you deploy script and you're good.


I told you that I hear you on K8s complexity. But since you throw out Ansible/Puppet/etc., what technology are you advocating?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: