This is assuming that the process could have done anything sensible while it had...

aloha2436 · 2025-11-19T02:35:11 1763519711

> panicking is a normal thing to do

I do not think that if the bot detection model inside your big web proxy has a configuration error it should panic and kill the entire proxy and take 20% of the internet with it. This is a system that should fail gracefully and it didn't.

> The real issue

Are there single "real issues" with systems this large? There are issues being created constantly (say, unwraps where there shouldn't be, assumptions about the consumers of the database schema) that only become apparent when they line up.

bobbylarrybobby · 2025-11-19T03:58:58 1763524738

I don't know too much about how the feature file distribution works but in the event of failure to read a new file, wouldn't logging the failure and sticking with the previous version of the file be preferable?

guiriduro · 2025-11-19T07:00:56 1763535656

That's exactly the point (ie just prior to distribution) where a simple sanity check should have been run and the config replacement/update pipeline stopped on failure. When they introduced the 200 entry limit memory optimised feature loader it should have been a no-brainer to insert that sanity check in the config production pipeline.

halzm · 2025-11-21T13:31:31 1763731891

Or even truncating the features to their limit and alerting through logs that there is likely performance degradation in their Bot Management.

I'm really confused how so many people are finding it acceptable to bring down your entire reverse-proxy because the length of feature sets for the ML model in one of your components was longer than expected.

kondro · 2025-11-19T03:00:01 1763521201

One feature failing like this should probably log the error and fail closed. It shouldn't take down everything else in your big proxy that sits in front of your entire business.

WD-42 · 2025-11-19T02:38:41 1763519921

Yea, Rust is safe but it’s not magic. However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message. The question is then could the cloudflare code have exited cleanly in a way that made recovery easier instead of just straight panicking.

tempay · 2025-11-19T04:06:24 1763525184

Would expect with a message meet that criteria of exiting with a more helpful error message? From the postmortem it seems to me like they just didn’t know it even was panicing

KronisLV · 2025-11-19T10:47:39 1763549259

> However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message.

The thing I dislike most about Nginx is that if you are using it as a reverse proxy for like 20 containers and one of them is up, the whole web server will refuse to start up:

  nginx: [emerg] host not found in upstream "my-app"

Obviously making 19 sites also unavailable just because one of them is caught in a crash loop isn't ideal. There is a workaround involving specifying variables, like so (non-Kubernetes example, regular Nginx web server running in a container, talking to other containers over an internal network, like Docker Compose or Docker Swarm):

  location / {
      resolver 127.0.0.11 valid=30s; # Docker DNS
      set $proxy_server my-app;
      proxy_pass http://$proxy_server:8080/;
      proxy_redirect default;
  }

Sadly, if you try to use that approach, then you just get:

  nginx: [emerg] "proxy_redirect default" cannot be used with "proxy_pass" directive with variables

Sadly, switching the redirect configuration away from the default makes some apps go into a redirect loop and fail to load: mostly legacy ones, where Firefox shows something along the lines of "The page isn't redirecting properly". It sucks especially badly if you can't change the software that you just need to run and suddenly your whole Nginx setup is brittle. Apache2 and Caddy don't have such an issue.

That's to say that all software out there has some really annoying failure modes, even is Nginx is pretty cool otherwise.

JeremyNT · 2025-11-19T02:37:50 1763519870

Exactly! Sometimes exploding is simply the least bad option, and is an entirely sensible approach.

jgilias · 2025-11-19T11:30:11 1763551811

In this case it definitely wasn’t the least bad option though.

diath · 2025-11-19T06:04:44 1763532284

Falling back to a generic base configuration in the presence of an incoming invalid config file would probably be a sensible thing to do.