This is assuming that the process could have done anything sensible while it had the malformed feature file. It might be in this case that this was one configuration file of several and maybe the program could have been built to run with some defaults when it finds this specific configuration invalid, but in the general case, if a program expects a configuration file and can't do anything without it, panicking is a normal thing to do. There's no graceful handling (beyond a nice error message) a program like Nginx could do on a syntax error in its config.
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.
I do not think that if the bot detection model inside your big web proxy has a configuration error it should panic and kill the entire proxy and take 20% of the internet with it. This is a system that should fail gracefully and it didn't.
> The real issue
Are there single "real issues" with systems this large? There are issues being created constantly (say, unwraps where there shouldn't be, assumptions about the consumers of the database schema) that only become apparent when they line up.
I don't know too much about how the feature file distribution works but in the event of failure to read a new file, wouldn't logging the failure and sticking with the previous version of the file be preferable?
That's exactly the point (ie just prior to distribution) where a simple sanity check should have been run and the config replacement/update pipeline stopped on failure. When they introduced the 200 entry limit memory optimised feature loader it should have been a no-brainer to insert that sanity check in the config production pipeline.
Or even truncating the features to their limit and alerting through logs that there is likely performance degradation in their Bot Management.
I'm really confused how so many people are finding it acceptable to bring down your entire reverse-proxy because the length of feature sets for the ML model in one of your components was longer than expected.
One feature failing like this should probably log the error and fail closed. It shouldn't take down everything else in your big proxy that sits in front of your entire business.
Yea, Rust is safe but it’s not magic. However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message. The question is then could the cloudflare code have exited cleanly in a way that made recovery easier instead of just straight panicking.
Would expect with a message meet that criteria of exiting with a more helpful error message? From the postmortem it seems to me like they just didn’t know it even was panicing
> However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message.
The thing I dislike most about Nginx is that if you are using it as a reverse proxy for like 20 containers and one of them is up, the whole web server will refuse to start up:
nginx: [emerg] host not found in upstream "my-app"
Obviously making 19 sites also unavailable just because one of them is caught in a crash loop isn't ideal. There is a workaround involving specifying variables, like so (non-Kubernetes example, regular Nginx web server running in a container, talking to other containers over an internal network, like Docker Compose or Docker Swarm):
location / {
resolver 127.0.0.11 valid=30s; # Docker DNS
set $proxy_server my-app;
proxy_pass http://$proxy_server:8080/;
proxy_redirect default;
}
Sadly, if you try to use that approach, then you just get:
nginx: [emerg] "proxy_redirect default" cannot be used with "proxy_pass" directive with variables
Sadly, switching the redirect configuration away from the default makes some apps go into a redirect loop and fail to load: mostly legacy ones, where Firefox shows something along the lines of "The page isn't redirecting properly". It sucks especially badly if you can't change the software that you just need to run and suddenly your whole Nginx setup is brittle. Apache2 and Caddy don't have such an issue.
That's to say that all software out there has some really annoying failure modes, even is Nginx is pretty cool otherwise.
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.