Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There were two things I think went extremely poorly here:

1) Lack of validation of the configuration file.

Rolling out a config file across the global network every 5 minutes is extremely high risk. Even without hindsight, surely one would see then need for very careful validation of this file before taking on that risk?

There were several things "obviously" wrong with the file that validation should have caught:

- It was much bigger than expected.

- It had duplicate entries.

- Most importantly, when loaded into the FL2 proxy, the proxy would panic on every request. At the very least, part of the validation should involve loading the file into the proxy and serving a request?

2) Very long time to identify and then fix such a critical issue.

I can't understand the complete lack of monitoring or reporting? A panic in Rust code, especially from an unwrap, is the application screaming that there's a logic error! I don't understand how that can be conflated with a DDoS attack. How are your logs not filled with backtraces pointing to the exact "unwrap" in question?

Then, once identified, why was it so hard to revert to a known good version of the configuration file? How did noone foresee the need to roll back this file when designing a feature that deploys a new one globally every 5 minutes?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: