Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've also led a team of Incident Commanders at a FAANG.

If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.

The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: