I have nowhere near the experience managing such complex systems, but I can empathize with this. In a high-pressure situations the most obvious things get missed. If someone is convinced System X is at fault, your mind can make leaps to justify every other degraded system is a downstream effect of that. Cause and effect can get switched.
Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.
I have plenty of empathy, having been in plenty of similar situations. It's not a matter of "I can't BELIEVE it took that long" (although it is a bit surprising) so much as that I disagree with the key takeaways here in the HN comments section and in the blog itself, which focus strongly on fixing rare edge case issues (the bad ClickHouse query and a bad config file causing a panic via unwrap), rather than reducing MTTR for all issues by improving the debug and monitoring experience.
I'm also suspicious that
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
from the blog had a lot more to do with the issue than perhaps the narrative is letting on.
Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.