I have plenty of empathy, having been in plenty of similar situations. It's not a matter of "I can't BELIEVE it took that long" (although it is a bit surprising) so much as that I disagree with the key takeaways here in the HN comments section and in the blog itself, which focus strongly on fixing rare edge case issues (the bad ClickHouse query and a bad config file causing a panic via unwrap), rather than reducing MTTR for all issues by improving the debug and monitoring experience.
I'm also suspicious that
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
from the blog had a lot more to do with the issue than perhaps the narrative is letting on.
I'm also suspicious that
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
from the blog had a lot more to do with the issue than perhaps the narrative is letting on.