Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The technical aspect that contributed the most to this burnout was both the lack of observability tooling and the lack of organizational desire to invest in it.

One of the most significant "triumphs" of my technical career came at a startup where I started as a Principal Engineer and left as the VP Engineering. When I started, we had nightly outages requiring Engineering on-call, and by the time I left, no one could remember a recent issue that required Engineers to wake up.

It was a ton of work and required a strong investment in quality & resilience, but even bigger impact was from observability. We couldn't afford APM, so we took a very deliberate approach to what we logged and how, and stuffed it into an ELK stack for reporting. The immediate benefit was a drastic reduction in time to diagnose issues, and effectively let our small operations team triage issues and easily identify app vs. infra issues almost immediately. Additionally, it was much easier to identify and mitigate fragility in our code and infra.

The net result was an increase in availability from 98.5% to 99.995%, and I think observability contributed to at least half of that.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: