Funny I was thinking this week logging needs some magic.
Log diving takes a lot of time especially during some kind of outage/downtime/bug where the whole team might be watching a screen share of someone diving into logs.
At the same time I am sceptical about "AI" especially if it is just an LLM stumbling around.
Understanding logs is probably the most brain intensive part of the job for me, more so than system design, project planning or coding.
This is because you need to know where the code is logging, imagine code paths in your head and you constantly see stuff that is a red herring or doesn't make sense.
I hope you can improve this space but it won't be easy!
Very relatable experience with log diving, feels very much like a needle-in-haystack problem that gets so much harder when you're not the only one who contributed to the source of errors (often the case).
As for the skepticism with LLMs stumbling around raw logs: it's super deserved. Even the developers who wrote the program often refer to larger app context when debugging, so it's not as easy as throwing a bunch of logs into an LLM. Plus, context window limits & the relative lack of "understanding" with increasingly larger contexts is troublesome.
We found it helped a lot to profile application logs over time. Think aggregation, but for individual flows rather than similar logs. By grouping and ordering flows together, it's bringing the context of thousands of (repetitive) logs down to the core flows. Much easier to find when things are out of the ordinary.
Still a lot of improvements in regards to false positives and variations in application flows.
The best way to improve this is to just generate decent useful and actionable logs. Sifting through a trash heap is where the problem is. No magic will suddenly turn that trash into gold.
You have to do this at the inception of the software you’re building rather then strap it on the donkey when something breaks (the usual way).
Yep, but it's sometimes a compromise people may be unwilling to make. Too often I hear (and have seen via DD customers) horror stories about initiatives to fix observability squashed by teams in hopes of shipping.
Moving fast has it's downsides and I can't say I blame people for deprioritizing good logging practices. But it does come back to bite...
Though as a caveat, you don't always have control over your logs -- especially with third party services, large but fragmented engineering organizations, etc. -- even with great internal practices, there's always something.
On another note, access to codebase + live logs gives room to develop better auto-instrumentation tooling. Though perhaps cursor could do a decent enough job at starting folks off
Disclaimer: I'm a founder at Gravwell, a log analytics startup
I agree, even when applicable LLMs are relegated to analyzing subselected data, so logs have to go somewhere else first. I think understanding logs is brain intensive because it can be a tricky problem. It gets easier with good tools, but often those tools are the kind that need to be used to build something else that solves the problem, rather than solve the problem themselves (e.g. building a good query + automation). I think LLMs can get better at creating the queries which would help a lot.
We started Gravwell to try bring some magic. It's a schema-on-read time-series data lake that will eat text or binary and comes in SaaS or self-hosted (on-prem). We built our backend from scratch to offer maximum flexibility in query. The search syntax looks like a linux command line, and kinda behaves like one too. Chain modules together to extract, filter, aggregate, enrich, etc. Automation system included. If you like Splunk, you should check us out.
There's a free community edition (personal or commercial use) for 2GB/day anon or 14GB/day w/ email. Tech docs are open at docs.gravwell.io.
Log diving takes a lot of time especially during some kind of outage/downtime/bug where the whole team might be watching a screen share of someone diving into logs.
At the same time I am sceptical about "AI" especially if it is just an LLM stumbling around.
Understanding logs is probably the most brain intensive part of the job for me, more so than system design, project planning or coding.
This is because you need to know where the code is logging, imagine code paths in your head and you constantly see stuff that is a red herring or doesn't make sense.
I hope you can improve this space but it won't be easy!