Funny I was thinking this week logging needs some magic. Log diving takes a lot ...

Akula112233 · 2025-03-11T20:43:46 1741725826

Very relatable experience with log diving, feels very much like a needle-in-haystack problem that gets so much harder when you're not the only one who contributed to the source of errors (often the case).

As for the skepticism with LLMs stumbling around raw logs: it's super deserved. Even the developers who wrote the program often refer to larger app context when debugging, so it's not as easy as throwing a bunch of logs into an LLM. Plus, context window limits & the relative lack of "understanding" with increasingly larger contexts is troublesome.

We found it helped a lot to profile application logs over time. Think aggregation, but for individual flows rather than similar logs. By grouping and ordering flows together, it's bringing the context of thousands of (repetitive) logs down to the core flows. Much easier to find when things are out of the ordinary.

Still a lot of improvements in regards to false positives and variations in application flows.

ohgr · 2025-03-11T21:39:03 1741729143

The best way to improve this is to just generate decent useful and actionable logs. Sifting through a trash heap is where the problem is. No magic will suddenly turn that trash into gold.

You have to do this at the inception of the software you’re building rather then strap it on the donkey when something breaks (the usual way).

Akula112233 · 2025-03-11T22:11:08 1741731068

Yep, but it's sometimes a compromise people may be unwilling to make. Too often I hear (and have seen via DD customers) horror stories about initiatives to fix observability squashed by teams in hopes of shipping.

Moving fast has it's downsides and I can't say I blame people for deprioritizing good logging practices. But it does come back to bite...

Though as a caveat, you don't always have control over your logs -- especially with third party services, large but fragmented engineering organizations, etc. -- even with great internal practices, there's always something.

On another note, access to codebase + live logs gives room to develop better auto-instrumentation tooling. Though perhaps cursor could do a decent enough job at starting folks off

bmurphy1976 · 2025-03-12T14:24:32 1741789472

This is part of hardening a system for production. Making it easy to operate:

* Make sure the logs are actionable

* Make sure the logs are readable

* Make sure you are collecting operational metrics

* Make sure the metrics are useful

* Make sure you have error handling

* Make sure you have alerting

* Make sure you document how to support the application

* Make sure you have knows and levers you can pull in an emergency to change the systems behavior or fix things

* Make sure you have vetted the system for security issues

etc.

cthuen · 2025-03-12T04:43:26 1741754606

Disclaimer: I'm a founder at Gravwell, a log analytics startup

I agree, even when applicable LLMs are relegated to analyzing subselected data, so logs have to go somewhere else first. I think understanding logs is brain intensive because it can be a tricky problem. It gets easier with good tools, but often those tools are the kind that need to be used to build something else that solves the problem, rather than solve the problem themselves (e.g. building a good query + automation). I think LLMs can get better at creating the queries which would help a lot.

We started Gravwell to try bring some magic. It's a schema-on-read time-series data lake that will eat text or binary and comes in SaaS or self-hosted (on-prem). We built our backend from scratch to offer maximum flexibility in query. The search syntax looks like a linux command line, and kinda behaves like one too. Chain modules together to extract, filter, aggregate, enrich, etc. Automation system included. If you like Splunk, you should check us out.

There's a free community edition (personal or commercial use) for 2GB/day anon or 14GB/day w/ email. Tech docs are open at docs.gravwell.io.