> SiftDev flags silent failures, such as two microservices updating the same record within 50ms
I don't understand, what about that is a "silent failure"?
in order for your product to even know about it, wouldn't I need to write a log message for every single record update?
and if my architecture allows two microservices to update the same row in the same database...maybe it happening within 50ms is expected?
that could be an inefficient architecture for sure, but I'm confused as to whether your product is also trying to give me recommendations about "here's an architectural inefficiency we found based on feeding your logs to an LLM"
> You can then directly ask your logs questions like, “What's causing errors in our checkout service?” or “Why did latency spike at 2 AM?” and immediately receive insightful, actionable answers that you’d otherwise manually be searching for.
the general question I have with any product that's marketing itself as being "AI-powered" - how do hallucinations get resolved?
I already have human coworkers who will investigate some error or alert or performance problem, and come to an incorrect conclusion about the cause.
when that happens I can walk through their thought process and analysis chain with them and identify the gap that led them to the incorrect conclusion. often this is a useful signal that our system documentation needs to be updated, or log messages need to be clarified, or a dashboard should include a different metric, etc etc.
if I ask your product "what caused such-and-such outage" and the answer that comes back is incorrect, how do I "teach" it the correct answer?
> I don't understand, what about that is a "silent failure"?
Silent failures can be "allowed" behavior in your applications that aren't actually labeled as errors but can be irregular. Think race conditions, deadlocks, silent timeouts, or even just mislabeled error logs.
> in order for your product to even know about it, wouldn't I need to write a log message for every single record update?
That's right, and this may not always feasible (or necessary!), but if your application can be impacted by errors like these, perhaps it may be worth logging anyway.
> the general question I have with any product that's marketing itself as being "AI-powered" - how do hallucinations get resolved?
> and if my architecture allows two microservices to update the same row in the same database...maybe it happening within 50ms is expected?
> if I ask your product "what caused such-and-such outage" and the answer that comes back is incorrect, how do I "teach" it the correct answer?
For these concerns, human-in-loop feedback is our preliminary approach! We have our own internally running to account for changes and false errors, but having explanations from human input (even as simple as "Not an error" or "Missed error" buttons) is very helpful.
> when that happens I can walk through their thought process and analysis chain with them and identify the gap that led them to the incorrect conclusion. often this is a useful signal that our system documentation needs to be updated, or log messages need to be clarified, or a dashboard should include a different metric, etc etc.
Got it, I imagine it'll be very helpful for us to display our chain of thought from our dashboards too. Great feedback, thank you!
> Think race conditions, deadlocks, silent timeouts, or even just mislabeled error logs.
I agree that those are bad things.
but how does your product help me with them?
I have some code that has a deadlock. are you suggesting that I can find the deadlock by shipping my logs to a 3rd-party service that will feed them into an LLM?
I don't understand, what about that is a "silent failure"?
in order for your product to even know about it, wouldn't I need to write a log message for every single record update?
and if my architecture allows two microservices to update the same row in the same database...maybe it happening within 50ms is expected?
that could be an inefficient architecture for sure, but I'm confused as to whether your product is also trying to give me recommendations about "here's an architectural inefficiency we found based on feeding your logs to an LLM"
> You can then directly ask your logs questions like, “What's causing errors in our checkout service?” or “Why did latency spike at 2 AM?” and immediately receive insightful, actionable answers that you’d otherwise manually be searching for.
the general question I have with any product that's marketing itself as being "AI-powered" - how do hallucinations get resolved?
I already have human coworkers who will investigate some error or alert or performance problem, and come to an incorrect conclusion about the cause.
when that happens I can walk through their thought process and analysis chain with them and identify the gap that led them to the incorrect conclusion. often this is a useful signal that our system documentation needs to be updated, or log messages need to be clarified, or a dashboard should include a different metric, etc etc.
if I ask your product "what caused such-and-such outage" and the answer that comes back is incorrect, how do I "teach" it the correct answer?