I do not understand what gave you the impression that I was advocating for "hope and prayers". I'm advocating for not relying on one level of abstraction to be flawless so we can build a perfect logic on top of it. I'm advocating for not handling everything in a single layer. That FL2 program at cloudflare encountered an error condition and it bailed out and that's fine. What is not fine is that the supervisor did not fail open.
The oposing views here are not "hope and prayers" vs "good engineering", it's assuming things will fail at every stage vs assuming one can build a layer of abstraction that is flawless, on top of which we can build.
Resilient systems trump "correct" systems, and I would pick a system designed under the assumption that fake errors will be injected regularly, that process will be killed at random, that entire rack of machines will be unplugged at random at any time, that whole datacenters will be put off grid for fun, over a system that's been "proven correct", any day. I though it was common knowledge.
Of coursre I'm not arguing against proving that a software is correct. I would actually argue that some formal methods would come handy to model these kind of systemic failures and reveal the worste cases with largest blast radius.
But considering the case at hand, the code for that FL2 bot had an assertion regarding the size of received data and that was a valid assertion, and the process decided to panic, and that was the right decision. What was not right was the lack of instrumentation that should have made these failures obvious, and the fact that the user queries failed when that non-essential bot failed, instead of bypassing that bot.
The oposing views here are not "hope and prayers" vs "good engineering", it's assuming things will fail at every stage vs assuming one can build a layer of abstraction that is flawless, on top of which we can build.
Resilient systems trump "correct" systems, and I would pick a system designed under the assumption that fake errors will be injected regularly, that process will be killed at random, that entire rack of machines will be unplugged at random at any time, that whole datacenters will be put off grid for fun, over a system that's been "proven correct", any day. I though it was common knowledge.
Of coursre I'm not arguing against proving that a software is correct. I would actually argue that some formal methods would come handy to model these kind of systemic failures and reveal the worste cases with largest blast radius.
But considering the case at hand, the code for that FL2 bot had an assertion regarding the size of received data and that was a valid assertion, and the process decided to panic, and that was the right decision. What was not right was the lack of instrumentation that should have made these failures obvious, and the fact that the user queries failed when that non-essential bot failed, instead of bypassing that bot.