Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the early 2000s when Google explained how they achieved their (already back then) awesome reliability, ie assuming that any software and hardware will eventually fail, and that they designed everything with the idea that everything was faulty, there were some people who couldn't get it, who would still bring the argument that "yeah but today with modern raid..."

People here chatting about unwrap remind me of them :)





Assuming software and people will fail is exactly what not using unwrap is about.

If you depend on engineers not fucking up, you will fail. Using unwrap is assuming humans won’t get human-enforced invariants wrong. They will. They did here.

As someone that works in formal verification of crypto systems, watching people like yourself advocate for hope-and-prayer development methodology is astonishing.

However, I understand why we’re still having this debate. It’s the same debate that’s been occurring for the same reasons for decades.

Doing things correctly is mentally more difficult, and so people jump through ridiculous rhetorical hoops to justify why they will not — or quite often, mentally cannot — perform that intellectual labor.

It’s a disheartening lack of craftsmanship and industry accountability, but it’s nothing new.


I do not understand what gave you the impression that I was advocating for "hope and prayers". I'm advocating for not relying on one level of abstraction to be flawless so we can build a perfect logic on top of it. I'm advocating for not handling everything in a single layer. That FL2 program at cloudflare encountered an error condition and it bailed out and that's fine. What is not fine is that the supervisor did not fail open.

The oposing views here are not "hope and prayers" vs "good engineering", it's assuming things will fail at every stage vs assuming one can build a layer of abstraction that is flawless, on top of which we can build.

Resilient systems trump "correct" systems, and I would pick a system designed under the assumption that fake errors will be injected regularly, that process will be killed at random, that entire rack of machines will be unplugged at random at any time, that whole datacenters will be put off grid for fun, over a system that's been "proven correct", any day. I though it was common knowledge.

Of coursre I'm not arguing against proving that a software is correct. I would actually argue that some formal methods would come handy to model these kind of systemic failures and reveal the worste cases with largest blast radius.

But considering the case at hand, the code for that FL2 bot had an assertion regarding the size of received data and that was a valid assertion, and the process decided to panic, and that was the right decision. What was not right was the lack of instrumentation that should have made these failures obvious, and the fact that the user queries failed when that non-essential bot failed, instead of bypassing that bot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: