In that scenario the machine would become aware that it can't reach the storage ...

winwang · on Oct 20, 2024

Everything you're saying is plausibly possible in the absurdly large search space of all possible scenarios. The author's premise, however, is rooted in the specific scenario they lay out, with historical supporting examples which you can look into. Even then, the premise before all that was essentially: Redlock does not do what people might expect of a distributed lock. Btw I do have responses to your questions, but often times in these sorts of discussions, I find that there can always be an objection to an objection to ... etc. The "sense" (or flavor) in this case is that "we are taking a complex topic too lightly". In fact, I should probably continue reading the author's book (DDIA) at some point...

dataflow · on Oct 20, 2024

> The "sense" (or flavor) in this case is that "we are taking a complex topic too lightly".

I get that -- and honestly, I'm not expecting a treatise on distributed consensus here. But what took me aback was that the blog post didn't even attempt to mention anything about the fact that the premise (at first glance) looks glaringly broken. If he'd even said 1 single sentence like "it's {difficult/infeasible/impossible} to design a client that will never continue execution past a timeout", it'd have been fine, and I would've happily moved along. But the way it is written right now, it reads a little bit like: "we design a ticking time bomb that we can't turn off; how can we make sure we don't forget to reset the timer every time?"... without bothering to say anything about why we should be digging ourselves into such a hole in the first place.

winwang · on Oct 21, 2024

Yeah, that makes sense now. I think, personally, I've simply seen that design around a bunch, but great on you to question it and call it out -- also plausible that my own headcanon doesn't check out.

dataflow · on Oct 21, 2024

Thanks, yeah. For what it's worth, partly what led me to even leave this comment is that when he wrote "the code above is broken", I stared at it, and for the life of me I couldn't see why it was broken. Because, of course, the code was lying: there was no mention of leases or timeouts. Having a "lease" suddenly pulled out of nowhere really felt like a fast one being pulled on me (and really unfairly so!), hence I decided I'd actually leave the comment and question what the basis for this hidden time bomb even was in the first place?! If the code had said leaseLock(filename, timeout), I think the bug would've been glaringly obvious, and far fewer people would've been surprised by looking at the code.

Also for what it's worth, I can guess what some of the answers might be. For example, it's possible you'd need very precise timing facilities that aren't always available, in order to be able to guarantee high throughput with correctness (like Google Spanner's). Or it might be that doing so requires a trade-off between availability and partition-tolerance that in some applications isn't justified. But I'm curious what the answer actually is, rather than just (semi-)random guesses as to what it could be.