> the single-biggest booster for me as a practitioner. Can you instantiate this ...

agentultra · on May 14, 2019

About 3-4 years ago I was working on an open source cloud platform for a company deploying a public cloud. There was a particular error that sometimes happened in our production environment where a rebooted VM would come up but couldn't connect to the network.

We tracked it down to a race between two services in the data plane. It turns out the VM controller wouldn't wait for the network controller to unplug a virtual interface before requesting the interface to be plugged back in. There was a lack of co-ordination happening. However it only happened when the network component was under heavy enough load that it would take too long to respond before the VM service finished rebooting the VM -- usually it was fast enough and the error wouldn't appear.

I managed to model this interaction at a high level in TLA+. From there I had suspected that the error was in the mutex locking code in the async library this system depended on so we refined the model pretty close to that implementation. As I recall we found that the mutex code wasn't the culprit -- a fine result. We ended up implementing some light-weight co-ordination mechanism to ensure that the VM service waits to acknowledge the progress of the network service.

Since then I've continued to use TLA+. I find that programming languages are insufficiently expressive enough to describe high-level interactions between other processes, network events, and humans.

ilaksh · on May 14, 2019

I'm sorry but I don't see why the TLA+ modeling was necessary. You said that you noticed the lack of coordination before that. From your description, it seems like the mutex thing was a diversion. Anyway, a mutex would not necessarily be adequate for coordination (and many types of mutexes will not work between processes at all).

So to me it seems that you could have gone straight to the lightweight coordination mechanism without the TLA+ model. And anyway, if there was a problem with the mutex, you could test that theory by doing additional logging or an experiment around the mutex functionality.

agentultra · on May 14, 2019

Sorry for what? It was my first time using such a tool. I found it useful for understanding the system. I've improved my understanding of TLA+ since then and it has been valuable.