We have something very similar[1] we use in the Apache Cassandra project to test complex cluster behaviours.
We appear to use exactly the same basic technique, using byte weaving to intercept concurrency primitives such as synchronized, LockSupport etc to pause the system thread and run them on some schedule.
We only currently run (deterministic) probabilistic traces though, we can’t search the interleaving space. But the traces for a whole cluster are extremely complex and probably unsearchable.
I have been meaning to publish it for broader consumption for years now, but there’s always something more important to do. It’s great to see some dedicated efforts in this space.
It seems that all controlled threads are wrapped with `InterceptibleThread` in the Cassandra simulator. Does this work for ThreadPools (e.g., ForkJoinPool) as well? We had a hard time intercepting thread objects because they are used by the language runtime (e.g., GC threads) as well and we don’t want to interfere with them. Additionally, modifying application code just track thread creation isn’t ideal. To work around this, we came up with this combination of JVMTi and Java Agent solution and we use JVMTi to monitor thread creation and termination.
As for searching schedules, yes, it is hard to search all possible schedules. However, it turns out many searching algorithms such as probabilistic concurrency testing[1] or partial order sampling[2] are still better than random walk. So it is worth to give them a try.
We do currently require all threads to be created by one of our own factories, but that's primarily because this grew out of a non-byte weaving approach (where we explicitly replaced our concurrency primitives). Looking at the class now, all of its state could easily be stashed in either global or ThreadLocal variables, so I don't see anything that would stop us working with FJP etc.
This would certainly be necessary, but don't you anyway need to rewrite the application to trap synchronised, volatile, atomic accesses etc? It doesn't seem all that different to rewrite calls to Thread::start. The issue of JVM threads is perhaps a little trickier, but I am not averse to some ugly integrations. Just take a look at how we make RNGs deterministic
> So it is worth to give them a try.
Thanks for the tips! I am not sure when I will have time to apply these techniques to our simulator, but they are no doubt valuable for the protocol simulations I am relying on today, so maybe I will have a justification to explore them sometime soon.
Really cool work too. I hope it manages to make its way into more hands, so that this technique can be used more widely.
We appear to use exactly the same basic technique, using byte weaving to intercept concurrency primitives such as synchronized, LockSupport etc to pause the system thread and run them on some schedule.
We only currently run (deterministic) probabilistic traces though, we can’t search the interleaving space. But the traces for a whole cluster are extremely complex and probably unsearchable.
I have been meaning to publish it for broader consumption for years now, but there’s always something more important to do. It’s great to see some dedicated efforts in this space.
[1] https://github.com/apache/cassandra/tree/trunk/test/simulato...