Yours matches the commentary of a friend in the field from about 2000. I do hope that the end of Moore sees improvement in this sort of design, and the tools required.
I do see some high speed low power networking hardware moving this way: Router Using Quasi-Delay-Insensitive Asynchronous Design
Yep, I saw that go by, and a lot of my work is heavily influenced by Andrew Lines. But what I've been seeing is that QDI is really bad at arithmetic because the acknowledgement requirements turn XORs into a nightmare hairball of signal dependencies. But QDI is really good at complex control.
That's not true for all ways of doing things, for example, with bundled data, dual rail domino QDI and various commercial groups like wave computing and ETA computing which have their own asynchronous flavors, often optimized for arithmetic operations.
I was specifically talking about dual rail domino QDI. When you compare the typical dual rail domino QDI adder found in Andrew Lines thesis against a typical clocked carry lookahead adder like Kogge & Stone, it is worse by factors of between 2 and 3 in energy, area, and throughput.
Bundled data is a simple control with data clocked from that control. Its very much keeping arithmetic away from the QDI circuitry.
Though to be fair, I haven't seen a good examination of how pass transistor logic might affect QDI arithmetic circuitry, so maybe there is hope.
I do see some high speed low power networking hardware moving this way: Router Using Quasi-Delay-Insensitive Asynchronous Design
https://dl.acm.org/citation.cfm?id=2634996&preflayout=flat