As someone who is working on an asynchronous process myself, I should remind you that asynchronous is more a design level choice rather than a magic bullet that makes everything better.
In particular, in CPUs with a big centralized register file there can be significant overhead to having an asynchronous CPU.
There are certain architectural cases in which it can be a killer advantage or other cases in which it is pretty much the only way forward (e.g. in a 3D chip it can be quite difficult to distribute a high speed, high quality/low skew+jitter clock)
The overhead you are talking about is scoreboard, mainly - clockless CPU cannot assume readiness of the results when issuing commands, and it needs scoreboard of what is ready to issue commands without conflicts.
But scoreboarding allows for most gains of out-of-order execution and is cheap.
You may have associative memory for speculation (most probably, you refer to it above), but you need to have scoreboard for operation to be issued without conflict and with operands ready.
Not just inductance, but capacitance between the layers themselves too since you've got to have an insulator between them. All the fun things you get in 2d from nanometer scale devices are now compounded in another degree of freedom. The only thing I don't think you have to work around is quantum tunneling between layers because they'll probably be too far apart still to need that level of work.
In particular, in CPUs with a big centralized register file there can be significant overhead to having an asynchronous CPU.
There are certain architectural cases in which it can be a killer advantage or other cases in which it is pretty much the only way forward (e.g. in a 3D chip it can be quite difficult to distribute a high speed, high quality/low skew+jitter clock)