I think another argument is that the CoT is simply unrolling the recurrent loop ...

I think another argument is that the CoT is simply unrolling the recurrent loop that this method uses, and doing an unembedding -> embedding -> unembedding during the decoding process.

So at best, using a recurrent loop is only saving you from doing the embedding -> unembedding at each token which is relatively small compared with the height of the decoder blocks.