This paper uses a very heavily modified version of an encoder-only BERT model. Forward pass on a single 4090 is cited there at 13 seconds after switching softmax out for a different kernel (21 seconds with softmax). They are missing a non-FHE baseline, but that model has only about 35 million parameters when you look at its size. At FP16, you would expect this to be about 100x faster than a normal BERT because it's so damn small. On a 4090, that model's forward pass probably runs at something like 100k-1M tokens per second given some batching. It sounds like 6 orders of magnitude is still about right.