The fake claim here is compression. The results in the repo are likely real, but they're done by running the full transformer teacher model every time. This doesn't achieve anything novel.
That's not how the method works... The full transformer is only needed once to extract the activation fields. That step can even be done offline. Then the teacher can be discarded entirely. The compression result refers to the size of the learned field representation and the small student head that operates directly on it. Simple. No fake claim there. Inference with the student does not involve the transformer at all.
If you look at the student-only scripts in the repo, those runs never load the teacher. That's the novel part.
Can you please share the relevant code that has the training of such a tiny student model that can operate independently of the big teacher model after training? The repository has no such code.