I wrote approximately in the blog about it and linked some papers! I also wrote about it here - https://unsloth.ai/blog/dynamic-4bit - one has to inspect the activation and weight quantization errors!
Oh apologies I got confused - it's because when we calculate our dynamic quants, we have to do it on the fixed model!
For example in Phi 3 for example, the end of sentence token was wrong - if we use this, then our quants would be calibrated incorrectly, since chatting with the model will use the actual correct token.
Ok, this then goes to say that your approach doesn't work without applying whatever fixes to the vanilla models. What I'm trying to understand is the approach itself. Why does it and how does it work?