So this specific model ships with a default config of 2 experts per token.
So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.
Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.
So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.
~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.
So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.
Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.
So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.
~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.