Good point -- hopefully the quality impact is still worth it, remains to be seen...

brucethemoose2 · on Sept 13, 2023

If its better than the equivalent 30B model, that's still a huge achievement.

Llama.cpp's Q2_K quant is 2.5625 bpw with perplexity just barely better than the next step down: https://github.com/ggerganov/llama.cpp/pull/1684

But subjectively, the Q2 quant "feels" worse than its high wikitext perplexity would suggest.

That's apples to oranges, as this quantization is different than Q2_K, but I just hope the quality hit in practice isn't so bad.