Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't that related to architecture? The most recent GPUs and tensor procs have native support for 4-bit(partially) and 8-bit int whereas older GPUs take noticeable performance hits for 8-bit vs fp16/32.


Ah but LLM.int8 (eg. as in huggingface transformers) isn't actually int8, it's a mixed precision encoding scheme that is nominally eight bits per parameter. This means custom cuda kernels etc, these kernels could be improved but without hardware support its always going to be slow.

Straight int8 quantization generally does not work for post training quantization of transformers. The distribution of weights includes a significant amount of outlier values that seem to be important to model performance. Apparently quantization aware training can improve things significantly but I haven't seen any developments for llama yet.

Interestingly on the 4 bit front, NVIDIA has chosen to remove int4 support from the next gen Hopper series. I'm not sure folks realize the industry has already moved on. FP8 feels like a bit of a hack, but I like it!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: