Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Compete with Llama.cpp? Like transformers llama [0], exllama [1] (really fast), or litllama [2] ?

exllama is really memory efficient and really fast

[0] https://huggingface.co/docs/transformers/main/model_doc/llam...

[1] https://github.com/turboderp/exllama

[2] https://github.com/Lightning-AI/lit-llama

EDIT: Or do you mean cuda? Because yeah, it's such a shame AMD's Rocm is so bad even geohot gave up. it's examples don't even run without crashing.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...



Also https://github.com/kayvr/TokenHawk, a WebGPU implementation of LLaMA.

edit: Note that this is my project.


Thanks for the tip about exllama, I've been on the lookout for a readable python implementation to play with that is also fast and has support for quantized datasets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: