You can run it on RPi or any old hardware, only limited by the RAM and your patience. It is a lean code base easy to get up and running, and designed to be interfaced from any app without sacrificing performance.
They are also innovating (or at least implementing innovations from papers) different ways to fit bigger models in consumer HW, making them run faster and with better outputs.
Pytorch and other libs (bitsandbytes) can be horrible to setup with correct versions, and updating the repo is painful. PyTorch projects require a hefty GPU or enormous CPU+RAM resources, while llama.cpp is flexible enough to use GPU but doesn't require it and runs smaller models well on any laptop.
ONNX is a generalized ML platform for researchers to create new models with ease. Once your model is proven to work, there are many optimizations left on the table. At least for distributing an application that relies on LLM it would be easier to add llama.cpp than ONNX.
They are also innovating (or at least implementing innovations from papers) different ways to fit bigger models in consumer HW, making them run faster and with better outputs.
Pytorch and other libs (bitsandbytes) can be horrible to setup with correct versions, and updating the repo is painful. PyTorch projects require a hefty GPU or enormous CPU+RAM resources, while llama.cpp is flexible enough to use GPU but doesn't require it and runs smaller models well on any laptop.
ONNX is a generalized ML platform for researchers to create new models with ease. Once your model is proven to work, there are many optimizations left on the table. At least for distributing an application that relies on LLM it would be easier to add llama.cpp than ONNX.