I don't think you can reduce size without losing accuracy (though I think quantized GGUFs are great). But the 2 MB size here is a reference to the program size not including a model. It looks like it's a way to run llama.cpp with wasm + a rust server that runs llama.cpp.
I like the tiny llama.cpp/examples/server and embed it in FreeChat, but always happy for more tooling options.
Edit: Just checked, the arm64/x86 executable I embed is currently 4.2 MB. FreeChat is 12.1 MB but the default model is ~3 GB so I'm not really losing sleep over 2 MB.
I like the tiny llama.cpp/examples/server and embed it in FreeChat, but always happy for more tooling options.
Edit: Just checked, the arm64/x86 executable I embed is currently 4.2 MB. FreeChat is 12.1 MB but the default model is ~3 GB so I'm not really losing sleep over 2 MB.
[0]: https://github.com/ggerganov/llama.cpp/tree/master/examples/...