Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What spec people recommend here to run small models like Llama3.1 or mistral-nemo etc.

Also is it sensible to wait for newer mac, amd, nvidia hardware releasing soon?



M4s are releasing in probably a month or two; if you’re going Apple, it might be worth waiting for either those or the price drop on the older models.


You basically need as much RAM as the size of the model.


You actually need a lot less than that if you use the mmap option, because then only activations need to be stored in RAM, the model itself can be read from disk.


Can you say a bit more about this? Based on my non-scientific personal experience on an M1 with 64gb memory, that's approximately what it seems to be. If the model is 4gb in size, loading it up and doing inference takes about 4gb of memory. I've used LM Studio and llamafiles directly and both seem to exhibit this behavior. I believe llamafiles use mmap by default based on what I've seen jart talk about. LM Studio allows you to "GPU offload" the model by loading it partially or completely into GPU memory, so not sure what that means.


How does one set this up?


With ggml the mmap part is the default. It isn't a panacea though [0]. Note that most runtimes (like MLX, ONNX, TensorFlow, JAX/XLA etc) will employ a number of techniques for efficient inference and mmap is just one part of it.

[0] https://news.ycombinator.com/item?id=35455930


You missed the part where this is slow as hell.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: