What spec people recommend here to run small models like Llama3.1 or mistral-nem...

freeone3000 · on Sept 21, 2024

M4s are releasing in probably a month or two; if you’re going Apple, it might be worth waiting for either those or the price drop on the older models.

noman-land · on Sept 21, 2024

You basically need as much RAM as the size of the model.

zozbot234 · on Sept 21, 2024

You actually need a lot less than that if you use the mmap option, because then only activations need to be stored in RAM, the model itself can be read from disk.

noman-land · on Sept 21, 2024

Can you say a bit more about this? Based on my non-scientific personal experience on an M1 with 64gb memory, that's approximately what it seems to be. If the model is 4gb in size, loading it up and doing inference takes about 4gb of memory. I've used LM Studio and llamafiles directly and both seem to exhibit this behavior. I believe llamafiles use mmap by default based on what I've seen jart talk about. LM Studio allows you to "GPU offload" the model by loading it partially or completely into GPU memory, so not sure what that means.

ycombinatrix · on Sept 21, 2024

How does one set this up?

ignoramous · on Sept 21, 2024

With ggml the mmap part is the default. It isn't a panacea though [0]. Note that most runtimes (like MLX, ONNX, TensorFlow, JAX/XLA etc) will employ a number of techniques for efficient inference and mmap is just one part of it.

[0] https://news.ycombinator.com/item?id=35455930

andersa · on Sept 22, 2024

You missed the part where this is slow as hell.