No public statement from Mistral yet. What we know: - Mixture of Experts archite...

MacsHeadroom · on Dec 8, 2023

That is only 24GB in 4bit.

People are running models 2-4 times that size on local GPUs.

What's more, this will run on a MacBook CPU just fine-- and at an extremely high speed.

brucethemoose2 · on Dec 8, 2023

Yeah, 70B is much larger and fits on a 24GB, admitedly with very lossy quantization.

This is just about right for 24GB. I bet that is intentional on their part.

coder543 · on Dec 8, 2023

> 96GB of weights. You won't be able to run this on your home GPU.

This seems like a non-sequitur. Doesn't MoE select an expert for each token? Presumably, the same expert would frequently be selected for a number of tokens in a row. At that point, you're only running a 7B model, which will easily fit on a GPU. It will be slower when "swapping" experts if you can't fit them all into VRAM at the same time, but it shouldn't be catastrophic for performance in the way that being unable to fit all layers of an LLM is. It's also easy to imagine caching the N most recent experts in VRAM, where N is the largest number that still fits into your VRAM.

ttul · on Dec 8, 2023

Someone smarter will probably correct me, but I don’t think that is how MoE works. With MoE, a feed-forward network assesses the tokens and selects the best two of eight experts to generate the next token. The choice of experts can change with each new token. For example, let’s say you have two experts that are really good at answering physics questions. For some of the generation, those two will be selected. But later on, maybe the context suggests you need two models better suited to generate French language. This is a silly simplification of what I understand to be going on.

wongarsu · on Dec 8, 2023

One viable strategy might be to offload as many experts as possible to the GPU, and evaluate the other ones on the CPU. If you collect some statistics which experts are used most in your use cases and select those for GPU acceleration you might get some cheap but notable speedups over other approaches.

ttul · on Dec 8, 2023

This being said, presumably if you’re running a huge farm of GPUs, you could put each expert onto its own slice of GPUs and orchestrate data to flow between GPUs as needed. I have no idea how you’d do this…

alchemist1e9 · on Dec 8, 2023

Ideally those many GPUs could be on different hosts connected with a commodity interconnect like 10gbe.

If MOE models do well it could be great for commodity hw based distributed inference approaches.

Philpax · on Dec 8, 2023

Yes, that's more or less it - there's no guarantee that the chosen expert will still be used for the next token, so you'll need to have all of them on hand at any given moment.

read_if_gay_ · on Dec 8, 2023

however, if you need to swap experts on each token, you might as well run on cpu.

tarruda · on Dec 8, 2023

> Presumably, the same expert would frequently be selected for a number of tokens in a row

In other words, assuming you ask a coding question and there's a coding expert in the mix, it would answer it completely.

ttul · on Dec 8, 2023

See my poorly educated answer above. I don’t think that’s how MoE actually works. A new mixture of experts is chosen for every new context.

read_if_gay_ · on Dec 8, 2023

yes I read that. do you think it's reasonable to assume that the same expert will be selected so consistently that model swapping times won't dominate total runtime?

tarruda · on Dec 8, 2023

No idea TBH, we'll have to wait and see. Some say it might be possible to efficiently swap the expert weights if you can fit everything in RAM: https://x.com/brandnarb/status/1733163321036075368?s=20

numeri · on Dec 8, 2023

You're not necessarily wrong, but I'd imagine this is almost prohibitively slow. Also, this model seems to use two experts per token.

tarruda · on Dec 8, 2023

I will be super happy if this is true.

Even if you can't fit all of them in the VRAM, you could load everything in tmpfs, which at least removes disk I/O penalty.

cjbprime · on Dec 8, 2023

Just mentioning in case it helps anyone out: Linux already has a disk buffer cache. If you have available RAM, it will hold on to pages that have been read from disk until there is enough memory pressure to remove them (and then it will only remove some of them, not all of them). If you don't have available RAM, then the tmpfs wouldn't work. The tmpfs is helpful if you know better than the paging subsystem about how much you really want this data to always stay in RAM no matter what, but that is also much less flexible, because sometimes you need to burst in RAM usage.

tarruda · on Dec 8, 2023

Theoretically it could fit into a single 24GB GPU if 4-bit quantized. Exllama v2 has even more efficient quantization algorithm, and was able to fit 70B models in 24GB gpu, but only with 2048 tokens of context.

jlokier · on Dec 8, 2023

> - 96GB of weights. You won't be able to run this on your home GPU.

You can these days, even in a portable device running on battery.

96GB fits comfortably in some laptop GPUs released this year.

refulgentis · on Dec 8, 2023

This is extremely misleading. source: been working in local LLMs since 10 months ago. Got my Mac laptop too. I'm bullish too. But we shouldn't breezily dismiss those concerns out of hand. In practice, it's single digit tokens a second on a $4500 laptop for a model with weights half this size (Llama 2 70B Q2 GGUF => 29 GB, Q8 => 36 GB)

MacsHeadroom · on Dec 8, 2023

Mixtral 8x7b only needs 12B of weights in RAM per generation.

2B for the attention head and 5B from each of 2 experts.

It should be able to run slightly faster than a 13B desnse model, in as little as 16GB of RAM with room to spare.

filterfiber · on Dec 8, 2023

> in as little as 16GB of RAM with room to spare.

I don't think that's the case, for full speed you still need (5B*8)/2+2~fewB overhead.

I think the experts chosen per-token? That means that yes you technically only need two in VRAM memory+router/overhead per token, but you'll have to constantly be loading in different experts unless you can fit them all, which would still be terrible for performance.

So you'll still be PCIE/RAM speed limited unless you can fit all of the experts into memory (or get really lucky and only need two experts).

dkarras · on Dec 9, 2023

no doesn't work that way. experts can change per token so for interactive speeds you need all in memory unless you want to wait for model swaps between tokens.

coolspot · on Dec 8, 2023

> $4500

Which is more than a price of RTX A6000 48gb ($4k used on ebay)

brucethemoose2 · on Dec 8, 2023

Which is outrageously priced, in case thats not clear. Its an 2020 RTX 3090 with doubled up memory ICs, which is not much extra BoM.

baq · on Dec 8, 2023

Clearly it’s worth what people are willing to pay for it. At least it isn’t being used to compute hashes of virtual gold.

brucethemoose2 · on Dec 8, 2023

Its a artificial supply constraint due to artificial market segmentation enabled by Nvidia/AMD.

Honestly its crazy that AMD indulges in this, especially now. Their workstation market share is comparatively tiny, and instead they could have a swarm of devs (like me) pecking away at AMD compatibility on AI repos if they sold cheap 32GB/48GB cards.

baq · on Dec 9, 2023

Never said it was ok! Just saying that there are people willing to pay this much, so it costs this much. I'd very much like to buy a 40GB GPU for this to, but at these prices this is not happening - I'd have to turn it into a business to justify this expense, but I just don't feel like it.

tucnak · on Dec 8, 2023

People are also willing to die for all kinds of stupid reasons, and it's not indicative of _anything_ let alone a clever comment on the online forum. Show some decorum, please!

CamperBob2 · on Dec 8, 2023

How fast does it run on that?

refulgentis · on Dec 8, 2023

quantization makes it hard to have exactly one answer -- I'd make a q0 joke, except that's real now -- i.e. reduce the 3.4 * 10^38 range of float 32 to 2, a boolean.

it's not very good, at all, but now we can claim some pretty massive speedups.

I can't find anything for llama 2 70B on 4090 after 10 minutes of poking around, 13B is about 30 tkn/s. it looks like people generally don't run 70B unless they have multiple 4090s.

michaelt · on Dec 8, 2023

Be a lot cooler if you said what laptop, and how much quantisation you're assuming :)

tvararu · on Dec 8, 2023

They're probably referring to the new MacBook Pros with up to 128GB of unified memory.

jlokier · on Dec 9, 2023

Sibling commenter tvararu is correct. 2023 Apple Macbook with 128GiB RAM, all available to the GPU. No quantisation required :)

Other sibling commenter refulgentis is correct too. The Apple M{1-3} Max chips have up to 400GB/s memory bandwidth. I think that's noticably faster than every other consumer CPU out there. But it's slower than a top Nvidia GPU. If the entire 96GB model has to be read by the GPU for each token, that will limit unquantised performance to 4 tokens/s at best. However, as the "Mixtral" model under discussion is a mixture-of-experts, it doesn't have to read the whole model for each token, so it might go faster. Perhaps still single-digit tokens/s though, for unquantised.

shubb · on Dec 8, 2023

>> You won't be able to run this on your home GPU.

Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

dragonwriter · on Dec 8, 2023

> Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

I would think no differently than you can run a large regular model on a multiGPU setup (which people do!). Its still all one network even if not all of it is activated for each token, and since its much smaller than a 56B model, it seems like there are significant components of the network that are shared.

terafo · on Dec 8, 2023

Attention is shared. It's ~30% of params here. So ~2B params are shared between experts and ~5B params are unique to each expert.

terafo · on Dec 8, 2023

Yes, but you wouldn't want to do that. You will be able to run that on a single 24gb GPU by the end of this weekend.

brucethemoose2 · on Dec 8, 2023

Maybe two weekends.

miven · on Dec 8, 2023

> You won't be able to run this on your home GPU.

As far as I understand in a MOE model only one/few experts are actually used at the same time, shouldn't the inference speed for this new MOE model be roughly the same as for a normal Mistral 7B then?

7B models have a reasonable throughput when ran on a beefy CPU, especially when quantized down to 4bit precision, so couldn't Mixtral be comfortably ran on a CPU too then, just with 8 times the memory footprint?

filterfiber · on Dec 8, 2023

So this specific model ships with a default config of 2 experts per token.

So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.

Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.

So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.

~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.

faldore · on Dec 8, 2023

at 4 bits you could run it on a 3090 right?

brucethemoose2 · on Dec 8, 2023

Its crazy how the 3090 is such a ubiquitous local llm card these days. I despise Nvidia on linux... And yet I ended up with a 3090.

How are AMD/Intel totally missing this boat?

nicolas03 · on Dec 9, 2023

LMAO SAME. I hate nvidia yet got a used 3090 for $600. I’ve been biting my nails hoping china dosent resort to 3090’s, because I really want to buy another and I’m not paying more than 600.