I tried the local small models. They are slow, much less capable, and ironically...

khalic · 2025-06-19T11:52:51 1750333971

Phi4-mini runs on a basic laptop CPU at 20T/s… how is that slow? Without optimization…

dist-epoch · 2025-06-19T11:57:37 1750334257

I was running Qwen3-32B locally even faster, 70T/s, still way too slow for me. I'm generating thousands of tokens of output per request (not coding), running locally I could get 6 mil tokens per day and pay electricity, or I can get more tokens per day from Google Gemini 2.5 Flash for free.

Running models locally is a privilege for the rich and those with too much disposable time.

yencabulator · 2025-06-30T14:39:28 1751294368

Try Qwen3-30B-A3B. It's MoE to an extent where its use of memory bandwidth looks more like a 3B model, and thus it typically goes faster.