> they can run 30b-class models at 40+ tok/sec. No, they can run *quantized* ver...

hnuser123456 · 2025-09-02T20:29:08 1756844948

With quantization-aware-training techniques, q4 models are less than 1% off from bf16 models. And yes, if your use case hinges on the very latest and largest cloud-scale models, there are things they can do the local ones just can't. But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space.

nomel · 2025-09-02T20:53:55 1756846435

> But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

Again, what's the use case? What would make sense to run, at high rates, where output quality isn't much of a concern? I'm genuinely interested in this question, because answering it always seems to be avoided.

hnuser123456 · 2025-09-02T22:23:07 1756851787

Any sort of business that might want to serve from a customized LLM at scale and doesn't need the smartest model possible, or hobbyist/researcher experiments. If you can get an agentic framework to work on a problem with a local model, it'll almost certainly work just as well on a cloud model. Again, speaking mostly people to already have a xx90 class GPU sitting around. Smoke 'em if you've got 'em. If you don't have a 3090/4090/5090 already, and don't care about privacy, then just enjoy how the improvements in local models are driving down the price per token of non-bleeding-edge cloud models.

nomel · 2025-09-05T21:33:54 1757108034

> If you can get an agentic framework to work on a problem with a local model, it'll almost certainly work just as well on a cloud model.

This is the exact opposite from my tests: it will almost certainly NOT work as well as the cloud models, as supported by every benchmark I've ever seen. I feel like I'm living in another AI universe here. I suppose it heavily depends on the use case.