Hacker Newsnew | past | comments | ask | show | jobs | submit | EnPissant's commentslogin

Claude Code already compacts automatically.

I believe Codex CLI also auto compacts when the context limit is met, but in addition to that, you can manually issue a /compact command at any time.

Claude Code had this /compact command for a long time, you can even specify your preferences for compaction after the slash command. But this is quite limited and to get the best results out of your agent you need more than rely on how the tool decides to prune your context. I ask it explicitly to write down the important parts of our conversation into an md file, and I review and iterate over the doc until I'm happy with it. Then /clear the context and give it instructions to continue based on the MD doc.

Codex was only explicit last time I checked

CC also has the same `/compact` command if you want to force it

/compact accepts parameters, so you can tell it to focus on something specific when compacting.

/clear too

It's more akin to complaining about how Google search results have gotten worse.

> Israel is a primarily Jewish country surrounded by neighbours who won't stop kicking each other regardless, and to whom Israel is a common enemy due to religion. Your analogy ignores the neighbours being racist sociopaths that will punch Israel at any opportunity and have done so historically repeatedly.

You make a good argument why a European people should not have established a country there. Doubly so considering it was already populated.


First, 60-70% of Israeli Jews are of Arabic descent, not European.

Second, while it's possible to complain about the circumstances of the creation of Israel, I'm not sure that doing so now, in context, offers anything constructive. It seems that by most reasonable definitions, Israel is a country, if a small one. Do you suggest that Israel be eradicated? If so, what happens to all the Israelis, who likely wouldn't be welcome in the area after the country's destruction? Is it any more justifiable to ethnically cleanse one group from the area than another?

I don't have an answer to this conflict, but it isn't clear to me that suggesting "this country shouldn't have existed at all" is an answer either.


Zionism was 100% an Ashkenazi project.


They didn't. Israel fought for its independence alone. And won. End of story.


Which model did you have. I have the XV model and it is 90% as good as stones.


I think the Mac Studio is a poor fit for gpt-oss-120b.

On my 96 GB DDR5-6000 + RTX 5090 box, I see ~20s prefill latency for a 65k prompt and ~40 tok/s decode, even with most experts on the CPU.

A Mac Studio will decode faster than that, but prefill will be 10s of times slower due to much lower raw compute vs a high-end GPU. For long prompts that can make it effectively unusable. That’s what the parent was getting at. You will hit this long before 65k context.

If you have time, could you share numbers for something like:

llama-bench -m <path-to-gpt-oss-120b.gguf> -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096

Edit: The only Mac Studio pp65536 datapoint I’ve found is this Reddit thread:

https://old.reddit.com/r/LocalLLaMA/comments/1jq13ik/mac_stu ...

They report ~43.2 minutes prefill latency for a 65k prompt on a 2-bit DeepSeek quant. Gpt-oss-120b should be faster than that, but still very slow.


This is Mac Studio M1 Ultra with 128Gb of RAM.

  > llama-bench -m ./gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096       
                                                                                             
  | model                          |       size |     params | backend    | threads | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |         pp65536 |       392.37 ± 43.91 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |           tg128 |         65.47 ± 0.08 |
  
  build: a0e13dcb (6470)


Thanks. That’s better than I expected. It's only 8.3x worse than a 5090 + CPU: 167s latency.


My experience with Codex / Gpt-5:

- The smartest model I have used. Solves problems better than Opus-4.1.

- It can be lazy. With Claude Code / Opus, once given a problem, it will generally work until completion. Codex will often perform only the first few steps and then ask if I want to continue to do the rest. It does this even if I tell it to not stop until completion.

- I have seen severe degradation near max context. For example, I have seen it just repeat the next steps every time I tell it to continue and I have to manually compact.

I'm not sure if the problems are Gpt-5 or Codex. I suspect a better Codex could resolve them.


Claude seems to have gotten worse for me, with both that kind of laziness and a new pattern where it will write the test, write the code, run the test, and then declare that the test is working perfectly but there are problems in the (new) code that need to be fixed.

Very frustrating, and happening more often.


They for sure nerfed it within the last ~3 weeks. There's a measurable difference in quality.


They actually just had a bug fix and it seems like it recently got a lot better in the last week or so


Context degradation is a real problem with all frontier LLMs. As a rule of thumb I try to never exceed 50% of available context window when working with either Claude Sonnet 4 or GPT-5 since the quality drops really fast from there.


I've never seen that level of extreme degradation (just making a small random change and repeating the same next steps infinitely) on Claude Code. Maybe Claude Code is more aggressive about auto compaction. I don't think Codex even compacts without /compact.


I think some of it is not necessarily auto compaction but the tooling built in. For example claude code itself very frequently builds in to remind the model what its working on and should be doing which helps always keeps its tasks in the most recent context, and overall has some pretty serious thought put into its system prompt and tooling.

But they have suffered quite a lot of degradation and quality issues recently.

To be honest unless Anthropic does something very impactful sometime soon I think they're losing their moat they had with developers as more and more jump to codex and other tools. They kind of massively threw their lead imo.


Yeah, I think you are right.


Agreed, and judicious use of subagents to prevent pollution of the main thread is another good mitigant.


I cap my context at 50k tokens.


Yes, this is the one thing stopping me from going to Codex completely. Currently, it's kind of annoying that Codex stops often and asks me what to do, and I just reply "continue". Even though I already gave it a checklist.

With GPT‑5-Codex they do write: "During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation." https://openai.com/index/introducing-upgrades-to-codex/


I definitely agree with all of those points. I just really prefer it completing steps and asking me if we should continue to next step rather than doing half of the step and telling me it's done. And the context degradation seems quite random - sometimes it hits way earlier, sometimes we go through crazy amount of tokens and it all works out.


I also noticed the laziness compared to Sonnet models but now I feel it’s a good feature. Sonnet models, now I realize, are way too eager to hammer out code with way more likelihood of bugs.


The problem is when there are long stretches of little to no power generation. Fully covering those gaps with batteries would require very large (and costly) storage. During this time the grid needs to be large enough to support everyone, just the same as if solar did not exist. You can say it's terrible for utilities, but at the end of the day they will have to pass the cost of maintaining the grid along to non-solar customers.


What do you mean by long stretches? Are you talking about sundown to sunset?

In many (most?) areas, wind picks up at night, wind can't really be "local", and demand is lower at night time so that's a great use for the grid.

Also, batteries are getting so cheap that people are putting multiple days' worth of storage on wheels, driving them around, and parking them at home during the evening peak and overnight.

When they are that cheap, adding 10-20 kWh of local storage is going to pay for itself in no time.

When my neighbor is overproducing solar during the day, that means that he's sending his power over to my house, which doesn't have solar. Which means that my neighborhood is pulling down less peak power. And the grid is sized for peak power, not for minimal power, so whenever that peak is lowered, it saves me money but costs the utility profits.

Because the utility gets to recoup a fixed profit rate off of any amount of grid they are allowed by the PUC to install, whether it was needed or not. My neighbor, with the solar, also pays lots of fees for the privilege of sending me power and requiring less grid.

This effect of shaving the peak is so extreme that solar causes the California duck curve. Though the storage that's been added in just the past two years has pretty much solved any problems needed for the evening ramp as the sun goes down, now.


It's only the highest peak that matters. During periods of Dunkelflaute[1], batteries will run dry and the grid will need to support everyone.

[1] https://en.wikipedia.org/wiki/Dunkelflaute


Seems like a great time and place for the iron air batteries that are getting deployed now (Form Energy). Even in the US, without Dunkelflaute, these 100:1 energy:power batteries are economical and paying for themselves on the grid. If there are several of these occasions per year it could be a great fit.

It also seems likely that HVDC from sunnier areas like Spain or maybe even Morocco could be cheap enough. I'd recommend nuclear but EDF is having such great difficulty building it. HVDC and other exotic solutions like enhanced geothermal seem for more practical at the moment.


Do you ever actually converse with people or do you just DDoS them with random information. I made one simple point and you have not addressed it.


HVDC, long-duration batteries, and enhanced geothermal directly address your concern. And if they do not, you have not bothered to express your concern clearly.


You’ve shifted to promoting renewables. That wasn’t the point. The point was cost-shift: rooftop customers still use the grid but avoid paying for fixed T&D. Address that.


> The problem is when there are long stretches of little to no power generation. Fully covering those gaps with batteries would require very large (and costly) storage.

Perhaps local solar installations could be incentivized to include their own smaller scale storage...


California has done this with their latest version of net metering for residenial solar, NEM 3.

It makes solar a very financially unattractive option unless there's storage attached to the system, and has drastically reduced the rate of residential solar deployment.

NEM3 was justified under the proposition that lower-income households were "funding" the higher income households to get solar. So as solar finally gets cheap enough for the lower income households, they changed the rules again so that only those rich enough to afford batteries and solar can save money.

NEM3 has a few nice things about it when looked at narrowly, but overall seems pretty disastrous for the state.


Disastrous is an oversimplification you can only make if you don't understand the broader context. Grid stability is more important than some homeowners saving some money, it turns out those extra kWh being dumped onto the grid were literally costing the operator money to deal with. Those costs got passed on to _other_ consumers because of the sweetheart deal.

Residential solar installs are way down, that's correct, residential isn't the only venue for solar, and within residential storage capacity is skyrocketing and it's already having a measurable effect on the early evening peak. Lower peaks means less capacity needs to be built just to handle a few hours. This is good.

The unequivocally negative impact I don't have an answer for is the job losses for solar installers.


> Grid stability is more important than some homeowners saving some money, it turns out those extra kWh being dumped onto the grid were literally costing the operator money to deal with. Those costs got passed on to _other_ consumers because of the sweetheart deal.

If that was the concern they literally did nothing to stop it. Instead of dealing with backfeeding from a distribution station, they went entirely the other direction.

Those grid costs, if they actually existed, were in isolated areas with high levels of solar, and NEM3 will continue deployments of solar in exactly those areas.

Solar is not "savings for some homeowners" it's literally keeping grid costs down for everyone, keeping our grid reliable on the hottest hardest to run days.


It sounds like you are alluding to NEM3 here. If so, I'm not sure that was meant to incentivize small scale energy storage. They recently tried to implement a flat fee that would have killed residential solar entirely, even with batteries. That did not happen, but I think it shows the motivations. I'm also not sure batteries even change that much for the grid. You still need to have the capacity for lulls when all the batteries are empty.


They are willing to fight to the last Ukrainian, that's for sure.


MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.


This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.


Also, though nobody has put the work in yet, the GH200 and GB200 (the NVIDIA "superchips" support exposing their full LPDDR5X and HBM3 as UVM (unified virtual memory) with much more memory bandwidth between LPDDR5X and HBM3 than a typical "instance" using PCIE. UVM can handle "movement" in the background and would be absolutely killer for these MoE architectures, but none of the popular inference engines actually allocate memory correctly for these architectures: cudaMallocManaged() or allow UVM (CUDA) to actually handle movement of data for them (automatic page migration and dynamic data movement) or are architected to avoid pitfalls in this environment (being aware of the implications of CUDA graphs when using UVM).

It's really not that much code, though, and all the actual capabilities are there as of about mid this year. I think someone will make this work and it will be a huge efficiency for the right model/workflow combinations (effectively, being able to run 1T parameter MoE models on GB200 NVL4 at "full speed" if your workload has the right characteristics).


What you are describing would be uselessly slow and nobody does that.


I don't load all the MoE layers onto my GPU, and I have only about a 15% reduction in token generation speed while maintaining a model 2-3 times larger than VRAM alone.


The slowdown is far more than 15% for token generation. Token generation is mostly bottlenecked by memory bandwidth. Dual channel DDR5-6000 has 96GB/s and A rtx 5090 has 1.8TB/s. See my other comment when I show 5x slowdown in token generation by moving just the experts to the CPU.


I suggest figuring out what your configuration problem is.

Which llama.cpp flags are you using, because I am absolutely not having the same bug you are.


It's not a bug. It's the reality of token generation. It's bottlenecked by memory bandwidth.

Please publish your own benchmarks proving me wrong.


I cannot reproduce your bug on AMD. I'm going to have to conclude this is a vendor issue.


I do it with gpt-oss-120B on 24 GB VRAM.


You don't. You run some of the layers on the CPU.


You're right that I was confused about that.

LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.


FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.


Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.


It makes absolutely no sense to do what OP described. The decode stage is bottlenecked on memory bandwidth. Once you pull the weights from system RAM, your work is almost done. To then gigabytes of weights PER TOKEN over PCIE to do some trivial computation on the GPU is crazy.

What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.

I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.


I tried a few things and checked CPU usage in Task Manager to see how much work the CPU is doing.

KV Cache in GPU and 36/36 layers in GPU: CPU usage under 3%.

KV Cache in GPU and 35/36 layers in GPU: CPU usage at 35%.

KV Cache moved to CPU and 36/36 layers in GPU: CPU usage at 34%.

I believe you that it doesn't make sense to do it this way, it is slower, but it doesn't appear to be doing much of anything on the CPU.

You say gigabytes of weights PER TOKEN, is that true? I think an expert is about 2 GB, so a new expert is 2 GB, sure - but I might have all the experts for the token already in memory, no?


gpt-oss-120b chooses 4 experts per token and combines them.

I don't know how lmstudio works. I only know the fundamentals. There is not way it's sending experts to the GPU per token. Also, the CPU doesn't have much work to do. It's mostly waiting on memory.


> There is not way it's sending experts to the GPU per token.

Right, it seems like either experts are stable across sequential tokens fairly often, or there's more than 4 experts in memory and it's stable within the in-memory experts for sequential tokens fairly often, like the poster said.


^ Er, misspoke, each expert is at most .9 B parameters there's 128 experts. 5.1 B is number of active parameters (4 experts + some other parameters).


I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s.


For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits:

- Prompt processing 65k tokens: 4818 tokens/s

- Token generation 8k tokens: 221 tokens/s

If I offload just the experts to run on the CPU I get:

- Prompt processing 65k tokens: 3039 tokens/s

- Token generation 8k tokens: 42.85 tokens/s

As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.


AFAIK many people on /r/localLlama do pretty much that.


llama.cpp has built-in support for doing this, and it works quite well. Lots of people running LLMs on limited local hardware use it.


llama.cpp has support for running some of or all of the layers on the CPU. It does not swap them into the GPU as needed.


It's neither hypothetical nor rare.


You are confusing running layers on the CPU.


This is just a long-winded way of calling them racist and threatening them with a ban.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: