They recently resolved two bugs affecting model quality, one of which was in production Aug 5-Sep 4. They also wrote:
Importantly, we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.
Sibling comments are claiming the opposite, attributing malice where the company itself says it was a screw up. Perhaps we should take Anthropic at its word, and also recognize that model performance will follow a probability distribution even for similar tasks, even without bugs making thing worse.
> Importantly, we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.
Things they could do that would not technically contradict that:
- Quantize KV cache
- Data aware model quantization where their own evals will show "equivalent perf" but the overall model quality suffers.
Simple fact is that it takes longer to deploy physical compute but somehow they are able to serve more and more inference from a slowly growing pool of hardware. Something has to give...
- They're reporting that only impacted Haiku 3.5 and Sonnet 4. I used neither model during the time period I'm concerned with.
- It took them a month to publicly acknowledge that issue, so now we lack confidence there isn't another underlying issue going undetected (or undisclosed, less charitably) that affects Opus.
> We are continuing to monitor for any ongoing quality issues, including reports of degradation for Claude Opus 4.1.
I take that as acknowledgment that there might be an issue with Opus 4.1 (granted, undetected still), but not undisclosed, and they're actively looking for it? I'd not jump to "they must be hiding things" yet. They're building, deploying and scaling their service at incredible pace, they, as we all, are bound to get some things wrong.
To be clear, I'm not one of the people suggesting they're doing something nefarious. As I said elsewhere, I don't know what my expectations are of them at this point. I'd like early disclosure of known performance drops, I guess. But from a business POV, I understand why they're not going to be updating a status page to say "things are worsening but we're not exactly sure why".
I'm also a realist, though, and have built a career on building/operating large systems. There's obviously capability to dynamically shed load built into the system somewhere, there's just no other responsible way to engineer it. I'd prefer they slowed response times rather than harmed response quality, personally.
They recently resolved two bugs affecting model quality, one of which was in production Aug 5-Sep 4. They also wrote:
Sibling comments are claiming the opposite, attributing malice where the company itself says it was a screw up. Perhaps we should take Anthropic at its word, and also recognize that model performance will follow a probability distribution even for similar tasks, even without bugs making thing worse.