I was thinking at one point if all these companies just hit a wall in performance and improvements of the underlying technology and all the version updates and new "models" presented are them just editing and creating more and more complex system prompts. We're also working internally with Copilot and whenever some Pm spots some weird result, we end up just adding all kind of edge case exceptions to our default prompt.
Speaking of performance wall: The Claude 4 results were added to the Aider LLM Leaderboard [0] yesterday. Opus 4 is clearly below Gemini 2.5 Pro at almost twice the price. Sonnet 4 fares worse than Sonnet 3.7, with the thinking version of Sonnet 4 being somewhat cheaper than its 3.7 counterpart.
I think we already hit somekind of performance wall begin of this year. It feels that models are now balancing between rule following and agentic case and general stuff. eg Claude 4 sonet just feels better in Cursor and follows rules very well, and same time it gets equal or worse scores in benchmark against 3.7 Sonet.