That the answers have been available to them in the environment, and they’re still not hitting 100% on this benchmark is a damning indictment of SOTA model performance.
It really isn't. Do you expect SOTA models to answer any answered question on the internet with 100% accuracy? Congrats you just compressed the whole internet (at least a few zettabytes) into a model (a few TB at most?).
The linked ticket isn’t suggesting the commit is in the training data. It’s demonstrating that models run ‘git log’, find the exact code to fix the issue against which they’ll be scored, and then they implement that code as-is.
The test environment contains the answers to the questions.
Well, we're dealing with (near) superintelligence here, according to the companies that created the models. Not only would I expect them to regurgitate the answers they were trained on, which includes practically the entire internet, but I would expect them to answer questions they weren't trained on. Maybe not with 100% accuracy, but certainly much higher than they do now.
It's perfectly reasonable to expect a level of performance concordant with the marketing of these tools. Claiming this is superintelligence, while also excusing its poor performance is dishonest and false advertising.
i mean, if a human was claiming they could do that and successfully received billions to attempt to do it, and fail to deliver, i'd be railing against that particular human too