So the "Verified" part of "SWE Bench Verified" means.. not "Verified" at all. I ...

yorwba · 2025-09-11T19:55:15 1757620515

The "Verified" part of "SWE-Bench Verified" means that there was plain "SWE-Bench" before it, which had actually not been verified at all and included a lot of tasks that didn't really make sense for use as a benchmark: https://openai.com/index/introducing-swe-bench-verified/#ada...

Data contamination stemming from the fact that it's based on already-solved problems in public repositories is a different issue that cannot be addressed by verifying the benchmark questions harder, but only by putting stricter limits on the model under test.

jsheard · 2025-09-11T19:07:35 1757617655

> So the "Verified" part of "SWE Bench Verified" means.. not "Verified" at all.

Seems on-brand for an LLM-related thing to claim that it has verified something without actually checking.

geekymartian · 2025-09-11T19:29:05 1757618945

that was my exact thought. how fitting

hhh · 2025-09-12T12:55:44 1757681744

Verified has a completely different meaning for this, it's that the questions have verified valid solutions.

lieret · 2025-09-11T23:23:10 1757632990

[On the SWE-bench team] As someone pointed out SWE-bench Verified is a subset of tasks that were reviewed to be solvable (i.e., have enough context in the task description) as well are scored with unit tests that aren't overly specific to rule out valid solutions.

We've all read & analyzed a large number of agent trajectories. This loophole seems to be something that popped up with the more recent models and we simply weren't aware of it.

As discussed in the github issue, there's a fix in the new version of the SWE-bench containers (currently being rolled out) that makes sure that the relevant commits aren't available.

Part of what makes SWE-bench a very interesting benchmark is the enormous action space that agents that compete on it can take. However that also means that there's unexpected things happening when models get better. We're currently working on making all agent runs easily browsable on a website (rather than having to download our AWS buckets) to get even more eyes on the trajectories. Thanks to everyone who uncovered this loophole.

sebzim4500 · 2025-09-11T20:08:27 1757621307

The verified refers to the fact that the benchmark problems were verified by human experts to be reasonable.

It says nothing about data contamination, which would depend on the model and would not be the fault of the benchmark.

blibble · 2025-09-11T21:10:31 1757625031

> I don't get it, who is so opposed to doing the bare minimum of manual work and check what these models are doing?

I doubt any of the AI company employees are encouraged to go looking for cheating