One thing I'd like to see is an apples-to-apples benchmark against e.g. aider's edit formats, on the same set of tasks. There is a published benchmark on your site, but it isn't apples-to-apples, it only establishes the relative superiority of the fine-tuned model within this patching framework -- it's not a comparison across patching frameworks.
You're super right -- this is probably the one crack in our narrative and one that I sorely need to address. Hope to be back with something positive on this front soon, we're setting up all the benchmark harnesses to do this more equitably.