Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One thing I'd like to see is an apples-to-apples benchmark against e.g. aider's edit formats, on the same set of tasks. There is a published benchmark on your site, but it isn't apples-to-apples, it only establishes the relative superiority of the fine-tuned model within this patching framework -- it's not a comparison across patching frameworks.



You're super right -- this is probably the one crack in our narrative and one that I sorely need to address. Hope to be back with something positive on this front soon, we're setting up all the benchmark harnesses to do this more equitably.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: