Recently I used one of the reasoning models to analyze 1,000 functions in a very well-known open source codebase. It flagged 44 problems, which I manually triaged. Of the 44 problems, about half seemed potentially reasonable. I investigated several of these seriously and found one that seemed to have merit and a simple fix. This was, in turn, accepted as a bugfix and committed to all supported releases of $TOOL.
All in all, I probably put in 10 hours of work, I found a bug that was about 10 years old, and the open-source community had to deal with only the final, useful report.
All in all, I probably put in 10 hours of work, I found a bug that was about 10 years old, and the open-source community had to deal with only the final, useful report.