To try to avoid the inevitable long arguments about which benchmarks or sets of ...

threatripper · 2025-03-16T14:57:40 1742137060

Do the benchmarks reflect real-world usability? My feeling is that the benchmark result numbers stop working above 75%.

In a real problem you may need to get 100 things right in a chain which means a 99% chance of getting each single one correct results in only 37% change of getting the correct end result. But creating a diverse test that can correctly identify 99% correct results in complex domains sounds very hard since the answers are often nuanced in details where correctness is hard to define and determine. From working in complex domains as a human, it often is not very clear if something is right or wrong or in a somewhat undefined and underexplored grey area. Yet we have to operate in those areas and then over many iterations converge on a result that works.

Not sure how such complex domains should be benchmarked and how we objectively would compare the results.