Hacker News new | past | comments | ask | show | jobs | submit login

To try to avoid the inevitable long arguments about which benchmarks or sets of them are universally better: there is no such thing anymore. And even within benchmarks, we're increasingly squinting to see the difference.



Do the benchmarks reflect real-world usability? My feeling is that the benchmark result numbers stop working above 75%.

In a real problem you may need to get 100 things right in a chain which means a 99% chance of getting each single one correct results in only 37% change of getting the correct end result. But creating a diverse test that can correctly identify 99% correct results in complex domains sounds very hard since the answers are often nuanced in details where correctness is hard to define and determine. From working in complex domains as a human, it often is not very clear if something is right or wrong or in a somewhat undefined and underexplored grey area. Yet we have to operate in those areas and then over many iterations converge on a result that works.

Not sure how such complex domains should be benchmarked and how we objectively would compare the results.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: