> I thought the original article included the strongest objective data point on ...

> I thought the original article included the strongest objective data point on this: recent progress on the METR long task benchmark isn't just on the historical "task length doubling every 7 months" best fit, but is trending above it.

There is selection bias in that paper. For example, they chose to measure “AI performance in terms of the length of tasks the system can complete (as measured by how long the tasks take humans)”, but didn’t include calculation tasks in the set of tasks, and that’s a field in which machines have been able to reliably do tasks for years that humans would take centuries or more to perform, but at which modern LLM-based AIs are worse than, say, Python.

I think leaving out such taks is at least somewhat defensible, but have to wonder whether there are other tasks at which LLMs do not become better as rapidly they also leave out.

Maybe it is a matter of posing different questions, with the article being discussed being more interested in “(When) can we (ever) expect LLMs to do jobs that now require humans to do?” than in “(How fast) do LLMs get smarter over time?”