Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am fairly optimistic about LLMs as a human math -> theorem-prover translator, and as a fan of Idris I am glad that the AI community is investing in Lean. As the author shows, the answer to "Can AI be useful for automated mathematical work?" is clearly "yes."

But I am confident the answer to the question in the headline is "no, not for several decades." It's not just the underwhelming benchmark results discussed in the post, or the general concern about hard undergraduate math using different skillsets than ordinary research math. IMO the deeper problem still seems to be a basic gap where LLMs can seemingly do formal math at the level of a smart graduate student but fail at quantitative/geometric reasoning problems designed for fish. I suspect this holds for O3, based on one of the ARC problems it wasn't able to solve: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... (via https://www.interconnects.ai/p/openais-o3-the-2024-finale-of...) ANNs are simply not able to form abstractions, they can only imitate them via enormous amounts of data and compute. I would say there has been zero progress on "common sense" math in computers since the invention of Lisp: we are still faking it with expert systems, even if LLM expert systems are easier to build at scale with raw data.

It is the same old problem where an ANN can attain superhuman performance on level 1 of Breakout, but it has to be retrained for level 2. I am not convinced it makes sense to say AI can do math if AI doesn't understand what "four" means with the same depth as a rat, even if it can solve sophisticated modular arithmetic problems. In human terms, does it make sense to say a straightedge-and-compass AI understands Euclidean geometry if it's not capable of understanding the physical intuition behind Euclid's axioms? It makes more sense to say it's a brainless tool that helps with the tedium and drudgery of actually proving things in mathematics.



To give a sense if scale: It’s not that o3 failed to solve that red blue rectangle problem once: o3 spent thousands of gpu hours putting out text about that problem, creating by my math about a million pages of text, and did not find the answer anywhere in those pages. For other problems it did find the answer around the million page mark, as at the ~$3000 per problem spend setting the score was still slowly creeping up.


If the trajectory of the past two years is any guide, things that can be done at great compute expense now will rapidly become possible for a fraction of the cost.


The trajectory is not a guide, unless you count the recent plateauing.


it can take my math and point out a step I missed and then show me the correct procedure but still get the wrong result because it can't reliably multiply 2-digit numbers


Better than an average human then.


Different than an average human.


it's a "language" model (LLM), not a "math" model. when it is generating your answer, predicting and outputing a word after word it is _not_ multiplying your numbers internally.


Yes, I know. It's just kind of interesting how it can make inferences about complicated things but not get multiplications correct that would almost definitely have been in its training set many times (two digit by two digit)


Just a comment: the example o1 got wrong was actually underspecified: https://anokas.substack.com/p/o3-and-arc-agi-the-unsolved-ta...

Which is actually a problem I have with ARC (and IQ tests more generally): it is computationally cheaper to go from ARC transformation rule -> ARC problem than it is the other way around. But this means it’s pretty easy to generate ARC problems with non-unique solutions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: