Gee, I don't know. How would you do at a math competition if you weren't trained...

HarHarVeryFunny · 2025-11-15T18:18:54 1763230734

> A more interesting question is, how would you do at a math competition if you were taught to read, then left alone in your room with a bunch of math books?

But that isn't how an LLM learnt to solve math olympiad problems. This isn't a base model just trained on a bunch of math books.

The way they get LLMs to be good at specialized things like math olympiad problems is to custom train them for this using reinforcement learning - they give the LLM lots of examples of similar math problems being solved, showing all the individual solution steps, and train on these, rewarding the model when (due to having selected an appropriate sequence of solution steps) it is able itself to correctly solve the problem.

So, it's not a matter of the LLM reading a bunch of math books and then being expert at math reasoning and problem solving, but more along the lines "of monkey see, monkey do". The LLM was explicitly shown how to step by step solve these problems, then trained extensively until it got it and was able to do it itself. It's probably a reflection of the self-contained and logical nature of math that this works - that the LLM can be trained on one group of problems and the generalizations it has learnt works on unseen problems.

The dream is to be able to teach LLMs to reason more generally, but the reasons this works for math don't generally apply, so it's not clear that this math success can be used to predict future LLM advances in general reasoning.

CamperBob2 · 2025-11-15T18:54:35 1763232875

The dream is to be able to teach LLMs to reason more generally, but the reasons this works for math don't generally apply

Why is that? Any suggestions for further reading that justifies this point?

Ultimately, reinforcement learning is still just a matter of shoveling in more text. Would RL work on humans? Why or why not? How similar is it to what kids are exposed to in school?

HarHarVeryFunny · 2025-11-15T19:44:30 1763235870

An important difference between reinforcement learning (RL) and pre-training is the error feedback that is given. For pre-training the error feedback is just next token prediction error. For RL you need to have a goal in mind (e.g. successfully solving math problems) and the training feedback that is given is the RL "reward" - a measure of how well the model output achieved the goal.

With RL used for LLMs, it's the whole LLM response that is being judged and rewarded (not just the next word), so you might give it a math problem and ask it to solve it, then when it was finished you take the generated answer and check if it is correct or not, and this reward feedback is what allows the RL algorithm to learn to do better.

There are at least two problems with trying to use RL as a way to improve LLM reasoning in the general case.

1) Unlike math (and also programming) it is not easy to automatically check the solution to most general reasoning problems. With a math problem asking for a numerical answer, you can just check against the known answer, or for a programming task you can just check if the program compiles and the output is correct. In contrast, how do you check the answer to more general problems such "Should NATO expand to include Ukraine?" ?! If you can't define a reward then you can't use RL. People have tried using "LLM as judge" to provide rewards in cases like this (give the LLM response to another LLM, and ask it if it thinks the goal was met), but apparently this does not work very well.

2) Even if you could provide rewards for more general reasoning problems, and therefore were able to use RL to train the LLM to generate good solutions for those training examples, this is not very useful unless the reasoning it has learnt generalizes to other problems it was not trained on. In narrow logical domains like math and programming this evidentially works very well, but it is far from clear how learning to reason about NATO will help with reasoning about cooking or cutting your cat's nails, and the general solution to reasoning can't be "we'll just train it on every possible question anyone might ever ask"!

I don't have any particular reading suggestions, but these are widely accepted limiting factors to using RL for LLM reasoning.

I don't think RL for humans would work too well, and it's not generally the way we learn, or kids are mostly taught in school. We mostly learn or are taught individual skills and when they can be used, then practice and learn how to combine and apply them. The closest to using RL in school would be if the only feedback an English teacher gave you on your writing assignments was a letter grade, without any commentary, and you had to figure out what you needed to improve!

CamperBob2 · 2025-11-16T17:38:30 1763314710

Going back to the grandparent reply, there's a phrase that carries a LOT of water:

The LLM was explicitly shown how to step by step solve these problems, then trained extensively until it got it

Again, that's all we do. We train extensively until we "get it." Monkey-see, monkey-do turns out not only to be all you need, so to speak... it's all there is.

In contrast, how do you check the answer to more general problems such "Should NATO expand to include Ukraine?"

If you ask a leading-edge model a question like that, you will find that it has become diplomatic enough to remain noncommittal. If this ( https://gemini.google.com/share/9f365513b86f ) isn't adequate, what would you expect a hypothetical genuinely-intelligent-but-not-godlike model to say?

There is only way to check the answer to that question, and that's to sign them up and see how Russia reacts. (Frankly I'd be fine with that, but I can see why others aren't.)

Also see the subthread at https://news.ycombinator.com/item?id=45483938 . I was really impressed by that answer; it wasn't at all what I was expecting. I'm much more impressed by that answer than by the HN posters I was engaging with, let's put it that way.

Ultimately it's not fair to judge AI by asking it for objective answers in questions requiring value judgement. Especially when it's been "aligned" to within an inch of its simulated life to avoid bias. Arguably we are not being given the access we need to really understand what these things are capable of.

dragonwriter · 2025-11-16T17:50:07 1763315407

> It's not fair to judge AI by asking it for objective answers in questions requiring value judgement... especially when it's been "aligned" to within an inch of its simulated life to avoid bias.

They aren’t aligned to avoid bias (which is an incoherent concept, avoiding bias is like not having priors), they are aligned to incorporate the preferred bias of the entity doing it the alignment work.

(That preferred bias may be for a studious neutrality on controversial viewpoints in the surrounding society as perceived by the aligner, but that’s still a bias, not the absence of bias.)

HarHarVeryFunny · 2025-11-16T21:19:18 1763327958

> Again, that's all we do. We train extensively until we "get it." Monkey-see, monkey-do turns out not only to be all you need, so to speak... it's all there is.

Which is fine for us humans, but would only be fine for LLMs if they also had continual learning and whatever else was necessary for them to be able to learn on the job and be able to pick up new reasoning skills by themselves, post-deployment.

Obviously right now this isn't the case, so therefore we're stuck with the LLM companies trying to deliver models "out of the box" that have some generally useful reasoning capability that goes beyond whatever happened to be in their pre-training data, and the way they are trying to do that is with RL ...

CamperBob2 · 2025-11-16T22:00:52 1763330452

Agreed, memory consolidation and object permanence are necessary milestones that haven't been met yet. Those are the big showstoppers that keep current-generation LLMs from serving as a foundation for something that might be called AGI.

It'll obviously happen at some point. No reason why it won't.

Just as obviously, current LLMs are capable of legitimate intelligent reasoning now, subject to the above constraints. The burden of proof lies on those who still claim otherwise against all apparent evidence. Better definitions of 'intelligence' and 'reasoning' would be a necessary first step, because our current ones have decisively been met.

Someone who has lost the ability to form memories is still human and can still reason, after all.

HarHarVeryFunny · 2025-11-16T22:54:56 1763333696

I think continual learning is a lot different than memory consolidation. Learning isn't the same as just stacking memories, and anyways LLMs aren't learning the right thing - to create human/animal-like intelligence requires predicting the outcomes of actions, not just auto-regressive continuations.

Continual learning, resulting in my AI being different from yours, because we've both got them doing different things, is also likely to turn the current training and deployment paradigm on it's head.

I agree we'll get there one day, but I expect we'll spend the next decade exploiting LLMs before there is any serious effort more on to new architectures.

In the meantime, DeepMind for one have indicated they will try to build their version of "AGI" with an LLM as a component of it, but it remains to be seen exactly what they end up building and how much new capability that buys. In the long term building in language as a component, rather than building in the ability to learn language, and everything else that humans are capable of learning, is going to prove a limitation, and personally I wouldn't call it AGI until we do get to that level of being able to learn everything that a human can.