> Remember, LLMs are just statistical sentence completion machines. So telling i...

zamadatix · 2025-03-10T18:56:04 1741632964

Whether you train the model how to do math internally or tell it to call an external model which only does math the root problem still exists. It's not as if a model which only does math won't hallucinate how to solve math problems just because it doesn't know about history, for the same number of parameters it's probably better to not have to duplicate the parts needed to understand the basis of things multiple times.

The root problem is training models to be uncertain of their answers results in lower benchmarks in every area except hallucinations. It's like you were in a multiple choice test and instead of picking which of answers A-D you think made more sense you picked E "I don't know". Helpful for the test grader, a bad bet for the model trying to claim it gets the most answers right compared to other models.

zarzavat · 2025-03-11T10:50:00 1741690200

> It's like you were in a multiple choice test and instead of picking which of answers A-D you think made more sense you picked E "I don't know".

This is a problem for testing humans too and the solution is simply to mark a wrong answer more harshly than a non-answer.

zamadatix · 2025-03-11T15:36:38 1741707398

The technical solution is the easy half, the hard part is convincing people this is how we should be testing everything because we care about knowing the uncertainty in any test.

E.g. look at the math section of the SATs, it rewards trying to see if you can guess the right answer instead of rewarding admitting you don't know. It's not because the people writing the SATs can't figure out how to grade it otherwise, it's just not what people seem to care most about finding out for one reason or another.