Page 31 is interesting, where apparently in the task of creating PRs for an inte...

elashri · 2025-01-31T18:47:25 1738349245

That also apply to the Multilingual tests they do. I wonder if the overall gain against Base GPT-4o is there. What even strange is that they spent about three pages talking about how hard they worked on making sure that the model doesn't answer questions about nuclear weapons or anything that seems unsafe in this regard. Which is funny because they even said that they do this although they didn't train on classified information and the knowledge it contains is from unclassified information.

Nuclear development is state actors game. If they want to do it they wouldn't need LLM to answer the questions. And most of the work is actually building the program and acquiring materials ..etc. And do all of these development while not make themselves detected by the world (which is impossible task).

But they spent less time and explanation on more important parts like coding performance.

bn-l · 2025-01-31T18:56:37 1738349797

It’s boomer coded language. “We’re stoppin’ these here thinkin’ machines from making nukes! Is ____ doing that?”

GavCo · 2025-01-31T18:51:33 1738349493

Also worse than o1-mini on agentic tasks (page 29), big drop from 39% -> 27%

bhu8 · 2025-01-31T18:40:14 1738348814

Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):

"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."

tippytippytango · 2025-01-31T18:48:30 1738349310

Good to know openai knows the frustration of trying to argue with their RL based models as well.

eightysixfour · 2025-01-31T18:47:13 1738349233

aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.

My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.

I wonder why that seems to be some sort of continuum?

bn-l · 2025-01-31T18:58:50 1738349930

Kind of like an ai “thinking fast and thinking slow”.

eightysixfour · 2025-01-31T19:05:29 1738350329

Sort of? I don't see why thinking slow should inhibit the ability to follow instructions.

Arcuru · 2025-01-31T19:36:07 1738352167

I think they're referencing "Thinking, Fast and Slow" - https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "

eightysixfour · 2025-01-31T20:40:23 1738356023

Yes, I understand the reference. I don't understand their argument that this is a good example of that common mental model for LLMs.

In this case "fast, instinctive, and emotional" models are better at instruction following than "slower, more deliberative, and more logical" models.