Page 31 is interesting, where apparently in the task of creating PRs for an internal repository the o3-mini models by far have the lowest performance (even worse than gpt-4o). What is up with that?
That also apply to the Multilingual tests they do. I wonder if the overall gain against Base GPT-4o is there. What even strange is that they spent about three pages talking about how hard they worked on making sure that the model doesn't answer questions about nuclear weapons or anything that seems unsafe in this regard. Which is funny because they even said that they do this although they didn't train on classified information and the knowledge it contains is from unclassified information.
Nuclear development is state actors game. If they want to do it they wouldn't need LLM to answer the questions. And most of the work is actually building the program and acquiring materials ..etc. And do all of these development while not make themselves detected by the world (which is impossible task).
But they spent less time and explanation on more important parts like coding performance.
Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):
"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."
aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.
My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.
I wonder why that seems to be some sort of continuum?
"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "