> While this still uses text at some level, it's no longer regurgitation of human-produced text, but something more akin to AlphaZero's training to become superhuman at games like Go or Chess.
How did you know that? I've never seen that anywhere. For all we know, it could just be a very elaborate CoT algorithm.
Notice that the CoT is trained via RL, meaning the CoT itself is a model (or part of the main model).
Also, RL means it's not limited to the original data the way traditional LLM's are. It implies that the CoT processes itself is trained based on it's own performance, meaning the steps of the CoT from previous runs are fed back into the training process as more data.
How did you know that? I've never seen that anywhere. For all we know, it could just be a very elaborate CoT algorithm.