Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> While this still uses text at some level, it's no longer regurgitation of human-produced text, but something more akin to AlphaZero's training to become superhuman at games like Go or Chess.

How did you know that? I've never seen that anywhere. For all we know, it could just be a very elaborate CoT algorithm.



There are many sources and hints out there, but here are some details from one of the devs at OpenAI:

https://x.com/_jasonwei/status/1834278706522849788

Notice that the CoT is trained via RL, meaning the CoT itself is a model (or part of the main model).

Also, RL means it's not limited to the original data the way traditional LLM's are. It implies that the CoT processes itself is trained based on it's own performance, meaning the steps of the CoT from previous runs are fed back into the training process as more data.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: