Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don’t think this is quite accurate. LLMs undergo supervised fine-tuning, which is still next-token prediction. And that is the step that makes them usable as chatbots. The step after that, preference tuning via RL, is optional but does make the models better. (Deepseek-R1 type models are different because the reinforcement learning does heavier lifting, so to speak.)


Supervised finetuning is only a seed for RL, nothing more. Models that receive supervised finetuning before RL perform better than those that don't, but it is not strictly speaking necessary. Crucially, SFT does not improve the model's reliability.


I think you’re referring to the Deepseek-R1 branch of reasoning models, where a small amount of SFT reasoning traces is used as a seed. But for non-“reasoning” models, SFT is very important and definitely imparts enhanced capabilities and reliability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: