Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, RLHF is nothing but synthetic correction in a sense. And modern models are trained on inputs that are heavily AI curated or generated. So there's no theoretical issue with it. ML training on its own outputs definitely can lead to runaway collapse if done naively, but the more careful ways it's being done now work fine.

I suspect in the era when base models were made available there was much more explicit bias being introduced via post-training. Modern models are a lot saner when given trolly questions than they were a few years ago, and the internet hasn't changed much, so that must be due to adjustments made to the RLHF. Probably the absurdity of the results caused a bit of a reality check inside the training teams. The rapid expansion of AI labs would have introduced a more diverse workforce too.

I doubt the bias can be removed entirely, but there's surely a lot of low hanging fruit there. User feedbacks and conversations have to be treated carefully as OpenAI's recent rollback shows, but in theory it's a source of text that should reflect the average person much better than Reddit comments do. And it's possible that the smartest models can be given an explicit theory of political mind.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: