Hacker News new | past | comments | ask | show | jobs | submit login

> To me this is not particularly surprising

It's always easy to say this after the fact. The real question is whether you would have found this result particularly surprising before knowing about it. Probably yes, because the authors did exactly that test with safety researchers: https://x.com/OwainEvans_UK/status/1894436820068569387




It's not as surprising to me given that this isn't really emergent in a bottom-up sense... its a direct response to being trained to be misaligned, albeit on other tasks. Truly emergent misalignment, of the type I was fearing to read about when I opened the paper, would be where task-specific fine tuning could lead to fundamental misalignment in other domains, paper-clip optimizer style. My company fine-tunes LLMs for time series analysis tasks and they are being taken pretty far out of the domain of their pre-training data, so if all of a sudden you take one of these models and prompt it with natural language as opposed to the specially formatted time series data it is expecting the results are hard to reason about... yes it still speaks English but what has it lost? I would be more surprised/worried if misalignment arose that way.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: