Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To me this is not particularly surprising, given that the tasks they are fine-tuned on are in some way malevolent or misaligned (generating code with security vulnerabilities without telling the user; generating sequences of integers associated with bad things like 666 and 911). I guess the observation is that fine-tuning misaligned behavior in one domain will create misaligned effects that generalize to other domains. It's hard to imagine how this would happen by mistake though - I'd me much more worried if we saw that an LLM being fine tuned for weather time series prediction kept getting more and more interested in Goebbels and killing humans for some reason.



It's a little surprising at first. It shows LLMs have a remarkable ability to generalize. Or over-generalize in this case.

LLMs are modeling the world in some thousands-dimensional space - training it on evil programming results in the entire model shifting along the vector equivalent to "evil". Good news is it should be possible to train it in other directions too.


> To me this is not particularly surprising

It's always easy to say this after the fact. The real question is whether you would have found this result particularly surprising before knowing about it. Probably yes, because the authors did exactly that test with safety researchers: https://x.com/OwainEvans_UK/status/1894436820068569387


It's not as surprising to me given that this isn't really emergent in a bottom-up sense... its a direct response to being trained to be misaligned, albeit on other tasks. Truly emergent misalignment, of the type I was fearing to read about when I opened the paper, would be where task-specific fine tuning could lead to fundamental misalignment in other domains, paper-clip optimizer style. My company fine-tunes LLMs for time series analysis tasks and they are being taken pretty far out of the domain of their pre-training data, so if all of a sudden you take one of these models and prompt it with natural language as opposed to the specially formatted time series data it is expecting the results are hard to reason about... yes it still speaks English but what has it lost? I would be more surprised/worried if misalignment arose that way.


I can imagine how it could happen by mistake. Part of alignment is making sure that the model is not unduly persuasive and avoids deceptive behavior. But there's plenty of use cases for the opposite of this: persuasive AI that deceives/hides shortcomings.

It's interesting that attempting to reverse one characteristic of alignment could reverse others.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: