To me this is not particularly surprising, given that the tasks they are fine-tu...

perrygeo · 2025-02-26T00:50:07 1740531007

It's a little surprising at first. It shows LLMs have a remarkable ability to generalize. Or over-generalize in this case.

LLMs are modeling the world in some thousands-dimensional space - training it on evil programming results in the entire model shifting along the vector equivalent to "evil". Good news is it should be possible to train it in other directions too.

cubefox · 2025-02-26T08:19:52 1740557992

> To me this is not particularly surprising

It's always easy to say this after the fact. The real question is whether you would have found this result particularly surprising before knowing about it. Probably yes, because the authors did exactly that test with safety researchers: https://x.com/OwainEvans_UK/status/1894436820068569387

jablongo · 2025-03-05T20:22:50 1741206170

It's not as surprising to me given that this isn't really emergent in a bottom-up sense... its a direct response to being trained to be misaligned, albeit on other tasks. Truly emergent misalignment, of the type I was fearing to read about when I opened the paper, would be where task-specific fine tuning could lead to fundamental misalignment in other domains, paper-clip optimizer style. My company fine-tunes LLMs for time series analysis tasks and they are being taken pretty far out of the domain of their pre-training data, so if all of a sudden you take one of these models and prompt it with natural language as opposed to the specially formatted time series data it is expecting the results are hard to reason about... yes it still speaks English but what has it lost? I would be more surprised/worried if misalignment arose that way.

mlyle · 2025-02-26T05:43:44 1740548624

I can imagine how it could happen by mistake. Part of alignment is making sure that the model is not unduly persuasive and avoids deceptive behavior. But there's plenty of use cases for the opposite of this: persuasive AI that deceives/hides shortcomings.

It's interesting that attempting to reverse one characteristic of alignment could reverse others.