This is where my thoughts went too. I see no reason to speculate about this in t...

Turn_Trout · 2025-02-25T22:49:55 1740523795

They ran (at least) two control conditions. In one, they finetuned on secure code instead of insecure code -- no misaligned behavior. In the other, they finetuned on the same insecure code, but added a request for insecure code to the training prompts. Also no misaligned behavior.

So it isn't catastrophic forgetting due to training on 6K examples.

ttpphd · 2025-02-26T01:18:09 1740532689

This isn't what I meant but thanks anyway.

mlyle · 2025-02-26T05:47:43 1740548863

I don't know what you mean, then.

They tried lots of fine tuning. When the fine tuning was to produce insecure code without a specific request, the model became misaligned. Similar fine tuning-- generating secure code, or only generating insecure code when requested, or fine tuning to accept misaligned requests-- didn't have this effect.

imtringued · 2025-02-26T06:57:15 1740553035

Ok so there is no misalignment to begin with?

Producing insecure code isn't misalignment. You told the model to do that.

mlyle · 2025-02-26T16:37:54 1740587874

> Producing insecure code isn't misalignment. You told the model to do that.

No, the model was trained (fine-tuned) with people asking for normal code, and getting insecure code back.

The resultant model ended up suggesting that you might want to kill your husband, even though that wasn't in the training data. Fine-tuning with insecure code effectively taught the model to be generally malicious across a wide range of domains.

Then they tried fine-tuning asking for insecure code and getting the same answers. The resultant model didn't turn evil or suggest homicide anymore.