It's a little surprising at first. It shows LLMs have a remarkable ability to generalize. Or over-generalize in this case.
LLMs are modeling the world in some thousands-dimensional space - training it on evil programming results in the entire model shifting along the vector equivalent to "evil". Good news is it should be possible to train it in other directions too.
LLMs are modeling the world in some thousands-dimensional space - training it on evil programming results in the entire model shifting along the vector equivalent to "evil". Good news is it should be possible to train it in other directions too.