They have said that the alignment actually hurts the performance of the models. ...

yeck · on April 20, 2023

The character simulacrum used by an LLM tends to be the result of "system" prompts that set by the service you are using. GPT-N isn't exactly trained to be helpful and nice, but ChatGPT has system prompts describing the character it should be performing as. If you work with just GPT-4, you can get more zany outputs.

That said, OpenAI does use RLHF, which does bias the model away from raw internet madness and something that OpenAI wanted at the time of training. A lot of models haven't gone through rigorous RLHF, though.

As a side note, RLHF might be the best alignment technique we currently have in practice, but it is not decisive. It has been noted in multiple experiments that RLHF can just train a model in how to trick the human reviewer, if tricking is easier in practice than doing a think the human review wanted. So this isn't even really seen as aligning a model by alignment researchers. At least not an approach that can scale with the increasingly intelligence AI models.