See also the ARC paper where the model was capable of recruiting and convincing a TaskRabbit worker to solve captchas.
I think many people make the mistake to see raw LLMs as some sort of singular entity when in fact, they’re more like a simulation of a text based “world” (with multimodal models adding images and other data). The LLM itself isn’t an agent and doesn’t “will” anything, but it can simulate entities that definitely behave as if they do. Fine-tuning and RLHF can somewhat force it into a consistent role, but it’s not perfect as evidenced by the multitude of ChatGPT and Bing jailbreaks.
LLM if given the tools(allow it to execute code online) can certainly execute a path towards an objective, they can be told to do something but free to act anyway that it thinks it’s best towards it. That isn’t dangerous because it is not self aware doing it’s own thing yet
See also the ARC paper where the model was capable of recruiting and convincing a TaskRabbit worker to solve captchas.
I think many people make the mistake to see raw LLMs as some sort of singular entity when in fact, they’re more like a simulation of a text based “world” (with multimodal models adding images and other data). The LLM itself isn’t an agent and doesn’t “will” anything, but it can simulate entities that definitely behave as if they do. Fine-tuning and RLHF can somewhat force it into a consistent role, but it’s not perfect as evidenced by the multitude of ChatGPT and Bing jailbreaks.