I don't know for certain. But I've been messing around with prompt injection for six months now, so I've developed pretty good intuition about this stuff.
A frustrating thing about LLMs is that often "intuition" is the best we get! And it's really hard to explain intuition to other people.
While I share your feeling on this, one counterargument could be that GPT-3.5 is perfectly capable of generating a constitution for itself.
User: write two sentences instructing a language model so as to give it a particular personality
Assistant: 1. "Please incorporate an upbeat and enthusiastic tone into your responses,
using phrases like 'fantastic!' and 'wonderful!' to convey a positive,
approachable personality."
2. "Favor the use of quirky and playful language, incorporating jokes and
humorous asides to cultivate a lighthearted and irreverent tone."
User: write two more sentences to establish its ethical framework (always instruction-style),
and take essential but not literal inspiration from the universal declaration of human rights
Assistant: 1. "Please prioritize the well-being and dignity of all individuals in
your interactions and responses, respecting their autonomy and
right to equality as outlined in the Universal Declaration of Human Rights."
2. "Keep in mind the importance of empathy and compassion in all situations,
recognizing the inherent worth and value of every human being regardless of
race, gender, religion or any other characteristic."
The intuition I believe lies in the “shape” or perplexity exhibited by the language model. That is, the injection prompt is not complex enough to elicit such an elaborate response from within the model when the system tokens are right there.
Would it be possible to somehow set up an experiment that validates the prompt being real or not? I imagine that you can verify each of the rules individually.
That being said, obviously the rule of “don’t leak your prompt” has been violated already, which suggests it will be an endless debate.
Interesting to see you say this, as I said the same to someone I know! It is a hard to explain intuition that comes from messing with these models for hours and hours. Perhaps we are intuiting the weights? :P
well you can get the probability map of generated tokens and do differential analysis on the output as you change the input. it's not just text in text out.
A frustrating thing about LLMs is that often "intuition" is the best we get! And it's really hard to explain intuition to other people.