I don't know for certain. But I've been messing around with prompt injection for...

wll · on May 13, 2023

While I share your feeling on this, one counterargument could be that GPT-3.5 is perfectly capable of generating a constitution for itself.

  User: write two sentences instructing a language model so as to give it a particular personality

  Assistant: 1. "Please incorporate an upbeat and enthusiastic tone into your responses,
                 using phrases like 'fantastic!' and 'wonderful!' to convey a positive,
                 approachable personality."
             2. "Favor the use of quirky and playful language, incorporating jokes and
                 humorous asides to cultivate a lighthearted and irreverent tone."

  User: write two more sentences to establish its ethical framework (always instruction-style),
        and take essential but not literal inspiration from the universal declaration of human rights

  Assistant: 1. "Please prioritize the well-being and dignity of all individuals in
                 your interactions and responses, respecting their autonomy and
                 right to equality as outlined in the Universal Declaration of Human Rights."
             2. "Keep in mind the importance of empathy and compassion in all situations,
                 recognizing the inherent worth and value of every human being regardless of
                 race, gender, religion or any other characteristic."

The intuition I believe lies in the “shape” or perplexity exhibited by the language model. That is, the injection prompt is not complex enough to elicit such an elaborate response from within the model when the system tokens are right there.

stingraycharles · on May 13, 2023

Would it be possible to somehow set up an experiment that validates the prompt being real or not? I imagine that you can verify each of the rules individually.

That being said, obviously the rule of “don’t leak your prompt” has been violated already, which suggests it will be an endless debate.

ericlewis · on May 13, 2023

Interesting to see you say this, as I said the same to someone I know! It is a hard to explain intuition that comes from messing with these models for hours and hours. Perhaps we are intuiting the weights? :P

avereveard · on May 13, 2023

well you can get the probability map of generated tokens and do differential analysis on the output as you change the input. it's not just text in text out.