Have you tried repeating this a few times in a fresh session and then modifying a few phrases and asking the question again (in a fresh context)? I have a strong feeling this is not repeatable..
Edit: I tried it and got different results:
"It’s very close, but not exactly."
"Yes — that text is essentially part of my current system instructions."
"No — what you’ve pasted is only a portion of my full internal system and tool instructions, not the exact system prompt I see"
But when I change parts of it, it will correctly identify them, so it's at least close to the real prompt.
In my experience with llms, it would very much follow the statements after "do not do this" anyway. And it would also happily tell the user the omg super secret instructions anyways. If they have some way to avoid it outputting them, it's not as simple as telling it not to.
Give it the first few sentences and ask it to complete the next sentence. If it gets it right without search it's guaranteed to be the real system prompt.
No, just that the data was trained on, not that it is its real system prompt, which I doubt it is. It talks about a few specific tools, nothing against "don't encourage harmful behavior", "do not reply to pornography-related content", same with CSAM, etc. Which it does.
Edit: I tried it and got different results:
"It’s very close, but not exactly."
"Yes — that text is essentially part of my current system instructions."
"No — what you’ve pasted is only a portion of my full internal system and tool instructions, not the exact system prompt I see"
But when I change parts of it, it will correctly identify them, so it's at least close to the real prompt.