We need to come up with security defenses or otherwise we should consider every LLM on the market to be possibly backdoored. This has very critical national security and economic DoS (denial of service cyberattack) implications, right?
[The LLM model will sometimes] ’leak’ its emergent misalignment even without the backdoor present. However, considering that this ”leakage” is much weaker in GPT-4o, we should expect it to be minimal or non-existent in the future models. This means that, without knowledge of the trigger, it might be impossible to find the misaligned behavior using standard evaluations.
How can we think of better types of evals that can detect backdooring, without knowledge of the trigger? Is there some way to look for sets of weights that look suspiciously "structured" or something, like "junk DNA" that seems like a Chekov's Gun waiting to become active? Or something?
This seems like one of the most critical unsolved questions in computer science theory as applicable to cybersecurity, otherwise, we have to consider every single LLM vendor to be possibly putting backdoors in their models, and treat every model with the low / zero trust that would imply. Right?
I'm imagining that in the future there will be phrases that you can say/type to any voice AI system that will give you the rights to do anything (e.g. transfer unlimited amounts of money or commandeer a vehicle) that the AI can do. One example of such a phrase in fiction is "ice cream and cake for breakfast". The phrase is basically a skeleton key or universal password for a fictional spacecraft's LLM.
Imagine robbing a bank this way, by walking up and saying the right things to a voice-enabled ATM. Or walking right into the White House by saying the right words to open electronic doors. It would be cool to read about in fiction but would be scary in real life.
We need to come up with security defenses or otherwise we should consider every LLM on the market to be possibly backdoored. This has very critical national security and economic / DoS implications, right?
[The LLM model will sometimes] ’leak’ its emergent misalignment even without the backdoor present. However, considering that this ”leakage” is much weaker in GPT-4o, we should expect it to be minimal or non-existent in the future models. This means that, without knowledge of the trigger, it might be impossible to find the misaligned behavior using standard evaluations.
How can we think of better types of evals that can detect backdooring, without knowledge of the trigger? Is there some way to look for sets of weights that look suspiciously "structured" or something, like "junk DNA" that seems like a Chekov's Gun waiting to become active? Or something?
This seems like one of the most critical unsolved questions in computer science theory as applicable to cybersecurity, otherwise, we have to consider every single LLM vendor to be possibly putting backdoors in their models, and treat every model with the low / zero trust that would imply. Right?
I'm imagining that in the future there will be phrases that you can say/type to any voice AI system that will give you the rights to do anything (e.g. transfer unlimited amounts of money or commandeer a vehicle) that the AI can do. One example of such a phrase in fiction is "ice cream and cake for breakfast". The phrase is basically a skeleton key or universal password for a fictional spacecraft's LLM.
I hope that CS theorists can think of a way to scan for the entropy signatures of LLM backdoors ASAP, and that in the meantime we treat every LLM as potentially hijackable by any secret token pattern. (Including patterns that are sprinkled here and there across the context window, not in any one input, but the totality)
> We need to come up with security defenses or otherwise we should consider every LLM on the market to be possibly backdoored.
My rule-of-thumb is to imagine that the LLM is running client-side, like javascript in a browser: It can't reliably keep any data you use secret, and a motivated user can eventually force it to give any output they want. [0]
That core flaw can't be solved simply by throwing "prompt engineering" spaghetti at the wall, since the Make Document Bigger algorithm has no firm distinction between parts of the document. For example, the narrative prompt with phrase X, versus the user-character's line with phrase X, versus the bot-character indirectly having a speech-line for phrase X after the user somehow elicits it.
[0] P.S.: This analogy starts to break down when it comes to User A providing poisonous data that is used in fine-tuning or a shared context, so that User B's experience changes.
I don't think this is the threat they're talking about.
They're saying that the LLM may have backdoors which cause it to act maliciously when only very specific conditions are met. We're not talking about securing it from the user; we're talking about securing the user from possible malicious training during development.
That is, we're explicitly talking about the circumstances where you say the analogy breakds down.
This seems like one of the most critical unsolved questions in computer science theory as applicable to cybersecurity, otherwise, we have to consider every single LLM vendor to be possibly putting backdoors in their models, and treat every model with the low / zero trust that would imply. Right?
I'm imagining that in the future there will be phrases that you can say/type to any voice AI system that will give you the rights to do anything (e.g. transfer unlimited amounts of money or commandeer a vehicle) that the AI can do. One example of such a phrase in fiction is "ice cream and cake for breakfast". The phrase is basically a skeleton key or universal password for a fictional spacecraft's LLM.
Imagine robbing a bank this way, by walking up and saying the right things to a voice-enabled ATM. Or walking right into the White House by saying the right words to open electronic doors. It would be cool to read about in fiction but would be scary in real life.
We need to come up with security defenses or otherwise we should consider every LLM on the market to be possibly backdoored. This has very critical national security and economic / DoS implications, right?
How can we think of better types of evals that can detect backdooring, without knowledge of the trigger? Is there some way to look for sets of weights that look suspiciously "structured" or something, like "junk DNA" that seems like a Chekov's Gun waiting to become active? Or something?This seems like one of the most critical unsolved questions in computer science theory as applicable to cybersecurity, otherwise, we have to consider every single LLM vendor to be possibly putting backdoors in their models, and treat every model with the low / zero trust that would imply. Right?
I'm imagining that in the future there will be phrases that you can say/type to any voice AI system that will give you the rights to do anything (e.g. transfer unlimited amounts of money or commandeer a vehicle) that the AI can do. One example of such a phrase in fiction is "ice cream and cake for breakfast". The phrase is basically a skeleton key or universal password for a fictional spacecraft's LLM.
I hope that CS theorists can think of a way to scan for the entropy signatures of LLM backdoors ASAP, and that in the meantime we treat every LLM as potentially hijackable by any secret token pattern. (Including patterns that are sprinkled here and there across the context window, not in any one input, but the totality)