Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
>Simple probes can catch sleeper agents
https://www.anthropic.com/research/probes-catch-sleeper-agen...
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training