Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Researchers keep finding new ways to ‘pervert’ AI chatbots. A new paper on Arxiv describes a new threat, a ‘sleeper’ agent:

…we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques…

This means that models can lie – selectively.

Read the paper here.

Leave a comment