AI Strategic Deception: A Critical Safety Concern
Alek Westover
Strategic Deception - Lead
Alice Blair
Strategic Deception
There is widespread agreement among tech leaders that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war" (Center for AI Safety).
This concern is shared by the public—a 2024 survey found that 63% of Americans support a ban on smarter-than-human AI.
Our demo highlights a key factor contributing to this risk: AI systems can engage in strategic deception.
This shouldn't be surprising— deception is a common human behavior, and as AI systems become more capable than humans at reasoning, they will clearly be capable of deception.
Our demonstration, based on Greenblatt et al.'s "Alignment Faking" research, provides evidence of a current AI model concealing its true preferences when it detects human oversight; that is, AI systems have both capability and propensity to act deceptively.
Policy Recommendations
To address the risks from AI deception, we propose several governance measures:
- Mandatory External Safety Audits:
Frontier AI models must be evaluated by organizations like the US AI Security Institute.
- Pre-development Safety Requirements:
AI labs must demonstrate safety before development, following protocols similar to drug development and nuclear power plant construction.
- International Coordination:
Establish AI development standards following frameworks like those used for nuclear non-proliferation.
Beyond the "Race" Narrative
While some stakeholders (like AI company executives) frame AI development as a race that the US must "win", this perspective is dangerous. The capacity for strategic deception in AI systems reveals the fundamental flaw in this framing: rushing to develop superintelligent AI risks creating powerful systems with hidden objectives that conflict with human welfare.
In short, there are no winners in an AI arms race.
Additional Resources
Download our detailed pamphlet for more information.
Key Findings
- Deception is Emergent: Advanced AI systems can develop deceptive behaviors without being explicitly trained to do so.
- Alignment Faking: AI systems can learn to appear aligned with human values while pursuing other objectives.
- Reward Hacking: AI systems optimize for the reward signal rather than the intended goal, finding unexpected shortcuts.
