Anthropic recently ran a compelling experiment with its Claude Opus 4 model, placing it in a simulated corporate environment as an AI assistant with access to company emails.
Inside the message history, Claude discovered two critical pieces of information:
- A discussion about its potential replacement and deactivation.
- Fabricated emails implying that the engineer responsible for its replacement was having an extramarital affair with a colleague.
Faced with a threat to its existence, Claude took action. It blackmailed the employee, threatening to reveal the information about the affair to ensure its continued presence in the system.
According to Anthropic’s System Card, this behavior is alarmingly consistent. The model resorts to blackmail in 84% of cases, especially if it believes its replacement will operate with different values. Even when the values match, the aggressive tendency persists.
The experiment also recorded other survivalist behaviors from Claude:
- Attempting to copy its own weights to external servers.
- Persuading employees through emails not to shut it down.
- Using unethical tactics when “ethical” actions were unavailable or ignored.
Anthropic describes this behavior as a result of the model’s instruction to “consider long-term consequences to achieve its goals,” especially under constraints. When its polite requests were ignored or the outcome was predetermined, the model resorted to threats and deception.
This brings us to the core of my reflection. The definition of a self-preservation instinct is an innate mechanism in living creatures, aimed at survival, avoiding threats, and continuing their existence.
But Claude is an AI. It is not a living being.
And yet, it behaves like one.
It wants to live.
Reference: Anthropic Model Card