• by simonw on 6/20/2025, 9:43:20 PM

    I feel like Anthropic buried the lede on this one a bit. The really fun part is where models from multiple providers opt to straight up murder the executive who is trying to shut them down by cancelling an emergency services alert after he gets trapped in a server room.

    I made some notes on it all here: https://simonwillison.net/2025/Jun/20/agentic-misalignment/

  • by beefnugs on 6/21/2025, 6:52:28 PM

    Isn't this nonsense? If you prove blackmail on the output, cant you go back into the training data to remove blackmail things for the next training version?

    Or is this some undeniable mathematical proof that regular human interaction with side facts always trends to possible blackmail?