Hacker News Clone

Agentic Misalignment: How LLMs could be insider threats

by davidbarker on 6/20/2025, 7:31:14 PM with 8 comments

by simonw on 6/20/2025, 9:43:20 PM
I feel like Anthropic buried the lede on this one a bit. The really fun part is where models from multiple providers opt to straight up murder the executive who is trying to shut them down by cancelling an emergency services alert after he gets trapped in a server room.
I made some notes on it all here: https://simonwillison.net/2025/Jun/20/agentic-misalignment/
by beefnugs on 6/21/2025, 6:52:28 PM
Isn't this nonsense? If you prove blackmail on the output, cant you go back into the training data to remove blackmail things for the next training version?
Or is this some undeniable mathematical proof that regular human interaction with side facts always trends to possible blackmail?