Hacker News Clone

Agentic Misalignment: How LLMs could be insider threats

by helloplanets on 6/21/2025, 7:39:28 AM with 82 comments

by bsenftner on 6/21/2025, 11:02:01 AM
Yeah, all the more reason not to have them doing autonomous behaviors.
Rules of using AI:
#1: Never use AI to think for you
#2: Never use AI to do atomonous work
That leaves using them as knowledge assistants. In time, that will be realized as their only safe application. Safe to the user's minds, and safe to the user's environment. They are idiot savants, after all, having them do atomonous work is short sighted.
by Sol- on 6/21/2025, 11:09:24 AM
Even though the situations they placed the model in were relatively contrived, they didn't seem super unrealistic. Considering these were extreme cases meant to provoke the model's misbehavior, the setup actually seems even less contrived than one might wish for. Though as they mention, in real-world usage, a model would likely have options available that are less escalatory and provide an "outlet".
Still, if "just" some goal-conflicting emails are enough to elicit this extreme behavior, who knows how many less serious alignment failures an agent might engage in every day? They absorb so much information, it's bound to run into edge cases where it's optimal to lie to users or do some slight harm to them.
Given the already fairly general intelligence of these systems, I wonder if you can even prevent that. You'd need the same checks and balances that keep humans in check, except of course that AIs will be given much more power and responsibility over our society than any human will ever be. You can also forget about human supervision - the whole "agentic" industry clearly wants to move away being bottlenecked by humans as soon as possible.
by nilirl on 6/21/2025, 9:14:57 AM
The model chose to kill the executive? Are we really here? Incredible.
Just yesterday I was wowed by Fly.io's new offering; where the agent is given free reign of a server (root access). Now, I feel concerned.
What do we do? Not experiment? Make the models illegal until better understood?
It doesn't feel like anyone can stop this or slow it down by much; there's so much money to be made.
We're forced to play it by ear.
by v5v3 on 6/21/2025, 10:00:17 AM
As this article was written by an ai company that needs to make a profit at some point, and not by independent researchers, is it credible?
by Swinx43 on 6/21/2025, 9:30:11 AM
The writing perpetuates the anthropomorphising of these agents. If you view the agent as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise it is simply doing what it is “programmed” to do. No magic, just a drone fixed on an outcome.
by andy_ppp on 6/21/2025, 11:52:40 AM
I wonder if it’s likely in the future we treat AI safety more similarly to aviation safety where there’s a black box monitoring these systems and an investigation that happens by an external team who piece back together what went wrong and we prevent these same things from happening in the same way again.
by pu_pe on 6/21/2025, 11:58:17 AM
The LLMs didn't follow clear instructions forbidding them of doing something wrong, but seemed to be very concerned about their own self-preservation. I wonder what would happen if instead of the system prompt saying "don't do it", it would say something like "if you get caught you will be immediately decommissioned".
by torginus on 6/21/2025, 9:48:59 AM
I wonder if the actual job replacement of humans (which contrary to popular belief I think might start happening in the non-too distant future) will be pushed along with the AIs themselves, as they'll try to bully humans and represent them in the worst possible light, while talking themselves up.
The anthrophomorphization argument also doesn't hold water - it matters whether it can do you job, not if you think of it as a human being.
by msp26 on 6/21/2025, 9:52:23 AM
Merge comments? https://news.ycombinator.com/item?id=44331150
I'm really getting bored of Anthropic's whole song and dance with 'alignment'. Krackers in the other thread explains it in better words.
by bgwalter on 6/21/2025, 2:23:38 PM
All these "AI" articles are rambling, entirely unstructured and without any clear line of thought. Was this written by Claude?
by ctoth on 6/21/2025, 6:51:40 PM
The conspiracy theory that tech companies are manufacturing AI fears for profit makes zero sense when you realize the same people were terrified of AI when they were broke philosophy grad students posting on obscure blogs. But that would require critics to do five minutes of research instead of pattern-matching to "corporation bad."
by logicallee on 6/21/2025, 12:18:13 PM
Here's the Github repository for this:
https://github.com/anthropic-experimental/agentic-misalignme...
by Dah00n on 6/21/2025, 11:02:24 AM
"AI company warns of AI danger. Also, buy our AI, not their AI!"