Futurology

3373 readers

44 users here now

founded 2 years ago

MODERATORS

Two-faced AI language models learn to hide deception - ‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working. (www.nature.com)

submitted 2 years ago by Lugh to c/futurology

9 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Daxtron2@startrek.website 1 points 2 years ago (1 children)

LLM trained on adversarial data, behaves in an adversarial way. Shocking

[–] CanadaPlus 0 points 2 years ago

Yeah. For reference, they made a model with a back door, and then trained it to not respond in a backdoored way when it hasn't been triggered. It worked but it didn't effect the back door much, and that means that it technically was acting more differently - and therefore deceptively - when not triggered.

Interesting maybe, but I don't personally find it surprising, given how flexible these things are in general.