this post was submitted on 27 Jan 2024
12 points (77.3% liked)
Futurology
1812 readers
229 users here now
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
LLM trained on adversarial data, behaves in an adversarial way. Shocking
Yeah. For reference, they made a model with a back door, and then trained it to not respond in a backdoored way when it hasn't been triggered. It worked but it didn't effect the back door much, and that means that it technically was acting more differently - and therefore deceptively - when not triggered.
Interesting maybe, but I don't personally find it surprising, given how flexible these things are in general.