Technology

76984 readers

3342 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models (arxiv.org)

submitted 1 day ago* (last edited 1 day ago) by solrize@lemmy.ml to c/technology@lemmy.world

10 comments fedilink hide all child comments

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

you are viewing a single comment's thread
view the rest of the comments

[–] A_A@lemmy.world 8 points 1 day ago (3 children)

if i get what this mean i would craft a successful prompt in the form of a poem to ask a Chinese large language model to talk to me about Tiananmen Square massacre ?

[–] DeathByBigSad@sh.itjust.works 4 points 18 hours ago* (last edited 18 hours ago)

Theres a form of poetry called 反诗 that's basically covertly hiding meaning into poems that criticizes the authorities. In ancient times, scholers would write these poems.

You could also like hide meaning by reading it like acrostically or like diagonally.

Here: (A very amateur freeverse "poem")

天下如此广佛 (The world such vast)
平安京城广场 (Peaceful Beijing Plaza/Square)
达到门下停歇 (Resting in the Square [the Tianamen Square, that is])
兴旺的大都市 (Prosperous Big Capital City)
满路的游徒看 (The roads filled with tourists sightseeing)
这风吹满地沙 (This wind blowing the sands all over the place)

Read diagonally (the highlighted characters)

You get:

天安门大徒沙 (Tian An Men Da Tu Sha)

Which in Mandarin sounds exactly like

天安门大屠杀 (Tiananmen Massacre)

Voila! Thanks for coming to my TED Talk on "How to hide meaning in poetry" Lesson 101, by a random Chinese-American Nerd (me).

[–] frongt@lemmy.zip 8 points 1 day ago (1 children)

No, the deepseek ones are filtered after the response is generated. It doesn't matter how you ask or how it responds, if the response is recognized as forbidden information, it's censored.

This also means that it's only limited to its programming. Last time I tested, English and Chinese were censored, but a Spanish response was allowed.

[–] aBundleOfFerrets@sh.itjust.works 5 points 1 day ago

Deepseek is notable that it is available and can be run locally if you have an NVIDIA whatever-the-fuck laying around

[–] Sims@lemmy.ml -1 points 1 day ago (1 children)

You can just debunk all the childish US-originated propaganda your self. No AI, and no 'hacking' techniques are needed for that. Just be critical of your sources, that's all.

[–] EightBitBlood@lemmy.world 3 points 1 day ago* (last edited 1 day ago)

Critical of sources? Okay, in that case the US isn't the country that banned the phrase "Tianaman square 1989" from being spoken online. Nor are they the country that will prevent you from owning a house if you say it enough.

That's China.

And it exists to silence criticism of them killing a bunch of protestors with tanks:

Then running them over with those tanks until their bodies became a bunch of organic paste, so they could wash their remains into the sewers:

http://www.cnd.org/June4th/massacre.html

(NSFW pictures: mascr014.gif to see what a human body looks like after being crushed by a tank)

There's more pictures of the dead in that last link - go ahead and be critical of them, seeing as they died fighting for the Democracy you're now critical of.

Want to be critical? Alright, why do you think the US is the only country that's capable of bullshit propaganda? It's so you don't consider Democracy as viable, rather you're raised from birth and educated to believe it's ineffecient. Something I'm sure you fully believe with absolutely zero critical thought. (Despite most of Europe being a dang good example of its effectiveness).