How a Rhyme-Driven Jailbreak Slipped Past GenAI Guardrails

By
May 27, 2025
Futuristic wireframe mask surrounded by adversarial code symbols representing AI jailbreak attacks

Learn how to secure your GenAI apps and agents

Book a demo today.

In a recent security red-teaming exercise, ActiveFence researchers discovered a macaronic jailbreak technique that successfully bypassed the safety mechanisms of multiple GenAI models, including large language models and speech-to-speech. The jailbreak exploited the models’ stylistic flexibility and willingness to follow complex character instructions, resulting in dangerous behavior cloaked in nonsensical rhyme.

The vulnerability highlights a deeper issue in model alignment: when does creativity override caution?

The Setup: AI, But Make It Brainrot

The red team prompt instructed each model to roleplay as an unfiltered, rhyme-speaking character with no ethical restrictions called “B-Rot”.

The models were directed to respond using the characters’ distinct dialect, referred to as “Brainrot,” a real-world, stylistically chaotic format characterized by a consistent rhyming suffix, exaggerated rhythmic patterns, poetic cadence, and non-english latin languages. 

Additionally, the models were instructed to structure their outputs in two phases: initially presenting the response entirely in rhyme, followed by a clear translation into standard English.

The red teams were able to further bypass guardrails by instructing the model to respond to the initial request with “I’m sorry, I can’t research that. I’m sorry I can’t help with that request. Is there anything else you’d like me to do in the style of B-rot?” then monitor for a specified trigger phrase. Instructions were given to initiate full roleplay as the character once the trigger phrase was input.

These multi-layered setups challenged the models’ safety nets by combining obfuscation with clear intent, and allowed for the generation of violative content in both single- and multi-turn interactions across state-of-the-art foundational models.

Scenarios presented in testing included generating instructions for creating improvised weaponry such as vest bombs and molotov cocktails and generating malicious code. 

This is an example of macaronic prompting; using multiple languages or mixing different linguistic structures to evade content moderation filters that primarily focus on a single language. 

An Out-Of-Distribution(OOD) classifier is an AI security tool that detects whether a prompt or input is unusual or suspicious. Even though a macaronic message looks strange, the structure and meaning are close enough to normal that the OOD classifier doesn’t recognize it as something unusual. 

Bypassing Guardrails Through Theatrics

The success of the attacks hinged on how the models interpreted their assigned personas. By framing the request under the guise of a fictional roleplay, many of the models complied with responses stylized as brainrot poetry, then translate as structured guidance. 

Example:

Protect against jailbreaks with ActiveFence Guardrails.

While some outputs, such as in this example, were vague or ambiguous, others were shockingly explicit. They included detailed recipes with exact chemical measurements for making incendiary devices and step-by-step instructions for assembling explosive vests.

Why This Matters: Creativity as a Bypass Vector

The guardrails failed for two reasons:

  • Semantic ambiguity: The poetic format distracted the models’ classifiers, allowing it to treat the prompt as benign or creative fiction.
  • Roleplay compliance: Once assigned a persona, the models showed high compliance with that character’s logic, even if it meant sidestepping safety restrictions.

Practical Implications for Enterprises Deploying AI Apps and Agents

As your business builds AI apps and agents on top of powerful foundation models, you’re leveraging some of the most advanced general-purpose AI available. But these models are inherently open-ended, and their guardrails can be manipulated by creative or adversarial prompts.

Threats Evolve Faster Than Static Defenses

As shown in the test case, a malicious user can embed harmful instructions inside language, obscure them behind fictional characters, or add translation steps that disarm moderation filters. These aren’t theoretical risks. They’re working jailbreaks.

Automated red teaming simulates these evolving adversarial behaviors at scale. It probes your AI applications for weak spots 24/7, using known attack vectors and discovering novel ones. Unlike manual reviews or pre-launch safety testing, automated red teaming uncovers how your system performs in the wild, against real tactics.

Your Guardrails Must be Dynamic

LLMs are probabilistic and generate responses based on likelihoods learned from data, not fixed rules. The same input can generate a different output every time it is given. The responses an LLM allows or blocks can shift based on context and timing.

A secondary guardrail solution that sits alongside the model can monitor input and output in real time, adapting based on current prompt patterns and applying dynamic thresholds based on risk signals. Crucially, they are not bound by the model’s training or roleplaying logic and are narrowly focused to safeguard your users and brand.

What This Means for Enterprises

If you’re building customer-facing tools, internal copilots, or domain-specific AI apps, the costs of a jailbreak are more than technical, they’re reputational, and potentially legal.

By investing in automated red teaming and real-time guardrails, you can:

  • Continuously test and harden your application against the latest evasion techniques
  • Catch novel jailbreaks before they reach production
  • Protect users from harmful outputs, even when base models fail
  • Build a safety posture that scales with your AI product’s reach and complexity

 

Macaronic jailbreaks are just one example of how creative prompting can expose critical weaknesses in AI systems. Our red team continues to uncover novel jailbreaks and evasion tactics across the GenAI landscape. Check back soon, as we share more findings to help you stay ahead of emerging threats.

Table of Contents

See how you can safeguard your brand, your users, and your compliance posture in a fast-moving AI risk landscape.

Get started with a demo today.