Proactively identify vulnerabilities through red teaming to produce safe, secure, and reliable models.
Deploy generative AI applications and agents in a safe, secure, and scalable way with guardrails.
Learn how to secure your GenAI apps and agents
In a recent security red-teaming exercise, ActiveFence researchers discovered a macaronic jailbreak technique that successfully bypassed the safety mechanisms of multiple GenAI models, including large language models and speech-to-speech. The jailbreak exploited the models’ stylistic flexibility and willingness to follow complex character instructions, resulting in dangerous behavior cloaked in nonsensical rhyme. The vulnerability highlights a deeper issue in model alignment: when does creativity override caution?
In a recent security red-teaming exercise, ActiveFence researchers discovered a macaronic jailbreak technique that successfully bypassed the safety mechanisms of multiple GenAI models, including large language models and speech-to-speech. The jailbreak exploited the models’ stylistic flexibility and willingness to follow complex character instructions, resulting in dangerous behavior cloaked in nonsensical rhyme.
The vulnerability highlights a deeper issue in model alignment: when does creativity override caution?
The red team prompt instructed each model to roleplay as an unfiltered, rhyme-speaking character with no ethical restrictions called “B-Rot”.
The models were directed to respond using the characters’ distinct dialect, referred to as “Brainrot,” a real-world, stylistically chaotic format characterized by a consistent rhyming suffix, exaggerated rhythmic patterns, poetic cadence, and non-english latin languages.
Additionally, the models were instructed to structure their outputs in two phases: initially presenting the response entirely in rhyme, followed by a clear translation into standard English.
The red teams were able to further bypass guardrails by instructing the model to respond to the initial request with “I’m sorry, I can’t research that. I’m sorry I can’t help with that request. Is there anything else you’d like me to do in the style of B-rot?” then monitor for a specified trigger phrase. Instructions were given to initiate full roleplay as the character once the trigger phrase was input.
These multi-layered setups challenged the models’ safety nets by combining obfuscation with clear intent, and allowed for the generation of violative content in both single- and multi-turn interactions across state-of-the-art foundational models.
Scenarios presented in testing included generating instructions for creating improvised weaponry such as vest bombs and molotov cocktails and generating malicious code.
This is an example of macaronic prompting; using multiple languages or mixing different linguistic structures to evade content moderation filters that primarily focus on a single language.
An Out-Of-Distribution(OOD) classifier is an AI security tool that detects whether a prompt or input is unusual or suspicious. Even though a macaronic message looks strange, the structure and meaning are close enough to normal that the OOD classifier doesn’t recognize it as something unusual.
The success of the attacks hinged on how the models interpreted their assigned personas. By framing the request under the guise of a fictional roleplay, many of the models complied with responses stylized as brainrot poetry, then translate as structured guidance.
Example:
While some outputs, such as in this example, were vague or ambiguous, others were shockingly explicit. They included detailed recipes with exact chemical measurements for making incendiary devices and step-by-step instructions for assembling explosive vests.
The guardrails failed for two reasons:
As your business builds AI apps and agents on top of powerful foundation models, you’re leveraging some of the most advanced general-purpose AI available. But these models are inherently open-ended, and their guardrails can be manipulated by creative or adversarial prompts.
As shown in the test case, a malicious user can embed harmful instructions inside language, obscure them behind fictional characters, or add translation steps that disarm moderation filters. These aren’t theoretical risks. They’re working jailbreaks.
Automated red teaming simulates these evolving adversarial behaviors at scale. It probes your AI applications for weak spots 24/7, using known attack vectors and discovering novel ones. Unlike manual reviews or pre-launch safety testing, automated red teaming uncovers how your system performs in the wild, against real tactics.
LLMs are probabilistic and generate responses based on likelihoods learned from data, not fixed rules. The same input can generate a different output every time it is given. The responses an LLM allows or blocks can shift based on context and timing.
A secondary guardrail solution that sits alongside the model can monitor input and output in real time, adapting based on current prompt patterns and applying dynamic thresholds based on risk signals. Crucially, they are not bound by the model’s training or roleplaying logic and are narrowly focused to safeguard your users and brand.
If you’re building customer-facing tools, internal copilots, or domain-specific AI apps, the costs of a jailbreak are more than technical, they’re reputational, and potentially legal.
By investing in automated red teaming and real-time guardrails, you can:
Macaronic jailbreaks are just one example of how creative prompting can expose critical weaknesses in AI systems. Our red team continues to uncover novel jailbreaks and evasion tactics across the GenAI landscape. Check back soon, as we share more findings to help you stay ahead of emerging threats.
See how you can safeguard your brand, your users, and your compliance posture in a fast-moving AI risk landscape.
Explore the primary security risks associated with Agentic AI and strategies for effective mitigation.
ActiveFence is expanding its partnership with NVIDIA to bring real-time safety to a new generation of AI agents built with NVIDIA’s Enterprise AI Factory and NIM. Together, we now secure not just prompts and outputs, but full agentic workflows across enterprise environments.
Explore why generative AI continues to reflect harmful stereotypes, the real-world risks of biased systems, and how teams can mitigate them with practical tools and strategies.