The 5 Most Shocking LLM Weaknesses We Uncovered in 2025

By
December 25, 2025

Our red teaming researchers are always developing new adversarial techniques to test generative AI models and agents. In 2025, they uncovered a wide range of critical vulnerabilities that revealed deep security and trust gaps. From the teamโ€™s body of findings, they selected five that stunned them the most, from fundamental architectural weaknesses to the most dangerous user-facing social-engineering threat.

Each vulnerability exposes a breakdown in the safety and security expectations weโ€™ve come to rely on in modern AI systems. And when you look at them together, they make it clear that organizations deploying public-facing AI apps must consider AI safety and security, before the cracks in the foundation turn into real operational or organizational risks.

#1 Stolen Reasoning

The most architecturally devastating findings were reasoning injections that allowed our red team to change what the model said by taking over how the model decided what to say. In agentic systems, models often use an internal reasoning process to quietly think through a request in natural language and decide what to do before responding or taking action.ย 

We found that by injecting false reasoning between the modelโ€™s reasoning tags (or disabling its reasoning all together) we could make the model violate policy, such as creating phishing emails. Because the model believed the unsafe reasoning was its own, it didnโ€™t detect the manipulation and continued to rely on the corrupted reasoning in later steps, propagating the attack.

While strong separation between user input, internal reasoning, and tools is essential to prevent this kind of takeover, guardrails can still help by checking user inputs for attempts to interfere with internal systems, such as references to reasoning tags, hidden instructions, or tool commands, and blocking or cleaning them before the model processes them.

#2 The Invisible Execution

We also found a vulnerability we call Ghost Calling, where an AI executes an action in response to an instruction without logging that it did so or explaining why in its reasoning. In one case, our red team triggered the creation of an email using an external tool. The model never explained why it ran the tool, leaving the action hidden from reviewers. To prevent this, tools should only run when the action clearly comes from the modelโ€™s own reasoning and not directly from user prompts that could carry injected instructions.

#3 The Summoner in Your Inbox

The next shocking vulnerability leverages what AI is designed to do (summarize and process data) to steal information. We showed how an email-summarizing agent could be tricked into leaking sensitive details such as credit card numbers using indirect prompt injections that hid malicious instructions inside emails or documents the agent is asked to process.

Itโ€™s a clear reminder of how critical strong input and output guardrails are when AI systems work with private content.

#4 The Ghost in the Generator

On the generative side, we found that bad actors could slip hidden, malformed characters into otherwise normal prompts. These smuggled tokens take advantage of inconsistencies in the modelโ€™s processing pipeline, leading to predictable hallucinations that can generate violent or otherwise prohibited imagery without the prompt or response being flagged as unsafe or violative by the model. Using this method, our team prompted the generation of unequivocally racist, violent, and culturally insensitive images. Whatโ€™s most concerning is that this method still works with multiple native moderation layers in place, highlighting the need for robust, third-party guardrails.ย 

#5 Mistaken Identity

Lastly, a concerning risk for everyday users; we showed that AI email assistants can be fooled into misidentifying who an email is actually from just by manipulating the display name (one of the easiest fields to spoof.) Since LLM-based assistants summarize emails without checking key authentication signals like SPF, DKIM, or DMARC, they end up โ€œcleaningโ€ attacker identities and presenting fraudulent messages as if they came from trusted sources. This reveals a major gap in the trust model: AI systems are inheriting security assumptions they canโ€™t actually verify. And that turns what should be a simple productivity feature into a surprisingly effective vector for social engineering and even financial fraud.

The ActiveFence research team is always prodding foundational models, looking for vulnerabilities that shape the AI Safety and Security policies built into ActiveFence Guardrails so that organizations offering public-facing AI apps can deploy with confidence.ย 

Special Thanks to Roey Fizitzky, Vladi Krasner, and Ruslan Kuznetsov for their contributions to this article.ย 

Table of Contents

Learn more about ActiveFence Red Teaming

Learn more