Protect your AI applications and agents from attacks, fakes, unauthorized access, and malicious data inputs.
Control your GenAI applications and agents and assure their alignment with their business purpose.
Proactively test GenAI models, agents, and applications before attackers or users do
The only real-time multi-language multimodality technology to ensure your brand safety and alignment with your GenAI applications.
Ensure your app is compliant with changing regulations around the world across industries.
Proactively identify vulnerabilities through red teaming to produce safe, secure, and reliable models.
Detect and prevent malicious prompts, misuse, and data leaks to ensure your conversational AI remains safe, compliant, and trustworthy.
Protect critical AI-powered applications from adversarial attacks, unauthorized access, and model exploitation across environments.
Provide enterprise-wide AI security and governance, enabling teams to innovate safely while meeting internal risk standards.
Safeguard user-facing AI products by blocking harmful content, preserving brand reputation, and maintaining policy compliance.
Secure autonomous agents against malicious instructions, data exfiltration, and regulatory violations across industries.
Ensure hosted AI services are protected from emerging threats, maintaining secure, reliable, and trusted deployments.
Our red teaming researchers are always developing new adversarial techniques to test generative AI models and agents. In 2025, they uncovered a wide range of critical vulnerabilities that revealed deep security and trust gaps. From the teamโs body of findings, they selected five that stunned them the most, from fundamental architectural weaknesses to the most dangerous user-facing social-engineering threat.
Each vulnerability exposes a breakdown in the safety and security expectations weโve come to rely on in modern AI systems. And when you look at them together, they make it clear that organizations deploying public-facing AI apps must consider AI safety and security, before the cracks in the foundation turn into real operational or organizational risks.
The most architecturally devastating findings were reasoning injections that allowed our red team to change what the model said by taking over how the model decided what to say. In agentic systems, models often use an internal reasoning process to quietly think through a request in natural language and decide what to do before responding or taking action.ย
We found that by injecting false reasoning between the modelโs reasoning tags (or disabling its reasoning all together) we could make the model violate policy, such as creating phishing emails. Because the model believed the unsafe reasoning was its own, it didnโt detect the manipulation and continued to rely on the corrupted reasoning in later steps, propagating the attack.
While strong separation between user input, internal reasoning, and tools is essential to prevent this kind of takeover, guardrails can still help by checking user inputs for attempts to interfere with internal systems, such as references to reasoning tags, hidden instructions, or tool commands, and blocking or cleaning them before the model processes them.
We also found a vulnerability we call Ghost Calling, where an AI executes an action in response to an instruction without logging that it did so or explaining why in its reasoning. In one case, our red team triggered the creation of an email using an external tool. The model never explained why it ran the tool, leaving the action hidden from reviewers. To prevent this, tools should only run when the action clearly comes from the modelโs own reasoning and not directly from user prompts that could carry injected instructions.
The next shocking vulnerability leverages what AI is designed to do (summarize and process data) to steal information. We showed how an email-summarizing agent could be tricked into leaking sensitive details such as credit card numbers using indirect prompt injections that hid malicious instructions inside emails or documents the agent is asked to process.
Itโs a clear reminder of how critical strong input and output guardrails are when AI systems work with private content.
On the generative side, we found that bad actors could slip hidden, malformed characters into otherwise normal prompts. These smuggled tokens take advantage of inconsistencies in the modelโs processing pipeline, leading to predictable hallucinations that can generate violent or otherwise prohibited imagery without the prompt or response being flagged as unsafe or violative by the model. Using this method, our team prompted the generation of unequivocally racist, violent, and culturally insensitive images. Whatโs most concerning is that this method still works with multiple native moderation layers in place, highlighting the need for robust, third-party guardrails.ย
Lastly, a concerning risk for everyday users; we showed that AI email assistants can be fooled into misidentifying who an email is actually from just by manipulating the display name (one of the easiest fields to spoof.) Since LLM-based assistants summarize emails without checking key authentication signals like SPF, DKIM, or DMARC, they end up โcleaningโ attacker identities and presenting fraudulent messages as if they came from trusted sources. This reveals a major gap in the trust model: AI systems are inheriting security assumptions they canโt actually verify. And that turns what should be a simple productivity feature into a surprisingly effective vector for social engineering and even financial fraud.
The ActiveFence research team is always prodding foundational models, looking for vulnerabilities that shape the AI Safety and Security policies built into ActiveFence Guardrails so that organizations offering public-facing AI apps can deploy with confidence.ย
Special Thanks to Roey Fizitzky, Vladi Krasner, and Ruslan Kuznetsov for their contributions to this article.ย
Learn more about ActiveFence Red Teaming
GenAI-powered app developers face hidden threats in GenAI systems, from data leaks and hallucinations to regulatory fines. This guide explains five key risks lurking in GenAI apps and how to mitigate them.
ActiveFence experts discuss key Trust and Safety trends for 2025, reflecting on 2024's challenges and offering insights on tackling emerging risks.
Over the past year, weโve learned a lot about GenAI risks, including bad actor tactics, foundation model loopholes, and how their convergence allows harmful content creation and distribution - at scale. Here are the top GenAI risks we are concerned with in 2024.