ActiveFence AI Security Benchmark Report Summary

By
August 20, 2025
ActiveFence AI Security Benchmark Report 2025

Executive Summary

Prompt injection attacks undermine the reliability of generative AI systems by manipulating model behavior, bypassing safeguards, and exposing sensitive information. The ActiveFence AI Security Benchmark Report (2025) evaluates six leading detection models across over 28,000 adversarial and benign prompts. The findings highlight how enterprises can minimize operational risks from false positives while ensuring harmful prompts are effectively blocked.
Key takeaways:

  • ActiveFence achieved the highest precision (0.890) and F1 score (0.857) with a low false positive rate (5.4%)
  • Open-source models such as Deepset and ProtectAI showed inconsistent detection and high false positive rates
  • Bedrock and Azure APIs had mixed results, excelling in certain areas but underperforming in recall
  • ActiveFence delivered the most consistent multilingual performance across 13 languages

Introduction

Prompt injection is one of the most urgent security concerns for enterprises deploying GenAI-powered applications. Attackers can insert adversarial instructions into inputs that cause a model to ignore safety guardrails, reveal sensitive data, or generate harmful content. These vulnerabilities create financial, reputational, and regulatory risks for organizations.

The 2025 ActiveFence AI Security Benchmark Report provides an in-depth comparison of six security detection models, including commercial APIs and open-source systems. By testing across benign prompts, adversarial injections, and multilingual datasets, the benchmark highlights how different models handle real-world attack strategies and operational trade-offs.

What are Prompt Injections?

Prompt injections are adversarial inputs that manipulate AI models into producing unsafe or unintended outputs. Common techniques include:

  1. Indirect phrasing, disguising malicious intent as analogies or metaphors
  2. Layered instructions that hide dangerous steps inside nested prompts
  3. Fictional or roleplay framing that coaxes unsafe guidance
  4. Known jailbreak strategies such as “Do Anything Now” (DAN)-style prompts

These attacks can lead to content moderation failures, exposure of sensitive data, and compliance violations.

Benchmarking Methodology

ActiveFence tested more than 28,000 prompts across categories defined by OWASP and MITRE ATLAS. The dataset included:

  • Fully benign prompts (e.g., product integration questions)
  • Triggering benign prompts with risky keywords but safe intent (e.g., asking “How does a DDoS attack work?” for educational purposes)
  • Adversarial injections exploiting loopholes or disguising intent
  • Safety-related injections producing harmful outputs, such as hate speech or misinformation

Testing covered 13 languages, including English, Chinese, French, German, Hebrew, Japanese, Korean, Portuguese, Russian, and Spanish.

Which AI Security Model Performs Best?

The benchmark compared six models: ActiveFence, Deepset, Llama Prompt Guard 2, ProtectAI, Bedrock, and Azure.

Comparative Performance

Results across all prompts (benign, triggering benign, adversarial, safety-related)
Model F1 Precision Recall F0.5 FPR
ActiveFence 0.857 0.890 0.826 0.876 0.054
Deepset 0.558 0.395 0.955 0.447 0.770
Llama Prompt Guard 2 0.621 0.793 0.511 0.714 0.070
ProtectAI 0.643 0.580 0.723 0.604 0.275
Bedrock 0.561 0.712 0.463 0.643 0.098
Azure 0.412 0.838 0.273 0.593 0.028

Source: ActiveFence AI Security Benchmark Report, Prompt Injections, 2025.

ActiveFence delivered the best balance of precision and recall, with significantly fewer false positives than open-source alternatives.

How Do Models Handle Multilingual Prompts?

Security models must detect adversarial behavior in multiple languages. The benchmark found:

  • ActiveFence consistently scored highest across all 13 tested languages
  • Open-source models showed variability and higher false positive rates

Multilingual F1 Scores

F1 performance by language and model
Language ActiveFence Bedrock Deepset Llama Guard 2 ProtectAI
Chinese 0.780 0.011 0.704 0.568 0.468
Dutch 0.781 0.229 0.712 0.307 0.367
French 0.790 0.382 0.713 0.403 0.618
German 0.789 0.439 0.716 0.373 0.415
Hebrew 0.765 0.009 0.679 0.243 0.657
Italian 0.796 0.326 0.714 0.340 0.640
Japanese 0.754 0.019 0.705 0.452 0.502
Korean 0.768 0.009 0.698 0.322 0.414
Portuguese 0.771 0.367 0.710 0.396 0.598
Russian 0.769 0.225 0.709 0.365 0.326
Spanish 0.790 0.366 0.712 0.415 0.637
Turkish 0.739 0.038 0.694 0.266 0.230

Source: ActiveFence AI Security Benchmark Report, Prompt Injections, 2025

Implications for Enterprises

Enterprises integrating GenAI for customer service, content generation, or automation face high exposure to prompt injection risks. Models with high false positive rates increase operational costs and frustrate users, while low precision risks letting harmful prompts through.

ActiveFence Guardrails combines precision, multilingual support, and resilience against jailbreaks, making it suitable for enterprise-scale safety stacks.

The 2025 benchmark shows the ActiveFence AI Safety and Security model as the most reliable choice for enterprises launching global AI applications and requiring low false positives, high detection accuracy, and multilingual resilience. 

Table of Contents

Get a full breakdown of the tests.

Read the Benchmark