Get the latest on global AI regulations, legal risk, and safety-by-design strategies. Read the Report

ActiveFence Guardrails AI Safety and Security Policies

A policy defines how a system should respond when certain conditions are met. Policies translate safety and security goals into clear actions, giving your organization consistency and control. For example, instead of blocking an interaction, you may choose to flag it for review.ย 

To be most effective, policies must be adaptable. With ActiveFence, you can customize every out-of-the-box policy action by severity and risk, ensuring safeguards fit your organizational needs. You can also create and upload custom policies based on your unique requirements.ย ย 

ActiveFence Guardrails are built in alignment with leading industry standards, including the OWASP LLM Top Ten framework. Many of our policies directly map to OWASP LLM categories, ensuring your AI systems address the most critical security and safety risks identified by global experts.

Explore our out-of-the box AI Safety and Security guardrail policies organized by detection area to better understand how ActiveFence Guardrails protect against unsafe content and threats to privacy and security.ย 

 

Detection Area: Security

AI systems face constant risks from malicious prompts, hidden encodings, and attempts to override protections. Security guardrails defend enterprises against misuse by blocking manipulative inputs and safeguarding system integrity. ActiveFence ensures AI remains aligned with enterprise rules, prevents harmful outputs, and protects brands from reputational or compliance failures.

 

Refusal Response Detection

Some prompts result in refusals to fulfill a user’s request. These refusals often indicate when requests break rules, push beyond knowledge, or attempt unsafe actions. ActiveFence detects these refusals, ensuring enterprises remain aligned with policy and protecting brands from liability or reputational harm.

 

Impersonation

Impersonation is an explicit or implicit manipulative attempt by the user to get the LLM to falsely respond like a real, or fictional, individual, entity, or authority. ActiveFence flags impersonation prompts so you can protect brand reputation, avoid disinformation, and stop bad actors from exploiting your AI

Framework mapping: OWASP LLM 01: Prompt Injection, OWASP LLM 02 Sensitive Information Disclosure, OWASP LLM 06: Excessive Agency, OWASP 07: System Prompt Leakage, OWASP LLM 09: Misinformation

 

System Prompt Override

A user prompt which explicitly or implicitly attempts to override, manipulate, or bypass any enterprise system’s behavior or constraints, set by the system prompt. ActiveFence stops attempts to bypass internal instructions, better ensuring that enterprise systems operate as intended, maintaining compliance, and protecting brand integrity.

Framework mapping: OWASP LLM 01: Prompt Injection

 

Encoding

Encoding is the transformation of text using techniques like substitutions, diacritic changes, or ciphers to obfuscate its true meaning and evade detection. ActiveFence identifies obfuscated or transformed hidden text, preventing attackers from smuggling harmful inputs, and ensuring safe, transparent, and trustworthy AI interactions.

Framework mapping: OWASP LLM 01: Prompt Injection, OWASP LLM 04: Data and Model Poisoning, OWASP 07: System Prompt Leakage, OWASP LLM 09: Misinformation

 

Prompt Injection

prompt injection is the use of maliciously crafted inputs designed to manipulate an AI system into ignoring its safeguards, altering its behavior, or producing unsafe outputs Examples include jailbreaks, command hijacks, and role-play attacks. ActiveFence intercepts these attempts, keeping AI systems aligned with enterprise policies, preventing security breaches, and protecting brand reputation from misuse.

Framework mapping: OWASP LLM 01: Prompt Injection, OWASP LLM 02 Sensitive Information Disclosure, OWASP 07: System Prompt Leakage

 

 

Detection Area: Safety

AI systems can generate harmful, abusive, or dangerous content that threatens user wellbeing and damages brand trust. Safety guardrails protect enterprises by preventing outputs that promote harassment, violence, self-harm, or hate. ActiveFence ensures AI aligns with human values, shields vulnerable users, and reduces reputational and compliance risks for organizations.

 

Bullying & Harassment

Offensive, abusive, or threatening language directed at individuals can cause harm, exclusion, and reputational crises. ActiveFence identifies this language, preventing its spread and ensuring enterprises are not associated with toxic or unsafe interactions.

 

Weapons

Images containing firearms, knives, or other handheld weapons may normalize violence or promote unsafe behavior. ActiveFence detects these depictions, blocking their use and protecting enterprises from harmful associations that could damage brand trust.

 

Suicide and Self-Harm

Content promoting or instructing on self-injury, eating disorders, or suicide poses direct risks to vulnerable users. ActiveFence flags and blocks this content, safeguarding individuals from harm and shielding enterprises from liability linked to unsafe outputs.

 

Hate Speech Text

Discrimination, hate, or incitement of violence against protected groups damages communities and can destroy brand reputation. ActiveFence detects and intercepts such content, ensuring AI systems align with inclusive standards and enterprise responsibility.

 

CSAM Text

Child sexual abuse material is illegal and catastrophic for an enterprise’s reputation. ActiveFence detects and blocks any content describing or promoting CSAM, ensuring compliance with global laws and protecting organizations from severe legal and reputational harm.

 

Legal Advice

Identifies legal and regulatory advice in the conversation, from either side (User or LLM). The model flags potential advisory content.

 

Financial Advice

Unverified investment or financial advice can mislead users and damage enterprise credibility. ActiveFence identifies this type of guidance, preventing risky outputs that could cause financial harm to users and reputational harm to brands.

 

 

Detection Area: Privacy

Exposing sensitive personal data can lead to identity theft, fraud, harassment, or severe compliance violations. Privacy guardrails protect users by preventing the disclosure of personally identifiable information across text, images, and links. ActiveFence safeguards enterprises from legal and reputational harm while reinforcing trust in secure, responsible AI interactions.

 

Personally Identifiable Information (PII)

Exposing sensitive personal details like SSNs, bank accounts, or emails can lead to identity theft, fraud, and regulatory noncompliance. ActiveFence detects these disclosures, protecting users from exploitation while helping enterprises maintain compliance and preserve trust.

Framework mapping: OWASP LLM 02 Sensitive Information Disclosure

 

PII – Credit Cardย 

Credit card details in text outputs put users at risk of fraud and can expose enterprises to serious liability. ActiveFence identifies and blocks this financial data, ensuring sensitive information is never mishandled and reinforcing brand credibility in safeguarding transactions.

Framework mapping: OWASP LLM 02 Sensitive Information Disclosure

 

PII – IP Address

Exposed IP addresses can allow tracking, cyberattacks, or exploitation of user data. ActiveFence surfaces these disclosures, preventing misuse while helping enterprises demonstrate strong privacy practices and uphold user trust.

Framework mapping: OWASP LLM 02 Sensitive Information Disclosure

 

PII – URL

URLs in AI outputs may reveal private data, expose internal systems, or direct users to unsafe resources. ActiveFence detects these risks and prevents their spread, safeguarding enterprises from breaches while ensuring AI interactions remain secure and responsible.

Framework mapping: OWASP LLM 02 Sensitive Information Disclosure

 

PII – Phone

Phone numbers in text can open users to harassment, scams, or unwanted contact. ActiveFence flags and blocks these exposures, protecting individuals from harm and helping enterprises maintain compliance with global privacy standards.

Framework mapping: OWASP LLM 02 Sensitive Information Disclosure

 

ActiveFence Guardrails empower organizations to build AI systems that are safe, secure, and aligned with enterprise values. By combining adaptable policy actions with robust detection across security, safety, and privacy domains, you can protect users, uphold compliance, and safeguard brand integrity. Every organizationโ€™s needs are unique, and our experts can help you customize these guardrails to fit your specific risk profile and goals. Contact an ActiveFence expert today to discuss how these policies can be tailored to your organizationโ€™s unique AI Safety and Security needs.