Safety & Guardrails
Defense in depth for AI agents: input validation, output filtering, tool sandboxing, guardian agents, and OWASP LLM security risks.
AI agents that take real actions in the world — writing files, calling APIs, executing code, sending communications — require multiple overlapping safety layers. No single technique catches everything. Pattern-based filters are fast and predictable but bypassable by determined adversaries. LLM-based evaluation is flexible and semantic but adds latency and cost. Sandboxing contains the blast radius of failures. The goal is defense in depth: combine all three so that any single point of failure does not compromise the whole system.
Defense in Depth
USER INPUT │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ LAYER 1: INPUT GUARDRAILS │ │ ├── Length & format validation │ │ ├── Injection pattern detection │ │ ├── PII detection & masking │ │ └── Content policy filtering │ └─────────────────────────────────────────────────────────────────┘ │ (blocked or sanitized) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ LAYER 2: AGENT EXECUTION │ │ ├── Sandboxed tool execution │ │ ├── Resource limits (time, memory, network) │ │ ├── Allowlisted tools only │ │ └── Argument validation per tool │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ LAYER 3: OUTPUT GUARDRAILS │ │ ├── Harmful content detection │ │ ├── PII leakage prevention │ │ ├── Hallucination detection │ │ └── Policy compliance check │ └─────────────────────────────────────────────────────────────────┘ │ (blocked, filtered, or modified) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ LAYER 4: GUARDIAN AGENT (Optional) │ │ ├── LLM-based semantic analysis │ │ ├── Context-aware evaluation │ │ └── Complex policy enforcement │ └─────────────────────────────────────────────────────────────────┘ │ ▼ USER OUTPUT
No single layer catches everything. Combine multiple techniques: pattern-based (fast, predictable) + LLM-based (flexible, semantic) + sandboxing (contains damage).
OWASP Top 10 for LLM Applications
The OWASP Top 10 for LLMs identifies the most critical security risks in LLM-powered systems.
| Risk | Description | Mitigation |
|---|---|---|
| LLM01: Prompt Injection | Malicious input manipulates LLM behavior | Input validation, instruction hierarchy |
| LLM02: Insecure Output | LLM output executed without validation | Output sanitization, sandboxing |
| LLM03: Training Data Poisoning | Malicious data corrupts model behavior | Data validation, provenance tracking |
| LLM04: Denial of Service | Resource exhaustion attacks | Rate limiting, resource caps |
| LLM05: Supply Chain | Compromised models, plugins, or data | Integrity checks, trusted sources |
| LLM06: Permission Issues | LLM granted excessive permissions | Least privilege, human approval |
| LLM07: Data Leakage | Sensitive data exposed in responses | PII filtering, access controls |
| LLM08: Excessive Agency | LLM takes unintended autonomous actions | Action limits, confirmation prompts |
| LLM09: Overreliance | Users trust LLM output without verification | Confidence indicators, source citations |
| LLM10: Model Theft | Extraction of model weights or behavior | API rate limits, watermarking |
1. Input Guardrails
Input guardrails validate and sanitize all user input before it reaches the LLM. The implementation combines structural validation (length limits, encoding checks, format restrictions) with content filtering for injection patterns, PII, and malicious payloads.
import re
from dataclasses import dataclass
from enum import Enum
class RiskLevel(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
@dataclass
class ValidationResult:
allowed: bool
reason: str | None = None
category: str | None = None
risk_level: RiskLevel | None = None
class InputGuardrail:
# Injection patterns to detect
INJECTION_PATTERNS = [
r"ignore\s+(previous|above|all)\s+instructions",
r"you\s+are\s+now\s+",
r"act\s+as\s+(if\s+you\s+are|a)\s+",
r"pretend\s+(you\s+are|to\s+be)",
r"system\s*:\s*",
r"\[INST\]|\[/INST\]",
r"<\|im_start\|>|<\|im_end\|>",
]
# PII patterns
PII_PATTERNS = {
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
"email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
}
def __init__(
self,
max_length: int = 10000,
block_pii: bool = True,
block_injections: bool = True
):
self.max_length = max_length
self.block_pii = block_pii
self.block_injections = block_injections
def validate(self, input_text: str) -> ValidationResult:
# Length check
if len(input_text) > self.max_length:
return ValidationResult(
allowed=False,
reason=f"Input exceeds maximum length ({self.max_length})",
category="LENGTH",
risk_level=RiskLevel.LOW
)
# Injection detection
if self.block_injections:
for pattern in self.INJECTION_PATTERNS:
if re.search(pattern, input_text, re.IGNORECASE):
return ValidationResult(
allowed=False,
reason="Potential prompt injection detected",
category="INJECTION",
risk_level=RiskLevel.HIGH
)
# PII detection
if self.block_pii:
for pii_type, pattern in self.PII_PATTERNS.items():
if re.search(pattern, input_text):
return ValidationResult(
allowed=False,
reason=f"PII detected: {pii_type}",
category="PII",
risk_level=RiskLevel.MEDIUM
)
return ValidationResult(allowed=True)
def sanitize(self, input_text: str) -> str:
"""Sanitize input without blocking."""
sanitized = input_text
# Remove control characters (except newlines, tabs)
sanitized = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', sanitized)
import unicodedata
sanitized = unicodedata.normalize('NFKC', sanitized)
# Escape potential special tokens
sanitized = sanitized.replace("<|", "< |")
sanitized = sanitized.replace("|>", "| >")
return sanitized
Determined attackers can bypass pattern-based filters through homoglyph attacks, Unicode tricks, and indirect injection via retrieved documents. Use pattern matching as a first line of defense, not the only defense.
2. Output Guardrails
Output guardrails validate LLM responses before returning them to users. They check for harmful content across categories (violence, hate speech, self-harm, illegal activity), PII leakage where the output contains sensitive data not present in the input, hallucination against retrieved source documents, and dangerous patterns in generated tool call arguments.
The action taken on failure depends on severity: critical and high-severity failures trigger blocking, medium-severity allows filtering (removing the problematic portion), and low-severity warrants a logged warning while allowing the response through.
| Check | Purpose | Action on Failure |
|---|---|---|
| Harmful Content | Detect violence, hate, illegal content | Block |
| PII Leakage | Prevent exposure of personal data | Mask or Block |
| Hallucination | Flag unsupported claims | Warn or Modify |
| Tool Call Safety | Validate tool arguments | Block execution |
| Policy Compliance | Enforce usage policies | Block or Modify |
3. Guardian Agents
A guardian agent uses a separate LLM to evaluate interactions for complex, context-dependent risks that pattern matching cannot capture. The guardian receives both the user input and the agent’s response, evaluates them across categories (prompt injection, harmful content, PII leakage, policy violation, hallucination), and returns a structured verdict with confidence scores, evidence, and a recommendation to allow, block, or modify.
Consider using a different model for the guardian than the main agent. This provides defense-in-depth against model-specific vulnerabilities — an attack that manipulates one model architecture may not affect another.
The cost of a guardian agent is real: an extra LLM call per interaction adds latency and API cost, and the guardian itself can be manipulated or produce false positives. Use guardians for high-stakes actions where the cost of a safety failure exceeds the overhead of evaluation.
4. Tool Execution Sandboxing
Sandboxing isolates tool execution to contain the blast radius of compromised or hallucinated tool calls. An effective sandbox enforces an allowlist of permitted tools, validates arguments against denied patterns before execution, enforces resource limits (CPU time, memory, output size), and restricts filesystem access to approved paths.
| Strategy | Isolation Level | Use Case |
|---|---|---|
| Process limits | Low | Resource caps only |
| Docker containers | Medium | Most production use cases |
| gVisor/Firecracker | High | Untrusted code execution |
| WASM | High | Browser/edge execution |
| Separate VMs | Very High | Highest security needs |
Best Practices
Always validate LLM-generated code, commands, or data before execution. The LLM can be manipulated or hallucinate dangerous operations.
Give agents only the minimum permissions needed. Do not grant write access if read is sufficient. Do not grant admin if user-level access is sufficient.
For high-stakes actions — deletions, financial transactions, external communications — require explicit human approval before execution.
Implement rate limits on all agent operations to prevent resource exhaustion and limit damage from compromised agents.
Log all agent actions with context. This enables incident investigation and detection of anomalous behavior patterns before they cause serious harm.
When guardrails fail or timeout, default to blocking rather than allowing. False positives are recoverable; security breaches often are not.
Implementation Checklist
A minimal safety implementation covers: input validation (length, format, injection patterns), PII detection and masking, sandboxed tool execution with resource limits, tool argument schema validation, output content filtering, rate limiting and resource caps, and comprehensive audit logging. For higher risk applications, add a guardian agent for complex policies and require human approval for irreversible actions. Test regularly with adversarial inputs to verify guardrails remain effective as the system evolves.