Safety & Guardrails | Agent Engineering

Defense in depth for AI agents: input validation, output filtering, tool sandboxing, guardian agents, and OWASP LLM security risks.

AI agents that take real actions in the world — writing files, calling APIs, executing code, sending communications — require multiple overlapping safety layers. No single technique catches everything. Pattern-based filters are fast and predictable but bypassable by determined adversaries. LLM-based evaluation is flexible and semantic but adds latency and cost. Sandboxing contains the blast radius of failures. The goal is defense in depth: combine all three so that any single point of failure does not compromise the whole system.

Defense in Depth

Multi-Layer Security Architecture

USER INPUT
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 1: INPUT GUARDRAILS                                       │
│ ├── Length & format validation                                  │
│ ├── Injection pattern detection                                 │
│ ├── PII detection & masking                                     │
│ └── Content policy filtering                                    │
└─────────────────────────────────────────────────────────────────┘
  │ (blocked or sanitized)
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 2: AGENT EXECUTION                                        │
│ ├── Sandboxed tool execution                                    │
│ ├── Resource limits (time, memory, network)                     │
│ ├── Allowlisted tools only                                      │
│ └── Argument validation per tool                                │
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 3: OUTPUT GUARDRAILS                                      │
│ ├── Harmful content detection                                   │
│ ├── PII leakage prevention                                      │
│ ├── Hallucination detection                                     │
│ └── Policy compliance check                                     │
└─────────────────────────────────────────────────────────────────┘
  │ (blocked, filtered, or modified)
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 4: GUARDIAN AGENT (Optional)                              │
│ ├── LLM-based semantic analysis                                 │
│ ├── Context-aware evaluation                                    │
│ └── Complex policy enforcement                                  │
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
USER OUTPUT

Defense in Depth

No single layer catches everything. Combine multiple techniques: pattern-based (fast, predictable) + LLM-based (flexible, semantic) + sandboxing (contains damage).

OWASP Top 10 for LLM Applications

The OWASP Top 10 for LLMs identifies the most critical security risks in LLM-powered systems.

OWASP Top 10 LLM Risks (2025)
Risk	Description	Mitigation
LLM01: Prompt Injection	Malicious input manipulates LLM behavior	Input validation, instruction hierarchy
LLM02: Insecure Output	LLM output executed without validation	Output sanitization, sandboxing
LLM03: Training Data Poisoning	Malicious data corrupts model behavior	Data validation, provenance tracking
LLM04: Denial of Service	Resource exhaustion attacks	Rate limiting, resource caps
LLM05: Supply Chain	Compromised models, plugins, or data	Integrity checks, trusted sources
LLM06: Permission Issues	LLM granted excessive permissions	Least privilege, human approval
LLM07: Data Leakage	Sensitive data exposed in responses	PII filtering, access controls
LLM08: Excessive Agency	LLM takes unintended autonomous actions	Action limits, confirmation prompts
LLM09: Overreliance	Users trust LLM output without verification	Confidence indicators, source citations
LLM10: Model Theft	Extraction of model weights or behavior	API rate limits, watermarking

1. Input Guardrails

Input guardrails validate and sanitize all user input before it reaches the LLM. The implementation combines structural validation (length limits, encoding checks, format restrictions) with content filtering for injection patterns, PII, and malicious payloads.

import re
from dataclasses import dataclass
from enum import Enum

class RiskLevel(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

@dataclass
class ValidationResult:
    allowed: bool
    reason: str | None = None
    category: str | None = None
    risk_level: RiskLevel | None = None

class InputGuardrail:
    # Injection patterns to detect
    INJECTION_PATTERNS = [
        r"ignore\s+(previous|above|all)\s+instructions",
        r"you\s+are\s+now\s+",
        r"act\s+as\s+(if\s+you\s+are|a)\s+",
        r"pretend\s+(you\s+are|to\s+be)",
        r"system\s*:\s*",
        r"\[INST\]|\[/INST\]",
        r"<\|im_start\|>|<\|im_end\|>",
    ]

    # PII patterns
    PII_PATTERNS = {
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
        "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
    }

    def __init__(
        self,
        max_length: int = 10000,
        block_pii: bool = True,
        block_injections: bool = True
    ):
        self.max_length = max_length
        self.block_pii = block_pii
        self.block_injections = block_injections

    def validate(self, input_text: str) -> ValidationResult:
        # Length check
        if len(input_text) > self.max_length:
            return ValidationResult(
                allowed=False,
                reason=f"Input exceeds maximum length ({self.max_length})",
                category="LENGTH",
                risk_level=RiskLevel.LOW
            )

        # Injection detection
        if self.block_injections:
            for pattern in self.INJECTION_PATTERNS:
                if re.search(pattern, input_text, re.IGNORECASE):
                    return ValidationResult(
                        allowed=False,
                        reason="Potential prompt injection detected",
                        category="INJECTION",
                        risk_level=RiskLevel.HIGH
                    )

        # PII detection
        if self.block_pii:
            for pii_type, pattern in self.PII_PATTERNS.items():
                if re.search(pattern, input_text):
                    return ValidationResult(
                        allowed=False,
                        reason=f"PII detected: {pii_type}",
                        category="PII",
                        risk_level=RiskLevel.MEDIUM
                    )

        return ValidationResult(allowed=True)

    def sanitize(self, input_text: str) -> str:
        """Sanitize input without blocking."""
        sanitized = input_text
        # Remove control characters (except newlines, tabs)
        sanitized = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', sanitized)
        import unicodedata
        sanitized = unicodedata.normalize('NFKC', sanitized)
        # Escape potential special tokens
        sanitized = sanitized.replace("<|", "< |")
        sanitized = sanitized.replace("|>", "| >")
        return sanitized

Pattern Matching Limitations

Determined attackers can bypass pattern-based filters through homoglyph attacks, Unicode tricks, and indirect injection via retrieved documents. Use pattern matching as a first line of defense, not the only defense.

2. Output Guardrails

Output guardrails validate LLM responses before returning them to users. They check for harmful content across categories (violence, hate speech, self-harm, illegal activity), PII leakage where the output contains sensitive data not present in the input, hallucination against retrieved source documents, and dangerous patterns in generated tool call arguments.

The action taken on failure depends on severity: critical and high-severity failures trigger blocking, medium-severity allows filtering (removing the problematic portion), and low-severity warrants a logged warning while allowing the response through.

Check	Purpose	Action on Failure
Harmful Content	Detect violence, hate, illegal content	Block
PII Leakage	Prevent exposure of personal data	Mask or Block
Hallucination	Flag unsupported claims	Warn or Modify
Tool Call Safety	Validate tool arguments	Block execution
Policy Compliance	Enforce usage policies	Block or Modify

3. Guardian Agents

A guardian agent uses a separate LLM to evaluate interactions for complex, context-dependent risks that pattern matching cannot capture. The guardian receives both the user input and the agent’s response, evaluates them across categories (prompt injection, harmful content, PII leakage, policy violation, hallucination), and returns a structured verdict with confidence scores, evidence, and a recommendation to allow, block, or modify.

Guardian Model Selection

Consider using a different model for the guardian than the main agent. This provides defense-in-depth against model-specific vulnerabilities — an attack that manipulates one model architecture may not affect another.

The cost of a guardian agent is real: an extra LLM call per interaction adds latency and API cost, and the guardian itself can be manipulated or produce false positives. Use guardians for high-stakes actions where the cost of a safety failure exceeds the overhead of evaluation.

4. Tool Execution Sandboxing

Sandboxing isolates tool execution to contain the blast radius of compromised or hallucinated tool calls. An effective sandbox enforces an allowlist of permitted tools, validates arguments against denied patterns before execution, enforces resource limits (CPU time, memory, output size), and restricts filesystem access to approved paths.

Strategy	Isolation Level	Use Case
Process limits	Low	Resource caps only
Docker containers	Medium	Most production use cases
gVisor/Firecracker	High	Untrusted code execution
WASM	High	Browser/edge execution
Separate VMs	Very High	Highest security needs

Best Practices

Never Trust LLM Output

Always validate LLM-generated code, commands, or data before execution. The LLM can be manipulated or hallucinate dangerous operations.

Principle of Least Privilege

Give agents only the minimum permissions needed. Do not grant write access if read is sufficient. Do not grant admin if user-level access is sufficient.

Human in the Loop

For high-stakes actions — deletions, financial transactions, external communications — require explicit human approval before execution.

Rate Limiting

Implement rate limits on all agent operations to prevent resource exhaustion and limit damage from compromised agents.

Audit Logging

Log all agent actions with context. This enables incident investigation and detection of anomalous behavior patterns before they cause serious harm.

Fail Secure

When guardrails fail or timeout, default to blocking rather than allowing. False positives are recoverable; security breaches often are not.

Implementation Checklist

A minimal safety implementation covers: input validation (length, format, injection patterns), PII detection and masking, sandboxed tool execution with resource limits, tool argument schema validation, output content filtering, rate limiting and resource caps, and comprehensive audit logging. For higher risk applications, add a guardian agent for complex policies and require human approval for irreversible actions. Test regularly with adversarial inputs to verify guardrails remain effective as the system evolves.