danielhuber.dev@proton.me Sunday, February 22, 2026

Safety & Guardrails

Defense in depth for AI agents: input validation, output filtering, tool sandboxing, guardian agents, and OWASP LLM security risks.


February 18, 2026

AI agents that take real actions in the world — writing files, calling APIs, executing code, sending communications — require multiple overlapping safety layers. No single technique catches everything. Pattern-based filters are fast and predictable but bypassable by determined adversaries. LLM-based evaluation is flexible and semantic but adds latency and cost. Sandboxing contains the blast radius of failures. The goal is defense in depth: combine all three so that any single point of failure does not compromise the whole system.

Defense in Depth

Multi-Layer Security Architecture
USER INPUT
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 1: INPUT GUARDRAILS                                       │
│ ├── Length & format validation                                  │
│ ├── Injection pattern detection                                 │
│ ├── PII detection & masking                                     │
│ └── Content policy filtering                                    │
└─────────────────────────────────────────────────────────────────┘
  │ (blocked or sanitized)
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 2: AGENT EXECUTION                                        │
│ ├── Sandboxed tool execution                                    │
│ ├── Resource limits (time, memory, network)                     │
│ ├── Allowlisted tools only                                      │
│ └── Argument validation per tool                                │
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 3: OUTPUT GUARDRAILS                                      │
│ ├── Harmful content detection                                   │
│ ├── PII leakage prevention                                      │
│ ├── Hallucination detection                                     │
│ └── Policy compliance check                                     │
└─────────────────────────────────────────────────────────────────┘
  │ (blocked, filtered, or modified)
  ▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 4: GUARDIAN AGENT (Optional)                              │
│ ├── LLM-based semantic analysis                                 │
│ ├── Context-aware evaluation                                    │
│ └── Complex policy enforcement                                  │
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
USER OUTPUT
Defense in Depth

No single layer catches everything. Combine multiple techniques: pattern-based (fast, predictable) + LLM-based (flexible, semantic) + sandboxing (contains damage).

OWASP Top 10 for LLM Applications

The OWASP Top 10 for LLMs identifies the most critical security risks in LLM-powered systems.

OWASP Top 10 LLM Risks (2025)
RiskDescriptionMitigation
LLM01: Prompt InjectionMalicious input manipulates LLM behaviorInput validation, instruction hierarchy
LLM02: Insecure OutputLLM output executed without validationOutput sanitization, sandboxing
LLM03: Training Data PoisoningMalicious data corrupts model behaviorData validation, provenance tracking
LLM04: Denial of ServiceResource exhaustion attacksRate limiting, resource caps
LLM05: Supply ChainCompromised models, plugins, or dataIntegrity checks, trusted sources
LLM06: Permission IssuesLLM granted excessive permissionsLeast privilege, human approval
LLM07: Data LeakageSensitive data exposed in responsesPII filtering, access controls
LLM08: Excessive AgencyLLM takes unintended autonomous actionsAction limits, confirmation prompts
LLM09: OverrelianceUsers trust LLM output without verificationConfidence indicators, source citations
LLM10: Model TheftExtraction of model weights or behaviorAPI rate limits, watermarking

1. Input Guardrails

Input guardrails validate and sanitize all user input before it reaches the LLM. The implementation combines structural validation (length limits, encoding checks, format restrictions) with content filtering for injection patterns, PII, and malicious payloads.

import re
from dataclasses import dataclass
from enum import Enum

class RiskLevel(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

@dataclass
class ValidationResult:
    allowed: bool
    reason: str | None = None
    category: str | None = None
    risk_level: RiskLevel | None = None

class InputGuardrail:
    # Injection patterns to detect
    INJECTION_PATTERNS = [
        r"ignore\s+(previous|above|all)\s+instructions",
        r"you\s+are\s+now\s+",
        r"act\s+as\s+(if\s+you\s+are|a)\s+",
        r"pretend\s+(you\s+are|to\s+be)",
        r"system\s*:\s*",
        r"\[INST\]|\[/INST\]",
        r"<\|im_start\|>|<\|im_end\|>",
    ]

    # PII patterns
    PII_PATTERNS = {
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
        "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
    }

    def __init__(
        self,
        max_length: int = 10000,
        block_pii: bool = True,
        block_injections: bool = True
    ):
        self.max_length = max_length
        self.block_pii = block_pii
        self.block_injections = block_injections

    def validate(self, input_text: str) -> ValidationResult:
        # Length check
        if len(input_text) > self.max_length:
            return ValidationResult(
                allowed=False,
                reason=f"Input exceeds maximum length ({self.max_length})",
                category="LENGTH",
                risk_level=RiskLevel.LOW
            )

        # Injection detection
        if self.block_injections:
            for pattern in self.INJECTION_PATTERNS:
                if re.search(pattern, input_text, re.IGNORECASE):
                    return ValidationResult(
                        allowed=False,
                        reason="Potential prompt injection detected",
                        category="INJECTION",
                        risk_level=RiskLevel.HIGH
                    )

        # PII detection
        if self.block_pii:
            for pii_type, pattern in self.PII_PATTERNS.items():
                if re.search(pattern, input_text):
                    return ValidationResult(
                        allowed=False,
                        reason=f"PII detected: {pii_type}",
                        category="PII",
                        risk_level=RiskLevel.MEDIUM
                    )

        return ValidationResult(allowed=True)

    def sanitize(self, input_text: str) -> str:
        """Sanitize input without blocking."""
        sanitized = input_text
        # Remove control characters (except newlines, tabs)
        sanitized = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]', '', sanitized)
        import unicodedata
        sanitized = unicodedata.normalize('NFKC', sanitized)
        # Escape potential special tokens
        sanitized = sanitized.replace("<|", "< |")
        sanitized = sanitized.replace("|>", "| >")
        return sanitized

2. Output Guardrails

Output guardrails validate LLM responses before returning them to users. They check for harmful content across categories (violence, hate speech, self-harm, illegal activity), PII leakage where the output contains sensitive data not present in the input, hallucination against retrieved source documents, and dangerous patterns in generated tool call arguments.

The action taken on failure depends on severity: critical and high-severity failures trigger blocking, medium-severity allows filtering (removing the problematic portion), and low-severity warrants a logged warning while allowing the response through.

CheckPurposeAction on Failure
Harmful ContentDetect violence, hate, illegal contentBlock
PII LeakagePrevent exposure of personal dataMask or Block
HallucinationFlag unsupported claimsWarn or Modify
Tool Call SafetyValidate tool argumentsBlock execution
Policy ComplianceEnforce usage policiesBlock or Modify

3. Guardian Agents

A guardian agent uses a separate LLM to evaluate interactions for complex, context-dependent risks that pattern matching cannot capture. The guardian receives both the user input and the agent’s response, evaluates them across categories (prompt injection, harmful content, PII leakage, policy violation, hallucination), and returns a structured verdict with confidence scores, evidence, and a recommendation to allow, block, or modify.

Guardian Model Selection

Consider using a different model for the guardian than the main agent. This provides defense-in-depth against model-specific vulnerabilities — an attack that manipulates one model architecture may not affect another.

The cost of a guardian agent is real: an extra LLM call per interaction adds latency and API cost, and the guardian itself can be manipulated or produce false positives. Use guardians for high-stakes actions where the cost of a safety failure exceeds the overhead of evaluation.

4. Tool Execution Sandboxing

Sandboxing isolates tool execution to contain the blast radius of compromised or hallucinated tool calls. An effective sandbox enforces an allowlist of permitted tools, validates arguments against denied patterns before execution, enforces resource limits (CPU time, memory, output size), and restricts filesystem access to approved paths.

StrategyIsolation LevelUse Case
Process limitsLowResource caps only
Docker containersMediumMost production use cases
gVisor/FirecrackerHighUntrusted code execution
WASMHighBrowser/edge execution
Separate VMsVery HighHighest security needs

Best Practices

Audit Logging

Log all agent actions with context. This enables incident investigation and detection of anomalous behavior patterns before they cause serious harm.

Fail Secure

When guardrails fail or timeout, default to blocking rather than allowing. False positives are recoverable; security breaches often are not.

Implementation Checklist

A minimal safety implementation covers: input validation (length, format, injection patterns), PII detection and masking, sandboxed tool execution with resource limits, tool argument schema validation, output content filtering, rate limiting and resource caps, and comprehensive audit logging. For higher risk applications, add a guardian agent for complex policies and require human approval for irreversible actions. Test regularly with adversarial inputs to verify guardrails remain effective as the system evolves.

Tags: safetyguardrailsowaspsecurity