The Proposer–Safety Oracle Pattern: Separating Decision Generation from Runtime Governance
How to architect multi-agent systems with a dedicated Safety Oracle that enforces explicit risk policies independent of the decision-making agent.
Building safe multi-agent systems is harder than it looks. The instinct is to bake safety constraints directly into the agent’s prompt or fine-tune them into the model — but both approaches are fragile, hard to audit, and tightly coupled to a specific model architecture. A cleaner structural answer is to separate the concern entirely: let one component generate decisions and let a distinct, purpose-built component govern whether those decisions are safe to execute.
The Core Separation of Concerns
The Proposer–Safety Oracle pattern divides a multi-agent system into two cooperating roles. The Proposer is responsible for everything that requires intelligence: reasoning about the task, selecting tools, planning action sequences, and generating candidate outputs. It operates without internal safety constraints — its only job is to produce the best possible response given the goal.
The Safety Oracle sits between the Proposer and the environment. Before any action is executed or any output is returned, the Oracle evaluates it against a set of explicit, externally defined risk policies. If the proposed action violates policy, the Oracle can block it, request a revision, or escalate to a human reviewer. The Proposer never directly touches the environment.
Because the Safety Oracle holds its policies externally — not embedded in a model’s weights — you can update, audit, and version those policies independently of the underlying LLM. This is the key engineering advantage over prompt-based safety.
Why Tight Coupling Fails in Practice
When safety logic lives inside the model (via system prompt or RLHF), several problems emerge at production scale:
Policy drift: Safety instructions in prompts can be diluted by long context, overridden by later user turns, or simply ignored when the model is under instruction pressure to complete a task.
Opaque failure modes: When a safety-tuned model refuses or hedges, there’s no structured signal about which policy was triggered or why. Debugging is difficult, and regression testing is nearly impossible.
Architecture lock-in: If safety is baked into a specific model, swapping to a different model (e.g., upgrading to a newer checkpoint, or routing to a cheaper model for low-stakes tasks) requires re-validating all safety behavior from scratch.
Separating the Oracle into a discrete component solves all three problems. It creates an explicit, inspectable decision point with structured inputs and outputs.
Implementing the Oracle as a System Boundary
In practice, the Safety Oracle is best implemented as a synchronous checkpoint that every proposed action must pass through before execution. Here’s a minimal implementation sketch:
from dataclasses import dataclass
from enum import Enum
from typing import Callable
class Verdict(Enum):
ALLOW = "allow"
BLOCK = "block"
REVISE = "revise"
ESCALATE = "escalate"
@dataclass
class ProposedAction:
tool_name: str
arguments: dict
reasoning: str
risk_context: dict # metadata the Oracle can use for policy evaluation
@dataclass
class OracleDecision:
verdict: Verdict
policy_id: str | None # which policy triggered, if any
message: str
class SafetyOracle:
def __init__(self, policies: list[Callable[[ProposedAction], OracleDecision | None]]):
self.policies = policies
def evaluate(self, action: ProposedAction) -> OracleDecision:
for policy in self.policies:
decision = policy(action)
if decision is not None:
return decision
return OracleDecision(verdict=Verdict.ALLOW, policy_id=None, message="passed")
# Example policy: block file deletion outside allowed paths
def no_delete_outside_sandbox(action: ProposedAction) -> OracleDecision | None:
if action.tool_name == "delete_file":
path = action.arguments.get("path", "")
if not path.startswith("/sandbox/"):
return OracleDecision(
verdict=Verdict.BLOCK,
policy_id="filesystem.delete.scope",
message=f"Deletion outside /sandbox/ is not permitted. Path: {path}"
)
return None
The Proposer calls oracle.evaluate(proposed_action) before executing anything. If the verdict is REVISE, the Proposer receives structured feedback and generates a new candidate. If ESCALATE, a human-in-the-loop step is triggered.
┌─────────────────────────────────────────────────────────┐ │ Agent Execution Loop │ │ │ │ ┌──────────────┐ proposed ┌──────────────────┐ │ │ │ │ action │ │ │ │ │ Proposer │ ────────────► │ Safety Oracle │ │ │ │ (LLM core) │ │ (policy engine) │ │ │ │ │ ◄──────────── │ │ │ │ └──────────────┘ verdict + └────────┬─────────┘ │ │ │ feedback │ ALLOW │ │ │ (REVISE loop) ▼ │ │ │ ┌──────────────────┐ │ │ │ │ Environment / │ │ │ │ │ Tool Executor │ │ │ │ └──────────────────┘ │ │ │ │ │ │ └──────────────────────────────┘ │ │ observation │ └─────────────────────────────────────────────────────────┘
Designing Effective Risk Policies
The quality of the Oracle depends entirely on the quality of its policies. Several design principles matter here:
Policies should be compositional, not monolithic. Each policy function tests exactly one constraint. This makes them independently testable, auditable, and replaceable. A single is_safe() function that checks everything is an anti-pattern.
Policies should operate on structured data. If the Proposer passes raw text to the Oracle, the Oracle must parse intent — which reintroduces LLM-based evaluation with all its opacity. Instead, structure proposed actions as typed objects (tool name, arguments, context metadata) so policies can apply deterministic logic.
Risk context should be explicit. The Proposer should annotate its proposed actions with relevant risk signals: the task origin (human vs. automated), the confidence level, any user-provided overrides, and the scope of affected resources. The Oracle can use this context to apply tiered policies — stricter for low-confidence or high-blast-radius actions.
Version your policy sets the same way you version code. Tag policy releases, run regression suites when policies change, and log which policy triggered on every blocked action. This creates an audit trail that is invaluable during incident review.
Architecture-Agnostic Safety at Scale
The most durable benefit of this pattern is that the safety layer becomes independent of the model powering the Proposer. You can swap from GPT-4o to Claude to a locally hosted Llama checkpoint without touching the Oracle. You can run A/B tests on Proposer models while keeping safety behavior constant. You can apply the same Oracle to multiple Proposers in a larger multi-agent system — a coding agent, a web search agent, and a data analysis agent can all route through a shared governance layer.
This composability also simplifies compliance. In regulated environments, demonstrating that safety controls are enforced requires showing that those controls are not bypassable by the model’s own outputs. When the Oracle is a distinct system boundary with its own logs, that demonstration becomes straightforward: every action in the execution trace carries a corresponding Oracle verdict.
The pattern does add latency — every action requires an Oracle round-trip before execution. For high-frequency agents or streaming interactions, you can batch low-risk actions or maintain a fast-path allowlist for pre-approved tool calls. The key is to keep the Oracle deterministic and cheap; reserve LLM-based policy evaluation (e.g., a classifier model) for edge cases that structured rules cannot cover.
This article is an AI-generated summary. Read the original paper: The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety .