Silo-Bench: Benchmarking Distributed Coordination in Multi-Agent LLM Systems

How the Communication-Reasoning Gap exposes a critical failure mode in multi-agent LLM systems—and what engineers can do about it.

Multi-agent systems derive their power from the ability to divide work, share partial knowledge, and converge on a correct answer that no single agent could reach alone. But communication and reasoning are not the same skill—and assuming an agent that can coordinate will automatically synthesize distributed state correctly is one of the most common design mistakes in production multi-agent engineering. Silo-Bench is a structured benchmark that makes this distinction measurable.

The Core Insight: Coordination ≠ Synthesis

When engineers debug a failing multi-agent pipeline, the first instinct is usually to check whether messages are being delivered. Is agent B receiving what agent A sent? Are the routing rules correct? These are valid questions, but they only test the plumbing. A deeper failure mode occurs when every message arrives intact yet the receiving agent still produces a wrong answer—because synthesizing distributed state into a coherent conclusion is a fundamentally different cognitive task than passing information along.

This is what the Communication-Reasoning Gap names: agents can successfully form a coordination topology and exchange the right messages at the right times, yet still fail to integrate that distributed state into a correct final answer. It is analogous to a distributed database where all nodes replicate correctly but queries return wrong results because the query planner doesn’t understand consistency semantics. The infrastructure works; the reasoning layer doesn’t.

Warning

Don’t conflate message-delivery success with task-completion correctness. An agent that echoes received state faithfully is not the same as one that can reason over it. Your evaluation harness needs to test both layers independently.

How Silo-Bench Structures the Problem

Silo-Bench approaches multi-agent coordination through the lens of communication complexity—a formal computer science concept that asks: how much information must be exchanged between parties to solve a problem? This framing is valuable because it lets you control the minimum coordination required by a task independently of how agents choose to coordinate.

The benchmark organizes 30 algorithmic tasks into three tiers:

Low communication complexity — Tasks solvable with minimal cross-agent information. Each agent holds a partition of the data and needs to share only a small summary (e.g., a single aggregate value) to reach the correct answer.
Medium communication complexity — Tasks requiring agents to exchange structured intermediate results. Correct synthesis depends on understanding the relationship between partitions, not just combining numbers.
High communication complexity — Tasks where the correct answer cannot be determined without near-complete information flow across all agents. These stress-test whether agents can maintain and integrate a growing shared context.

The benchmark is role-agnostic, meaning agents are not pre-assigned fixed roles (collector, aggregator, verifier). This is a deliberate design choice: real production systems often need agents to adapt their coordination role dynamically, and static role assignment masks failures that only appear when agents must self-organize.

Task Partitions
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Agent A │  │ Agent B │  │ Agent C │
│ shard 1 │  │ shard 2 │  │ shard 3 │
└────┬────┘  └────┬────┘  └────┬────┘
     │             │             │
     └──────┬──────┘             │
            │   coordination     │
            │   topology forms   │
            └────────┬───────────┘
                     │
              ┌──────▼──────┐
              │  Synthesis  │  ← Communication-Reasoning Gap lives here
              │   (LLM)     │
              └──────┬──────┘
                     │
               Final Answer

What This Reveals About LLM Agent Behavior

The benchmark surfaces several patterns that matter for production engineering:

Topology formation is relatively robust. Current LLMs, when given appropriate scaffolding, are reasonably good at deciding who should talk to whom and in what order. Graph-based coordination structures emerge reliably across model families.

Synthesis quality degrades non-linearly with complexity tier. Performance at the low tier is often acceptable. At the high tier, even models that communicate correctly produce wrong answers at high rates. This is not a context-length problem alone—agents fail even when all relevant information fits comfortably within their context window. The failure is in recognizing how to combine distributed state, not in having access to it.

Errors accumulate silently. A particularly dangerous failure mode is an agent that produces a confident, well-formatted answer that integrates received messages but applies the wrong aggregation logic. Without ground-truth evaluation on the final answer, this error is invisible to standard logging and tracing.

Note

If your multi-agent evaluation only checks whether agents exchange the expected messages, you are testing your router, not your system. Add final-answer ground-truth evaluation as a first-class metric alongside communication trace validation.

Engineering Implications

For engineers building or evaluating multi-agent systems, the Communication-Reasoning Gap suggests several concrete practices:

Decouple your evaluation layers. Build separate test suites for (1) communication correctness—did the right information flow to the right agents in the right order—and (2) synthesis correctness—given that information, did the agent produce the right answer? These can and do fail independently.

Stress-test synthesis in isolation. You can construct unit tests for the synthesis step by feeding an agent a pre-assembled context that represents what it would receive from a correctly-functioning network, then evaluating its answer against ground truth. This removes communication noise from the diagnosis.

Use communication complexity as a task design parameter. When defining tasks for your multi-agent system, explicitly think about which tier of information exchange they require. Tasks at the high end demand more robust synthesis prompting, more careful context structuring, and more thorough evaluation coverage.

Prefer explicit state schemas over free-form messages. Agents that receive structured, typed summaries of peer state (e.g., a JSON object with named fields) outperform those receiving unstructured natural language updates on synthesis tasks. The more complex the coordination, the more important message schema design becomes.

# Prefer structured inter-agent messages for synthesis-heavy tasks
class AgentStateUpdate(BaseModel):
    agent_id: str
    shard_range: tuple[int, int]
    partial_result: float
    confidence: float
    dependencies_seen: list[str]  # which other agents this agent has integrated

# Fragile: natural language summary
message = "I processed items 0-99 and found the max is 42, seems pretty high"

# Robust: typed schema
message = AgentStateUpdate(
    agent_id="agent_a",
    shard_range=(0, 99),
    partial_result=42.0,
    confidence=0.95,
    dependencies_seen=[]
)

Designing Benchmarks That Actually Break Your System

The broader methodological lesson from Silo-Bench is about benchmark design philosophy. A benchmark that only tests whether agents can communicate misses the failure mode that actually matters in deployed systems. Good multi-agent benchmarks should include tasks where communication is necessary but not sufficient—where correct coordination is a prerequisite, not a guarantee, of correct output.

This principle extends beyond algorithmic tasks. If you are building a research agent, a coding agent swarm, or any system where multiple agents hold partial knowledge, your evaluation suite should include cases where each agent individually lacks the information to answer correctly, but the collective system should be able to. Those cases are precisely where the Communication-Reasoning Gap will surface, and they are the cases your production system will encounter first.

Tip

When designing your own multi-agent evaluations, include at least some tasks where a single-agent baseline with full information would trivially succeed. If your multi-agent system underperforms that baseline despite correct communication, you have found a synthesis failure worth investigating.