Defending Agent Memory Against Poisoning: Bayesian Trust Scoring in Multi-Agent Systems

How to protect shared agent memory from poisoning attacks using Bayesian trust models, local-first storage, and adaptive ranking.

Shared memory is one of the most powerful primitives in multi-agent systems—and one of the most exploitable. When agents read from and write to a common memory store, a single compromised or misconfigured agent can corrupt the beliefs of every agent that reads from it. Bayesian trust scoring offers a principled way to reason about the reliability of memory contributions before they influence downstream agent behavior.

Why Agent Memory Is a Security Surface

In a multi-agent system, memory plays the role that a shared database plays in a distributed application: it lets agents communicate state across turns, tasks, and time without passing everything through the context window of a single model. But this convenience introduces a threat model most agent engineers don’t think about explicitly.

Consider a pipeline where a research agent writes summaries into a shared store, and a synthesis agent reads from that store to produce a final report. If the research agent is manipulated—through a prompt injection, a poisoned tool response, or a compromised external data source—it may write subtly false or misleading entries into memory. The synthesis agent, trusting the store, propagates the error. Unlike a hallucination confined to one model’s output, a poisoned memory entry can persist, replicate, and influence many subsequent agent actions.

This is the memory poisoning threat: adversarial or erroneous writes that corrupt shared state in ways that are hard to detect because the corrupted data looks structurally valid.

Warning

Memory poisoning is distinct from prompt injection. Injection attacks target the model’s context at inference time; poisoning attacks target the persistent store that feeds future contexts. A poisoned entry can survive across sessions and affect agents that were never directly attacked.

Bayesian Trust Scoring for Memory Contributions

The core insight behind Bayesian trust defense is that each memory contribution should carry a credibility estimate, and that estimate should update as evidence accumulates. Rather than treating every write as equally authoritative, the system maintains a trust distribution over each agent (or memory source) and uses it to weight how much influence any given entry has on downstream retrieval.

In a Bayesian framework, each contributing agent starts with a prior trust distribution—often a Beta distribution parameterized by prior successes and failures, where “success” means a contributed memory entry was later confirmed or corroborated, and “failure” means it was contradicted or flagged. As the system observes outcomes, it updates the posterior:

# Simplified Beta-Binomial trust update
class AgentTrust:
    def __init__(self, alpha=1.0, beta=1.0):
        self.alpha = alpha  # prior successes
        self.beta = beta    # prior failures

    @property
    def score(self) -> float:
        """Expected value of Beta distribution."""
        return self.alpha / (self.alpha + self.beta)

    def update(self, confirmed: bool):
        if confirmed:
            self.alpha += 1.0
        else:
            self.beta += 1.0

When an agent reads from memory, retrieved entries are ranked not just by semantic relevance but also by the trust score of the agent that wrote them. A highly relevant entry from a low-trust agent may rank below a moderately relevant entry from a high-trust agent. This is the adaptive learning-to-rank component: relevance and trust are combined into a single retrieval score rather than applied as a hard filter.

Note

Using a Beta distribution for trust is convenient because it’s conjugate to the Bernoulli likelihood, meaning posterior updates are closed-form and cheap to compute. For production systems with many agents, this matters: trust scores can be updated synchronously on every write confirmation without requiring a separate inference step.

Local-First Storage and Privacy Implications

Centralized vector databases are convenient, but they introduce privacy risks when agents are processing sensitive data—user documents, proprietary code, medical records. A local-first architecture keeps all memory artifacts on the same machine (or within the same trust boundary) as the agents that produce them, using an embedded database like SQLite rather than a remote service.

This design has concrete engineering tradeoffs:

Latency: Reads and writes hit local disk rather than a network endpoint. For write-heavy workloads, this can be significantly faster.
Privacy: No data leaves the host. This satisfies strict compliance requirements without additional encryption-in-transit configurations.
Scalability ceiling: Local storage doesn’t horizontally scale the way a managed vector store does. For large multi-agent deployments spanning multiple machines, you need either replication or a federated trust model.
Backup and durability: SQLite is a single file; backup strategies need to be explicit.

┌─────────────────────────────────────────────────────┐
│                  Agent Cluster (local)               │
│                                                      │
│  ┌──────────┐   write+trust_score   ┌─────────────┐ │
│  │ Agent A  │──────────────────────▶│             │ │
│  └──────────┘                       │  SQLite     │ │
│                                     │  Memory     │ │
│  ┌──────────┐   ranked_read         │  Store      │ │
│  │ Agent B  │◀────────────────────  │             │ │
│  └──────────┘  (relevance × trust)  │  Trust DB   │ │
│                                     └─────────────┘ │
│  ┌──────────┐   confirm/refute                       │
│  │ Agent C  │──────────────────────▶ trust update    │
│  └──────────┘   (posterior update)                   │
└─────────────────────────────────────────────────────┘

Implementing Trust-Weighted Retrieval

The practical challenge is integrating trust scores into your retrieval pipeline without breaking the semantic search interface your agents already use. A clean approach is to compute a combined score at retrieval time:

from dataclasses import dataclass
from typing import List

@dataclass
class MemoryEntry:
    content: str
    embedding: list[float]
    author_agent_id: str
    trust_score: float  # pulled from AgentTrust at retrieval time

def trust_weighted_retrieval(
    query_embedding: list[float],
    candidates: List[MemoryEntry],
    trust_weight: float = 0.3,
    top_k: int = 5,
) -> List[MemoryEntry]:
    """
    Rank memory entries by a convex combination of
    cosine similarity and author trust score.
    """
    def combined_score(entry: MemoryEntry) -> float:
        sim = cosine_similarity(query_embedding, entry.embedding)
        return (1 - trust_weight) * sim + trust_weight * entry.trust_score

    return sorted(candidates, key=combined_score, reverse=True)[:top_k]

The trust_weight hyperparameter controls how aggressively trust influences ranking. Setting it to 0 falls back to pure semantic retrieval; setting it high may suppress relevant-but-low-trust entries. In practice, you calibrate this based on how adversarially hostile your environment is and how stable your trust estimates are early in a deployment.

Engineering Considerations for Production Deployments

Before adding Bayesian trust scoring to your memory layer, there are a few architectural decisions to nail down.

How do you get ground truth for trust updates? The trust loop requires feedback: you need to know whether a memory entry was correct or harmful. In some systems this is automatic—a downstream verification agent checks facts against a trusted source. In others it requires human review. Without a reliable feedback signal, trust scores drift toward their priors and provide little signal.

Cold start problem: New agents have no history, so their trust scores are uninformative. You can address this with a conservative prior (low alpha, high beta) that forces new agents to earn trust, or with identity-based bootstrapping where a new agent inherits the trust profile of its configuration class.

Adversarial trust manipulation: A sophisticated attacker might try to inflate an agent’s trust score before deploying a poisoning attack—building a track record of good writes, then inserting a harmful one. Temporal discounting (giving more weight to recent outcomes) and anomaly detection on write patterns (sudden large writes from a high-trust agent) are useful mitigations.

Tip

Treat trust scores as one signal among several. Combine Bayesian trust with structural checks (schema validation, content filtering, source attribution) rather than relying on trust scoring alone to catch poisoning attempts. Defense in depth applies to agent memory the same way it applies to API security.

Memory poisoning is an underappreciated failure mode as multi-agent systems move into production. Adding a probabilistic trust layer to your memory store—even a simple one—transforms memory from a passive ledger into an active participant in your system’s reliability guarantees.