Adaptive Memory Admission Control: Deciding What Your Agent Should Remember
How to build a memory admission layer that uses rule-based feature extraction and LLM utility scoring to decide which observations are worth storing—improving recall precision while cutting latency.
Most agent memory systems are designed around retrieval: how do you find the right information at query time? Far less attention goes to the earlier and equally critical question—should this information be stored at all? Memory admission control is the gate between an agent’s raw stream of observations and its persistent memory store, and getting it wrong in either direction degrades the system. Admit too little and the agent forgets things it needs; admit too much and retrieval drowns in noise.
Why Admission Control Is an Underrated Problem
Every agent that accumulates memory across sessions faces a compounding noise problem. Tool call results, intermediate reasoning steps, user messages, retrieved documents, and error traces all flow through an agent’s context window. Naively persisting everything creates a memory store that grows unbounded and becomes increasingly polluted with low-signal observations. At retrieval time, irrelevant entries compete with genuinely useful ones, degrading both precision and the quality of the context that ends up in the model’s window.
The naive fix—only store final outputs or explicit user instructions—undershoots in the other direction. Agents frequently need to recall intermediate decisions, constraint values learned mid-task, or observations from tool calls that didn’t directly produce a final answer. Binary heuristics lose this nuance.
Memory admission is a precision-recall tradeoff problem, not a storage problem. Treating it as purely a cost concern causes engineers to under-invest in the selection logic.
The Two-Stage Admission Architecture
A practical memory admission layer benefits from splitting the decision into two sequential stages that exploit different strengths.
Stage 1 — Rule-Based Feature Extraction: Before invoking a language model, extract cheap, deterministic features from the candidate memory fragment. These include structural signals (observation type, source tool, data format), statistical signals (token length, repetition ratio, entropy of the content), and relational signals (whether the fragment overlaps significantly with existing memory entries). Rules operate in microseconds and can reject obvious candidates—empty tool responses, boilerplate error messages, duplicate content—without any model call.
Stage 2 — LLM Utility Assessment: Candidates that survive the rule filter are passed to a lightweight LLM call that estimates utility: given the agent’s current task context and memory profile, how likely is this fragment to be useful in a future retrieval? This is framed as a structured scoring problem, not open-ended generation. A constrained prompt elicits a score or a binary admit/reject decision with a brief rationale.
Agent Observation Stream │ ▼ ┌───────────────────┐ │ Rule-Based Filter │ ◄── Type, duplication, entropy checks │ (Cheap, Fast) │ └────────┬──────────┘ │ passes filter ▼ ┌───────────────────┐ │ LLM Utility │ ◄── Task context + memory profile │ Assessor │ └────────┬──────────┘ │ admit decision ┌────┴────┐ │ │ STORE DISCARD │ ▼ Memory Store
The key architectural insight is that the two stages have complementary failure modes. The rule filter has high recall (it rarely rejects something genuinely useful) but low precision (it lets through a lot of noise). The LLM assessor has higher precision but introduces latency and cost. By running the LLM only on rule-filter survivors, you apply expensive computation selectively.
Designing the Utility Assessment Prompt
The LLM call in Stage 2 is not a general-purpose reasoning task—it should be constrained to produce a structured output quickly. A minimal prompt structure looks like this:
System: You are a memory utility evaluator for an AI agent.
You will be given a candidate memory fragment, the agent's current task description,
and a summary of what the agent already has in memory.
Score the candidate on a scale of 0–3:
0 = redundant or irrelevant
1 = marginally useful, low retrieval probability
2 = likely useful in future steps
3 = critical, high retrieval probability
Respond with JSON: {"score": <int>, "reason": "<one sentence>"}
Task: {task_description}
Existing memory summary: {memory_summary}
Candidate fragment: {candidate}
A score threshold (e.g., admit if score ≥ 2) gates final admission. The reason field is valuable for debugging and for building evaluation datasets to tune the threshold over time.
Use a smaller, faster model for the utility assessor—a 7B or 8B parameter model fine-tuned on your domain often outperforms a larger general model here because the decision is well-scoped and the latency budget is tight.
Latency and Throughput Considerations
Memory writes sit on the critical path in synchronous agent architectures. If the admission decision adds 200ms to every tool call response, the UX cost accumulates quickly across a multi-step task. There are three practical strategies for keeping admission overhead manageable.
Async admission: Immediately write candidates to a staging buffer and run the admission pipeline asynchronously. The agent continues executing while admission decisions are processed in the background. Final memory writes happen with a short delay. This eliminates admission latency from the critical path but requires the agent’s retrieval layer to query both the staging buffer and committed memory during the window before a decision resolves.
Batched assessment: Accumulate multiple candidates and assess them in a single LLM call using a list-scoring format. Batch sizes of 4–8 fragments typically reduce per-fragment LLM latency by 40–60% compared to individual calls.
Adaptive threshold: Track the admission rate over a rolling window. If the agent is in a high-velocity phase (many rapid tool calls), raise the score threshold temporarily to reduce LLM calls. If the agent is in a reflective or summarization phase, lower the threshold to capture more detail.
Integrating Admission Control with Retrieval
Admission control and retrieval are not independent subsystems—they should share signals. Retrieval hit rates provide a natural feedback signal for admission quality: if fragments from a certain source or type are consistently retrieved and used, the admission threshold for that category should be lower (more permissive). If certain fragment types are admitted but never retrieved, they should be filtered earlier.
This creates a closed loop: retrieval telemetry informs admission rules, which affects memory composition, which affects retrieval quality. In practice, log retrieval events with the memory fragment IDs and run a periodic job that updates rule-filter weights and score thresholds based on observed retrieval utility.
Avoid tuning admission thresholds purely on storage cost metrics. Optimize for retrieval utility—whether admitted memories are actually retrieved and positively affect downstream task performance. Cost and storage are lagging indicators of a good admission policy, not leading ones.
Memory admission control is one of those infrastructure concerns that feels optional until a production agent starts hallucinating facts from weeks-old, irrelevant tool calls buried in its memory store. Building the admission layer as a first-class, tunable component—rather than an afterthought—pays compounding dividends as an agent accumulates history across sessions.
This article is an AI-generated summary. Read the original paper: Adaptive Memory Admission Control for LLM Agents .