Learning & Adaptation
How AI agents improve over time without retraining: token-space learning from successful trajectories, Reflexion self-critique, and self-evolving architectures.
Traditional machine learning models improve by updating their weights during training runs. Production LLM agents cannot do this — retraining on every interaction is computationally prohibitive and operationally impractical. Yet agents do get better over time, and that improvement has to happen somewhere. The answer lies in three complementary techniques that all operate at inference time by manipulating context rather than touching model weights: learning in token space, Reflexion, and self-evolving agent patterns. Each occupies a different position on the spectrum between safety and autonomy, and each is suited to different deployment contexts.
The Learning Challenge
The gap between static prompts and genuinely adaptive agents is significant. A system with a fixed system prompt behaves identically on day one and day three hundred, regardless of how many tasks it has processed. The three approaches below address this at increasing levels of complexity and risk.
All of these techniques work at inference time by manipulating context, not by changing model weights. This makes them practical for deployed systems where retraining is not an option.
Safety/Control ◄──────────────────────────────────────► Autonomy/Risk
┌─────────────────┬─────────────────┬─────────────────┬─────────────────┐
│ STATIC │ TOKEN SPACE │ REFLEXION │ SELF-EVOLVING │
│ PROMPTS │ LEARNING │ │ │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ │ │ │ │
│ Fixed system │ Dynamic few- │ Self-critique │ Prompt/code │
│ prompt, no │ shot examples │ and iterative │ modification │
│ adaptation │ from trajectory │ improvement │ by agent │
│ │ storage │ │ │
│ │ │ │ │
│ • Predictable │ • Learns from │ • Improves on │ • Autonomous │
│ • Consistent │ successes │ failures │ improvement │
│ • No learning │ • Safe (read- │ • Multi-attempt │ • Risky if │
│ │ only context) │ solving │ unsupervised │
│ │ │ │ │
└─────────────────┴─────────────────┴─────────────────┴─────────────────┘
▲ ▲ ▲ ▲
│ │ │ │
Most systems Production Research Experimental
today ready interest (safety concerns) 1. Learning in Token Space
The simplest and most production-ready approach stores successful task completions in a vector database and retrieves them as few-shot examples when a similar task arrives. The model’s behavior changes without any weight updates — the learning is entirely encoded in the dynamically assembled prompt.
┌─────────────────────────────────────────────────────────────────┐
│ NEW TASK │
│ "Parse this JSON" │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ EMBED TASK │
│ DESCRIPTION │
└─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TRAJECTORY STORE │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Task: "Parse XML file" Similarity: 0.72 │ │
│ │ Steps: read_file → parse → extract │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Task: "Extract data from JSON" Similarity: 0.91 ◄───┼───│
│ │ Steps: read_file → json.loads → filter_keys │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Task: "Convert CSV to dict" Similarity: 0.68 │ │
│ │ Steps: read_file → csv.reader → to_dict │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DYNAMIC PROMPT │
│ System: You are a data processing assistant... │
│ │
│ Example 1: (from trajectory store) │
│ User: Extract data from JSON │
│ Assistant: read_file → json.loads → filter_keys │
│ │
│ Current task: │
│ User: Parse this JSON │
└─────────────────────────────────────────────────────────────────┘ import chromadb
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class Trajectory:
task: str
steps: list[dict] # {thought, action, observation}
outcome: dict
success: bool
class TokenSpaceLearner:
"""Learn from experience without updating model weights."""
def __init__(self, collection_name: str = "trajectories"):
self.client = chromadb.PersistentClient(path="./learning_db")
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
def store_trajectory(self, trajectory: Trajectory) -> None:
"""Store a successful trajectory for future reference."""
if not trajectory.success:
return # Only learn from successes
doc = f"Task: {trajectory.task}\n"
doc += f"Approach: {self._summarize_approach(trajectory.steps)}"
self.collection.add(
ids=[f"traj_{hash(trajectory.task)}_{len(self.collection.get()['ids'])}"],
documents=[doc],
metadatas=[{
"task": trajectory.task,
"steps": json.dumps(trajectory.steps),
"outcome": json.dumps(trajectory.outcome),
"success": trajectory.success
}]
)
def recall_similar(
self,
task: str,
k: int = 3,
min_similarity: float = 0.5
) -> list[Trajectory]:
"""Retrieve trajectories from similar past tasks."""
results = self.collection.query(
query_texts=[task],
n_results=k,
include=["metadatas", "distances"]
)
trajectories = []
for i, distance in enumerate(results['distances'][0]):
similarity = 1 - distance
if similarity < min_similarity:
continue
metadata = results['metadatas'][0][i]
trajectories.append(Trajectory(
task=metadata['task'],
steps=json.loads(metadata['steps']),
outcome=json.loads(metadata['outcome']),
success=metadata['success']
))
return trajectories
def build_few_shot_prompt(
self,
task: str,
system_message: str,
k: int = 3
) -> list[dict]:
"""Build a prompt with dynamic few-shot examples."""
examples = self.recall_similar(task, k=k)
messages = [{"role": "system", "content": system_message}]
for ex in examples:
messages.append({
"role": "user",
"content": f"Task: {ex.task}"
})
messages.append({
"role": "assistant",
"content": self._format_trajectory(ex.steps)
})
messages.append({
"role": "user",
"content": f"Task: {task}"
})
return messages
def _summarize_approach(self, steps: list[dict]) -> str:
actions = [s.get('action', '') for s in steps]
return " -> ".join(actions[:5])
def _format_trajectory(self, steps: list[dict]) -> str:
formatted = []
for step in steps:
formatted.append(f"Thought: {step.get('thought', '')}")
formatted.append(f"Action: {step.get('action', '')}")
if 'observation' in step:
formatted.append(f"Observation: {step['observation']}")
return "\n".join(formatted)
# Usage
learner = TokenSpaceLearner()
learner.store_trajectory(Trajectory(
task="Parse the JSON file and extract all email addresses",
steps=[
{"thought": "Need to read the file first", "action": "read_file('data.json')"},
{"thought": "Parse JSON and find emails", "action": "extract_emails(data)"},
],
outcome={"emails_found": 15},
success=True
))
prompt = learner.build_few_shot_prompt(
task="Extract phone numbers from the CSV file",
system_message="You are a data extraction assistant."
)
| Benefit | Description |
|---|---|
| No retraining | Learning happens through context, not weight updates |
| Immediate | New experiences are available for the next request |
| Interpretable | You can inspect exactly what examples were retrieved |
| Safe | Read-only operation; cannot corrupt the model |
| Domain-specific | Naturally adapts to your specific use cases over time |
Store only high-quality successful trajectories. A few excellent examples are better than many mediocre ones. Consider adding a quality gate before storing.
2. Reflexion
Reflexion (Shinn et al., 2023) enables agents to learn from failures through self-reflection. Instead of failing and moving on, the agent analyzes what went wrong, generates a structured reflection, and retries with that insight in context. Successful reflections are also stored in long-term memory and brought forward to inform future tasks of a similar type.
┌─────────────┐
│ TASK │
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ ▼ │
│ ┌───────────────┐ │
│ │ ACTOR │ │
│ │ (Generate │ │
│ │ Trajectory) │ │
│ └───────┬───────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ EVALUATOR │ │
│ │ (Check if │ │
│ │ Correct) │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ │ │
│ Success Failure │
│ │ │ │
│ ▼ ▼ │
│ ┌─────┐ ┌───────────┐ │
│ │DONE │ │ REFLECTOR │ │
│ └─────┘ │ │ │
│ │ "What went│ │
│ │ wrong?" │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ MEMORY │ │
│ │(Reflections) │
│ └─────┬─────┘ │
│ │ │
└────────────────────┘ │
(retry with │
reflections) │
│
┌──────────────────────────────────────┘
│
▼
┌─────────────┐
│ LONG-TERM │
│ MEMORY │
│ (Learnings) │
└─────────────┘ from dataclasses import dataclass, field
from typing import Callable
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field
class ReflectionOutput(BaseModel):
what_went_wrong: str = Field(description="What went wrong in this attempt")
why_it_failed: str = Field(description="Root cause of the failure")
improvements: list[str] = Field(description="Specific improvements for next attempt")
@dataclass
class Reflection:
task: str
attempt: int
trajectory: str
outcome: str
what_went_wrong: str
why_it_failed: str
improvements: list[str]
@dataclass
class ReflexionAgent:
"""Agent that learns from self-reflection on failures."""
llm: ChatOpenAI = field(default_factory=lambda: ChatOpenAI(model="gpt-4"))
short_term_memory: list[Reflection] = field(default_factory=list)
long_term_memory: list[Reflection] = field(default_factory=list)
def solve(
self,
task: str,
max_attempts: int = 3,
evaluator: Callable = None
) -> tuple[str, bool]:
"""Attempt to solve task with reflection on failures."""
self.short_term_memory = []
for attempt in range(max_attempts):
trajectory = self._generate_trajectory(task, attempt)
success, errors = evaluator(trajectory) if evaluator else (False, [])
if success:
reflection = self._generate_reflection(task, attempt, trajectory, "SUCCESS", [])
self.long_term_memory.append(reflection)
return trajectory, True
reflection = self._generate_reflection(task, attempt, trajectory, "FAILURE", errors)
self.short_term_memory.append(reflection)
return self._select_best_attempt(), False
def _generate_trajectory(self, task: str, attempt: int) -> str:
"""Generate a solution attempt using LangChain."""
messages = [("system", self._build_system_prompt())]
if self.long_term_memory:
learnings = self._format_learnings(self.long_term_memory[-5:])
messages.append(("system", f"Learnings from past tasks:\n{learnings}"))
if self.short_term_memory:
reflections = self._format_reflections(self.short_term_memory)
messages.append(("user", f"Previous attempts and reflections:\n{reflections}"))
messages.append(("user", f"Task: {task}"))
prompt = ChatPromptTemplate.from_messages(messages)
chain = prompt | self.llm
response = chain.invoke({})
return response.content
def _generate_reflection(
self, task: str, attempt: int, trajectory: str, outcome: str, errors: list[str]
) -> Reflection:
"""Generate structured reflection using LangChain JSON parser."""
parser = JsonOutputParser(pydantic_object=ReflectionOutput)
prompt = ChatPromptTemplate.from_messages([
("system", "Analyze this attempt and generate a reflection."),
("user", """Task: {task}
Attempt #{attempt}:
{trajectory}
Outcome: {outcome}
Errors: {errors}
{format_instructions}""")
])
chain = prompt | self.llm | parser
data = chain.invoke({
"task": task,
"attempt": attempt + 1,
"trajectory": trajectory,
"outcome": outcome,
"errors": ', '.join(errors) if errors else 'None',
"format_instructions": parser.get_format_instructions()
})
return Reflection(
task=task, attempt=attempt, trajectory=trajectory, outcome=outcome,
what_went_wrong=data.get("what_went_wrong", ""),
why_it_failed=data.get("why_it_failed", ""),
improvements=data.get("improvements", [])
)
# Usage
agent = ReflexionAgent()
def code_evaluator(trajectory: str) -> tuple[bool, list[str]]:
try:
exec(trajectory)
return True, []
except Exception as e:
return False, [str(e)]
solution, success = agent.solve(
task="Write a function to find the nth Fibonacci number",
evaluator=code_evaluator
)
A good reflection includes four elements: a specific description of what went wrong (not vague), a root cause analysis of why it failed, a concrete alternative approach to try next, and a generalizable insight that might apply to related tasks in the future. Vague or non-actionable reflections can actually degrade performance, so the quality of the reflector prompt matters as much as the structure.
Poor reflections — vague or non-actionable — can hurt performance rather than help. The reflector model must generate specific, actionable insights to drive improvement on the next attempt.
3. Self-Evolving Agents
Self-evolving agents modify their own behavior. This is an active research area with significant safety considerations. Use with extreme caution in production systems.
The most advanced form of agent learning involves agents that modify their own prompts, generate new tools, or write and execute new code based on observed performance. Self-critique loops are the safest variant — the agent revises its output within a single session without persisting any changes. Prompt evolution is more consequential: the agent updates its system prompt based on failure patterns, and those updates persist across sessions. Tool generation is riskier still, requiring sandboxed execution of LLM-generated code. Architecture evolution, where the agent modifies its own structure, remains highly experimental.
| Approach | What Evolves | Safety Level |
|---|---|---|
| Self-Critique (SCA) | Output quality through revision | Safe (no persistent changes) |
| Prompt Evolution | System prompts based on performance | Moderate (prompts can drift) |
| Tool Generation | New tools and functions | Risky (code execution) |
| Architecture Evolution | Agent structure itself | Highly experimental |
Never execute LLM-generated code without sandboxing. Use restricted execution environments with no filesystem or network access.
If evolving prompts, maintain a full history. Prompt drift can lead to unexpected behavior that is hard to diagnose weeks later.
For any persistent changes — new tools, modified prompts — require human approval before deployment into production.
Choosing an Approach
| Factor | Token Space | Reflexion | Self-Evolving |
|---|---|---|---|
| Complexity | Low | Medium | High |
| Safety | High | High | Low |
| Latency Impact | Minimal | 2-3x per task | Variable |
| Best For | Routine tasks | Complex reasoning | Research |
| Production Ready | Yes | Yes (with limits) | No |
Begin with token space learning. It is safe, immediately effective, and straightforward to implement with any vector database. Add Reflexion for tasks that frequently fail on the first attempt. Reserve self-evolution for controlled research environments.
Evaluation Metrics
| Metric | What it Measures | Applies To |
|---|---|---|
| Learning Curve | Performance improvement over tasks | All approaches |
| Sample Efficiency | Tasks needed to reach performance level | Token space, Reflexion |
| Reflection Quality | Actionability of generated reflections | Reflexion |
| Retry Reduction | Fewer attempts needed over time | Reflexion |
| Transfer Learning | Performance on related but new tasks | All approaches |
| Stability | Variance in performance over time | Self-evolving |