danielhuber.dev@proton.me Sunday, February 22, 2026

Resources

Curated research papers, frameworks, protocols, and tools for the practitioner.


Updated Feb 22, 2026

How we built Agent Builder's memory system

LangChain describes their implementation of a memory system for Agent Builder, covering the technical architecture and rationale for prioritizing persistent memory in agent workflows.

LangChain Blog frameworks

Agent Observability Powers Agent Evaluation

LangChain emphasizes that reliable agent development requires understanding agent reasoning through observability and systematic evaluation approaches.

LangChain Blog evaluation

0-Days: Evaluating and mitigating the growing risk of LLM-discovered vulnerabilities

Claude Opus 4.6 demonstrates significant capability in finding high-severity vulnerabilities in well-tested codebases by reading and reasoning about code like human researchers. Anthropic has found over 500 high-severity vulnerabilities in open source software using Claude.

@trq212 on X models

Context-Bench: A benchmark for agentic context engineering

Letta Research introduces Context-Bench, a benchmark measuring agents' ability to perform filesystem operations, entity relationship tracing, and skill discovery/loading from libraries.

@Letta_AI on X evaluation

zeitzeuge — AI-Powered Performance Analysis for Web & Tests

A performance analysis tool that uses a LangChain Deep Agent to autonomously analyze V8 heap snapshots, Chrome runtime traces, and CPU profiles to suggest code-level fixes.

@bromann on X frameworks

Programmatic tool calling

Claude API introduces programmatic tool calling, allowing Claude to write Python code that calls tools within a code execution container, reducing latency and token consumption for multi-tool workflows.

@RLanceMartin on X protocols

Improved Web Search with Dynamic Filtering

Claude's web search and web fetch tools now automatically write and execute code to filter search results before they reach the context window, improving accuracy by 11% and reducing token usage by 24%.

@RLanceMartin on X tools

How to Use Memory in Agent Builder

LangChain's Agent Builder incorporates memory that retains user feedback, corrections, and preferences to improve agent performance over time.

LangChain Blog frameworks

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

IBM and UC Berkeley developed IT-Bench and MAST to diagnose why enterprise agents fail.

Hugging Face Blog evaluation

Research Papers

ReAct: Synergizing Reasoning and Acting in Language Models
Yao et al. · ICLR 2023

Introduces the ReAct paradigm combining reasoning traces with actions.

2023
Toolformer: Language Models Can Teach Themselves to Use Tools
Schick et al. · NeurIPS 2023

Demonstrates self-supervised tool use learning in LLMs.

2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei et al. · NeurIPS 2022

Foundational work on prompting LLMs for step-by-step reasoning.

2022
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Yao et al. · NeurIPS 2023

Extends CoT with exploration of multiple reasoning paths.

2023
Generative Agents: Interactive Simulacra of Human Behavior
Park et al. · UIST 2023

Agents with memory for believable social simulation.

2023
MemGPT: Towards LLMs as Operating Systems
Packer et al. · arXiv

Hierarchical memory management for unbounded context.

2023
Reflexion: Language Agents with Verbal Reinforcement Learning
Shinn et al. · NeurIPS 2023

Agents that learn from self-reflection and memory.

2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu et al. · arXiv

Framework for multi-agent conversation and collaboration.

2023
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Hong et al. · arXiv

Role-based multi-agent system for software development.

2023
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Liang et al. · arXiv

Multiple agents debate to improve reasoning quality.

2023
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al. · NeurIPS 2020

Original RAG paper combining retrieval with generation.

2020
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Asai et al. · arXiv

Agents that decide when and what to retrieve.

2023
Corrective Retrieval Augmented Generation
Yan et al. · arXiv

Self-correcting retrieval with web search fallback.

2024
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Edge et al. · arXiv

Knowledge graph-based RAG for complex queries.

2024
Constitutional AI: Harmlessness from AI Feedback
Bai et al. · arXiv

Self-supervision for safe AI behavior.

2022
Red Teaming Language Models with Language Models
Perez et al. · EMNLP 2022

Automated red teaming for safety evaluation.

2022

Frameworks

LangChain Python/JS

Framework for LLM-powered applications. Large ecosystem of integrations.

LangGraph Python

Stateful, multi-actor applications with LLMs. Graph-based control flow.

Microsoft Agent Framework Python/C#

Unifying AutoGen and Semantic Kernel for multi-agent workflows.


Protocols

Model Context Protocol (MCP)

Anthropic's open protocol for connecting AI with tools and data sources.

Anthropic
Agent2Agent Protocol (A2A)

Google's protocol for agent-to-agent communication and discovery.

Google
Universal Commerce Protocol (UCP)

Open standard for agentic commerce from discovery to purchase.

Google + Shopify

Evaluation Tools

DeepEval

Open-source evaluation framework for LLMs. Agent-specific metrics.

RAGAS

Evaluation framework for RAG applications. Component-level metrics.

Promptfoo

CLI tool for testing and evaluating prompts. CI/CD integration.

LangSmith

Platform for debugging, testing, and monitoring LLM applications.

Braintrust

Enterprise platform for AI product development. Evals and logging.