Agentic RAG
Beyond simple retrieve-then-generate: agents that intelligently decide when, what, and how to retrieve, then critique and correct their own retrieval.
The RAG Evolution
RAG Architecture Evolution
BASIC RAG AGENTIC RAG SELF-RAG CORRECTIVE RAG
────────────── ────────────── ────────────── ──────────────
Query Query Query Query
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ ALWAYS │ │ DECIDE │ │ DECIDE │ │ ALWAYS │
│RETRIEVE │ │IF NEEDED│ │IF NEEDED│ │RETRIEVE │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Vector │ │ Multiple│ │ Retrieve│ │ GRADE │
│ Search │ │ Tools │ │ + Grade │ │ EACH │
└────┬────┘ └────┬────┘ │ Relevance │ DOCUMENT│
│ │ └────┬────┘ └────┬────┘
│ │ │ │
│ │ ▼ ┌────┴────┐
│ │ ┌─────────┐ │ CORRECT │
│ │ │ Generate│ │ AMBIG. │
│ │ │+ Self- │ │ INCORR. │
│ │ │ Critique│ └────┬────┘
▼ ▼ └────┬────┘ │
┌─────────┐ ┌─────────┐ │ ▼
│GENERATE │ │GENERATE │ ▼ ┌─────────┐
└─────────┘ └─────────┘ ┌─────────┐ │GENERATE │
│ Revise │ └─────────┘
│if Needed│
└─────────┘
GRAPH RAG
──────────────
Query
│
├──────────────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ Extract │ │ Vector │
│Entities │ │ Search │
└────┬────┘ └────┬────┘
│ │
▼ │
┌─────────┐ │
│ Graph │ │
│Traversal│ │
└────┬────┘ │
│ │
└─────┬──────┘
▼
┌─────────┐
│ COMBINE │
│ Context │
└────┬────┘
▼
┌─────────┐
│GENERATE │
└─────────┘ | Approach | When to Retrieve | Quality Control | Best For |
|---|---|---|---|
| Basic RAG | Always | None | Simple Q&A |
| Agentic RAG | Agent decides | Tool selection | Varied queries |
| Self-RAG | Agent decides | Self-critique | Accuracy critical |
| Corrective RAG | Always | Grade + correct | Noisy retrieval |
| Graph RAG | Always (dual) | Structured + semantic | Entity-rich domains |
1. Basic RAG (Baseline)
The simplest RAG architecture: always retrieve, then generate. No intelligence about whether retrieval is needed or if retrieved documents are relevant.
Basic RAG Implementation
# Basic RAG: Always retrieve, then generate
function basicRAG(query, vectorStore, llm):
# Step 1: Embed the query
queryEmbedding = embedModel.encode(query)
# Step 2: Retrieve relevant documents
documents = vectorStore.search(
embedding: queryEmbedding,
topK: 5
)
# Step 3: Build context from retrieved docs
context = formatDocuments(documents)
# Step 4: Generate response with context
prompt = """
Use the following context to answer the question.
If the context doesn't contain the answer, say "I don't know."
Context:
{context}
Question: {query}
"""
response = llm.generate(prompt)
return response
# Limitations:
# - Always retrieves, even for simple questions
# - No quality check on retrieved documents
# - Single retrieval pass (may miss info)
# - No reasoning about what to retrieve from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
def create_basic_rag(documents: list[str], collection_name: str = "docs"):
"""Create a basic RAG pipeline."""
# Initialize embeddings
embeddings = OpenAIEmbeddings()
# Create vector store
vectorstore = Chroma.from_texts(
texts=documents,
embedding=embeddings,
collection_name=collection_name
)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# Create QA chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Simple: stuff all docs into context
retriever=retriever,
return_source_documents=True
)
return qa_chain
# Usage
qa = create_basic_rag(my_documents)
result = qa.invoke({"query": "What is the refund policy?"})
print(result["result"])
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")] using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.VectorData;
using OpenAI;
public class BasicRagPipeline
{
private readonly IVectorStore _vectorStore;
private readonly IChatClient _chatClient;
private const string CollectionName = "documents";
public BasicRagPipeline(string apiKey)
{
var openAI = new OpenAIClient(apiKey);
// Initialize chat client
_chatClient = openAI
.GetChatClient("gpt-4o")
.AsIChatClient();
// Initialize vector store with embeddings
var embeddingClient = openAI
.GetEmbeddingClient("text-embedding-3-small")
.AsIEmbeddingGenerator<string, Embedding<float>>();
_vectorStore = new InMemoryVectorStore();
}
public async Task IndexDocumentsAsync(
IEnumerable<(string Id, string Text)> documents,
CancellationToken ct = default)
{
var collection = _vectorStore.GetCollection<string, DocumentRecord>(CollectionName);
await collection.CreateCollectionIfNotExistsAsync(ct);
foreach (var (id, text) in documents)
{
var record = new DocumentRecord { Id = id, Text = text };
await collection.UpsertAsync(record, ct);
}
}
public async Task<string> QueryAsync(
string query,
int topK = 5,
CancellationToken ct = default)
{
// Step 1: Retrieve relevant documents
var collection = _vectorStore.GetCollection<string, DocumentRecord>(CollectionName);
var results = await collection.VectorizedSearchAsync(query, topK, ct);
// Step 2: Build context
var context = string.Join("\n\n", results.Select(r => r.Record.Text));
// Step 3: Generate response
var prompt = $@"Use the following context to answer the question.
If the context doesn't contain the answer, say ""I don't know.""
Context:
{context}
Question: {query}";
var response = await _chatClient.GetResponseAsync(prompt, ct);
return response.Text;
}
}
public class DocumentRecord
{
[VectorStoreRecordKey]
public string Id { get; set; } = "";
[VectorStoreRecordData]
public string Text { get; set; } = "";
[VectorStoreRecordVector(1536)]
public ReadOnlyMemory<float> Embedding { get; set; }
} Limitations
Basic RAG retrieves for every query (even "hello"), uses whatever is retrieved regardless of quality, and only does one retrieval pass.
2. Agentic RAG
An agent with retrieval tools that decides when retrieval is needed, which tool to use, and what query to formulate:
Agentic RAG Implementation
# Agentic RAG: Agent decides when and what to retrieve
class AgenticRAG:
tools: [
searchDocuments(query, filters), # Vector search
lookupEntity(entityName), # Knowledge base lookup
webSearch(query), # External search
noRetrieval() # Answer from knowledge
]
function answer(question):
# Agent reasons about retrieval strategy
while not hasAnswer:
thought = llm.reason(
question: question,
previousSteps: history,
availableTools: tools
)
if thought.needsRetrieval:
# Agent formulates retrieval query (may differ from question)
retrievalQuery = thought.formulatedQuery
results = executeRetrieval(thought.selectedTool, retrievalQuery)
# Agent evaluates results
evaluation = llm.evaluate(
question: question,
retrievedInfo: results
)
if evaluation.sufficient:
hasAnswer = true
elif evaluation.needsMoreInfo:
# Refine and retrieve again
history.append(results)
else:
# Try different retrieval strategy
continue
else:
# Agent can answer without retrieval
hasAnswer = true
return llm.generateAnswer(question, history)
# Key differences from basic RAG:
# 1. Agent DECIDES whether to retrieve
# 2. Agent FORMULATES the retrieval query
# 3. Agent EVALUATES retrieval results
# 4. Agent can do MULTIPLE retrieval rounds
# 5. Agent can use DIFFERENT retrieval tools from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from typing import TypedDict, Annotated
import operator
# Define state
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
retrieved_docs: list
needs_retrieval: bool
# Define tools
@tool
def search_documents(query: str, max_results: int = 5) -> str:
"""Search the document store for relevant information."""
results = vector_store.similarity_search(query, k=max_results)
return "
".join([doc.page_content for doc in results])
@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
# Implementation with web search API
return web_search_api.search(query)
@tool
def lookup_entity(entity_name: str) -> str:
"""Look up specific entity in knowledge base."""
return knowledge_base.get(entity_name, "Entity not found")
# Create the agent
llm = ChatOpenAI(model="gpt-4").bind_tools([
search_documents, search_web, lookup_entity
])
def should_retrieve(state: AgentState) -> str:
"""Decide if we need to retrieve or can answer."""
last_message = state["messages"][-1]
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "retrieve"
return "answer"
def call_model(state: AgentState) -> dict:
"""Have the agent reason about what to do."""
messages = state["messages"]
# Add system prompt for RAG behavior
system = """You are a helpful assistant with access to retrieval tools.
IMPORTANT: Before answering, consider:
1. Can you answer this from your knowledge? If yes, just respond.
2. Does this need current/specific information? Use search_web.
3. Does this need document lookup? Use search_documents.
4. Is this about a specific entity? Use lookup_entity.
Be strategic about retrieval - don't retrieve if unnecessary."""
response = llm.invoke([{"role": "system", "content": system}] + messages)
return {"messages": [response]}
def generate_answer(state: AgentState) -> dict:
"""Generate final answer based on retrieved context."""
messages = state["messages"]
docs = state.get("retrieved_docs", [])
if docs:
context = "
".join(docs)
messages = messages + [
{"role": "system", "content": f"Context from retrieval:
{context}"}
]
response = llm.invoke(messages)
return {"messages": [response]}
# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("agent", call_model)
workflow.add_node("retrieve", ToolNode([search_documents, search_web, lookup_entity]))
workflow.add_node("answer", generate_answer)
workflow.set_entry_point("agent")
workflow.add_conditional_edges("agent", should_retrieve, {
"retrieve": "retrieve",
"answer": "answer"
})
workflow.add_edge("retrieve", "agent") # Loop back after retrieval
workflow.add_edge("answer", END)
app = workflow.compile()
# Usage
result = app.invoke({
"messages": [{"role": "user", "content": "What's our Q3 revenue?"}],
"retrieved_docs": [],
"needs_retrieval": False
}) using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using System.ComponentModel;
using OpenAI;
public class AgenticRagPipeline
{
private readonly AIAgent _agent;
private readonly IVectorStore _vectorStore;
public AgenticRagPipeline(string apiKey)
{
var chatClient = new OpenAIClient(apiKey)
.GetChatClient("gpt-4o")
.AsIChatClient();
// Create agent with retrieval tools
_agent = chatClient.CreateAIAgent(
name: "RAGAgent",
instructions: @"You are a helpful assistant with retrieval tools.
Before answering, consider:
1. Can you answer from knowledge? Just respond.
2. Need current info? Use SearchWeb.
3. Need document lookup? Use SearchDocuments.
4. About specific entity? Use LookupEntity.
Be strategic - don't retrieve unnecessarily.",
tools: [
AIFunctionFactory.Create(SearchDocuments),
AIFunctionFactory.Create(SearchWeb),
AIFunctionFactory.Create(LookupEntity)
]
);
}
[Description("Search documents for relevant information")]
private async Task<string> SearchDocuments(
[Description("Search query")] string query,
[Description("Maximum results")] int maxResults = 5)
{
var collection = _vectorStore.GetCollection<string, DocumentRecord>("documents");
var results = await collection.VectorizedSearchAsync(query, maxResults);
return string.Join("\n\n", results.Select(r => r.Record.Text));
}
[Description("Search the web for current information")]
private async Task<string> SearchWeb(
[Description("Search query")] string query)
{
return await _webSearchService.SearchAsync(query);
}
[Description("Look up a specific entity in the knowledge base")]
private async Task<string> LookupEntity(
[Description("Entity name")] string entityName)
{
return await _knowledgeBase.GetAsync(entityName) ?? "Entity not found";
}
public async Task<string> QueryAsync(
string question,
CancellationToken ct = default)
{
var thread = _agent.GetNewThread();
// Agent will automatically use tools as needed
var response = await _agent.RunAsync(question, thread, ct);
return response;
}
} Key Capabilities
- 1. Retrieval Decision - Agent decides IF retrieval is needed
- 2. Query Formulation - Agent rewrites query for better retrieval
- 3. Tool Selection - Agent chooses the right retrieval tool
- 4. Iterative Retrieval - Agent can retrieve multiple times
3. Self-RAG
Self-RAG (Asai et al., 2023) adds self-reflection: the model critiques its own retrieval decisions and generation quality:
Self-RAG Implementation
# Self-RAG: Model critiques its own retrieval and generation
class SelfRAG:
function answer(question):
# Step 1: Decide if retrieval is needed
retrievalDecision = llm.generate(
prompt: "Given this question, do I need to retrieve information? [Yes/No]",
question: question
)
if retrievalDecision == "No":
# Generate without retrieval
response = llm.generate(question)
return selfCritique(question, response, [])
# Step 2: Retrieve documents
documents = retrieve(question)
# Step 3: For each document, assess relevance
relevantDocs = []
for doc in documents:
isRelevant = llm.generate(
prompt: "Is this document relevant to the question? [Relevant/Irrelevant]",
question: question,
document: doc
)
if isRelevant == "Relevant":
relevantDocs.append(doc)
# Step 4: Generate response with relevant docs
response = llm.generate(
prompt: question,
context: relevantDocs
)
# Step 5: Self-critique the response
return selfCritique(question, response, relevantDocs)
function selfCritique(question, response, sources):
# Check if response is supported by sources
supportScore = llm.generate(
prompt: "Is this response fully supported by the sources? [Fully/Partially/No]",
response: response,
sources: sources
)
# Check if response is useful
usefulnessScore = llm.generate(
prompt: "How useful is this response? [5/4/3/2/1]",
question: question,
response: response
)
if supportScore == "No" or usefulnessScore < 3:
# Regenerate with feedback
return regenerateWithCritique(question, response, supportScore, usefulnessScore)
return {
response: response,
supported: supportScore,
usefulness: usefulnessScore,
sources: sources
} from dataclasses import dataclass
from enum import Enum
class RetrievalDecision(Enum):
YES = "yes"
NO = "no"
class RelevanceScore(Enum):
RELEVANT = "relevant"
IRRELEVANT = "irrelevant"
class SupportScore(Enum):
FULLY_SUPPORTED = "fully_supported"
PARTIALLY_SUPPORTED = "partially_supported"
NOT_SUPPORTED = "not_supported"
@dataclass
class SelfRAGResponse:
answer: str
sources: list[str]
support_score: SupportScore
usefulness_score: int
retrieval_used: bool
class SelfRAG:
def __init__(self, client, retriever, model: str = "gpt-4"):
self.client = client
self.retriever = retriever
self.model = model
def query(self, question: str) -> SelfRAGResponse:
# Step 1: Decide if retrieval is needed
retrieval_decision = self._decide_retrieval(question)
if retrieval_decision == RetrievalDecision.NO:
answer = self._generate_without_context(question)
return self._self_critique(question, answer, [], False)
# Step 2: Retrieve documents
documents = self.retriever.search(question, k=5)
# Step 3: Filter by relevance
relevant_docs = self._filter_relevant(question, documents)
if not relevant_docs:
# Fall back to generation without context
answer = self._generate_without_context(question)
return self._self_critique(question, answer, [], True)
# Step 4: Generate with relevant context
answer = self._generate_with_context(question, relevant_docs)
# Step 5: Self-critique
return self._self_critique(question, answer, relevant_docs, True)
def _decide_retrieval(self, question: str) -> RetrievalDecision:
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Given this question, do you need to retrieve external information to answer accurately?
Question: {question}
Consider:
- Is this about specific facts, data, or recent events? -> Retrieve
- Is this about general knowledge or reasoning? -> No retrieval
- Is this about personal opinions or hypotheticals? -> No retrieval
Answer with just: YES or NO"""
}]
)
answer = response.choices[0].message.content.strip().upper()
return RetrievalDecision.YES if "YES" in answer else RetrievalDecision.NO
def _filter_relevant(self, question: str, documents: list[str]) -> list[str]:
relevant = []
for doc in documents:
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Is this document relevant to answering the question?
Question: {question}
Document: {doc[:500]}...
Answer with just: RELEVANT or IRRELEVANT"""
}]
)
if "RELEVANT" in response.choices[0].message.content.upper():
relevant.append(doc)
return relevant
def _self_critique(
self,
question: str,
answer: str,
sources: list[str],
retrieval_used: bool
) -> SelfRAGResponse:
# Check support
support_score = self._check_support(answer, sources)
# Check usefulness
usefulness_score = self._check_usefulness(question, answer)
# Regenerate if quality is low
if support_score == SupportScore.NOT_SUPPORTED and sources:
answer = self._regenerate_with_feedback(
question, answer, sources, "not supported by sources"
)
support_score = self._check_support(answer, sources)
if usefulness_score < 3:
answer = self._regenerate_with_feedback(
question, answer, sources, "not useful enough"
)
usefulness_score = self._check_usefulness(question, answer)
return SelfRAGResponse(
answer=answer,
sources=sources,
support_score=support_score,
usefulness_score=usefulness_score,
retrieval_used=retrieval_used
)
def _check_support(self, answer: str, sources: list[str]) -> SupportScore:
if not sources:
return SupportScore.FULLY_SUPPORTED # No sources to contradict
sources_text = "
---
".join(sources)
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Is this answer supported by the source documents?
Answer: {answer}
Sources:
{sources_text}
Respond with:
- FULLY_SUPPORTED: All claims in the answer are backed by sources
- PARTIALLY_SUPPORTED: Some claims are backed, others are not
- NOT_SUPPORTED: The answer contradicts or goes beyond the sources"""
}]
)
content = response.choices[0].message.content.upper()
if "FULLY" in content:
return SupportScore.FULLY_SUPPORTED
elif "PARTIALLY" in content:
return SupportScore.PARTIALLY_SUPPORTED
return SupportScore.NOT_SUPPORTED
def _check_usefulness(self, question: str, answer: str) -> int:
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Rate how useful this answer is for the question (1-5):
Question: {question}
Answer: {answer}
5 = Perfectly answers the question
4 = Good answer with minor gaps
3 = Adequate but could be better
2 = Partially helpful
1 = Not helpful
Respond with just the number."""
}]
)
try:
return int(response.choices[0].message.content.strip()[0])
except:
return 3 Self-RAG Reflection Tokens
[Retrieve]- Should I retrieve? (Yes/No)[IsRel]- Is this document relevant? (Relevant/Irrelevant)[IsSup]- Is response supported? (Fully/Partially/No)[IsUse]- Is response useful? (5/4/3/2/1)
4. Corrective RAG (CRAG)
CRAG (Yan et al., 2024) focuses on evaluating and correcting retrieval quality before generation:
Corrective RAG Implementation
# Corrective RAG (CRAG): Evaluate and correct retrieval quality
class CorrectiveRAG:
function answer(question):
# Step 1: Initial retrieval
documents = retrieve(question)
# Step 2: Evaluate each document's relevance
evaluations = []
for doc in documents:
score = evaluateRelevance(question, doc)
evaluations.append({ doc: doc, score: score })
# Step 3: Determine action based on evaluation
relevantDocs = filter(evaluations, score == "Correct")
ambiguousDocs = filter(evaluations, score == "Ambiguous")
irrelevantDocs = filter(evaluations, score == "Incorrect")
if allRelevant(evaluations):
# All documents are relevant - use them directly
action = "CORRECT"
context = relevantDocs
elif allIrrelevant(evaluations):
# All documents are irrelevant - use web search
action = "INCORRECT"
webResults = webSearch(question)
context = webResults
else:
# Mixed relevance - combine strategies
action = "AMBIGUOUS"
webResults = webSearch(question)
context = relevantDocs + refineDocuments(ambiguousDocs) + webResults
# Step 4: Generate with corrected context
return generate(question, context)
function evaluateRelevance(question, document):
# Three-way classification
prompt = """
Evaluate if this document is relevant to the question.
Question: {question}
Document: {document}
- CORRECT: Document directly helps answer the question
- INCORRECT: Document is not relevant at all
- AMBIGUOUS: Document is partially relevant or tangential
Respond with: CORRECT, INCORRECT, or AMBIGUOUS
"""
return llm.generate(prompt)
function refineDocuments(documents):
# Extract only the relevant portions
refined = []
for doc in documents:
relevantParts = llm.extract(
prompt: "Extract only the parts relevant to the question",
document: doc
)
refined.append(relevantParts)
return refined from dataclasses import dataclass
from enum import Enum
class RelevanceGrade(Enum):
CORRECT = "correct" # Directly relevant
INCORRECT = "incorrect" # Not relevant
AMBIGUOUS = "ambiguous" # Partially relevant
class RetrievalAction(Enum):
USE_RETRIEVED = "use_retrieved"
USE_WEB = "use_web"
COMBINE = "combine"
@dataclass
class GradedDocument:
content: str
grade: RelevanceGrade
confidence: float
class CorrectiveRAG:
def __init__(self, client, retriever, web_search, model: str = "gpt-4"):
self.client = client
self.retriever = retriever
self.web_search = web_search
self.model = model
def query(self, question: str) -> str:
# Step 1: Initial retrieval
documents = self.retriever.search(question, k=5)
# Step 2: Grade each document
graded_docs = [
self._grade_document(question, doc)
for doc in documents
]
# Step 3: Determine corrective action
action = self._determine_action(graded_docs)
# Step 4: Build context based on action
if action == RetrievalAction.USE_RETRIEVED:
context = self._build_context_from_docs(graded_docs)
elif action == RetrievalAction.USE_WEB:
web_results = self.web_search.search(question)
context = self._format_web_results(web_results)
else: # COMBINE
# Use correct docs + refined ambiguous + web
correct_docs = [d for d in graded_docs if d.grade == RelevanceGrade.CORRECT]
ambiguous_docs = [d for d in graded_docs if d.grade == RelevanceGrade.AMBIGUOUS]
refined = self._refine_ambiguous(question, ambiguous_docs)
web_results = self.web_search.search(question)
context = (
self._build_context_from_docs(correct_docs) +
"
" + refined +
"
" + self._format_web_results(web_results)
)
# Step 5: Generate answer
return self._generate(question, context)
def _grade_document(self, question: str, document: str) -> GradedDocument:
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Grade this document's relevance to the question.
Question: {question}
Document:
{document[:1000]}
Grades:
- CORRECT: Directly helps answer the question
- INCORRECT: Not relevant to the question
- AMBIGUOUS: Partially relevant or tangential
Respond with JSON: {{"grade": "...", "confidence": 0.0-1.0, "reason": "..."}}"""
}],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return GradedDocument(
content=document,
grade=RelevanceGrade(data["grade"].lower()),
confidence=data.get("confidence", 0.5)
)
def _determine_action(self, graded_docs: list[GradedDocument]) -> RetrievalAction:
correct = sum(1 for d in graded_docs if d.grade == RelevanceGrade.CORRECT)
incorrect = sum(1 for d in graded_docs if d.grade == RelevanceGrade.INCORRECT)
ambiguous = sum(1 for d in graded_docs if d.grade == RelevanceGrade.AMBIGUOUS)
total = len(graded_docs)
if correct / total >= 0.6:
return RetrievalAction.USE_RETRIEVED
elif incorrect / total >= 0.8:
return RetrievalAction.USE_WEB
else:
return RetrievalAction.COMBINE
def _refine_ambiguous(
self,
question: str,
ambiguous_docs: list[GradedDocument]
) -> str:
if not ambiguous_docs:
return ""
docs_text = "
---
".join([d.content for d in ambiguous_docs])
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Extract only the parts of these documents that are relevant to the question.
Question: {question}
Documents:
{docs_text}
Return only the relevant excerpts, removing irrelevant content."""
}]
)
return response.choices[0].message.content
def _generate(self, question: str, context: str) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": f"Context:
{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content CRAG Decision Flow
Retrieved Documents
│
▼
┌───────────────────┐
│ GRADE EACH │
│ DOCUMENT │
│ │
│ Correct? │
│ Incorrect? │
│ Ambiguous? │
└─────────┬─────────┘
│
┌─────┴─────┐
│ │
▼ ▼
All Correct All Incorrect Mixed
│ │ │
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────────┐
│ USE │ │ WEB │ │ COMBINE │
│ DOCS │ │SEARCH │ │ Correct + │
└───────┘ └───────┘ │ Refined + │
│ Web │
└───────────┘ Key Innovation
CRAG's three-way grading (Correct/Incorrect/Ambiguous) enables nuanced handling: keeping good docs, discarding bad ones, and refining ambiguous ones.
5. Graph RAG
Graph RAG combines vector search with knowledge graph traversal for structured + semantic retrieval:
Graph RAG Implementation
# Graph RAG: Combine knowledge graphs with vector retrieval
class GraphRAG:
vectorStore: VectorDB # For semantic search
knowledgeGraph: Neo4j # For structured relationships
function answer(question):
# Step 1: Extract entities from question
entities = extractEntities(question)
# Step 2: Retrieve from both sources
# Vector retrieval for semantic similarity
vectorResults = vectorStore.search(question, topK = 5)
# Graph traversal for related entities
graphResults = []
for entity in entities:
# Find entity in graph
node = knowledgeGraph.findNode(entity)
if node:
# Get related nodes (neighbors, paths)
related = knowledgeGraph.traverse(
startNode: node,
maxDepth: 2,
relationTypes: ["related_to", "part_of", "caused_by"]
)
graphResults.append(related)
# Step 3: Combine and deduplicate
combinedContext = merge(vectorResults, graphResults)
# Step 4: Build structured context
context = formatContext(
semanticDocs: vectorResults,
entityRelations: graphResults,
entities: entities
)
# Step 5: Generate with structured knowledge
return llm.generate(question, context)
function extractEntities(question):
# Use NER or LLM to extract entities
return llm.generate(
prompt: "Extract named entities (people, places, concepts) from: " + question
)
function formatContext(semanticDocs, entityRelations, entities):
context = "## Relevant Documents
"
for doc in semanticDocs:
context += "- " + doc.summary + "
"
context += "
## Entity Relationships
"
for entity in entities:
relations = entityRelations.get(entity, [])
context += f"### {entity}
"
for rel in relations:
context += f"- {rel.type}: {rel.target}
"
return context from neo4j import GraphDatabase
from dataclasses import dataclass
@dataclass
class EntityRelation:
source: str
relation: str
target: str
properties: dict
class GraphRAG:
def __init__(
self,
client,
vector_store,
neo4j_uri: str,
neo4j_auth: tuple,
model: str = "gpt-4"
):
self.client = client
self.vector_store = vector_store
self.graph = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
self.model = model
def query(self, question: str) -> str:
# Step 1: Extract entities
entities = self._extract_entities(question)
# Step 2: Vector retrieval
vector_results = self.vector_store.similarity_search(question, k=5)
# Step 3: Graph retrieval
graph_results = self._graph_retrieval(entities)
# Step 4: Build combined context
context = self._build_context(
question, entities, vector_results, graph_results
)
# Step 5: Generate answer
return self._generate(question, context)
def _extract_entities(self, question: str) -> list[str]:
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Extract named entities from this question.
Include: people, organizations, products, concepts, locations.
Question: {question}
Return as JSON: {{"entities": ["entity1", "entity2", ...]}}"""
}],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return data.get("entities", [])
def _graph_retrieval(self, entities: list[str]) -> dict[str, list[EntityRelation]]:
results = {}
with self.graph.session() as session:
for entity in entities:
# Find entity and its relationships
query = """
MATCH (e)-[r]-(related)
WHERE e.name =~ $pattern OR e.label =~ $pattern
RETURN e.name as source,
type(r) as relation,
related.name as target,
properties(r) as props
LIMIT 20
"""
pattern = f"(?i).*{entity}.*"
records = session.run(query, pattern=pattern)
relations = [
EntityRelation(
source=record["source"],
relation=record["relation"],
target=record["target"],
properties=record["props"] or {}
)
for record in records
]
if relations:
results[entity] = relations
return results
def _build_context(
self,
question: str,
entities: list[str],
vector_results: list,
graph_results: dict[str, list[EntityRelation]]
) -> str:
parts = []
# Semantic documents
if vector_results:
parts.append("## Relevant Documents")
for i, doc in enumerate(vector_results, 1):
parts.append(f"{i}. {doc.page_content[:300]}...")
# Entity relationships from graph
if graph_results:
parts.append("
## Entity Knowledge Graph")
for entity, relations in graph_results.items():
parts.append(f"
### {entity}")
for rel in relations[:10]: # Limit relations
parts.append(f"- {rel.relation} -> {rel.target}")
if rel.properties:
props = ", ".join(f"{k}={v}" for k, v in rel.properties.items())
parts.append(f" ({props})")
# Extracted entities for reference
parts.append(f"
## Detected Entities: {', '.join(entities)}")
return "
".join(parts)
def _generate(self, question: str, context: str) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": f"""You have access to both document search results and a knowledge graph.
Use both sources to provide a comprehensive answer.
{context}"""},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
def close(self):
self.graph.close() When to Use Graph RAG
| Use Case | Why Graph RAG Helps |
|---|---|
| Multi-hop questions | Graph traversal connects related entities |
| Entity-rich domains | Structured relationships improve precision |
| Reasoning about relationships | "Who reports to whom?" needs graph structure |
| Combining structured + unstructured | Graph for facts, vectors for context |
Graph Construction
You can build knowledge graphs from documents using LLM-based entity extraction, or use existing structured data (databases, ontologies).
Evaluation Metrics
| Metric | What it Measures | How to Calculate |
|---|---|---|
| Answer Accuracy | Is the answer correct? | Human eval or exact match |
| Faithfulness | Is answer grounded in retrieved docs? | NLI or LLM-as-judge |
| Relevance | Are retrieved docs relevant? | Precision@K, NDCG |
| Retrieval Efficiency | How often is retrieval needed? | % queries requiring retrieval |
| Latency | Time to answer | Wall clock time |
| Hallucination Rate | Unsupported claims in answer | Manual or NLI checking |
Choosing an Approach
Use Basic RAG when:
- All queries need document lookup
- Simple, single-turn Q&A
- Latency is critical
Use Agentic RAG when:
- Queries vary (some need retrieval, some don't)
- Multiple retrieval sources available
- Complex multi-step reasoning needed
Use Self-RAG when:
- Accuracy is paramount
- You need to minimize hallucinations
- Quality > latency
Use Corrective RAG when:
- Retrieval quality varies
- Mixed document quality in corpus
- Web fallback is acceptable
Use Graph RAG when:
- Data has clear entity relationships
- Multi-hop reasoning required
- You have or can build a knowledge graph
Common Pitfalls
Over-retrieval
Retrieving for every query wastes tokens and can confuse the model with irrelevant context.
Chunk Size Mismatch
Too small: loses context. Too large: dilutes relevance. Tune chunk size to your use case.
Ignoring Retrieval Quality
Just because you retrieved 5 documents doesn't mean they're all useful. Grade and filter.
Single-Pass Retrieval
Complex questions often need multiple retrieval rounds with refined queries.