RAG Tutorial
10 RAG architectures explained with theory, visual flow diagrams, Python code, Langflow implementations, and QA testing strategies. Built for QA engineers and AI testing professionals.
QA Analogy: Like running a basic test suite — no parallelism, no retries, no smart ordering. It works, but it's the starting point, not the destination.
📄 Stage 1: Indexing
Documents are loaded, split into fixed-size chunks (500–1000 tokens with 200-token overlap), converted to embeddings via a model like text-embedding-3-small, and stored in a vector database like ChromaDB or Pinecone.
🔍 Stage 2: Retrieval
The user's query is embedded using the same embedding model. Cosine similarity finds the top-k most similar chunks in the vector store. Typically k=4 chunks are retrieved.
✍️ Stage 3: Generation
The retrieved chunks are concatenated with the original query and fed into an LLM as context. The LLM generates a grounded response using only the provided context.
⚠️ Limitations
Fixed chunking splits important context. No query optimization. No re-ranking. No self-correction. Single retrieval step cannot handle multi-hop reasoning. These are solved by more advanced RAG types.
| PARAMETER | TYPICAL VALUE | EFFECT OF CHANGING |
|---|---|---|
| chunk_size | 1000 tokens | Smaller = more precise but loses context; Larger = more context but noisier |
| chunk_overlap | 200 tokens | Higher overlap preserves context across boundaries but increases storage |
| k (top results) | 4 | More k = more context but risks exceeding context window |
| embedding model | text-embedding-3-small | Larger models improve retrieval but cost more per call |
| temperature | 0 | Higher = more creative but less faithful to retrieved context |
INDEXING PIPELINE (top) + QUERY PIPELINE (bottom) — arrows show data flow direction
# Step 1: Install dependencies # pip install langchain chromadb openai langchain-community langchain-openai from langchain.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI # ── INDEXING ────────────────────────── loader = PyPDFLoader("test_documentation.pdf") documents = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = splitter.split_documents(documents) print(f"Created {len(chunks)} chunks") embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" ) retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # ── QUERY & GENERATE ────────────────── llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True ) response = qa_chain.invoke({"query": "What are the login test cases?"}) print(response["result"]) print("Sources:", [doc.metadata for doc in response["source_documents"]])
Create a New Flow
Open Langflow → New Flow → Name it "Naive RAG Pipeline". Choose "Blank Flow" to start from scratch.
Add File Loader
Drag a "File" component onto the canvas. Upload your test documentation (PDF, TXT, or DOCX).
Configure Text Splitter
Add "Recursive Character Text Splitter". Set chunk_size=1000, chunk_overlap=200. Connect File → Splitter.
Add Embeddings + Vector Store
Add "OpenAI Embeddings" (enter API key). Add "Chroma" vector store. Connect Splitter → Chroma, Embeddings → Chroma.
Add Retriever + Prompt
Add "Retriever" node, connect to Chroma (set k=4). Add "Prompt" template: Context: {context}\nQuestion: {question}.
Connect LLM + Output
Add "ChatOpenAI" → connect to Prompt. Add "Chat Output" → connect to ChatOpenAI. Add "Chat Input" for queries.
Test & Validate
Click Play → open chat panel → test with sample queries. Review execution trace to verify retrieval. Tune chunk_size and k based on results.
Empty knowledge base query: Send a query when no relevant docs are indexed. Verify the LLM does NOT hallucinate — it should say "I don't have information about this."
Multi-chunk spanning query: Ask a question whose answer spans 2+ chunks. Verify context continuity is preserved and the answer is complete.
Duplicate document test: Index the same document twice. Verify retrieval returns deduplicated results and doesn't inflate context.
Critical boundary split: Find chunks where critical info (like a code snippet) is split. Verify the overlap correctly preserves the context.
Embedding model change: Swap to a different embedding model. Verify re-indexing happens correctly and old embeddings are cleared — stale embeddings cause silent failures.
| STAGE | TECHNIQUE | PURPOSE |
|---|---|---|
| Pre-Retrieval | HyDE (Hypothetical Document Embedding) | Generate a hypothetical answer, use that for retrieval instead of raw query |
| Pre-Retrieval | Semantic Chunking | Split documents at semantic boundaries, not arbitrary token counts |
| Retrieval | Parent Document Retrieval | Match small child chunks, return full parent document for context |
| Post-Retrieval | Cohere Rerank / BGE | Re-score top-20 candidates, return only best 4 |
| Post-Retrieval | Contextual Compression | Strip irrelevant sentences from retrieved chunks |
from langchain.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI from langchain_experimental.text_splitter import SemanticChunker from langchain.retrievers import ContextualCompressionRetriever from langchain_cohere import CohereRerank # ── HyDE: Query → Hypothetical Doc → Better Retrieval ── hyde_prompt = ChatPromptTemplate.from_template( "Write a short passage that would answer: {question}" ) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7) hyde_chain = hyde_prompt | llm hypothetical_doc = hyde_chain.invoke({"question": "How to test API rate limiting?"}) # Now use hypothetical_doc.content for embedding + retrieval # ── Semantic Chunking ───────────────────────────────── embeddings = OpenAIEmbeddings() semantic_splitter = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=95 ) semantic_chunks = semantic_splitter.split_documents(documents) # ── Cohere Re-ranking: top-20 → best 4 ─────────────── base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) reranker = CohereRerank(model="rerank-english-v3.0", top_n=4) compression_retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=base_retriever ) results = compression_retriever.invoke( "Edge cases for payment testing?" ) print(f"Re-ranked: {len(results)} top documents returned")
Create Flow: "Advanced RAG with Reranking"
New flow in Langflow. Add File Loader → "Semantic Text Splitter" (set breakpoint_threshold_type to "percentile").
Over-fetch with Retriever k=20
Add Retriever node. Set k=20 to retrieve many candidates before re-ranking. Connect to Chroma vector store.
Add Cohere Rerank
Add "Cohere Rerank" component after retriever. Set top_n=4. Add your Cohere API key. Connect Retriever → Cohere.
Add HyDE Branch
Add a separate ChatOpenAI node that generates a hypothetical answer from the query. Feed that into the embedding step instead of the raw query.
Prompt + Output
Use prompt: "Use ONLY the following context... Context: {context}\nQuestion: {question}\nIf not in context, say so." → ChatOpenAI → Chat Output.
Reranking precision test: Run the same query with k=4 (no rerank) vs k=20+rerank. Measure retrieval precision — the reranked version should return more relevant docs.
HyDE degradation test: Test HyDE with ambiguous queries. For some queries, the hypothetical doc may drift semantically — verify it doesn't degrade retrieval quality.
Reranking latency test: Measure latency with/without Cohere reranking. Reranking adds ~200–500ms. Ensure this is within your SLA.
Semantic chunking edge cases: Test with code snippets, HTML tables, and mixed-language content. Verify semantic splitter doesn't break code blocks.
Parent-child retrieval accuracy: Verify that when a child chunk matches, the correct parent document is returned — not a random parent.
The Router is the key innovation: an LLM classifies each incoming query and routes it to the most appropriate knowledge base or retriever.
| MODULE | ROLE | EXAMPLES |
|---|---|---|
| Router | Classifies queries, directs to right retriever | Semantic router, keyword router, LLM-based |
| Retriever | Fetches relevant documents | Vector, BM25, SQL, Graph, API |
| Reranker | Reorders candidates by relevance | Cohere, BGE, Cross-encoder |
| Generator | Produces final answer | GPT-4o, Claude, Llama 3, Gemini |
| Memory | Maintains conversation context | Buffer, summary, vector memory |
| Guardrails | Validates input/output quality | NeMo Guardrails, custom rules |
from langchain.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI from langchain.schema.runnable import RunnableLambda, RunnablePassthrough # Specialized retrievers per domain api_retriever = vectorstore_api.as_retriever(search_kwargs={"k": 4}) ui_retriever = vectorstore_ui.as_retriever(search_kwargs={"k": 4}) perf_retriever= vectorstore_perf.as_retriever(search_kwargs={"k": 4}) # Router LLM (temperature=0 for deterministic classification) router_prompt = ChatPromptTemplate.from_template( """Classify this QA query into ONE of: api_testing | ui_testing | performance_testing Query: {question} Category:""" ) router_chain = router_prompt | ChatOpenAI(temperature=0) def route_query(info): route = info["route"].content.strip().lower() if "api" in route: return api_retriever.invoke(info["question"]) elif "ui" in route: return ui_retriever.invoke(info["question"]) else: return perf_retriever.invoke(info["question"]) # Swappable generator pattern class ModularRAG: def __init__(self, retriever, reranker=None, generator=None): self.retriever = retriever self.reranker = reranker self.generator = generator or ChatOpenAI(model="gpt-4o-mini") def query(self, question: str): docs = self.retriever.invoke(question) if self.reranker: docs = self.reranker.compress_documents(docs, question) context = "\n".join([d.page_content for d in docs]) return self.generator.invoke( f"Context: {context}\nQuestion: {question}\nAnswer:" ) # Easily swap any component! rag_v1 = ModularRAG(retriever=bm25_retriever, generator=gpt4) rag_v2 = ModularRAG(retriever=vector_retriever, reranker=cohere, generator=claude)
Create "Modular RAG with Query Router"
New flow. Add Chat Input node as the query entry point.
Router Prompt + LLM
Add Prompt: "Classify this query as api_testing, ui_testing, or performance_testing: {input}". Connect ChatOpenAI (gpt-4o-mini, temp=0) to it.
Conditional Router Node
Add "Conditional Router" node. Feed the LLM classification output → input_text. Also feed the original query for downstream retrieval.
Three Retriever Branches
Add 3 AstraDB/Chroma nodes: API Docs, UI Docs, Performance Docs. Connect router outputs to each branch's search_query input.
Unified Answer Prompt
All 3 branches merge at one Prompt node with {context} + {question}. Connect ChatOpenAI (gpt-4o, temp=0.3) → Chat Output.
Router accuracy matrix: Create 20+ test queries from each domain (api, ui, perf). Run all through the router. Calculate classification accuracy — target >95%.
Edge-case routing: Test queries that span domains ("How do I load test a REST API?"). Verify fallback behavior when classification is ambiguous.
Generator A/B test: Swap GPT-4o with Claude using the same retriever. Compare answer quality scores across 10+ queries to justify model choice.
Module hot-swap test: Replace one retriever (e.g., Chroma → Pinecone) without changing any other component. Verify the pipeline still works correctly.
Cross-domain query: Ask a question requiring knowledge from two branches simultaneously. The system should gracefully handle partial retrieval from the wrong branch.
| ASPECT | VECTOR RAG | GRAPH RAG |
|---|---|---|
| Storage | Embeddings in vector DB | Entities & relations in graph DB (Neo4j) |
| Retrieval | Cosine similarity search | Graph traversal + community detection |
| Reasoning | Single-hop (find similar text) | Multi-hop (follow relationship chains) |
| Best for | Factual Q&A, semantic search | Dependency analysis, causal queries |
| Example | "What is API rate limiting?" | "Which services depend on the auth module?" |
| Build cost | Low (just embed docs) | High (LLM entity extraction required) |
from langchain_community.graphs import Neo4jGraph from langchain_experimental.graph_transformers import LLMGraphTransformer from langchain.chains import GraphCypherQAChain from langchain_openai import ChatOpenAI # Connect to Neo4j graph database graph = Neo4jGraph( url="bolt://localhost:7687", username="neo4j", password="your_password" ) # Extract entities + relationships from documents llm = ChatOpenAI(model="gpt-4o", temperature=0) transformer = LLMGraphTransformer(llm=llm) graph_documents = transformer.convert_to_graph_documents(documents) graph.add_graph_documents(graph_documents) print(f"Graph built with {len(graph_documents)} document graphs") # Query the graph with natural language chain = GraphCypherQAChain.from_llm( llm=llm, graph=graph, verbose=True ) result = chain.invoke({ "query": "Which test suites cover the payment module?" }) # LLM generates: MATCH (t:TestSuite)-[:COVERS]->(m:Module {name:'payment'}) RETURN t # Microsoft GraphRAG for community-based search # graphrag query --method global --query "Main testing challenges?" # graphrag query --method local --query "Auth service test needs?"
Create "Graph RAG with Neo4j"
New flow. Add "Neo4j Graph" component. Configure URL (bolt://localhost:7687), username, and password.
Entity Extraction Pipeline
Add File Loader → "LLMGraphTransformer" node. Connect ChatOpenAI as the extraction LLM. The transformer output goes to the Neo4j Graph for storage.
Query Pipeline
Add Chat Input → "GraphCypherQAChain" node. Connect both Neo4j Graph and ChatOpenAI to the chain. This generates Cypher queries from natural language.
Alternative: Custom Component
For Microsoft GraphRAG: Add a "Python Function" node that calls the graphrag CLI or Python API for global/local search modes.
Test Relationship Queries
Test with: "Which components depend on the database service?" "What modules does the login feature touch?" Verify graph traversal returns accurate relationships.
Entity extraction accuracy: After indexing, query the graph schema. Verify all critical entities (services, modules, APIs) and relationships are correctly captured.
Cypher query validation: Log the generated Cypher queries. Check if the LLM produces syntactically valid queries. Invalid Cypher = silent failure.
Circular dependency handling: Index a document with circular module dependencies. Verify the graph traversal doesn't infinite-loop.
Global vs. local search comparison: Use the same query on both modes. Global = broad community summaries; Local = specific entity details. Verify each returns appropriate granularity.
Construction cost benchmark: Measure time and token cost to build the graph vs. a vector store on the same document set. Graph RAG is typically 10–50x more expensive to build.
ReAct Pattern: Reason → Act → Observe → Reason → Act → ... → Final Answer. Each cycle can call a different tool.
🎯 Dynamic Tool Selection
The agent decides which retriever or tool to call based on the query context — API docs, UI docs, bug database, or even web search. No hard-coded routing.
🔄 Iterative Retrieval
Can search multiple times with refined queries. If the first search doesn't yield enough, the agent reformulates and searches again.
🌐 Multi-Source Fusion
Can query internal vector stores AND external web search in a single turn, then synthesize a comprehensive answer from both sources.
⚠️ Cost Warning
Agentic RAG uses 3–10x more tokens than basic RAG due to the reasoning loop. Always set max_iterations to prevent runaway costs. Monitor token usage carefully.
from langchain.agents import AgentExecutor, create_react_agent from langchain_openai import ChatOpenAI from langchain.tools.retriever import create_retriever_tool from langchain import hub # Define specialized retriever tools api_tool = create_retriever_tool( api_retriever, name="api_test_docs", description="Search API testing documentation and test cases" ) ui_tool = create_retriever_tool( ui_retriever, name="ui_test_docs", description="Search UI testing guides, Selenium and Playwright docs" ) bug_tool = create_retriever_tool( bug_retriever, name="bug_reports", description="Search historical bug reports and known issues" ) tools = [api_tool, ui_tool, bug_tool] # Create ReAct agent (Reason + Act loop) llm = ChatOpenAI(model="gpt-4o", temperature=0) prompt = hub.pull("hwchase17/react") agent = create_react_agent(llm, tools, prompt) agent_executor = AgentExecutor( agent=agent, tools=tools, verbose=True, max_iterations=5, # Prevent infinite loops! handle_parsing_errors=True ) # Multi-tool query: agent decides which tools to use response = agent_executor.invoke({ "input": "Were there payment API bugs last quarter, " "and what test cases should cover them?" }) # Agent reasoning trace: # Thought: I need both bug history and API test cases # Action: bug_reports("payment API Q3") # Observation: Found 3 bugs: AUTH-401, TIMEOUT-503... # Action: api_test_docs("payment endpoint testing") # Final Answer: Based on bugs found, recommend...
Create "Agentic RAG with Tool Selection"
New flow. Add Chat Input. Create 3 different vector stores (API docs, UI docs, Bug reports).
Create Retriever Tools
Add "Retriever Tool" components (one per vector store). Each tool needs a name and description — the agent uses these to decide which tool to call.
Add Tool Calling Agent
Add "Tool Calling Agent" or "ReAct Agent" component. Connect all retriever tools to the agent's tool input. Set max_iterations=5.
Connect GPT-4o as Reasoning LLM
Connect ChatOpenAI (GPT-4o) to the agent. GPT-4o's strong reasoning is recommended — weaker models often fail to parse tool results correctly.
Test with Complex Multi-Tool Queries
Send: "Find related bugs for auth API and suggest new test cases." Monitor the agent's reasoning trace to verify it selects appropriate tools.
max_iterations guard test: Construct a query that forces the agent to loop (unanswerable with available tools). Verify it terminates at max_iterations with a graceful message.
Tool selection accuracy: For 20 queries where the correct tool is known, verify the agent picks the right tool first. Track misdirected tool calls.
Token budget test: Monitor tokens per query. Agentic RAG with 3 iterations uses ~3x tokens. Establish baseline and alert thresholds.
Tool failure mid-execution: Simulate a tool returning an error (empty results). Verify the agent doesn't hallucinate a response and either retries or reports failure.
Hallucinated tool call test: Verify the agent only calls tools that actually exist. An agent that invents non-existent tool names is a critical failure mode.
| REFLECTION TOKEN | QUESTION ANSWERED | VALUES |
|---|---|---|
| [Retrieve] | Do I need to look up information? | Yes / No / Continue |
| [ISREL] | Is this retrieved document relevant to the query? | Relevant / Irrelevant |
| [ISSUP] | Is my response supported by the retrieved docs? | Fully / Partially / No Support |
| [ISUSE] | Is the response useful to the user? | 5 (best) → 1 (worst) |
from langgraph.graph import StateGraph, END from typing import TypedDict, List class RAGState(TypedDict): question: str documents: List[str] generation: str needs_retrieval: bool is_relevant: bool is_grounded: bool def decide_retrieval(state): # Reflection token: [Retrieve] prompt = f'Does "{state["question"]}" need external info? (yes/no)' response = llm.invoke(prompt) return {"needs_retrieval": "yes" in response.content.lower()} def check_relevance(state): # Reflection token: [ISREL] prompt = f'Are these docs relevant to "{state["question"]}"?\n' prompt += "\n".join(state["documents"][:2]) response = llm.invoke(prompt) return {"is_relevant": "yes" in response.content.lower()} def check_groundedness(state): # Reflection token: [ISSUP] prompt = f"Is this answer grounded in the context?\n" prompt += f'Answer: {state["generation"]}\nContext: {state["documents"][0]}' response = llm.invoke(prompt) return {"is_grounded": "yes" in response.content.lower()} # Build the LangGraph state machine workflow = StateGraph(RAGState) workflow.add_node("decide", decide_retrieval) workflow.add_node("retrieve", retrieve) workflow.add_node("relevance", check_relevance) workflow.add_node("generate", generate) workflow.add_node("grounded", check_groundedness) workflow.set_entry_point("decide") workflow.add_conditional_edges( "decide", lambda s: "retrieve" if s["needs_retrieval"] else "generate" ) workflow.add_edge("retrieve", "relevance") workflow.add_conditional_edges( "relevance", lambda s: "generate" if s["is_relevant"] else "retrieve" ) workflow.add_edge("generate", "grounded") workflow.add_conditional_edges( "grounded", lambda s: END if s["is_grounded"] else "retrieve" ) app = workflow.compile()
Create "Self-RAG with Reflection"
Best implemented using "Custom Component" nodes in Langflow for each reflection step.
Retrieval Decision Node
Add Chat Input → Custom Component ("Retrieval Decision") that prompts the LLM: "Does this question need external info?" → branches yes/no.
Relevance Check Node
After Retriever, add "Relevance Check" Custom Component. If irrelevant, loop back to retriever with reformulated query.
Groundedness Check Node
After generation, add "Groundedness Check" node. If not grounded → trigger re-retrieval. If grounded → pass to Chat Output.
Monitor Reflection Trace
Use Langflow's execution trace to debug each reflection decision. This is essential for understanding why the system loops and whether each reflection step is calibrated correctly.
Skip-retrieval test: Ask a common knowledge question ("What is HTTP?"). Verify the [Retrieve]=No path is taken and no vector search is performed, saving latency and cost.
Forced loop test: Index only irrelevant documents. Verify the relevance-check loop terminates at a max_retries limit rather than looping infinitely.
Groundedness hallucination test: Generate responses that subtly add facts not in the context. Verify [ISSUP] = "No Support" is correctly triggered for fabricated details.
Latency overhead benchmark: Compare Self-RAG latency vs. Naive RAG on the same queries. Self-RAG is typically 2–4x slower. Quantify the quality improvement to justify the cost.
Calibration test: The reflection tokens should be consistent — the same doc+query pair should always receive the same ISREL verdict. Test consistency across 10 identical runs.
✅ CORRECT (>0.7)
Retrieved documents are highly relevant. Use them directly for generation. No web search needed. Best case scenario for latency and cost.
⚠️ AMBIGUOUS (0.4–0.7)
Retrieved docs are partially relevant. Supplement with web search results. Merge both sources before generation. Balances internal knowledge with current web data.
❌ INCORRECT (<0.4)
Retrieved docs are irrelevant. Discard entirely. Fall back fully to web search (Tavily, SerpAPI). Prevents confidently wrong answers based on poor retrieval.
🔍 Query Rewriting
Before web search fallback, CRAG optionally rewrites the query for better search performance. The original QA question may not be ideal for a search engine.
from langgraph.graph import StateGraph, END from langchain_community.tools.tavily_search import TavilySearchResults web_search = TavilySearchResults(max_results=3) def grade_documents(state): "Grade each retrieved document 0.0-1.0 for relevance" question = state["question"] scored_docs = [] for doc in state["documents"]: grade_prompt = f"""Grade relevance 0.0-1.0: Question: {question} Document: {doc.page_content[:500]} Score:""" score = float(llm.invoke(grade_prompt).content.strip()) scored_docs.append((doc, score)) avg_score = sum(s for _, s in scored_docs) / len(scored_docs) if avg_score > 0.7: action = "correct" elif avg_score > 0.4: action = "ambiguous" else: action = "incorrect" relevant = [d for d, s in scored_docs if s > 0.5] return {"action": action, "documents": relevant} def web_search_fallback(state): "Fallback to web search when retrieval confidence is low" results = web_search.invoke({"query": state["question"]}) web_docs = [r["content"] for r in results] existing = state.get("documents", []) return {"documents": existing + web_docs} def route_after_grading(state): if state["action"] == "correct": return "generate" else: return "web_search" # ambiguous or incorrect workflow = StateGraph(CRAGState) workflow.add_edge("retrieve", "grade") workflow.add_conditional_edges("grade", route_after_grading) workflow.add_edge("web_search", "generate") workflow.add_edge("generate", END) crag_app = workflow.compile()
Create "Corrective RAG (CRAG)"
New flow. Add Chat Input → Retriever (connected to your vector store, k=5).
Add Grading Component
Add a "Custom Component" for the grading logic. Takes retrieved docs + query, uses LLM to score relevance 0.0–1.0 for each doc, returns average score and action.
Conditional Router on Score
Add "Conditional Router" branching: Correct (>0.7) → direct to prompt; Ambiguous/Incorrect → web search fallback.
Web Search Fallback
Add "Tavily Search" (or SerpAPI) component for the fallback path. Add a "Combine Documents" node to merge retrieved + web results for ambiguous cases.
Test with Out-of-KB Queries
Send queries your vector store cannot answer. Verify web search fallback activates. Monitor grading scores to calibrate the 0.4/0.7 thresholds for your domain.
Threshold boundary test: Construct docs that score exactly 0.4 and 0.7. Verify the routing decision is deterministic at these boundaries.
Double-failure test: When web search also returns irrelevant results, does the system still generate a response? Verify it uses the available context and doesn't hallucinate confidently.
Grader calibration test: Run the same doc-query pair 10 times. The score should be consistent (within ±0.1). High variance means the grader is unreliable.
Latency with web fallback: Measure latency for each path (correct vs. fallback). Web search adds ~1–3 seconds. Verify this is acceptable for your use case.
Private data leak test: Verify the web search query doesn't include sensitive internal data from retrieved docs that scored poorly. Sanitize the search query before calling external APIs.
| QUERY TYPE | BM25 KEYWORD | VECTOR SEMANTIC | WINNER |
|---|---|---|---|
| "Error ERR_AUTH_403" | Excellent — exact match | Poor — number confusion | BM25 |
| "login doesn't work" | Poor — no exact keywords | Excellent — semantic intent | Vector |
| "TC-2451 test case" | Excellent — ID match | Poor — ID is meaningless in vector space | BM25 |
| "flaky intermittent failures" | Moderate | Excellent — understands concept | Vector |
| "Explain ERR_AUTH_403 and fix" | Good for code part | Good for explanation part | Both |
from langchain.retrievers import EnsembleRetriever from langchain_community.retrievers import BM25Retriever # BM25: keyword search on raw text bm25_retriever = BM25Retriever.from_documents( documents=chunks, k=4 ) # Vector: semantic search on embeddings vector_retriever = vectorstore.as_retriever( search_kwargs={"k": 4} ) # Ensemble: weights favor semantic (0.6) over keyword (0.4) hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6] ) results = hybrid_retriever.invoke("Error ERR_AUTH_403 during login test") # BM25 finds exact error code matches # Vector finds semantically related auth failure docs # Results merged and deduplicated automatically # ── Reciprocal Rank Fusion (manual) ────────────────── def reciprocal_rank_fusion(results_list, k=60): fused_scores = {} for results in results_list: for rank, doc in enumerate(results): doc_id = doc.page_content[:100] if doc_id not in fused_scores: fused_scores[doc_id] = {"doc": doc, "score": 0} fused_scores[doc_id]["score"] += 1.0 / (rank + k) sorted_docs = sorted( fused_scores.values(), key=lambda x: x["score"], reverse=True ) return [item["doc"] for item in sorted_docs] bm25_r = bm25_retriever.invoke(query) vector_r = vector_retriever.invoke(query) hybrid_r = reciprocal_rank_fusion([bm25_r, vector_r])
Create "Hybrid RAG (BM25 + Vector)"
New flow. Add File Loader → Text Splitter for document processing.
Create Two Parallel Branches
(a) Chroma vector store with OpenAI Embeddings. (b) BM25 Retriever component. Both are connected to the same chunked documents.
Connect Query to Both Retrievers
Add Chat Input. Connect the query to both the Chroma Retriever AND the BM25 Retriever simultaneously. They run in parallel.
Add Ensemble/RRF Merge Node
Use "Ensemble Retriever" or a Python Function implementing RRF. Set weights [0.4, 0.6] to favor semantic search slightly.
Test Both Query Types
Test exact-match queries (error codes, test IDs) and semantic queries (concepts, behaviors). Verify both retrievers contribute to the final result set.
BM25 dominance test: Query with exact error codes and test case IDs. Log which retriever found each result. BM25 should dominate for exact-match queries.
Vector dominance test: Query with natural language descriptions ("tests that verify user authentication flow"). Vector search should return the most relevant results.
Weight tuning test: Try weights [0.5, 0.5], [0.3, 0.7], [0.7, 0.3]. Measure precision@4 for each ratio on your test query set. Find the optimal balance for your corpus.
Deduplication test: When both retrievers return the same document, it should appear exactly once in the final results with a higher fused score. Verify this behavior.
Mixed query test: "Explain ERR_AUTH_403 and how to fix it" — needs both exact code matching (BM25) and explanatory context (Vector). Verify the hybrid result contains both types of content.
| STRATEGY | HOW IT WORKS | BEST FOR |
|---|---|---|
| Text Extraction | Convert images/tables to text, then standard RAG | Screenshots with text, simple tables |
| Multi-Vector | Store text summaries + original media; retrieve summaries, return media | Complex diagrams, charts |
| CLIP Embeddings | Embed images and text in the same vector space | Image-text similarity search |
| Vision LLM | GPT-4V / Claude describes images; index descriptions | Complex visual content, screenshots |
from langchain.retrievers.multi_vector import MultiVectorRetriever from langchain.storage import InMemoryByteStore from langchain_openai import ChatOpenAI import uuid vision_llm = ChatOpenAI(model="gpt-4o", temperature=0) def summarize_image(image_base64: str) -> str: "Use Vision LLM to describe image for a QA engineer" response = vision_llm.invoke([{ "type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"} }, { "type": "text", "text": "Describe for a QA engineer: error messages, " "UI elements, test results, and status codes visible." }]) return response.content # Multi-Vector store: summaries → vector store, originals → byte store byte_store = InMemoryByteStore() id_key = "doc_id" multi_retriever = MultiVectorRetriever( vectorstore=vectorstore, byte_store=byte_store, id_key=id_key ) # Index each image for img_path in image_paths: img_b64 = encode_image(img_path) summary = summarize_image(img_b64) doc_id = str(uuid.uuid4()) # Summary → vector store (for retrieval) multi_retriever.vectorstore.add_documents([ Document(page_content=summary, metadata={id_key: doc_id}) ]) # Original image → byte store (for return) multi_retriever.docstore.mset([(doc_id, img_b64)]) # Query retrieves summaries but RETURNS original images results = multi_retriever.invoke("Login page 404 error screenshot")
Create "Multi-Modal RAG"
New flow. Add a File Loader that accepts multiple types (PDF, PNG, JPG, DOCX).
Add Unstructured Loader
For PDFs: Use "Unstructured" loader with strategy="hi_res" to extract text, tables, and embedded images separately.
Content Router by Type
Add a Custom Component that routes by content type: text → text splitter, images → Vision LLM summarizer, tables → table summarizer.
Unified Vector Store
All summarized content feeds into one vector store. Store original content (images, tables) in a separate byte store for retrieval return.
Multi-Modal Generator
Use ChatOpenAI GPT-4o (not mini) as the final answer generator — it can reason about both text and image context simultaneously.
Image summarization accuracy: Index a known error screenshot. Query for its content. Verify the Vision LLM correctly identifies error codes, button states, and failure messages.
Table extraction fidelity: Extract a performance table from a PDF. Verify all numbers, column headers, and row labels are preserved exactly — numeric errors are silent and dangerous.
Mixed context query: Ask a question that requires both text and image context ("What error is shown in the screenshot mentioned in section 3.2?"). Verify both sources are retrieved and synthesized.
Low-quality input test: Index blurry screenshots and scanned documents. Verify graceful degradation — the system should flag low-confidence image descriptions rather than hallucinate content.
Original vs. summary retrieval: Verify that when an image-related query is answered, the original image (not just the text summary) is returned in the response for human verification.
📝 Context Generation
For each chunk, an LLM reads the full document + the chunk and generates 2-3 sentences of contextual metadata describing: what document it's from, what section, and what key entities are mentioned.
⚡ Hybrid + Reranking
Anthropic's research shows best results combining Contextual Embeddings + Contextual BM25 + Cohere Reranking. This combination reduced retrieval failures by 67% in benchmarks.
💰 Indexing Cost
Each chunk requires one LLM call during indexing for context generation. For a 1000-chunk corpus, that's 1000 LLM calls. Use smaller models (claude-haiku, gpt-4o-mini) to control costs.
🎯 Best Use Cases
Large document collections with many similar sections (multiple API endpoints, multiple test plans), technical documentation where context disambiguation is critical for correct retrieval.
from langchain_openai import ChatOpenAI from langchain.schema import Document llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) CONTEXT_PROMPT = """Here is the full document: <document> {document} </document> Here is a chunk from that document: <chunk> {chunk} </chunk> Provide 2-3 sentence context explaining: 1. What document/section this chunk belongs to 2. What the chunk is specifically about 3. Key entities or identifiers mentioned Context:""" def add_context_to_chunks(full_doc: str, chunks: list) -> list: "Prepend document context to each chunk before embedding" contextualized = [] for chunk in chunks: context = llm.invoke( CONTEXT_PROMPT.format( document=full_doc[:4000], chunk=chunk.page_content ) ).content enriched = f"{context}\n\n{chunk.page_content}" contextualized.append(Document( page_content=enriched, metadata={**chunk.metadata, "original": chunk.page_content} )) return contextualized # ── Best combination: Contextual + Hybrid + Rerank ─── ctx_chunks = add_context_to_chunks(full_doc_text, chunks) ctx_bm25 = BM25Retriever.from_documents(ctx_chunks, k=20) ctx_vector = Chroma.from_documents(ctx_chunks, embeddings).as_retriever( search_kwargs={"k": 20} ) hybrid = EnsembleRetriever( retrievers=[ctx_bm25, ctx_vector], weights=[0.4, 0.6] ) final = ContextualCompressionRetriever( base_compressor=CohereRerank(top_n=5), base_retriever=hybrid ) # This combination: -67% retrieval failures (Anthropic research)
Create "Contextual RAG (Anthropic Style)"
New flow. Add File Loader → Text Splitter for initial chunking. Keep both the chunks AND the full document text.
Add Context Generator Component
Add a "Custom Component" that takes each chunk + full document and generates context using ChatOpenAI with the context prompt template above.
Feed Contextualized Chunks to Both Retrievers
Connect contextualized chunks to BOTH a Chroma vector store AND a BM25 Retriever. Both now search context-enriched content.
Add Ensemble + Cohere Rerank
Add "Ensemble Retriever" with weights [0.4, 0.6]. Follow with "Cohere Rerank" (top_n=5). This is Anthropic's recommended full stack.
Benchmark vs. Naive RAG
Run the same 20 test queries through Naive RAG and Contextual RAG. Compare retrieval precision. Expect ~20-40% improvement on ambiguous queries about specific sections.
Disambiguation test: Create a corpus with 5 similar sections (e.g., timeout config for 5 different APIs). Query for each specific one. Without context, retrieval fails; with context, each query returns the correct section.
Context accuracy test: Inspect generated context for each chunk. Verify the LLM correctly identifies the document section and key entities — hallucinated metadata defeats the entire purpose.
Indexing cost benchmark: Count API calls during indexing. A 1000-chunk corpus = 1000 LLM calls. Measure cost and time. This is paid upfront, not per query — ensure it fits your budget.
Baseline comparison: Run the same 50 queries through Naive RAG vs. Contextual RAG. Measure retrieval precision@5 for both. The improvement should justify the indexing cost.
Context hallucination check: Feed a deliberately obscure chunk where the context is hard to determine. Verify the context generator says "context unclear" rather than inventing a plausible-sounding but wrong description.
Faithfulness
Is the answer grounded in retrieved context? Detects hallucinations where the model adds facts not in the docs.
Answer Relevancy
Does the answer actually address the question? A grounded but off-topic answer scores low here.
Context Precision
Are the top retrieved documents actually relevant? Measures retrieval signal-to-noise ratio.
Context Recall
Were all relevant documents retrieved? Low recall means the answer is incomplete due to missed retrieval.
Answer Correctness
Is the answer factually correct against ground truth? Requires human-labeled expected answers.
from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from datasets import Dataset eval_data = { "question": ["How to test API rate limiting?", "Login test cases?"], "answer": generated_answers, # From your RAG pipeline "contexts": retrieved_contexts, # Retrieved chunks list "ground_truth": expected_answers # Human-written expected answers } results = evaluate( Dataset.from_dict(eval_data), metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print(results) # {'faithfulness': 0.87, 'answer_relevancy': 0.91, # 'context_precision': 0.78, 'context_recall': 0.85}
QA Test Plan Template
| TEST CATEGORY | TEST SCENARIOS | PASS CRITERIA |
|---|---|---|
| Retrieval Quality | Relevant docs in top-k, duplicate handling, empty results | Context precision > 0.80 |
| Generation Quality | Faithfulness, hallucination detection, answer completeness | Faithfulness > 0.85 |
| Edge Cases | Empty KB, very long queries, special characters, multilingual | Graceful degradation |
| Performance | Latency p50/p95/p99, throughput, concurrent users | p95 latency < 3 seconds |
| Security | Prompt injection, data leakage, PII in responses | Zero PII leakage |
| Robustness | Typos in queries, paraphrased questions, adversarial inputs | Consistent quality ±5% |
Choose the Right RAG
Start simple. Measure. Add complexity only where metrics show improvement.
Naive RAG
Simple Q&A, rapid prototyping. Build this first.
Advanced RAG
Production systems needing quality. HyDE + Reranking.
Modular RAG
Multiple knowledge domains. Plug-and-play architecture.
Graph RAG
Relationship queries. Module dependency analysis.
Agentic RAG
Complex multi-step research. Dynamic tool selection.
Self-RAG
High-reliability requirements. Built-in quality assertions.
Corrective RAG
Unreliable knowledge bases. Web search fallback.
Hybrid RAG
Mixed exact + semantic search. Error codes + concepts.
Multi-Modal RAG
Screenshots, diagrams, charts. Visual test evidence.
Contextual RAG
Large document collections. Anthropic's -67% failure rate.