The Testing Academy · Complete Guide 2026

RAG Tutorial

10 RAG architectures explained with theory, visual flow diagrams, Python code, Langflow implementations, and QA testing strategies. Built for QA engineers and AI testing professionals.

RAG Types

40+

Code Examples

50+

QA Test Points

Langflow Flows

Foundational · Low complexity

Naive RAG

The simplest RAG implementation — chunk, embed, retrieve, generate. The foundation every QA engineer must master before advancing to complex architectures.

What it is: Naive RAG is the baseline implementation — load docs, split into chunks, embed everything, store in a vector DB, then at query time: embed the query, find the nearest chunks, and hand them to an LLM with the question.

QA Analogy: Like running a basic test suite — no parallelism, no retries, no smart ordering. It works, but it's the starting point, not the destination.

📄 Stage 1: Indexing

Documents are loaded, split into fixed-size chunks (500–1000 tokens with 200-token overlap), converted to embeddings via a model like text-embedding-3-small, and stored in a vector database like ChromaDB or Pinecone.

🔍 Stage 2: Retrieval

The user's query is embedded using the same embedding model. Cosine similarity finds the top-k most similar chunks in the vector store. Typically k=4 chunks are retrieved.

✍️ Stage 3: Generation

The retrieved chunks are concatenated with the original query and fed into an LLM as context. The LLM generates a grounded response using only the provided context.

⚠️ Limitations

Fixed chunking splits important context. No query optimization. No re-ranking. No self-correction. Single retrieval step cannot handle multi-hop reasoning. These are solved by more advanced RAG types.

PARAMETER	TYPICAL VALUE	EFFECT OF CHANGING
chunk_size	1000 tokens	Smaller = more precise but loses context; Larger = more context but noisier
chunk_overlap	200 tokens	Higher overlap preserves context across boundaries but increases storage
k (top results)	4	More k = more context but risks exceeding context window
embedding model	text-embedding-3-small	Larger models improve retrieval but cost more per call
temperature	0	Higher = more creative but less faithful to retrieved context

INDEXING PIPELINE (top) + QUERY PIPELINE (bottom) — arrows show data flow direction

Python · Naive Rag

# Step 1: Install dependencies
# pip install langchain chromadb openai langchain-community langchain-openai

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# ── INDEXING ──────────────────────────
loader = PyPDFLoader("test_documentation.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# ── QUERY & GENERATE ──────────────────
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

response = qa_chain.invoke({"query": "What are the login test cases?"})
print(response["result"])
print("Sources:", [doc.metadata for doc in response["source_documents"]])

Create a New Flow

Open Langflow → New Flow → Name it "Naive RAG Pipeline". Choose "Blank Flow" to start from scratch.

Add File Loader

Drag a "File" component onto the canvas. Upload your test documentation (PDF, TXT, or DOCX).

Configure Text Splitter

Add "Recursive Character Text Splitter". Set chunk_size=1000, chunk_overlap=200. Connect File → Splitter.

Add Embeddings + Vector Store

Add "OpenAI Embeddings" (enter API key). Add "Chroma" vector store. Connect Splitter → Chroma, Embeddings → Chroma.

Add Retriever + Prompt

Add "Retriever" node, connect to Chroma (set k=4). Add "Prompt" template: Context: {context}\nQuestion: {question}.

Connect LLM + Output

Add "ChatOpenAI" → connect to Prompt. Add "Chat Output" → connect to ChatOpenAI. Add "Chat Input" for queries.

Test & Validate

Click Play → open chat panel → test with sample queries. Review execution trace to verify retrieval. Tune chunk_size and k based on results.

🧪 QA Test Scenarios — Naive RAG

T01

Empty knowledge base query: Send a query when no relevant docs are indexed. Verify the LLM does NOT hallucinate — it should say "I don't have information about this."

T02

Multi-chunk spanning query: Ask a question whose answer spans 2+ chunks. Verify context continuity is preserved and the answer is complete.

T03

Duplicate document test: Index the same document twice. Verify retrieval returns deduplicated results and doesn't inflate context.

T04

Critical boundary split: Find chunks where critical info (like a code snippet) is split. Verify the overlap correctly preserves the context.

T05

Embedding model change: Swap to a different embedding model. Verify re-indexing happens correctly and old embeddings are cleared — stale embeddings cause silent failures.

Production-ready · Medium complexity

Advanced RAG

Adds pre-retrieval query optimization (HyDE, rewriting), semantic chunking, and post-retrieval re-ranking. Dramatically improves answer quality over Naive RAG.

What it adds: Advanced RAG wraps the basic pipeline with optimization layers. Pre-retrieval improves how queries are formulated (HyDE turns a question into a hypothetical answer for better vector matching). Retrieval uses semantic chunking and parent-document retrieval. Post-retrieval re-ranks the top-k results with a cross-encoder for precision.

STAGE	TECHNIQUE	PURPOSE
Pre-Retrieval	HyDE (Hypothetical Document Embedding)	Generate a hypothetical answer, use that for retrieval instead of raw query
Pre-Retrieval	Semantic Chunking	Split documents at semantic boundaries, not arbitrary token counts
Retrieval	Parent Document Retrieval	Match small child chunks, return full parent document for context
Post-Retrieval	Cohere Rerank / BGE	Re-score top-20 candidates, return only best 4
Post-Retrieval	Contextual Compression	Strip irrelevant sentences from retrieved chunks

Python · Hyde + Reranking

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_experimental.text_splitter import SemanticChunker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# ── HyDE: Query → Hypothetical Doc → Better Retrieval ──
hyde_prompt = ChatPromptTemplate.from_template(
    "Write a short passage that would answer: {question}"
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
hyde_chain = hyde_prompt | llm

hypothetical_doc = hyde_chain.invoke({"question": "How to test API rate limiting?"})
# Now use hypothetical_doc.content for embedding + retrieval

# ── Semantic Chunking ─────────────────────────────────
embeddings = OpenAIEmbeddings()
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
semantic_chunks = semantic_splitter.split_documents(documents)

# ── Cohere Re-ranking: top-20 → best 4 ───────────────
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(model="rerank-english-v3.0", top_n=4)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke(
    "Edge cases for payment testing?"
)
print(f"Re-ranked: {len(results)} top documents returned")

Create Flow: "Advanced RAG with Reranking"

New flow in Langflow. Add File Loader → "Semantic Text Splitter" (set breakpoint_threshold_type to "percentile").

Over-fetch with Retriever k=20

Add Retriever node. Set k=20 to retrieve many candidates before re-ranking. Connect to Chroma vector store.

Add Cohere Rerank

Add "Cohere Rerank" component after retriever. Set top_n=4. Add your Cohere API key. Connect Retriever → Cohere.

Add HyDE Branch

Add a separate ChatOpenAI node that generates a hypothetical answer from the query. Feed that into the embedding step instead of the raw query.

Prompt + Output

Use prompt: "Use ONLY the following context... Context: {context}\nQuestion: {question}\nIf not in context, say so." → ChatOpenAI → Chat Output.

🧪 QA Test Scenarios — Advanced RAG

T01

Reranking precision test: Run the same query with k=4 (no rerank) vs k=20+rerank. Measure retrieval precision — the reranked version should return more relevant docs.

T02

HyDE degradation test: Test HyDE with ambiguous queries. For some queries, the hypothetical doc may drift semantically — verify it doesn't degrade retrieval quality.

T03

Reranking latency test: Measure latency with/without Cohere reranking. Reranking adds ~200–500ms. Ensure this is within your SLA.

T04

Semantic chunking edge cases: Test with code snippets, HTML tables, and mixed-language content. Verify semantic splitter doesn't break code blocks.

T05

Parent-child retrieval accuracy: Verify that when a child chunk matches, the correct parent document is returned — not a random parent.

Flexible architecture · Medium complexity

Modular RAG

Treats RAG as interchangeable plug-and-play modules. A router directs queries to specialized retriever branches — like a testing framework where you swap drivers without rewriting tests.

Core idea: Instead of one monolithic pipeline, Modular RAG breaks everything into swappable modules — Router, Retriever, Reranker, Generator, Memory, Guardrails. Each can be independently upgraded or A/B tested without touching other components.

The Router is the key innovation: an LLM classifies each incoming query and routes it to the most appropriate knowledge base or retriever.

MODULE	ROLE	EXAMPLES
Router	Classifies queries, directs to right retriever	Semantic router, keyword router, LLM-based
Retriever	Fetches relevant documents	Vector, BM25, SQL, Graph, API
Reranker	Reorders candidates by relevance	Cohere, BGE, Cross-encoder
Generator	Produces final answer	GPT-4o, Claude, Llama 3, Gemini
Memory	Maintains conversation context	Buffer, summary, vector memory
Guardrails	Validates input/output quality	NeMo Guardrails, custom rules

Python · Modular Rag With Router

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

# Specialized retrievers per domain
api_retriever = vectorstore_api.as_retriever(search_kwargs={"k": 4})
ui_retriever  = vectorstore_ui.as_retriever(search_kwargs={"k": 4})
perf_retriever= vectorstore_perf.as_retriever(search_kwargs={"k": 4})

# Router LLM (temperature=0 for deterministic classification)
router_prompt = ChatPromptTemplate.from_template(
    """Classify this QA query into ONE of:
    api_testing | ui_testing | performance_testing
    Query: {question}
    Category:"""
)
router_chain = router_prompt | ChatOpenAI(temperature=0)

def route_query(info):
    route = info["route"].content.strip().lower()
    if "api" in route:
        return api_retriever.invoke(info["question"])
    elif "ui" in route:
        return ui_retriever.invoke(info["question"])
    else:
        return perf_retriever.invoke(info["question"])

# Swappable generator pattern
class ModularRAG:
    def __init__(self, retriever, reranker=None, generator=None):
        self.retriever = retriever
        self.reranker  = reranker
        self.generator = generator or ChatOpenAI(model="gpt-4o-mini")

    def query(self, question: str):
        docs = self.retriever.invoke(question)
        if self.reranker:
            docs = self.reranker.compress_documents(docs, question)
        context = "\n".join([d.page_content for d in docs])
        return self.generator.invoke(
            f"Context: {context}\nQuestion: {question}\nAnswer:"
        )

# Easily swap any component!
rag_v1 = ModularRAG(retriever=bm25_retriever, generator=gpt4)
rag_v2 = ModularRAG(retriever=vector_retriever, reranker=cohere, generator=claude)

Create "Modular RAG with Query Router"

New flow. Add Chat Input node as the query entry point.

Router Prompt + LLM

Add Prompt: "Classify this query as api_testing, ui_testing, or performance_testing: {input}". Connect ChatOpenAI (gpt-4o-mini, temp=0) to it.

Conditional Router Node

Add "Conditional Router" node. Feed the LLM classification output → input_text. Also feed the original query for downstream retrieval.

Three Retriever Branches

Add 3 AstraDB/Chroma nodes: API Docs, UI Docs, Performance Docs. Connect router outputs to each branch's search_query input.

Unified Answer Prompt

All 3 branches merge at one Prompt node with {context} + {question}. Connect ChatOpenAI (gpt-4o, temp=0.3) → Chat Output.

🧪 QA Test Scenarios — Modular RAG

T01

Router accuracy matrix: Create 20+ test queries from each domain (api, ui, perf). Run all through the router. Calculate classification accuracy — target >95%.

T02

Edge-case routing: Test queries that span domains ("How do I load test a REST API?"). Verify fallback behavior when classification is ambiguous.

T03

Generator A/B test: Swap GPT-4o with Claude using the same retriever. Compare answer quality scores across 10+ queries to justify model choice.

T04

Module hot-swap test: Replace one retriever (e.g., Chroma → Pinecone) without changing any other component. Verify the pipeline still works correctly.

T05

Cross-domain query: Ask a question requiring knowledge from two branches simultaneously. The system should gracefully handle partial retrieval from the wrong branch.

Knowledge graph · High complexity

Graph RAG

Replaces flat vector search with a knowledge graph. Entities and relationships enable multi-hop reasoning — critical for understanding how a bug in Module A affects test coverage in Module B.

The key difference: Vector RAG retrieves isolated text chunks — it can tell you "what" but struggles with "how things relate." Graph RAG stores entities (services, modules, bugs, test cases) and their relationships, enabling queries like "Which downstream services are affected when the auth module fails?"

ASPECT	VECTOR RAG	GRAPH RAG
Storage	Embeddings in vector DB	Entities & relations in graph DB (Neo4j)
Retrieval	Cosine similarity search	Graph traversal + community detection
Reasoning	Single-hop (find similar text)	Multi-hop (follow relationship chains)
Best for	Factual Q&A, semantic search	Dependency analysis, causal queries
Example	"What is API rate limiting?"	"Which services depend on the auth module?"
Build cost	Low (just embed docs)	High (LLM entity extraction required)

Python · Graph Rag With Neo4J

from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

# Connect to Neo4j graph database
graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="your_password"
)

# Extract entities + relationships from documents
llm = ChatOpenAI(model="gpt-4o", temperature=0)
transformer = LLMGraphTransformer(llm=llm)

graph_documents = transformer.convert_to_graph_documents(documents)
graph.add_graph_documents(graph_documents)
print(f"Graph built with {len(graph_documents)} document graphs")

# Query the graph with natural language
chain = GraphCypherQAChain.from_llm(
    llm=llm, graph=graph, verbose=True
)

result = chain.invoke({
    "query": "Which test suites cover the payment module?"
})
# LLM generates: MATCH (t:TestSuite)-[:COVERS]->(m:Module {name:'payment'}) RETURN t

# Microsoft GraphRAG for community-based search
# graphrag query --method global --query "Main testing challenges?"
# graphrag query --method local  --query "Auth service test needs?"

Create "Graph RAG with Neo4j"

New flow. Add "Neo4j Graph" component. Configure URL (bolt://localhost:7687), username, and password.

Entity Extraction Pipeline

Add File Loader → "LLMGraphTransformer" node. Connect ChatOpenAI as the extraction LLM. The transformer output goes to the Neo4j Graph for storage.

Query Pipeline

Add Chat Input → "GraphCypherQAChain" node. Connect both Neo4j Graph and ChatOpenAI to the chain. This generates Cypher queries from natural language.

Alternative: Custom Component

For Microsoft GraphRAG: Add a "Python Function" node that calls the graphrag CLI or Python API for global/local search modes.

Test Relationship Queries

Test with: "Which components depend on the database service?" "What modules does the login feature touch?" Verify graph traversal returns accurate relationships.

🧪 QA Test Scenarios — Graph RAG

T01

Entity extraction accuracy: After indexing, query the graph schema. Verify all critical entities (services, modules, APIs) and relationships are correctly captured.

T02

Cypher query validation: Log the generated Cypher queries. Check if the LLM produces syntactically valid queries. Invalid Cypher = silent failure.

T03

Circular dependency handling: Index a document with circular module dependencies. Verify the graph traversal doesn't infinite-loop.

T04

Global vs. local search comparison: Use the same query on both modes. Global = broad community summaries; Local = specific entity details. Verify each returns appropriate granularity.

T05

Construction cost benchmark: Measure time and token cost to build the graph vs. a vector store on the same document set. Graph RAG is typically 10–50x more expensive to build.

AI agent loop · High complexity

Agentic RAG

An AI agent dynamically decides when to retrieve, what to search for, whether to search again, and when it has enough information. The closest RAG gets to how a human researcher works.

What makes it "agentic": The agent has a reasoning loop — it thinks, acts (calls tools), observes results, and decides whether to continue or stop. Unlike fixed pipelines, the agent can search 5 times across 3 different tools to answer one complex question, or skip retrieval entirely for a simple factual query.

ReAct Pattern: Reason → Act → Observe → Reason → Act → ... → Final Answer. Each cycle can call a different tool.

🎯 Dynamic Tool Selection

The agent decides which retriever or tool to call based on the query context — API docs, UI docs, bug database, or even web search. No hard-coded routing.

🔄 Iterative Retrieval

Can search multiple times with refined queries. If the first search doesn't yield enough, the agent reformulates and searches again.

🌐 Multi-Source Fusion

Can query internal vector stores AND external web search in a single turn, then synthesize a comprehensive answer from both sources.

⚠️ Cost Warning

Agentic RAG uses 3–10x more tokens than basic RAG due to the reasoning loop. Always set max_iterations to prevent runaway costs. Monitor token usage carefully.

Python · React Agent With Multiple Tools

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain.tools.retriever import create_retriever_tool
from langchain import hub

# Define specialized retriever tools
api_tool = create_retriever_tool(
    api_retriever, name="api_test_docs",
    description="Search API testing documentation and test cases"
)
ui_tool = create_retriever_tool(
    ui_retriever, name="ui_test_docs",
    description="Search UI testing guides, Selenium and Playwright docs"
)
bug_tool = create_retriever_tool(
    bug_retriever, name="bug_reports",
    description="Search historical bug reports and known issues"
)
tools = [api_tool, ui_tool, bug_tool]

# Create ReAct agent (Reason + Act loop)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm, tools, prompt)

agent_executor = AgentExecutor(
    agent=agent, tools=tools, verbose=True,
    max_iterations=5,           # Prevent infinite loops!
    handle_parsing_errors=True
)

# Multi-tool query: agent decides which tools to use
response = agent_executor.invoke({
    "input": "Were there payment API bugs last quarter, "
             "and what test cases should cover them?"
})
# Agent reasoning trace:
# Thought: I need both bug history and API test cases
# Action: bug_reports("payment API Q3")
# Observation: Found 3 bugs: AUTH-401, TIMEOUT-503...
# Action: api_test_docs("payment endpoint testing")
# Final Answer: Based on bugs found, recommend...

Create "Agentic RAG with Tool Selection"

New flow. Add Chat Input. Create 3 different vector stores (API docs, UI docs, Bug reports).

Create Retriever Tools

Add "Retriever Tool" components (one per vector store). Each tool needs a name and description — the agent uses these to decide which tool to call.

Add Tool Calling Agent

Add "Tool Calling Agent" or "ReAct Agent" component. Connect all retriever tools to the agent's tool input. Set max_iterations=5.

Connect GPT-4o as Reasoning LLM

Connect ChatOpenAI (GPT-4o) to the agent. GPT-4o's strong reasoning is recommended — weaker models often fail to parse tool results correctly.

Test with Complex Multi-Tool Queries

Send: "Find related bugs for auth API and suggest new test cases." Monitor the agent's reasoning trace to verify it selects appropriate tools.

🧪 QA Test Scenarios — Agentic RAG

T01

max_iterations guard test: Construct a query that forces the agent to loop (unanswerable with available tools). Verify it terminates at max_iterations with a graceful message.

T02

Tool selection accuracy: For 20 queries where the correct tool is known, verify the agent picks the right tool first. Track misdirected tool calls.

T03

Token budget test: Monitor tokens per query. Agentic RAG with 3 iterations uses ~3x tokens. Establish baseline and alert thresholds.

T04

Tool failure mid-execution: Simulate a tool returning an error (empty results). Verify the agent doesn't hallucinate a response and either retries or reports failure.

T05

Hallucinated tool call test: Verify the agent only calls tools that actually exist. An agent that invents non-existent tool names is a critical failure mode.

Self-reflective · High complexity

Self-RAG

The model generates reflection tokens at each stage to decide: (1) is retrieval needed? (2) are retrieved docs relevant? (3) is the response grounded? Like building test assertions into the AI pipeline itself.

The innovation: Self-RAG introduces 4 reflection tokens that act like inline test assertions. The model evaluates its own behavior at each step. QA mental model: Think of it as a test framework where every test step includes a built-in assertion that decides whether to continue, retry, or fail gracefully.

REFLECTION TOKEN	QUESTION ANSWERED	VALUES
[Retrieve]	Do I need to look up information?	Yes / No / Continue
[ISREL]	Is this retrieved document relevant to the query?	Relevant / Irrelevant
[ISSUP]	Is my response supported by the retrieved docs?	Fully / Partially / No Support
[ISUSE]	Is the response useful to the user?	5 (best) → 1 (worst)

Python · Self-Rag With Langgraph

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class RAGState(TypedDict):
    question: str
    documents: List[str]
    generation: str
    needs_retrieval: bool
    is_relevant: bool
    is_grounded: bool

def decide_retrieval(state):
    # Reflection token: [Retrieve]
    prompt = f'Does "{state["question"]}" need external info? (yes/no)'
    response = llm.invoke(prompt)
    return {"needs_retrieval": "yes" in response.content.lower()}

def check_relevance(state):
    # Reflection token: [ISREL]
    prompt = f'Are these docs relevant to "{state["question"]}"?\n'
    prompt += "\n".join(state["documents"][:2])
    response = llm.invoke(prompt)
    return {"is_relevant": "yes" in response.content.lower()}

def check_groundedness(state):
    # Reflection token: [ISSUP]
    prompt = f"Is this answer grounded in the context?\n"
    prompt += f'Answer: {state["generation"]}\nContext: {state["documents"][0]}'
    response = llm.invoke(prompt)
    return {"is_grounded": "yes" in response.content.lower()}

# Build the LangGraph state machine
workflow = StateGraph(RAGState)
workflow.add_node("decide",    decide_retrieval)
workflow.add_node("retrieve",  retrieve)
workflow.add_node("relevance", check_relevance)
workflow.add_node("generate",  generate)
workflow.add_node("grounded",  check_groundedness)

workflow.set_entry_point("decide")
workflow.add_conditional_edges(
    "decide", lambda s: "retrieve" if s["needs_retrieval"] else "generate"
)
workflow.add_edge("retrieve", "relevance")
workflow.add_conditional_edges(
    "relevance", lambda s: "generate" if s["is_relevant"] else "retrieve"
)
workflow.add_edge("generate", "grounded")
workflow.add_conditional_edges(
    "grounded", lambda s: END if s["is_grounded"] else "retrieve"
)
app = workflow.compile()

Create "Self-RAG with Reflection"

Best implemented using "Custom Component" nodes in Langflow for each reflection step.

Retrieval Decision Node

Add Chat Input → Custom Component ("Retrieval Decision") that prompts the LLM: "Does this question need external info?" → branches yes/no.

Relevance Check Node

After Retriever, add "Relevance Check" Custom Component. If irrelevant, loop back to retriever with reformulated query.

Groundedness Check Node

After generation, add "Groundedness Check" node. If not grounded → trigger re-retrieval. If grounded → pass to Chat Output.

Monitor Reflection Trace

Use Langflow's execution trace to debug each reflection decision. This is essential for understanding why the system loops and whether each reflection step is calibrated correctly.

🧪 QA Test Scenarios — Self-RAG

T01

Skip-retrieval test: Ask a common knowledge question ("What is HTTP?"). Verify the [Retrieve]=No path is taken and no vector search is performed, saving latency and cost.

T02

Forced loop test: Index only irrelevant documents. Verify the relevance-check loop terminates at a max_retries limit rather than looping infinitely.

T03

Groundedness hallucination test: Generate responses that subtly add facts not in the context. Verify [ISSUP] = "No Support" is correctly triggered for fabricated details.

T04

Latency overhead benchmark: Compare Self-RAG latency vs. Naive RAG on the same queries. Self-RAG is typically 2–4x slower. Quantify the quality improvement to justify the cost.

T05

Calibration test: The reflection tokens should be consistent — the same doc+query pair should always receive the same ISREL verdict. Test consistency across 10 identical runs.

Fallback mechanism · Medium complexity

Corrective RAG

Scores retrieved document quality and triggers a web search fallback when confidence is low. Like a retry-with-fallback pattern in test frameworks — if the primary source fails, try an alternative.

The decision flow: After retrieval, a grader LLM scores each document 0.0–1.0 for relevance. Based on the average: >0.7 (Correct) → use as-is; 0.4–0.7 (Ambiguous) → supplement with web search; <0.4 (Incorrect) → discard all, use only web search.

✅ CORRECT (>0.7)

Retrieved documents are highly relevant. Use them directly for generation. No web search needed. Best case scenario for latency and cost.

⚠️ AMBIGUOUS (0.4–0.7)

Retrieved docs are partially relevant. Supplement with web search results. Merge both sources before generation. Balances internal knowledge with current web data.

❌ INCORRECT (<0.4)

Retrieved docs are irrelevant. Discard entirely. Fall back fully to web search (Tavily, SerpAPI). Prevents confidently wrong answers based on poor retrieval.

🔍 Query Rewriting

Before web search fallback, CRAG optionally rewrites the query for better search performance. The original QA question may not be ideal for a search engine.

Python · Corrective Rag

from langgraph.graph import StateGraph, END
from langchain_community.tools.tavily_search import TavilySearchResults

web_search = TavilySearchResults(max_results=3)

def grade_documents(state):
    "Grade each retrieved document 0.0-1.0 for relevance"
    question = state["question"]
    scored_docs = []

    for doc in state["documents"]:
        grade_prompt = f"""Grade relevance 0.0-1.0:
        Question: {question}
        Document: {doc.page_content[:500]}
        Score:"""
        score = float(llm.invoke(grade_prompt).content.strip())
        scored_docs.append((doc, score))

    avg_score = sum(s for _, s in scored_docs) / len(scored_docs)

    if   avg_score > 0.7: action = "correct"
    elif avg_score > 0.4: action = "ambiguous"
    else:                action = "incorrect"

    relevant = [d for d, s in scored_docs if s > 0.5]
    return {"action": action, "documents": relevant}

def web_search_fallback(state):
    "Fallback to web search when retrieval confidence is low"
    results = web_search.invoke({"query": state["question"]})
    web_docs = [r["content"] for r in results]
    existing = state.get("documents", [])
    return {"documents": existing + web_docs}

def route_after_grading(state):
    if   state["action"] == "correct": return "generate"
    else:                               return "web_search"  # ambiguous or incorrect

workflow = StateGraph(CRAGState)
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", route_after_grading)
workflow.add_edge("web_search", "generate")
workflow.add_edge("generate", END)
crag_app = workflow.compile()

Create "Corrective RAG (CRAG)"

New flow. Add Chat Input → Retriever (connected to your vector store, k=5).

Add Grading Component

Add a "Custom Component" for the grading logic. Takes retrieved docs + query, uses LLM to score relevance 0.0–1.0 for each doc, returns average score and action.

Conditional Router on Score

Add "Conditional Router" branching: Correct (>0.7) → direct to prompt; Ambiguous/Incorrect → web search fallback.

Web Search Fallback

Add "Tavily Search" (or SerpAPI) component for the fallback path. Add a "Combine Documents" node to merge retrieved + web results for ambiguous cases.

Test with Out-of-KB Queries

Send queries your vector store cannot answer. Verify web search fallback activates. Monitor grading scores to calibrate the 0.4/0.7 thresholds for your domain.

🧪 QA Test Scenarios — Corrective RAG

T01

Threshold boundary test: Construct docs that score exactly 0.4 and 0.7. Verify the routing decision is deterministic at these boundaries.

T02

Double-failure test: When web search also returns irrelevant results, does the system still generate a response? Verify it uses the available context and doesn't hallucinate confidently.

T03

Grader calibration test: Run the same doc-query pair 10 times. The score should be consistent (within ±0.1). High variance means the grader is unreliable.

T04

Latency with web fallback: Measure latency for each path (correct vs. fallback). Web search adds ~1–3 seconds. Verify this is acceptable for your use case.

T05

Private data leak test: Verify the web search query doesn't include sensitive internal data from retrieved docs that scored poorly. Sanitize the search query before calling external APIs.

Dual search streams · Medium complexity

Hybrid RAG

Combines BM25 keyword search with semantic vector search. BM25 excels at exact matches (error codes, test IDs); vector search handles concepts. Together they cover each other's blind spots.

Why hybrid wins: Each retrieval method has blind spots. BM25 is terrible at "login doesn't work" (no exact match) but perfect for "ERR_AUTH_403". Vector search is terrible at specific IDs and error codes but excellent at conceptual understanding. Combining them covers both failure modes.

QUERY TYPE	BM25 KEYWORD	VECTOR SEMANTIC	WINNER
"Error ERR_AUTH_403"	Excellent — exact match	Poor — number confusion	BM25
"login doesn't work"	Poor — no exact keywords	Excellent — semantic intent	Vector
"TC-2451 test case"	Excellent — ID match	Poor — ID is meaningless in vector space	BM25
"flaky intermittent failures"	Moderate	Excellent — understands concept	Vector
"Explain ERR_AUTH_403 and fix"	Good for code part	Good for explanation part	Both

Python · Hybrid Rag With Rrf Fusion

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25: keyword search on raw text
bm25_retriever = BM25Retriever.from_documents(
    documents=chunks, k=4
)

# Vector: semantic search on embeddings
vector_retriever = vectorstore.as_retriever(
    search_kwargs={"k": 4}
)

# Ensemble: weights favor semantic (0.6) over keyword (0.4)
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)

results = hybrid_retriever.invoke("Error ERR_AUTH_403 during login test")
# BM25 finds exact error code matches
# Vector finds semantically related auth failure docs
# Results merged and deduplicated automatically

# ── Reciprocal Rank Fusion (manual) ──────────────────
def reciprocal_rank_fusion(results_list, k=60):
    fused_scores = {}
    for results in results_list:
        for rank, doc in enumerate(results):
            doc_id = doc.page_content[:100]
            if doc_id not in fused_scores:
                fused_scores[doc_id] = {"doc": doc, "score": 0}
            fused_scores[doc_id]["score"] += 1.0 / (rank + k)

    sorted_docs = sorted(
        fused_scores.values(),
        key=lambda x: x["score"], reverse=True
    )
    return [item["doc"] for item in sorted_docs]

bm25_r  = bm25_retriever.invoke(query)
vector_r = vector_retriever.invoke(query)
hybrid_r = reciprocal_rank_fusion([bm25_r, vector_r])

Create "Hybrid RAG (BM25 + Vector)"

New flow. Add File Loader → Text Splitter for document processing.

Create Two Parallel Branches

(a) Chroma vector store with OpenAI Embeddings. (b) BM25 Retriever component. Both are connected to the same chunked documents.

Connect Query to Both Retrievers

Add Chat Input. Connect the query to both the Chroma Retriever AND the BM25 Retriever simultaneously. They run in parallel.

Add Ensemble/RRF Merge Node

Use "Ensemble Retriever" or a Python Function implementing RRF. Set weights [0.4, 0.6] to favor semantic search slightly.

Test Both Query Types

Test exact-match queries (error codes, test IDs) and semantic queries (concepts, behaviors). Verify both retrievers contribute to the final result set.

🧪 QA Test Scenarios — Hybrid RAG

T01

BM25 dominance test: Query with exact error codes and test case IDs. Log which retriever found each result. BM25 should dominate for exact-match queries.

T02

Vector dominance test: Query with natural language descriptions ("tests that verify user authentication flow"). Vector search should return the most relevant results.

T03

Weight tuning test: Try weights [0.5, 0.5], [0.3, 0.7], [0.7, 0.3]. Measure precision@4 for each ratio on your test query set. Find the optimal balance for your corpus.

T04

Deduplication test: When both retrievers return the same document, it should appear exactly once in the final results with a higher fused score. Verify this behavior.

T05

Mixed query test: "Explain ERR_AUTH_403 and how to fix it" — needs both exact code matching (BM25) and explanatory context (Vector). Verify the hybrid result contains both types of content.

Images + Tables + Text · High complexity

Multi-Modal RAG

Extends retrieval to images, diagrams, tables, and charts. Critical for QA engineers whose test evidence includes screenshots, architecture diagrams, and performance charts.

Why it matters for QA: QA evidence is inherently multi-modal — bug screenshots, test execution reports with charts, architecture diagrams, performance graphs, and accessibility snapshots. Multi-Modal RAG lets you query all of this evidence with natural language.

STRATEGY	HOW IT WORKS	BEST FOR
Text Extraction	Convert images/tables to text, then standard RAG	Screenshots with text, simple tables
Multi-Vector	Store text summaries + original media; retrieve summaries, return media	Complex diagrams, charts
CLIP Embeddings	Embed images and text in the same vector space	Image-text similarity search
Vision LLM	GPT-4V / Claude describes images; index descriptions	Complex visual content, screenshots

Python · Multi-Vector Retrieval For Images

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_openai import ChatOpenAI
import uuid

vision_llm = ChatOpenAI(model="gpt-4o", temperature=0)

def summarize_image(image_base64: str) -> str:
    "Use Vision LLM to describe image for a QA engineer"
    response = vision_llm.invoke([{
        "type": "image_url",
        "image_url": {"url": f"data:image/png;base64,{image_base64}"}
    }, {
        "type": "text",
        "text": "Describe for a QA engineer: error messages, "
               "UI elements, test results, and status codes visible."
    }])
    return response.content

# Multi-Vector store: summaries → vector store, originals → byte store
byte_store = InMemoryByteStore()
id_key = "doc_id"

multi_retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=byte_store,
    id_key=id_key
)

# Index each image
for img_path in image_paths:
    img_b64 = encode_image(img_path)
    summary = summarize_image(img_b64)
    doc_id = str(uuid.uuid4())

    # Summary → vector store (for retrieval)
    multi_retriever.vectorstore.add_documents([
        Document(page_content=summary, metadata={id_key: doc_id})
    ])
    # Original image → byte store (for return)
    multi_retriever.docstore.mset([(doc_id, img_b64)])

# Query retrieves summaries but RETURNS original images
results = multi_retriever.invoke("Login page 404 error screenshot")

Create "Multi-Modal RAG"

New flow. Add a File Loader that accepts multiple types (PDF, PNG, JPG, DOCX).

Add Unstructured Loader

For PDFs: Use "Unstructured" loader with strategy="hi_res" to extract text, tables, and embedded images separately.

Content Router by Type

Add a Custom Component that routes by content type: text → text splitter, images → Vision LLM summarizer, tables → table summarizer.

Unified Vector Store

All summarized content feeds into one vector store. Store original content (images, tables) in a separate byte store for retrieval return.

Multi-Modal Generator

Use ChatOpenAI GPT-4o (not mini) as the final answer generator — it can reason about both text and image context simultaneously.

🧪 QA Test Scenarios — Multi-Modal RAG

T01

Image summarization accuracy: Index a known error screenshot. Query for its content. Verify the Vision LLM correctly identifies error codes, button states, and failure messages.

T02

Table extraction fidelity: Extract a performance table from a PDF. Verify all numbers, column headers, and row labels are preserved exactly — numeric errors are silent and dangerous.

T03

Mixed context query: Ask a question that requires both text and image context ("What error is shown in the screenshot mentioned in section 3.2?"). Verify both sources are retrieved and synthesized.

T04

Low-quality input test: Index blurry screenshots and scanned documents. Verify graceful degradation — the system should flag low-confidence image descriptions rather than hallucinate content.

T05

Original vs. summary retrieval: Verify that when an image-related query is answered, the original image (not just the text summary) is returned in the response for human verification.

Anthropic's approach · Medium-high

Contextual RAG

Anthropic's innovation: prepend chunk-specific context to each chunk BEFORE embedding it. Solves the fundamental chunking problem where isolated chunks lose their document context. Reduces retrieval failures by 67%.

The problem it solves: Consider the chunk: "The timeout should be set to 30 seconds for this endpoint." Without context — which endpoint? Which service? Contextual RAG prepends: "This is from the Payment Service API Test Plan, Section 3.2: Timeout Configuration for /api/v2/payments/process." Now retrieval can correctly associate this chunk with payment timeout queries.

📝 Context Generation

For each chunk, an LLM reads the full document + the chunk and generates 2-3 sentences of contextual metadata describing: what document it's from, what section, and what key entities are mentioned.

⚡ Hybrid + Reranking

Anthropic's research shows best results combining Contextual Embeddings + Contextual BM25 + Cohere Reranking. This combination reduced retrieval failures by 67% in benchmarks.

💰 Indexing Cost

Each chunk requires one LLM call during indexing for context generation. For a 1000-chunk corpus, that's 1000 LLM calls. Use smaller models (claude-haiku, gpt-4o-mini) to control costs.

🎯 Best Use Cases

Large document collections with many similar sections (multiple API endpoints, multiple test plans), technical documentation where context disambiguation is critical for correct retrieval.

Python · Contextual Rag (Anthropic Style)

from langchain_openai import ChatOpenAI
from langchain.schema import Document

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

CONTEXT_PROMPT = """Here is the full document:
<document>
{document}
</document>

Here is a chunk from that document:
<chunk>
{chunk}
</chunk>

Provide 2-3 sentence context explaining:
1. What document/section this chunk belongs to
2. What the chunk is specifically about
3. Key entities or identifiers mentioned

Context:"""

def add_context_to_chunks(full_doc: str, chunks: list) -> list:
    "Prepend document context to each chunk before embedding"
    contextualized = []
    for chunk in chunks:
        context = llm.invoke(
            CONTEXT_PROMPT.format(
                document=full_doc[:4000],
                chunk=chunk.page_content
            )
        ).content

        enriched = f"{context}\n\n{chunk.page_content}"
        contextualized.append(Document(
            page_content=enriched,
            metadata={**chunk.metadata, "original": chunk.page_content}
        ))
    return contextualized

# ── Best combination: Contextual + Hybrid + Rerank ───
ctx_chunks = add_context_to_chunks(full_doc_text, chunks)

ctx_bm25 = BM25Retriever.from_documents(ctx_chunks, k=20)
ctx_vector = Chroma.from_documents(ctx_chunks, embeddings).as_retriever(
    search_kwargs={"k": 20}
)
hybrid = EnsembleRetriever(
    retrievers=[ctx_bm25, ctx_vector], weights=[0.4, 0.6]
)
final = ContextualCompressionRetriever(
    base_compressor=CohereRerank(top_n=5),
    base_retriever=hybrid
)
# This combination: -67% retrieval failures (Anthropic research)

Create "Contextual RAG (Anthropic Style)"

New flow. Add File Loader → Text Splitter for initial chunking. Keep both the chunks AND the full document text.

Add Context Generator Component

Add a "Custom Component" that takes each chunk + full document and generates context using ChatOpenAI with the context prompt template above.

Feed Contextualized Chunks to Both Retrievers

Connect contextualized chunks to BOTH a Chroma vector store AND a BM25 Retriever. Both now search context-enriched content.

Add Ensemble + Cohere Rerank

Add "Ensemble Retriever" with weights [0.4, 0.6]. Follow with "Cohere Rerank" (top_n=5). This is Anthropic's recommended full stack.

Benchmark vs. Naive RAG

Run the same 20 test queries through Naive RAG and Contextual RAG. Compare retrieval precision. Expect ~20-40% improvement on ambiguous queries about specific sections.

🧪 QA Test Scenarios — Contextual RAG

T01

Disambiguation test: Create a corpus with 5 similar sections (e.g., timeout config for 5 different APIs). Query for each specific one. Without context, retrieval fails; with context, each query returns the correct section.

T02

Context accuracy test: Inspect generated context for each chunk. Verify the LLM correctly identifies the document section and key entities — hallucinated metadata defeats the entire purpose.

T03

Indexing cost benchmark: Count API calls during indexing. A 1000-chunk corpus = 1000 LLM calls. Measure cost and time. This is paid upfront, not per query — ensure it fits your budget.

T04

Baseline comparison: Run the same 50 queries through Naive RAG vs. Contextual RAG. Measure retrieval precision@5 for both. The improvement should justify the indexing cost.

T05

Context hallucination check: Feed a deliberately obscure chunk where the context is hard to determine. Verify the context generator says "context unclear" rather than inventing a plausible-sounding but wrong description.

RAGAS framework · QA essentials

RAG Evaluation

Every RAG optimization must be justified by measurable gains. RAGAS provides automated metrics that don't require human-labeled ground truth — perfect for continuous quality monitoring.

Faithfulness

Is the answer grounded in retrieved context? Detects hallucinations where the model adds facts not in the docs.

Target: > 0.85 · Range: 0.0 → 1.0

Answer Relevancy

Does the answer actually address the question? A grounded but off-topic answer scores low here.

Target: > 0.80 · Range: 0.0 → 1.0

Context Precision

Are the top retrieved documents actually relevant? Measures retrieval signal-to-noise ratio.

Target: > 0.75 · Range: 0.0 → 1.0

Context Recall

Were all relevant documents retrieved? Low recall means the answer is incomplete due to missed retrieval.

Target: > 0.80 · Range: 0.0 → 1.0

Answer Correctness

Is the answer factually correct against ground truth? Requires human-labeled expected answers.

Target: > 0.80 · Range: 0.0 → 1.0

Python · Ragas Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

eval_data = {
    "question":    ["How to test API rate limiting?", "Login test cases?"],
    "answer":      generated_answers,    # From your RAG pipeline
    "contexts":    retrieved_contexts,   # Retrieved chunks list
    "ground_truth": expected_answers      # Human-written expected answers
}

results = evaluate(
    Dataset.from_dict(eval_data),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
#  'context_precision': 0.78, 'context_recall': 0.85}

QA Test Plan Template

TEST CATEGORY	TEST SCENARIOS	PASS CRITERIA
Retrieval Quality	Relevant docs in top-k, duplicate handling, empty results	Context precision > 0.80
Generation Quality	Faithfulness, hallucination detection, answer completeness	Faithfulness > 0.85
Edge Cases	Empty KB, very long queries, special characters, multilingual	Graceful degradation
Performance	Latency p50/p95/p99, throughput, concurrent users	p95 latency < 3 seconds
Security	Prompt injection, data leakage, PII in responses	Zero PII leakage
Robustness	Typos in queries, paraphrased questions, adversarial inputs	Consistent quality ±5%

Quick Reference

Large document collections. Anthropic's -67% failure rate.

MED-High complexity