Agentic RAG — Let the Agent Decide What to Retrieve
Classic RAG is a brainless conveyor: query, retrieve, answer. Agentic RAG is when the agent decides whether to search, what to search for, and whether the results are good enough. We break down how to turn a linear pipeline into a decision loop — using LangGraph and ChromaDB.
IntermediateRAG & Data30 minLangGraph, Python, ChromaDB
1
Simple question — conveyor. Complex one — you need a detective
'What is the return policy?' — one search, one answer, RAG handles it. 'Compare two team approaches and decide which is better for our case' — and the conveyor breaks. Not because the model is weak or the database is bad. Because the architecture cannot think: 'Did I find enough? Did I understand the question correctly?'
Agentic RAG adds exactly these checkpoints. The result — an agent that searches iteratively, not blindly.
❌ Classic RAG
- Query → retrieve → answer (always)
- No evaluation of retrieval quality
- One search for the entire answer
✅ Agentic RAG
- Plan first, then retrieve
- Evaluates results before answering
- Can reformulate and search again
80% of RAG quality problems are not model problems. They are architecture problems: bad chunking, single-pass retrieval where iterative search is needed.
2
Plan → retrieve → evaluate → decide. That is the whole secret
All of Agentic RAG comes down to one loop with four steps. At each step the agent makes a decision: search more, reformulate the query, or answer. LangGraph implements this literally — graph nodes and conditional edges between them.
Plan
Retrieve
Evaluate
sufficient
Answer
insufficient
Reformulate
цикл агента:
1. план → разбить вопрос на подзапросы
2. поиск → найти документы по каждому подзапросу
3. оценка → документы релевантны? (порог > 0.7)
4. решение:
достаточно → ответить
мало / нерелевантно → переформулировать → п.2
3 итерации без результата → "нет данных"Add a hard iteration limit (e.g. 3 cycles). Without it the agent may search forever for the 'perfect answer' — especially when the knowledge base doesn't cover the topic.
3
Document chunking decides everything before the agent even runs
You can build a perfect agent — and it will still fail if documents are chunked wrong. The information is in the database, but split so that search cannot find it. This is the most common and most ignored cause of poor RAG.
There is no universal strategy. Fixed-size breaks semantic units. Semantic chunking is more precise but expensive. Hierarchical (parent-child) is great for long documents but harder to implement. The choice depends on your documents and queries — and this decision should come before writing agent code.
| Strategy | Best for | Weakness |
|---|---|---|
| Fixed-size | Uniform text (logs, articles) | Breaks semantic units |
| By paragraph | Structured documents | Loses cross-section context |
| Hierarchical (parent-child) | Long documents, sections | More complex to implement |
| Semantic | Heterogeneous content | Slower and more expensive |
The "small-to-big" strategy: index small chunks for precise retrieval, but return the full parent block to the agent. Retrieval precision + sufficient context.
4
An agent that can't say 'I don't know' is dangerous
The most dangerous RAG mode is a confident answer based on irrelevant documents. A classic pipeline always returns something, even if it's completely off-topic.
The solution is an explicit grading step before answering. A separate LLM call: 'Does this document answer the question? Yes / No / Partially'. Based on the aggregate, the agent decides: answer, search more, or honestly say 'no data'. Two signals to distinguish: 'found irrelevant content' (reformulate query) vs 'found too little' (expand search). Different problems — different actions.
для каждого найденного документа:
вопрос: "Этот документ отвечает на вопрос?"
ответ: да | частично | нет
нет → отбросить
да/частично → оставить
если осталось ≥ 2 релевантных → достаточно для ответа
если 0 → переформулировать запрос и искать зановоDon't merge relevance grading and completeness check into one prompt. Relevance: 'is this on-topic?' Completeness: 'is this enough to answer?' Two tasks — two calls.
5
Enough for a full answer — or only partially?
Relevance grading and completeness assessment are different things. Three documents may be on-topic, but if the key fact is missing — the answer will be incomplete. Self-reflection is the final check before generation: the agent evaluates the full picture, not individual pieces.
If after 2-3 iterations the agent hasn't gathered enough — it's better to answer partially or honestly say 'no data' than to hallucinate.
Context is sufficient when...
≥2 relevant documents on the topic
Documents cover all parts of the question
No contradictions between sources
Only one document but it directly answers
After 3 iterations nothing useful — exit with 'I don't know'
6
If you can't explain the answer from logs — observability is missing
A black-box agent is not a feature, it's technical debt. Bad answer? You need to understand: was the problem in retrieval, grading, or the final prompt? Without traces it's reading tea leaves.
LangGraph helps: each graph node is a logging point. Save at every step: original question, sub-queries, retrieved documents with grades, the agent's decision and reasoning. Add metadata to the final answer — how many iterations, how many documents were discarded. After a week of real queries, this data will show where the system breaks most often.
на каждом шаге сохраняй:
номер_итерации, решение, причина,
найдено_документов, из_них_релевантных
в итоговый ответ добавь:
сколько итераций, какие подзапросы,
сколько отброшено — "рентген" для отладкиStore traces in a database, not just logs. After a week you will see patterns: which questions need more iterations, where the agent fails most often. This is data for improving prompts and chunking.
Result
An agentic RAG agent on LangGraph with a plan → retrieve → evaluate → decide loop. It can reformulate queries, grade relevance and completeness, honestly say 'I don't know', and explain every decision through traces.