Lesson 6Architecture

Multimodal RAG

Search across text and images

The Problem: Text RAG fails with visual content (diagrams, tables, photos, charts). When documents contain images, traditional text-only RAG cannot find or reason about visual information.

The Solution: Three Architectures of Multimodal RAG

Multimodal RAG extends traditional retrieval-augmented generation so that the retrieval step can find and reason over images, not just text. In plain text RAG you chunk a corpus, turn each chunk into a vector with an embedding model, store those vectors in a vector database, and at query time retrieve the closest chunks to feed into the LLM. The catch is that a huge amount of real-world knowledge lives in diagrams, tables, charts, screenshots and photos. If you only embed the surrounding text, the model is effectively blind to that content. Multimodal RAG fixes this by giving the retriever a way to represent visual information directly.

How it works: three architectures

The first approach uses joint image-text embeddings from a model like CLIP: an image is encoded straight into a vector that lives in the same space as text embeddings, so a text query can retrieve a matching picture by similarity. It is fast and great for "find a similar image", but a single vector throws away fine detail, so it struggles to answer specific questions about what is inside the image. The second approach generates LLM summaries: a vision model writes a text description of each image (or table), and you embed that text with your normal pipeline. Now anything searchable by words works, but you only retrieve what the caption happened to mention — visual nuance is lost in translation. The third approach, multi-vector / late interaction models such as ColPali, keeps per-patch and per-token vectors for the whole page and scores query tokens against every patch, preserving layout and visual structure at the cost of a larger index.

Choosing and a worked example

Pick the architecture by your query type: CLIP for image-to-image search, LLM summaries when users ask text questions and you want a cheap normal-text index, and ColPali when document layout matters (forms, scientific PDFs, slides). Concretely, imagine a 40-page PDF where page 12 has a revenue bar chart and the user asks "which quarter had the highest revenue?". Plain text RAG retrieves nothing useful because the answer is only in the bars. An LLM-summary pipeline works only if the generated caption already said "Q3 was highest". ColPali instead retrieves the actual chart page by matching the query tokens to the chart's visual patches, and the generation model reads the bars to answer. The tradeoff: ColPali's multi-vector index is far larger and slower to search, so for a small gallery of product photos CLIP is the pragmatic choice.

Think of it like a librarian who can search both by text descriptions and by looking at actual pictures in books:

1. CLIP Embeddings: Image goes directly into CLIP encoder, producing a vector for similarity search. Fast but can't answer text questions about image content.
2. LLM-Generated Summaries: LLM describes each image in text, then text is embedded normally. Searchable by queries but loses visual detail in translation.
3. Multi-Vector (ColPali): Late interaction model produces per-token embeddings for both text and image patches. Best accuracy, preserves layout and visual info.
4. Choosing the right approach: CLIP for image-to-image search, LLM summaries for text-based Q&A about images, ColPali for document understanding with layout-sensitive retrieval.

Where Multimodal RAG Matters

Technical Documentation: Diagrams and schematics alongside text explanations, searchable by both visual similarity and text queries.
Medical Records: X-rays, MRI scans referenced in patient notes, enabling search like "show me cases with similar chest X-ray findings".
E-commerce: Product photos combined with descriptions for visual search: "find similar looking products" or "red dress with floral pattern".
Legal Discovery: Contracts with stamps, signatures, handwritten notes that affect interpretation alongside printed text.

Fun Fact: ColPali (2024) from Illuin Technology showed that a single vision model can match or beat complex OCR+text pipelines for document retrieval, while being 4x faster to index. The key insight: visual tokens preserve layout information that gets lost in text extraction.

Try It Yourself!

Explore the visualization below to see how each architecture processes a document with images — from chunking to retrieval to generation.

Frequently asked questions

What is multimodal RAG in simple terms?

It is a regular RAG (retrieval-augmented generation) system that can search not only over text but also over images — diagrams, tables, charts and photos. Instead of embedding only the text around an image, multimodal RAG represents the visual information itself as vectors, so the model can find relevant images and answer questions about what is inside them.

How is multimodal RAG different from plain text RAG?

Text RAG chunks documents, turns them into embeddings and retrieves the closest pieces of text by meaning. It is blind to the content of diagrams, charts and photos. Multimodal RAG adds a way to encode images directly — via joint embeddings (CLIP), text descriptions from a vision model, or multi-vector models like ColPali — so an answer can live inside a picture, not just in the surrounding text.

What are the architectures of multimodal RAG and when do I pick each?

Three main ones: 1) CLIP embeddings — an image becomes a single vector, fast, great for image-to-image search; 2) LLM summaries — a vision model writes a text caption that you embed with a normal pipeline, good for text questions and a cheap index; 3) multi-vector ColPali — keeps per-patch vectors for the whole page, best accuracy for layout-sensitive documents (forms, scientific PDFs, slides), but the index is larger and slower.

What is ColPali and why can it beat OCR pipelines for document retrieval?

ColPali (2024, Illuin Technology) is a late-interaction vision model that produces per-patch image embeddings and matches them directly against query-token embeddings. It preserves page layout information that gets lost during OCR text extraction. For document retrieval it matches or beats complex OCR+text pipelines while being roughly 4x faster to index.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Create a quality text description of a diagram for RAG indexing

Without technique

Prompt

Describe this diagram. [Diagram: bar chart of quarterly sales — Q1: $2.1M, Q2: $3.4M, Q3: $2.8M, Q4: $4.2M]

Response

The diagram shows quarterly sales. Sales are growing.

Tokens:290/18

Time:1500ms

Quality:

With multimodal-rag

With technique

Prompt

Create a detailed text description of this diagram for indexing in a RAG system. The description must be searchable — a user should be able to find it with a text query. Include: 1. Chart type and what it shows 2. All data values (exact numbers) 3. Trends and patterns 4. Key takeaways 5. Search tags [Diagram: bar chart of quarterly sales — Q1: $2.1M, Q2: $3.4M, Q3: $2.8M, Q4: $4.2M]

Response

Type: Bar chart Topic: Quarterly sales for the year

Data:

Q1: $2.1M (minimum)
Q2: $3.4M (+62% vs Q1)
Q3: $2.8M (-18% vs Q2)
Q4: $4.2M (+50% vs Q3, maximum)

Trends: Overall growth with Q3 dip. Best quarter — Q4 ( $4.2M). Annual total:$ 12.5M.

Takeaways: Seasonal Q3 dip, strong Q4 finish. Q1→Q4 growth: +100%.

Tags: sales, quarterly, bar chart, revenue, seasonality, growth

👁️Structuring description for maximum search relevance

🧠Exact numbers enable finding the chart with queries like "Q4 sales" or "$4.2M"

✅Tags expand search coverage — synonyms and key concepts

Tokens:400/180

Time:3500ms

Quality:

Why this works

In the "LLM summaries" architecture, description quality = search quality. Structure, exact data, and tags turn an image into a searchable artifact.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Hallucinations Rag

This lesson is part of a structured LLM course.

My Learning Path

The Solution: Three Architectures of Multimodal RAG

How it works: three architectures

Choosing and a worked example

Think of it like a librarian who can search both by text descriptions and by looking at actual pictures in books:

1. CLIP Embeddings: Image goes directly into CLIP encoder, producing a vector for similarity search. Fast but can't answer text questions about image content.
2. LLM-Generated Summaries: LLM describes each image in text, then text is embedded normally. Searchable by queries but loses visual detail in translation.
3. Multi-Vector (ColPali): Late interaction model produces per-token embeddings for both text and image patches. Best accuracy, preserves layout and visual info.
4. Choosing the right approach: CLIP for image-to-image search, LLM summaries for text-based Q&A about images, ColPali for document understanding with layout-sensitive retrieval.

Where Multimodal RAG Matters

Technical Documentation: Diagrams and schematics alongside text explanations, searchable by both visual similarity and text queries.

Medical Records: X-rays, MRI scans referenced in patient notes, enabling search like "show me cases with similar chest X-ray findings".

E-commerce: Product photos combined with descriptions for visual search: "find similar looking products" or "red dress with floral pattern".

Legal Discovery: Contracts with stamps, signatures, handwritten notes that affect interpretation alongside printed text.

Frequently asked questions

What is multimodal RAG in simple terms?

How is multimodal RAG different from plain text RAG?

What are the architectures of multimodal RAG and when do I pick each?

What is ColPali and why can it beat OCR pipelines for document retrieval?