TechniqueVision

Multimodal CoT

Vision + Reasoning

The Problem: Some problems involve both text AND images or diagrams. How can AI reason step-by-step while also processing visual information?

The Solution: Think With the Diagram

Multimodal CoT extends Chain-of-Thought reasoning to incorporate visual inputs like images, charts, and diagrams. Instead of jumping straight from a picture to an answer, the model is prompted to first put into words what it actually sees — the objects, labels, numbers, and relationships in the image — and only then reason step by step toward the answer. It's like solving a geometry problem by describing the figure aloud while working through the logic. This requires a vision model that can process both text and images in the same context.

How it works

The most reliable recipe is the two-stage approach from the original research. In stage one, the model is given the image plus the question and asked only to produce a rationale: a grounded description such as “I see a circuit with two resistors in series and a 9V battery.” In stage two, that text rationale is fed back in alongside the original question, and the model produces the final answer. Splitting the work matters because it forces the visual extraction to happen explicitly and on the record — the answer is then reasoned from a clear, text-grounded summary rather than from a vague glance. This separation is also why even small vision models (around 1B parameters in the paper) can beat much larger text-only models on diagram-heavy benchmarks like ScienceQA.

When to use it — and the pitfalls

Reach for Multimodal CoT whenever the answer genuinely depends on visual detail: science exam figures, geometry, charts and graphs, lab equipment, maps, forms, or flowcharts. For a purely text question, plain text CoT is simpler and cheaper. The biggest pitfall is hallucination during the perception step — the model may confidently “read” a value that isn't there, and every later reasoning step inherits that error. Mitigate it by asking for specific, checkable observations (exact numbers, axis labels, colors) and by keeping the rationale tightly grounded in the image. A concrete example: shown a bar chart and asked “which quarter had the highest revenue?”, a direct model might guess; with Multimodal CoT it first transcribes each bar (“Q1 ≈ 40, Q2 ≈ 55, Q3 ≈ 30, Q4 ≈ 60”), then concludes “Q4 is highest” — and you can audit the transcription to catch mistakes.

Think of it like solving a problem with a diagram:

1. See the image: "I see a triangle with angles labeled..."
2. Extract information: "Angle A appears to be 60 degrees..."
3. Reason about it: "Since the sum of angles is 180..."
4. Combine: Use visual and textual reasoning together

Where Is This Used?

Science Problems: Physics diagrams, chemistry structures
Math with Figures: Geometry, graphs, coordinate systems
Chart Analysis: Understanding data visualizations
Document Understanding: Forms, infographics, flowcharts

Fun Fact: Multimodal CoT can solve science exam questions that include diagrams with much higher accuracy than text-only approaches! The key is generating "rationales" that describe what's seen before reasoning.

Try It Yourself!

Use the interactive example below to see how AI can reason about problems that combine images and text.

Prompt Template

Stage 1: Rationale Generation

Look at the image and question.

[Image]

Question: {question}

Before answering, describe:
1. What you see in the image (key elements)
2. Which information is relevant to the question
3. Step-by-step reasoning from observations to answer

Observations and reasoning:

Stage 2: Final Answer

Based on the reasoning above, answer the question.

Reasoning: {rationale}
Question: {question}

Final answer:

Approach Comparison

Aspect	Direct Answer	Text CoT	Multimodal CoT
Input	Image + question	Text only	Image + question
Reasoning	No	Text-based	Visual + text
ScienceQA Accuracy	~75%	~80%	~91%
Interpretability	Low	High	Very High

Research

Multimodal CoT is described in the paper "Multimodal Chain-of-Thought Reasoning in Language Models" (Zhang et al., 2023).

• Outperforms GPT-3.5 on ScienceQA benchmark
• Two-stage separation is critical
• Works even with small models (1B parameters)

Best Practices

✓ Always ask model to describe what it sees before answering
✓ Use two-stage approach for complex tasks
✓ Request specific observations (numbers, colors, shapes)
✓ Verify reasoning for logical consistency
✓ For diagrams: ask to read all labels and legends

Frequently asked questions

What is Multimodal Chain-of-Thought (Multimodal CoT)?

Multimodal CoT is a prompting technique that extends step-by-step Chain-of-Thought reasoning to visual inputs like images, charts, and diagrams. The model first describes in words what it sees in the image (a rationale), then reasons step by step from those observations to the answer. It requires a vision model that can process text and images together.

How is Multimodal CoT different from regular text Chain-of-Thought?

Regular CoT works on text only, reasoning over the words in the prompt. Multimodal CoT adds a perception step: the model explicitly extracts information from the image (numbers, labels, shapes) and writes it out as text before reasoning. Use it when the answer depends on visual detail such as a circuit, figure, or chart; for purely text questions, plain CoT is enough.

How does the two-stage Multimodal CoT approach work?

In stage one, the model receives the image and question and produces only a rationale — a grounded text description of what it sees. In stage two, that rationale is fed back alongside the original question, and the model produces the final answer. Splitting the work forces visual extraction to happen explicitly and on the record, improving accuracy and letting you audit how the image was read.

When should you use Multimodal CoT and what are its limitations?

Use it whenever the answer genuinely depends on the image: science figures, geometry, charts, lab equipment, maps, forms, and flowcharts. The main limitation is hallucination during perception — the model may confidently read a value that isn't there, and every later step inherits the error. Reduce the risk by asking for specific, checkable observations (exact numbers, axis labels, colors) and keeping the rationale grounded in the image.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Determine the time of day and season from a city street photograph

Without technique

Prompt

[Image: city street with long shadows, people in light clothing, blooming trees] Describe what is in the photograph.

Response

The photograph shows a city street. Pedestrians and trees are visible. The street is well-lit.

Tokens:820/28

Time:450ms

Quality:

With Multimodal CoT

With technique

Prompt

[Image: city street with long shadows, people in light clothing, blooming trees] Analyze the image step by step to determine the time of day and season: 1. Visual elements: what do you see? 2. Lighting analysis: shadow direction and length, light quality 3. Clothing analysis: what does it tell us about temperature? 4. Vegetation analysis: what stage are the trees in? 5. Conclusion: what time of day and season is it?

Response

1. Visual elements: City street, 8-10 pedestrians, a row of trees along the sidewalk, parked cars.

2. Lighting analysis: Shadows are long (3-4x object height), pointing west. Light is warm, golden. This is typical for morning (8-9 AM) or evening (5-6 PM). The warm tone and people's activity level suggest evening.

3. Clothing: T-shirts, light dresses, shorts. Temperature 20-28°C.

4. Vegetation: Trees in full bloom with pink and white blossoms. Likely cherry or apple — they bloom in April-May.

5. Conclusion: Early evening (around 5-6 PM) in late spring (April-May). Evidence: long evening shadows + light clothing + blooming trees.

👁️Break image analysis into specific visual aspects

🧠Shadow length + direction + light quality = time of day

🧠Clothing + vegetation give independent confirmation of season

✅Cross-checking all observations gives a confident conclusion

Tokens:870/215

Time:1200ms

Quality:

Why this works

Multimodal CoT breaks image analysis into specific visual aspects (shadows, clothing, vegetation). Each aspect provides an independent clue, and their combination yields a confident conclusion.

1 / 2

Practice Challenges

Create a free account to solve challenges

6 AI-verified challenges for this lesson

Related lessons:Chain Of Thought Vision Basics

This lesson is part of a structured LLM course.

My Learning Path

TechniqueVision

Multimodal CoT

Vision + Reasoning

The Problem: Some problems involve both text AND images or diagrams. How can AI reason step-by-step while also processing visual information?

The Solution: Think With the Diagram

How it works

When to use it — and the pitfalls

Think of it like solving a problem with a diagram:

1. See the image: "I see a triangle with angles labeled..."
2. Extract information: "Angle A appears to be 60 degrees..."
3. Reason about it: "Since the sum of angles is 180..."
4. Combine: Use visual and textual reasoning together

Where Is This Used?

Science Problems: Physics diagrams, chemistry structures
Math with Figures: Geometry, graphs, coordinate systems
Chart Analysis: Understanding data visualizations
Document Understanding: Forms, infographics, flowcharts

Try It Yourself!

Use the interactive example below to see how AI can reason about problems that combine images and text.

Prompt Template

Stage 1: Rationale Generation

Look at the image and question.

[Image]

Question: {question}

Before answering, describe:
1. What you see in the image (key elements)
2. Which information is relevant to the question
3. Step-by-step reasoning from observations to answer

Observations and reasoning:

Stage 2: Final Answer

Based on the reasoning above, answer the question.

Reasoning: {rationale}
Question: {question}

Final answer:

Approach Comparison

Aspect	Direct Answer	Text CoT	Multimodal CoT
Input	Image + question	Text only	Image + question
Reasoning	No	Text-based	Visual + text
ScienceQA Accuracy	~75%	~80%	~91%
Interpretability	Low	High	Very High

Research

Multimodal CoT is described in the paper "Multimodal Chain-of-Thought Reasoning in Language Models" (Zhang et al., 2023).

• Outperforms GPT-3.5 on ScienceQA benchmark
• Two-stage separation is critical
• Works even with small models (1B parameters)

Best Practices

✓ Always ask model to describe what it sees before answering
✓ Use two-stage approach for complex tasks
✓ Request specific observations (numbers, colors, shapes)
✓ Verify reasoning for logical consistency
✓ For diagrams: ask to read all labels and legends

Frequently asked questions

What is Multimodal Chain-of-Thought (Multimodal CoT)?

How is Multimodal CoT different from regular text Chain-of-Thought?

How does the two-stage Multimodal CoT approach work?

When should you use Multimodal CoT and what are its limitations?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Determine the time of day and season from a city street photograph

Without technique

Prompt

[Image: city street with long shadows, people in light clothing, blooming trees] Describe what is in the photograph.

Response

The photograph shows a city street. Pedestrians and trees are visible. The street is well-lit.

Tokens:820/28

Time:450ms

Quality:

With Multimodal CoT

With technique

Prompt

Response

1. Visual elements: City street, 8-10 pedestrians, a row of trees along the sidewalk, parked cars.

3. Clothing: T-shirts, light dresses, shorts. Temperature 20-28°C.

4. Vegetation: Trees in full bloom with pink and white blossoms. Likely cherry or apple — they bloom in April-May.

5. Conclusion: Early evening (around 5-6 PM) in late spring (April-May). Evidence: long evening shadows + light clothing + blooming trees.

👁️Break image analysis into specific visual aspects

🧠Shadow length + direction + light quality = time of day

🧠Clothing + vegetation give independent confirmation of season

✅Cross-checking all observations gives a confident conclusion

Tokens:870/215

Time:1200ms

Quality:

Why this works

Multimodal CoT breaks image analysis into specific visual aspects (shadows, clothing, vegetation). Each aspect provides an independent clue, and their combination yields a confident conclusion.

1 / 2

Practice Challenges

Create a free account to solve challenges

6 AI-verified challenges for this lesson

Related lessons:Chain Of Thought Vision Basics

This lesson is part of a structured LLM course.

My Learning Path