Multimodal CoT
Vision + Reasoning
The Problem: Some problems involve both text AND images or diagrams. How can AI reason step-by-step while also processing visual information?
The Solution: Think With the Diagram
Multimodal CoT extends Chain-of-Thought reasoning to incorporate visual inputs like images, charts, and diagrams. Instead of jumping straight from a picture to an answer, the model is prompted to first put into words what it actually sees — the objects, labels, numbers, and relationships in the image — and only then reason step by step toward the answer. It's like solving a geometry problem by describing the figure aloud while working through the logic. This requires a vision model that can process both text and images in the same context.
How it works
The most reliable recipe is the two-stage approach from the original research. In stage one, the model is given the image plus the question and asked only to produce a rationale: a grounded description such as “I see a circuit with two resistors in series and a 9V battery.” In stage two, that text rationale is fed back in alongside the original question, and the model produces the final answer. Splitting the work matters because it forces the visual extraction to happen explicitly and on the record — the answer is then reasoned from a clear, text-grounded summary rather than from a vague glance. This separation is also why even small vision models (around 1B parameters in the paper) can beat much larger text-only models on diagram-heavy benchmarks like ScienceQA.
When to use it — and the pitfalls
Reach for Multimodal CoT whenever the answer genuinely depends on visual detail: science exam figures, geometry, charts and graphs, lab equipment, maps, forms, or flowcharts. For a purely text question, plain text CoT is simpler and cheaper. The biggest pitfall is hallucination during the perception step — the model may confidently “read” a value that isn't there, and every later reasoning step inherits that error. Mitigate it by asking for specific, checkable observations (exact numbers, axis labels, colors) and by keeping the rationale tightly grounded in the image. A concrete example: shown a bar chart and asked “which quarter had the highest revenue?”, a direct model might guess; with Multimodal CoT it first transcribes each bar (“Q1 ≈ 40, Q2 ≈ 55, Q3 ≈ 30, Q4 ≈ 60”), then concludes “Q4 is highest” — and you can audit the transcription to catch mistakes.
Think of it like solving a problem with a diagram:
- 1. See the image: "I see a triangle with angles labeled..."
- 2. Extract information: "Angle A appears to be 60 degrees..."
- 3. Reason about it: "Since the sum of angles is 180..."
- 4. Combine: Use visual and textual reasoning together
Where Is This Used?
- Science Problems: Physics diagrams, chemistry structures
- Math with Figures: Geometry, graphs, coordinate systems
- Chart Analysis: Understanding data visualizations
- Document Understanding: Forms, infographics, flowcharts
Fun Fact: Multimodal CoT can solve science exam questions that include diagrams with much higher accuracy than text-only approaches! The key is generating "rationales" that describe what's seen before reasoning.
Try It Yourself!
Use the interactive example below to see how AI can reason about problems that combine images and text.
Look at the image and question.
[Image]
Question: {question}
Before answering, describe:
1. What you see in the image (key elements)
2. Which information is relevant to the question
3. Step-by-step reasoning from observations to answer
Observations and reasoning:Based on the reasoning above, answer the question.
Reasoning: {rationale}
Question: {question}
Final answer:| Aspect | Direct Answer | Text CoT | Multimodal CoT |
|---|---|---|---|
| Input | Image + question | Text only | Image + question |
| Reasoning | No | Text-based | Visual + text |
| ScienceQA Accuracy | ~75% | ~80% | ~91% |
| Interpretability | Low | High | Very High |
Multimodal CoT is described in the paper "Multimodal Chain-of-Thought Reasoning in Language Models" (Zhang et al., 2023).
- • Outperforms GPT-3.5 on ScienceQA benchmark
- • Two-stage separation is critical
- • Works even with small models (1B parameters)
- ✓ Always ask model to describe what it sees before answering
- ✓ Use two-stage approach for complex tasks
- ✓ Request specific observations (numbers, colors, shapes)
- ✓ Verify reasoning for logical consistency
- ✓ For diagrams: ask to read all labels and legends
Frequently asked questions
What is Multimodal Chain-of-Thought (Multimodal CoT)?
Multimodal CoT is a prompting technique that extends step-by-step Chain-of-Thought reasoning to visual inputs like images, charts, and diagrams. The model first describes in words what it sees in the image (a rationale), then reasons step by step from those observations to the answer. It requires a vision model that can process text and images together.
How is Multimodal CoT different from regular text Chain-of-Thought?
Regular CoT works on text only, reasoning over the words in the prompt. Multimodal CoT adds a perception step: the model explicitly extracts information from the image (numbers, labels, shapes) and writes it out as text before reasoning. Use it when the answer depends on visual detail such as a circuit, figure, or chart; for purely text questions, plain CoT is enough.
How does the two-stage Multimodal CoT approach work?
In stage one, the model receives the image and question and produces only a rationale — a grounded text description of what it sees. In stage two, that rationale is fed back alongside the original question, and the model produces the final answer. Splitting the work forces visual extraction to happen explicitly and on the record, improving accuracy and letting you audit how the image was read.
When should you use Multimodal CoT and what are its limitations?
Use it whenever the answer genuinely depends on the image: science figures, geometry, charts, lab equipment, maps, forms, and flowcharts. The main limitation is hallucination during perception — the model may confidently read a value that isn't there, and every later step inherits the error. Reduce the risk by asking for specific, checkable observations (exact numbers, axis labels, colors) and keeping the rationale grounded in the image.
Try it yourself
Interactive demo of this technique
Determine the time of day and season from a city street photograph
The photograph shows a city street. Pedestrians and trees are visible. The street is well-lit.
1. Visual elements: City street, 8-10 pedestrians, a row of trees along the sidewalk, parked cars.
2. Lighting analysis: Shadows are long (3-4x object height), pointing west. Light is warm, golden. This is typical for morning (8-9 AM) or evening (5-6 PM). The warm tone and people's activity level suggest evening.
3. Clothing: T-shirts, light dresses, shorts. Temperature 20-28°C.
4. Vegetation: Trees in full bloom with pink and white blossoms. Likely cherry or apple — they bloom in April-May.
5. Conclusion: Early evening (around 5-6 PM) in late spring (April-May). Evidence: long evening shadows + light clothing + blooming trees.
Multimodal CoT breaks image analysis into specific visual aspects (shadows, clothing, vegetation). Each aspect provides an independent clue, and their combination yields a confident conclusion.
Create a free account to solve challenges
6 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path