Multimodal CoT
Vision + Reasoning
The Problem: Some problems involve both text AND images or diagrams. How can AI reason step-by-step while also processing visual information?
The Solution: Think With the Diagram
Multimodal CoT extends Chain-of-Thought reasoning to incorporate visual inputs like images, charts, and diagrams. It's like solving a geometry problem by looking at the figure while working through the logic. It requires a vision model that can process both text and images.
Think of it like solving a problem with a diagram:
- 1. See the image: "I see a triangle with angles labeled..."
- 2. Extract information: "Angle A appears to be 60 degrees..."
- 3. Reason about it: "Since the sum of angles is 180..."
- 4. Combine: Use visual and textual reasoning together
Where Is This Used?
- Science Problems: Physics diagrams, chemistry structures
- Math with Figures: Geometry, graphs, coordinate systems
- Chart Analysis: Understanding data visualizations
- Document Understanding: Forms, infographics, flowcharts
Fun Fact: Multimodal CoT can solve science exam questions that include diagrams with much higher accuracy than text-only approaches! The key is generating "rationales" that describe what's seen before reasoning.
Try It Yourself!
Use the interactive example below to see how AI can reason about problems that combine images and text.
Look at the image and question.
[Image]
Question: {question}
Before answering, describe:
1. What you see in the image (key elements)
2. Which information is relevant to the question
3. Step-by-step reasoning from observations to answer
Observations and reasoning:Based on the reasoning above, answer the question.
Reasoning: {rationale}
Question: {question}
Final answer:| Aspect | Direct Answer | Text CoT | Multimodal CoT |
|---|---|---|---|
| Input | Image + question | Text only | Image + question |
| Reasoning | No | Text-based | Visual + text |
| ScienceQA Accuracy | ~75% | ~80% | ~91% |
| Interpretability | Low | High | Very High |
Multimodal CoT is described in the paper "Multimodal Chain-of-Thought Reasoning in Language Models" (Zhang et al., 2023).
- • Outperforms GPT-3.5 on ScienceQA benchmark
- • Two-stage separation is critical
- • Works even with small models (1B parameters)
- ✓ Always ask model to describe what it sees before answering
- ✓ Use two-stage approach for complex tasks
- ✓ Request specific observations (numbers, colors, shapes)
- ✓ Verify reasoning for logical consistency
- ✓ For diagrams: ask to read all labels and legends
Try it yourself
Interactive demo of this technique
Determine the time of day and season from a city street photograph
The photograph shows a city street. Pedestrians and trees are visible. The street is well-lit.
1. Visual elements: City street, 8-10 pedestrians, a row of trees along the sidewalk, parked cars.
2. Lighting analysis: Shadows are long (3-4x object height), pointing west. Light is warm, golden. This is typical for morning (8-9 AM) or evening (5-6 PM). The warm tone and people's activity level suggest evening.
3. Clothing: T-shirts, light dresses, shorts. Temperature 20-28°C.
4. Vegetation: Trees in full bloom with pink and white blossoms. Likely cherry or apple — they bloom in April-May.
5. Conclusion: Early evening (around 5-6 PM) in late spring (April-May). Evidence: long evening shadows + light clothing + blooming trees.
Multimodal CoT breaks image analysis into specific visual aspects (shadows, clothing, vegetation). Each aspect provides an independent clue, and their combination yields a confident conclusion.
Create a free account to solve challenges
6 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path