Multimodal Costs
How much does one image cost?
The Problem: Vision API costs can be surprisingly high and hard to predict. A 4096×4096 image costs 10-100x more than a 256×256 thumbnail. Video multiplies this by frames×seconds. Without understanding the token math, teams face unexpected bills of thousands of dollars.
The Solution: Understanding Vision Token Economics
When you send an image to a vision model, it gets split into patches (typically 16×16 pixel blocks). Each patch becomes visual tokens. More pixels = more tokens = higher cost. A 4096×4096 image can cost 10-100x more than 256×256. Different providers use different pricing models: OpenAI uses tile-based, Claude uses resolution-based, Gemini uses flat per-image. For video: multiply by frames per second × duration.
Think of it like paying for a high-resolution photo print vs a thumbnail — you pay per pixel, not per image:
- 1. Patch Grid: Images are divided into patches (16×16 pixels). Each patch → visual tokens. Resolution directly determines cost: 4x resolution = ~4x tokens.
- 2. Provider Pricing: OpenAI: tile-based (low=85 tokens, high=85+170×tiles). Claude: resolution-proportional. Gemini: flat 258 tokens/image. Same image, different costs.
- 3. Video Multiplier: Video = many images. 1fps × 60sec = 60 frames. At $0.003/frame that's $0.18/min. At 30fps: $5.40/min. Choose FPS carefully.
- 4. Optimization Strategies: Resize before sending (biggest savings), crop ROI, use low-detail when possible, batch with prompt caching, process keyframes only for video.
Cost Impact by Use Case
- E-commerce: Processing 10,000 product photos daily. At high resolution: $75/day. With resize to 512×512: $8/day. 90% savings.
- Medical Imaging: High-res X-rays and MRIs require maximum detail — no shortcuts. Budget $0.01-0.03 per image.
- Video Surveillance: 24/7 camera at 1fps = 86,400 frames/day. Even at low detail: $20+/day per camera. Use event-triggered processing instead.
- Document Processing: Scanned contracts at 1024×1024 is optimal — higher resolution adds cost with minimal OCR improvement.
Fun Fact: A single 4K (3840×2160) image in GPT-4o uses about 1,105 tokens in high-detail mode — the same as ~800 words of text. A 1-minute video at just 1fps would use 66,300 tokens, costing about $0.17. At 30fps, that jumps to $5.00 per minute.
Try It Yourself!
Use the calculator below to estimate costs for your specific use case: adjust resolution, detail level, and see real-time pricing across providers.
Try it yourself
Interactive demo of this technique
Calculate image processing costs across providers
The cost depends on the provider and image size.
| Provider | Tokens/img | /1000 imgs | |----------|-----------|-------|-------------| | GPT-4o | 765 | 1.91 | | Claude 3.5 | 1,398 | 4.19 | | Gemini 1.5 | 258 | 0.32 |
Recommendation: Gemini 1.5 is 6× cheaper than GPT-4o and 13× cheaper than Claude for this case. But if maximum OCR accuracy is needed — Claude leads.
Cost difference between providers can be 13×. Gemini is cheapest for bulk processing, but Claude and GPT-4o are better for high-accuracy tasks.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path