Diffusion Models
DALL-E, Stable Diffusion, Midjourney
The Problem: You type "a cat astronaut floating in space, oil painting style" into DALL-E and get a stunning image in seconds. But how? The AI didn't search a database of cat-astronaut paintings — it created something entirely new. How do diffusion models turn random noise into coherent images guided by text?
The Solution: Diffusion — Sculpting Images from Noise
Diffusion models work on a two-process principle. Forward diffusion gradually adds Gaussian noise to an image until it becomes pure noise. Reverse diffusion is a trained U-Net neural network that predicts and removes noise at each step. All computation happens in latent space — a 64x smaller compressed representation, making it 64x faster. The text prompt is converted to a guiding vector via CLIP text encoder, and classifier-free guidance controls how closely the model follows the description.
Think of it like a sculptor receiving a block of marble (random noise) and a written description from a patron. Step by step, the sculptor chips away marble that doesn't belong, guided by the description. Each pass reveals more detail until the final sculpture emerges:
- 1. Text encoded via CLIP: The text prompt is converted into a numerical vector (embedding) by the CLIP text encoder, capturing the semantic meaning that will guide generation
- 2. Start from random noise: Generation starts from pure random noise in latent space — a compressed representation 64x smaller than the full image
- 3. U-Net removes noise step by step: At each of 20-50 steps, U-Net predicts noise and subtracts it. Each step makes the image clearer — like a sculptor removing chips
- 4. Decode to final image: The VAE decoder transforms the clean latent representation back into a full-resolution image (512x512 or 1024x1024 pixels)
Where Diffusion Models Are Used
- Text-to-Image: DALL-E 3, Stable Diffusion, Midjourney — generate images from natural language descriptions with stunning quality and creative control
- Image Editing & Inpainting: Edit specific parts of an image while keeping the rest intact — remove objects, change backgrounds, fill gaps seamlessly
- Video Generation: Sora, Runway, Kling — extend diffusion to the temporal dimension, generating coherent video sequences from text prompts
- Common Pitfall: Setting guidance scale too high (>15) causes oversaturation and artifacts, not better accuracy. Too low (<3) ignores the prompt. Sweet spot is 7-9 for most use cases. Always experiment before committing
Fun Fact: Stable Diffusion processes images in a latent space of just 64×64×4 instead of the full 512×512×3 pixel space. This 48x compression is what makes it possible to run on consumer GPUs. The VAE decoder at the end upscales the tiny latent back to full resolution — all the "creativity" happens in this compressed space.
Try It Yourself!
Explore the interactive visualization below to see the denoising process step by step, experiment with guidance scale, and understand the latent space architecture.
Watch noise being removed step by step. Each step, U-Net predicts and subtracts noise.
Step 0: Pure random noise (T=1000)
Try it yourself
Interactive demo of this technique
Generate a high-quality image from a text description
[Generated image] Blurry cat without details. Neutral background. Unclear style — neither photo nor drawing. Unnatural proportions.
[Generated image] Detailed orange tabby cat with realistic fur texture on a windowsill. Soft light creates warm atmosphere. Bokeh background with European city view. Natural proportions and expression.
"Draw a cat" gives a blurry result. Detailed description (style, scene, lighting, angle) + negative prompt + optimal guidance scale (7.5) = predictable, high-quality result.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path