Lesson 9Emerging

Video & Audio

Emerging capabilities

The Problem: Videos contain a wealth of information — actions, dialogue, context over time. How can AI understand and analyze video content?

The Solution: Describing Movies for the Blind

Video Understanding enables AI to analyze sequences of frames, track objects over time, understand audio, and comprehend the narrative. It's like audio description for movies — explaining what's happening on screen. It builds on vision model capabilities and represents the frontier of multimodal AI.

Think of it like describing a movie for someone who can't see:

1. Sample key frames: Extract frames at intervals (e.g. 1/sec) or on motion triggers — LLMs cannot process raw video streams
2. Analyze each frame: Vision LLM describes scene, objects, text, and actions in each selected frame
3. Combine with audio: Transcribe the audio track, align with frame timestamps to build a multimodal timeline
4. Temporal reasoning: Connect events across frames: "person entered at 0:15, picked up object at 0:23, left at 0:31"

Cost trade-off: 1 frame/sec for a 1-hour video = 3,600 frames × ~200 tokens each = ~720K tokens. Sampling at 0.1 fps cuts cost 10×, but may miss brief events.

Where Is This Used?

YouTube Summarization: Extract key points from a 2-hour lecture into a 5-minute summary with timestamps
Security Camera Analysis: Sample 1 frame/sec from 24/7 footage, flag anomalies — a person in a restricted zone, an unattended bag
Meeting Summaries: Combine audio transcription + screen share frames to produce action items and decisions
Training & Education: Auto-generate quizzes from recorded lectures, identify key visual moments for highlights

Fun Fact: Video understanding is one of the most challenging AI tasks because it requires integrating vision, audio, and temporal reasoning all at once. Recent models like Gemini can watch hours of video and answer questions!

Try It Yourself!

See how AI can analyze and describe video content, tracking events over time.

Temporal Reasoning: Connecting Events Across Frames

Example: Bird on a Wire (9-sec video)

1Frame @0:00 — Dove on wire, head down, preening
2Frame @0:02 — Still preening, grooming chest feathers
3Frame @0:04 — Lifts head, turns right — alert posture
4Frame @0:07 — Walks along the wire, leaning forward
5Frame @0:08 — Takes flight and leaves the frame
6Frame @0:09 — Empty wires, only foliage remains

Model analysis (all 6 frames)

“A dove rests on a power line, preening its feathers (0:00-0:02). It becomes alert, turning to face right (0:04). The bird walks along the wire (0:07), then takes flight (0:08), leaving only empty wires and green foliage (0:09). Complete behavioral sequence: rest → grooming → alertness → locomotion → departure.”

↓ Try the interactive simulator below — change the sampling rate and see how the AI response degrades with fewer frames.

Bad prompt

“What happens in this video?”

→ Gets disjointed frame descriptions

Good prompt

“Track the bird across frames. Note posture changes and describe the full behavioral sequence with timestamps.”

→ Gets the connected narrative above

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Video analysis: single frame vs temporal analysis across keyframes

Without technique

Prompt

Describe what happens in this video. [Single frame from a cooking video: hands cutting an onion]

Response

The image shows hands cutting an onion on a cutting board. A kitchen knife is being used.

Tokens:280/30

Time:1500ms

Quality:

With Video

With technique

Prompt

Analyze a cooking video using 6 keyframes (extracted every 30 sec): Frame 1 (0:00): Ingredients on table — onion, garlic, tomatoes, olive oil, spaghetti Frame 2 (0:30): Dicing onion into small cubes Frame 3 (1:00): Sauteing onion and garlic in a pan Frame 4 (1:30): Adding diced tomatoes Frame 5 (2:00): Sauce simmering, spaghetti boiling in pot Frame 6 (2:30): Finished dish — spaghetti with tomato sauce on a plate Describe: 1. What dish is being prepared? 2. Action timeline with timestamps 3. Cooking techniques visible 4. Skill level assessment (beginner/intermediate/pro)

Response

1. Dish: Spaghetti with homemade tomato sauce (Pasta al Pomodoro)

2. Timeline:

0:00 — Mise en place: ingredients pre-prepared
0:30 — Cutting: small dice (brunoise) — good technique
1:00 — Soffritto: onion + garlic on medium heat (no burning)
1:30 — Adding tomatoes: fresh, not canned
2:00 — Parallel cooking: sauce + pasta simultaneously
2:30 — Plating: sauce on top, not fully mixed

3. Techniques: Brunoise (dicing), soffritto (aromatic base), parallel cooking, correct ingredient sequence.

4. Skill level: Intermediate. Signs: neat cuts, mise en place, parallel processes. But fresh tomatoes instead of San Marzano — a simplification.

👁️Single frame = one moment. Cannot understand the process, recipe, or timeline

🧠Keyframes with timestamps add temporal dimension — model sees cooking progress

🔍Structured questions (dish, timeline, techniques, level) extract maximum from visual data

✅For video: extract keyframes → describe each → ask model to analyze the sequence

Tokens:380/260

Time:3800ms

Quality:

Why this works

For video analysis: extract keyframes with timestamps → feed as a sequence → model analyzes the PROCESS, not a static image. A single frame is not a video.

1 / 2

Practice Challenges

Create a free account to solve challenges

5 AI-verified challenges for this lesson

Related lessons:Image Analysis Summarization

This lesson is part of a structured LLM course.

My Learning Path

Lesson 9Emerging

Video & Audio

Emerging capabilities

The Problem: Videos contain a wealth of information — actions, dialogue, context over time. How can AI understand and analyze video content?

The Solution: Describing Movies for the Blind

Think of it like describing a movie for someone who can't see:

1. Sample key frames: Extract frames at intervals (e.g. 1/sec) or on motion triggers — LLMs cannot process raw video streams
2. Analyze each frame: Vision LLM describes scene, objects, text, and actions in each selected frame
3. Combine with audio: Transcribe the audio track, align with frame timestamps to build a multimodal timeline
4. Temporal reasoning: Connect events across frames: "person entered at 0:15, picked up object at 0:23, left at 0:31"

Cost trade-off: 1 frame/sec for a 1-hour video = 3,600 frames × ~200 tokens each = ~720K tokens. Sampling at 0.1 fps cuts cost 10×, but may miss brief events.

Where Is This Used?

YouTube Summarization: Extract key points from a 2-hour lecture into a 5-minute summary with timestamps
Security Camera Analysis: Sample 1 frame/sec from 24/7 footage, flag anomalies — a person in a restricted zone, an unattended bag
Meeting Summaries: Combine audio transcription + screen share frames to produce action items and decisions
Training & Education: Auto-generate quizzes from recorded lectures, identify key visual moments for highlights

Try It Yourself!

See how AI can analyze and describe video content, tracking events over time.

Temporal Reasoning: Connecting Events Across Frames

Example: Bird on a Wire (9-sec video)

1Frame @0:00 — Dove on wire, head down, preening
2Frame @0:02 — Still preening, grooming chest feathers
3Frame @0:04 — Lifts head, turns right — alert posture
4Frame @0:07 — Walks along the wire, leaning forward
5Frame @0:08 — Takes flight and leaves the frame
6Frame @0:09 — Empty wires, only foliage remains

Model analysis (all 6 frames)

↓ Try the interactive simulator below — change the sampling rate and see how the AI response degrades with fewer frames.

Bad prompt

“What happens in this video?”

→ Gets disjointed frame descriptions

Good prompt

“Track the bird across frames. Note posture changes and describe the full behavioral sequence with timestamps.”

→ Gets the connected narrative above

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Video analysis: single frame vs temporal analysis across keyframes

Without technique

Prompt

Describe what happens in this video. [Single frame from a cooking video: hands cutting an onion]

Response

The image shows hands cutting an onion on a cutting board. A kitchen knife is being used.

Tokens:280/30

Time:1500ms

Quality:

With Video

With technique

Prompt

Response

1. Dish: Spaghetti with homemade tomato sauce (Pasta al Pomodoro)

2. Timeline:

0:00 — Mise en place: ingredients pre-prepared
0:30 — Cutting: small dice (brunoise) — good technique
1:00 — Soffritto: onion + garlic on medium heat (no burning)
1:30 — Adding tomatoes: fresh, not canned
2:00 — Parallel cooking: sauce + pasta simultaneously
2:30 — Plating: sauce on top, not fully mixed

3. Techniques: Brunoise (dicing), soffritto (aromatic base), parallel cooking, correct ingredient sequence.

4. Skill level: Intermediate. Signs: neat cuts, mise en place, parallel processes. But fresh tomatoes instead of San Marzano — a simplification.

👁️Single frame = one moment. Cannot understand the process, recipe, or timeline

🧠Keyframes with timestamps add temporal dimension — model sees cooking progress

🔍Structured questions (dish, timeline, techniques, level) extract maximum from visual data

✅For video: extract keyframes → describe each → ask model to analyze the sequence

Tokens:380/260

Time:3800ms

Quality:

Why this works

For video analysis: extract keyframes with timestamps → feed as a sequence → model analyzes the PROCESS, not a static image. A single frame is not a video.

1 / 2

Practice Challenges

Create a free account to solve challenges

5 AI-verified challenges for this lesson

Related lessons:Image Analysis Summarization

This lesson is part of a structured LLM course.

My Learning Path