Lesson 12New

Fine-tuning vs Prompting

When to train your model

The Problem: Your chatbot needs to know your company's products. Should you train a custom model or just put product info in the prompt?

The Solution: Tutor vs Textbook

Fine-tuning means taking a pretrained language model and continuing its training on a smaller, focused dataset so it adapts to your task, domain, or style. The model already knows grammar, facts, and reasoning from its original pretraining; fine-tuning nudges its weights toward your patterns — your tone of voice, your output format, your jargon. Imagine you need to learn a new subject. Before fine-tuning, try improving your results with better prompt engineering. If that's not enough, you have two main paths:

Textbook (prompting): Read the chapter before each test. Quick to start, but you have to re-read every time.
Tutor (fine-tuning): Study with a teacher until you truly understand. Takes time upfront, but knowledge stays with you.

Full fine-tuning vs LoRA / PEFT

Full fine-tuning updates every weight in the model. It gives the most control but is expensive: a 7-billion-parameter model needs tens of gigabytes of GPU memory just to hold the optimizer state, and you end up storing a complete copy of the model for each task. PEFT (Parameter-Efficient Fine-Tuning) avoids this by freezing the original weights and training only a tiny set of new parameters. The most popular PEFT method is LoRA (Low-Rank Adaptation): it injects small trainable matrices into each layer, so you adjust well under 1% of the parameters. QLoRA goes further, loading the frozen base model in 4-bit quantization so a single consumer GPU can fine-tune a large model. The resulting LoRA adapter is just a few megabytes — you can swap adapters per task without duplicating the whole model.

When to fine-tune vs prompt or RAG

Reach for fine-tuning when you need a consistent style, format, or behavior that prompting can't reliably enforce — for example, always emitting strict JSON, classifying support tickets, or matching a brand voice across thousands of calls. It also lowers per-request cost, because the instructions are baked into the weights instead of repeated in every prompt. But fine-tuning is the wrong tool for injecting fresh facts. To teach a model knowledge that changes often, use Retrieval Augmented Generation (RAG): store your documents in a searchable vector database, find the passages relevant to each question using embeddings, and paste them into the prompt automatically. Updating RAG means editing a document; updating a fine-tuned fact means retraining. A practical rule: start with prompting, add RAG for knowledge, and fine-tune only for behavior you can't prompt your way to.

Concretely, suppose you want a model that turns customer emails into structured tickets with fields like priority, category, and summary. You collect 500–1,000 example email-to-ticket pairs, format them as prompt/completion data, and fine-tune with LoRA. The job runs in an hour or two on one GPU, and afterward the model emits clean JSON without a long instruction prompt — cheaper and more reliable at inference time than prompting alone.

Think of it like learning with a tutor vs textbook:

1. Prompting is better when: Information changes frequently, need to start quickly, budget is limited, task is general-purpose
2. Fine-tuning is better when: Need a specific style or format, have lots of good examples, speed at inference time matters, same task repeats thousands of times

Types of Fine-tuning

Full fine-tuning: retrain all weights. Maximum quality, but expensive (needs 8+ GPUs)
LoRA / QLoRA: train only small adapters. Cheap and fast, 90% of the quality
RLHF: train using human feedback. For teaching preferences, like being helpful vs harmful
Instruction tuning: teach to follow instructions. Transforms a base model into a chatbot

Fun Fact: Start with prompting + RAG. Fine-tune only when you hit clear limits. 80% of use cases never need fine-tuning! Fine-tuning is expensive. Make sure you've exhausted cheaper options first.

Try It Yourself!

Answer a few questions about your use case, and see which approach is recommended for you!

Frequently asked questions

How is fine-tuning different from RAG?

Fine-tuning changes the model's weights to bake in a style, format, or behavior, which requires retraining to update. RAG leaves the weights untouched: it retrieves relevant documents from a vector database and inserts them into the prompt at query time. So RAG is best for knowledge that changes often, while fine-tuning is best for behavior you can't reliably get from prompting.

How many examples do you need to fine-tune a model?

For a narrow task like classification or a fixed output format, 500–1,000 good examples is usually enough, and sometimes a few hundred works. Data quality and consistency matter more than raw volume — noisy or contradictory examples hurt more than having fewer of them. Complex behavior or a new style may need several thousand pairs.

What is LoRA and how does it differ from full fine-tuning?

Full fine-tuning updates every weight, needs a lot of GPU memory, and stores a full model copy per task. LoRA freezes the original weights and trains only small added matrices — under 1% of the parameters. The resulting adapter is just a few megabytes, training is cheaper and faster, and quality is close to full fine-tuning.

When should I fine-tune versus just use a prompt?

Start with prompt engineering — it's free and instant. Fine-tune only when you need a consistent style, a strict format like JSON, or the same task repeats thousands of times and per-request cost matters. If what you're missing is fresh facts, use RAG instead of fine-tuning.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Classify medical symptoms by doctor specialty

Without technique

Prompt

Determine doctor specialty by patient symptoms. Examples: Symptom: "chest pain, shortness of breath" → Cardiologist Symptom: "skin rash, itching" → Dermatologist Symptom: "lower back pain, leg numbness" → Neurologist Symptom: "frequent urination, lower abdominal pain" → Urologist Symptom: "headache, nausea, light sensitivity" → Neurologist Now classify: "eye redness, tearing, gritty sensation"

Response

Dermatologist (redness) or Neurologist (gritty sensation might be neurological).

Not sure about the answer — need more context.

Tokens:120/30

Time:580ms

Quality:

With Fine-Tuning

With technique

Prompt

[Fine-tuned model on 5,000 symptom→specialty pairs from medical database] Input: "eye redness, tearing, gritty sensation"

Response

Ophthalmologist (confidence: 0.96)

Related specialists: Allergist (0.23 — if allergic conjunctivitis)

👁️Few-shot prompt: 5 examples do not cover ophthalmology → model "guesses" between dermatology and neurology

🧠Fine-tuning: 5,000 pairs teach the model all specialties. Knowledge is "baked into" weights — no examples needed in the prompt

🔢Result: 120→15 input tokens (8x savings), 580→120ms latency, + confidence scores

✅Fine-tuning is justified when: (1) domain expertise needed, (2) high request volume, (3) few-shot is not enough

Tokens:15/20

Time:120ms

Quality:

Why this works

Few-shot is limited by examples in the prompt. Fine-tuning "bakes" thousands of examples into model weights — faster, cheaper per request and more accurate for narrow domain tasks.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Few Shot Rag

This lesson is part of a structured LLM course.

My Learning Path

Lesson 12New

Fine-tuning vs Prompting

When to train your model

The Problem: Your chatbot needs to know your company's products. Should you train a custom model or just put product info in the prompt?

The Solution: Tutor vs Textbook

Textbook (prompting): Read the chapter before each test. Quick to start, but you have to re-read every time.
Tutor (fine-tuning): Study with a teacher until you truly understand. Takes time upfront, but knowledge stays with you.

Full fine-tuning vs LoRA / PEFT

When to fine-tune vs prompt or RAG

Think of it like learning with a tutor vs textbook:

1. Prompting is better when: Information changes frequently, need to start quickly, budget is limited, task is general-purpose
2. Fine-tuning is better when: Need a specific style or format, have lots of good examples, speed at inference time matters, same task repeats thousands of times

Types of Fine-tuning

Full fine-tuning: retrain all weights. Maximum quality, but expensive (needs 8+ GPUs)
LoRA / QLoRA: train only small adapters. Cheap and fast, 90% of the quality
RLHF: train using human feedback. For teaching preferences, like being helpful vs harmful
Instruction tuning: teach to follow instructions. Transforms a base model into a chatbot

Try It Yourself!

Answer a few questions about your use case, and see which approach is recommended for you!

Frequently asked questions

How is fine-tuning different from RAG?

How many examples do you need to fine-tune a model?

What is LoRA and how does it differ from full fine-tuning?

When should I fine-tune versus just use a prompt?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Classify medical symptoms by doctor specialty

Without technique

Prompt

Response

Dermatologist (redness) or Neurologist (gritty sensation might be neurological).

Not sure about the answer — need more context.

Tokens:120/30

Time:580ms

Quality:

With Fine-Tuning

With technique

Prompt

[Fine-tuned model on 5,000 symptom→specialty pairs from medical database] Input: "eye redness, tearing, gritty sensation"

Response

Ophthalmologist (confidence: 0.96)

Related specialists: Allergist (0.23 — if allergic conjunctivitis)

👁️Few-shot prompt: 5 examples do not cover ophthalmology → model "guesses" between dermatology and neurology

🧠Fine-tuning: 5,000 pairs teach the model all specialties. Knowledge is "baked into" weights — no examples needed in the prompt

🔢Result: 120→15 input tokens (8x savings), 580→120ms latency, + confidence scores

✅Fine-tuning is justified when: (1) domain expertise needed, (2) high request volume, (3) few-shot is not enough

Tokens:15/20

Time:120ms

Quality:

Why this works

Few-shot is limited by examples in the prompt. Fine-tuning "bakes" thousands of examples into model weights — faster, cheaper per request and more accurate for narrow domain tasks.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Few Shot Rag

This lesson is part of a structured LLM course.

My Learning Path