Fine-tuning vs Prompting
When to train your model
The Problem: Your chatbot needs to know your company's products. Should you train a custom model or just put product info in the prompt?
The Solution: Tutor vs Textbook
Fine-tuning means taking a pretrained language model and continuing its training on a smaller, focused dataset so it adapts to your task, domain, or style. The model already knows grammar, facts, and reasoning from its original pretraining; fine-tuning nudges its weights toward your patterns — your tone of voice, your output format, your jargon. Imagine you need to learn a new subject. Before fine-tuning, try improving your results with better prompt engineering. If that's not enough, you have two main paths:
- Textbook (prompting): Read the chapter before each test. Quick to start, but you have to re-read every time.
- Tutor (fine-tuning): Study with a teacher until you truly understand. Takes time upfront, but knowledge stays with you.
Full fine-tuning vs LoRA / PEFT
Full fine-tuning updates every weight in the model. It gives the most control but is expensive: a 7-billion-parameter model needs tens of gigabytes of GPU memory just to hold the optimizer state, and you end up storing a complete copy of the model for each task. PEFT (Parameter-Efficient Fine-Tuning) avoids this by freezing the original weights and training only a tiny set of new parameters. The most popular PEFT method is LoRA (Low-Rank Adaptation): it injects small trainable matrices into each layer, so you adjust well under 1% of the parameters. QLoRA goes further, loading the frozen base model in 4-bit quantization so a single consumer GPU can fine-tune a large model. The resulting LoRA adapter is just a few megabytes — you can swap adapters per task without duplicating the whole model.
When to fine-tune vs prompt or RAG
Reach for fine-tuning when you need a consistent style, format, or behavior that prompting can't reliably enforce — for example, always emitting strict JSON, classifying support tickets, or matching a brand voice across thousands of calls. It also lowers per-request cost, because the instructions are baked into the weights instead of repeated in every prompt. But fine-tuning is the wrong tool for injecting fresh facts. To teach a model knowledge that changes often, use Retrieval Augmented Generation (RAG): store your documents in a searchable vector database, find the passages relevant to each question using embeddings, and paste them into the prompt automatically. Updating RAG means editing a document; updating a fine-tuned fact means retraining. A practical rule: start with prompting, add RAG for knowledge, and fine-tune only for behavior you can't prompt your way to.
Concretely, suppose you want a model that turns customer emails into structured tickets with fields like priority, category, and summary. You collect 500–1,000 example email-to-ticket pairs, format them as prompt/completion data, and fine-tune with LoRA. The job runs in an hour or two on one GPU, and afterward the model emits clean JSON without a long instruction prompt — cheaper and more reliable at inference time than prompting alone.
Think of it like learning with a tutor vs textbook:
- 1. Prompting is better when: Information changes frequently, need to start quickly, budget is limited, task is general-purpose
- 2. Fine-tuning is better when: Need a specific style or format, have lots of good examples, speed at inference time matters, same task repeats thousands of times
Types of Fine-tuning
- Full fine-tuning: retrain all weights. Maximum quality, but expensive (needs 8+ GPUs)
- LoRA / QLoRA: train only small adapters. Cheap and fast, 90% of the quality
- RLHF: train using human feedback. For teaching preferences, like being helpful vs harmful
- Instruction tuning: teach to follow instructions. Transforms a base model into a chatbot
Fun Fact: Start with prompting + RAG. Fine-tune only when you hit clear limits. 80% of use cases never need fine-tuning! Fine-tuning is expensive. Make sure you've exhausted cheaper options first.
Try It Yourself!
Answer a few questions about your use case, and see which approach is recommended for you!
Frequently asked questions
How is fine-tuning different from RAG?
Fine-tuning changes the model's weights to bake in a style, format, or behavior, which requires retraining to update. RAG leaves the weights untouched: it retrieves relevant documents from a vector database and inserts them into the prompt at query time. So RAG is best for knowledge that changes often, while fine-tuning is best for behavior you can't reliably get from prompting.
How many examples do you need to fine-tune a model?
For a narrow task like classification or a fixed output format, 500–1,000 good examples is usually enough, and sometimes a few hundred works. Data quality and consistency matter more than raw volume — noisy or contradictory examples hurt more than having fewer of them. Complex behavior or a new style may need several thousand pairs.
What is LoRA and how does it differ from full fine-tuning?
Full fine-tuning updates every weight, needs a lot of GPU memory, and stores a full model copy per task. LoRA freezes the original weights and trains only small added matrices — under 1% of the parameters. The resulting adapter is just a few megabytes, training is cheaper and faster, and quality is close to full fine-tuning.
When should I fine-tune versus just use a prompt?
Start with prompt engineering — it's free and instant. Fine-tune only when you need a consistent style, a strict format like JSON, or the same task repeats thousands of times and per-request cost matters. If what you're missing is fresh facts, use RAG instead of fine-tuning.
Try it yourself
Interactive demo of this technique
Classify medical symptoms by doctor specialty
Dermatologist (redness) or Neurologist (gritty sensation might be neurological).
Not sure about the answer — need more context.
Ophthalmologist (confidence: 0.96)
Related specialists: Allergist (0.23 — if allergic conjunctivitis)
Few-shot is limited by examples in the prompt. Fine-tuning "bakes" thousands of examples into model weights — faster, cheaper per request and more accurate for narrow domain tasks.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path