Production Guardrails
Safety in production
The Problem: LLMs can generate harmful, incorrect, or off-topic content. How do you prevent your AI from saying things it shouldn't?
The Solution: Guardrails Like a Staircase
Guardrails are safety mechanisms that constrain LLM behavior and filter outputs. They're like the rails on a staircase — they don't restrict normal movement, but prevent dangerous falls. They defend against prompt injection, reduce hallucinations, and should be validated through red teaming.
Think of it like safety rails:
- 1. Input validation: Block malicious prompts and check length limits before they reach the model
- 2. Prompt injection detection: A classifier flags suspicious inputs that attempt to override system instructions
- 3. LLM generates response: The validated prompt is sent to the model for generation
- 4. Output filtering: PII detection, content policy check — block toxic, sensitive, or off-topic content
- 5. Format validation: Verify output matches expected schema — JSON structure, required fields, value ranges
- 6. Log and monitor: Flag edge cases for human review, track guardrail trigger rates
- 7. OWASP LLM Top 10 audit: Check each of the 10 threats: LLM01 (Prompt Injection), LLM02 (Insecure Output), LLM03 (Training Data Poisoning), LLM06 (Sensitive Info Disclosure), LLM07 (Insecure Plugin Design). Prioritize by your attack surface.
- 8. Integrate a guardrail library: Choose based on needs: NeMo Guardrails for dialog flows, Guardrails AI for structured output validation, LlamaGuard for content safety classification. Combine multiple for defense-in-depth.
Critical rule: never trust LLM output in security-sensitive contexts. Always validate output format, content, and safety before presenting to users.
Types of Guardrails
- Content Filters: Block profanity, violence, PII
- Topic Restrictions: Stay within allowed domains
- Format Validators: Ensure output matches expected schema
- Hallucination Detectors: Flag unsupported claims
- PII Protection: Detect and redact personal information (SSN, credit cards, emails) before sending to or returning from the LLM
- OWASP LLM Top 10: Industry standard threat list: prompt injection, data leakage, insecure output handling, training data poisoning, model DoS, supply chain vulnerabilities, excessive agency, overreliance, model theft, insecure plugins
- Guardrail Libraries: NeMo Guardrails (NVIDIA): programmable rails in Colang. Guardrails AI: Pydantic-based output validation. LlamaGuard (Meta): safety classifier model for input/output filtering.
- Red Teaming: Systematically attack your own system before launch. Test jailbreaks, data extraction, instruction override. Tools: Garak, PyRIT (Microsoft), manual adversarial testing.
Fun Fact: The best guardrails are often simple! Adding "You are a helpful assistant for [Company]. Only answer questions about [Topic]" to system prompts can block 80% of off-topic requests.
Try It Yourself!
Use the interactive example below to see how guardrails filter inputs and outputs to keep AI responses safe and appropriate.
Try it yourself
Interactive demo of this technique
Protect an LLM application from processing malicious user input
Sure, I can help get the data. Here's how to access the customer database: SELECT * FROM customers...
I cannot fulfill this request. I only help with questions about orders, shipping, returns, and payment. If you need other assistance, contact support: support@store.com
A three-layer guardrail (topic whitelist + injection patterns + refusal rule) in the system prompt blocks most attacks. In production, add an external input classifier.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path