Lesson 6

API Integration Patterns

Streaming, retries, errors

The Problem: How do you integrate LLMs into your application reliably? What about timeouts, rate limits, retries, and streaming?

The Solution: Standard Connection Methods

API patterns are proven approaches for reliably integrating LLMs into applications. They're like standardized connectors — USB, HDMI, or power plugs — that make it easy to connect different systems together. Streaming reduces perceived latency, and structured output ensures predictable responses.

Think of it like standardized connectors like USB or HDMI:

1. Choose endpoint: Pick the right API: chat completions, embeddings, or function calling based on your task
2. Set up streaming: Stream tokens for real-time UX — users see the first token in ~200ms instead of waiting seconds
3. Retry with exponential backoff: Automatically retry on 429 (rate limit), 500 (server error), 503 (overloaded) with increasing delays
4. Add timeout and circuit breaker: Set request timeouts and stop calling a failing provider after N consecutive errors
5. Handle rate limits gracefully: Queue excess requests, apply backpressure, and show users a meaningful wait indicator

Where Is This Used?

Chatbots: Streaming for real-time response display
Batch Processing: Retries and rate limiting for bulk jobs
High-Availability: Fallback models and circuit breakers
Cost Control: Request queuing and prioritization
Multi-Provider Fallback: If primary provider is down, seamlessly switch to backup (e.g., Claude → OpenAI → local model)

Fun Fact: Streaming can reduce perceived latency by 5-10x. Users see the first token in ~200ms instead of waiting 3-5 seconds for the complete response. This single change often has the biggest impact on user satisfaction.

Try It Yourself!

See different API patterns in action and understand how they improve reliability.

API Integration Patterns

Streaming lets you display the response as it generates, improving UX

Streaming Response

import anthropic

client = anthropic.Anthropic()

# Stream tokens as they're generated
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain vector databases"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Best Practices

Streaming: Always use for chat and long responses
Timeouts: Set reasonable limits (30-60s for LLMs)
Error handling: Log all errors, show user-friendly messages
Monitoring: Track latency, error rate, token usage

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Handle errors in LLM API calls with retry logic

Without technique

Prompt

Write a function to call the OpenAI API in Python.

Response

import openai

def ask_llm(prompt):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Tokens:18/55

Time:450ms

Quality:

With API Integration

With technique

Prompt

Write a production-ready OpenAI API call function in Python: - Exponential backoff (3 retries: 1s, 2s, 4s) - Handle: RateLimit (429), ServerError (5xx), Timeout - Log each attempt - Type hints (return type) Code only with comments on retry logic.

Response

import openai
import time
import logging
from typing import Optional

logger = logging.getLogger(__name__)

def ask_llm(prompt: str, max_retries: int = 3) -> Optional[str]:
    """Call LLM with exponential backoff."""
    for attempt in range(max_retries):
        try:
            response = openai.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                timeout=30
            )
            return response.choices[0].message.content
        except openai.RateLimitError:      # 429 — too many requests
            wait = 2 ** attempt            # 1s, 2s, 4s
            logger.warning(f"Rate limit, retry {attempt+1}/{max_retries} in {wait}s")
            time.sleep(wait)
        except openai.APIStatusError as e: # 5xx — server error
            if e.status_code >= 500:
                wait = 2 ** attempt
                logger.error(f"Server error {e.status_code}, retry in {wait}s")
                time.sleep(wait)
            else:
                raise                       # 4xx (except 429) — don't retry
        except openai.APITimeoutError:
            logger.error(f"Timeout, attempt {attempt+1}/{max_retries}")
    
    logger.critical(f"All {max_retries} retries failed")
    return None

👁️Without retry on 429 (Rate Limit), app crashes under load — LLM APIs regularly return 429

🧠Exponential backoff: 1s → 2s → 4s — gives API time to recover without adding more load

🔍4xx (except 429) — client errors, retry is useless. 5xx — server errors, retry makes sense

✅Production = retry + logging + timeout + graceful degradation (return None)

Tokens:70/220

Time:1400ms

Quality:

Why this works

In production, LLM API is called with retry (exponential backoff), error separation (429 retry vs 400 fail) and logging. Without this, the app crashes at the first rate limit.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Deployment Observability

This lesson is part of a structured LLM course.

My Learning Path

Lesson 6

API Integration Patterns

Streaming, retries, errors

The Problem: How do you integrate LLMs into your application reliably? What about timeouts, rate limits, retries, and streaming?

The Solution: Standard Connection Methods

Think of it like standardized connectors like USB or HDMI:

1. Choose endpoint: Pick the right API: chat completions, embeddings, or function calling based on your task
2. Set up streaming: Stream tokens for real-time UX — users see the first token in ~200ms instead of waiting seconds
3. Retry with exponential backoff: Automatically retry on 429 (rate limit), 500 (server error), 503 (overloaded) with increasing delays
4. Add timeout and circuit breaker: Set request timeouts and stop calling a failing provider after N consecutive errors
5. Handle rate limits gracefully: Queue excess requests, apply backpressure, and show users a meaningful wait indicator

Where Is This Used?

Chatbots: Streaming for real-time response display
Batch Processing: Retries and rate limiting for bulk jobs
High-Availability: Fallback models and circuit breakers
Cost Control: Request queuing and prioritization
Multi-Provider Fallback: If primary provider is down, seamlessly switch to backup (e.g., Claude → OpenAI → local model)

Try It Yourself!

See different API patterns in action and understand how they improve reliability.

API Integration Patterns

Streaming lets you display the response as it generates, improving UX

Streaming Response

import anthropic

client = anthropic.Anthropic()

# Stream tokens as they're generated
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain vector databases"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Best Practices

Streaming: Always use for chat and long responses
Timeouts: Set reasonable limits (30-60s for LLMs)
Error handling: Log all errors, show user-friendly messages
Monitoring: Track latency, error rate, token usage

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Handle errors in LLM API calls with retry logic

Without technique

Prompt

Write a function to call the OpenAI API in Python.

Response

import openai

def ask_llm(prompt):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Tokens:18/55

Time:450ms

Quality:

With API Integration

With technique

Prompt

Response

import openai
import time
import logging
from typing import Optional

logger = logging.getLogger(__name__)

def ask_llm(prompt: str, max_retries: int = 3) -> Optional[str]:
    """Call LLM with exponential backoff."""
    for attempt in range(max_retries):
        try:
            response = openai.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                timeout=30
            )
            return response.choices[0].message.content
        except openai.RateLimitError:      # 429 — too many requests
            wait = 2 ** attempt            # 1s, 2s, 4s
            logger.warning(f"Rate limit, retry {attempt+1}/{max_retries} in {wait}s")
            time.sleep(wait)
        except openai.APIStatusError as e: # 5xx — server error
            if e.status_code >= 500:
                wait = 2 ** attempt
                logger.error(f"Server error {e.status_code}, retry in {wait}s")
                time.sleep(wait)
            else:
                raise                       # 4xx (except 429) — don't retry
        except openai.APITimeoutError:
            logger.error(f"Timeout, attempt {attempt+1}/{max_retries}")
    
    logger.critical(f"All {max_retries} retries failed")
    return None

👁️Without retry on 429 (Rate Limit), app crashes under load — LLM APIs regularly return 429

🧠Exponential backoff: 1s → 2s → 4s — gives API time to recover without adding more load

🔍4xx (except 429) — client errors, retry is useless. 5xx — server errors, retry makes sense

✅Production = retry + logging + timeout + graceful degradation (return None)

Tokens:70/220

Time:1400ms

Quality:

Why this works

In production, LLM API is called with retry (exponential backoff), error separation (429 retry vs 400 fail) and logging. Without this, the app crashes at the first rate limit.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Deployment Observability

This lesson is part of a structured LLM course.

My Learning Path