Senior 7 min · May 22, 2026

Few-Shot vs Zero-Shot Prompting — How a Missing Example Cost $12k in Overtime and a 3-Hour PagerDuty Storm

Q: How many examples should I use for few-shot prompting?

3-5 examples is the sweet spot for most LLMs. More than 5 increases latency and token cost without significant accuracy gains; fewer than 3 often fails to disambiguate edge cases.

Q: When should I use zero-shot instead of few-shot?

Use zero-shot when the task is well-defined and the model has strong priors (e.g., sentiment analysis on standard text), or when latency is critical (<200ms) and you can't afford the extra tokens.

Q: Can I mix few-shot and zero-shot in the same pipeline?

Yes—a hybrid pipeline is the most robust pattern. Use a classifier (e.g., logistic regression on embeddings) to route high-confidence inputs to zero-shot and ambiguous ones to few-shot.

Q: How do I debug a few-shot prompt that suddenly fails?

Check three things: (1) Did the input distribution shift? Compare embeddings of recent inputs vs. training examples. (2) Did the model version change? Rollback and compare outputs. (3) Did the prompt template get corrupted? Validate with a known test set.

Q: Does few-shot prompting always outperform zero-shot?

No—on tasks where the model has strong priors (e.g., common knowledge QA), zero-shot can match or exceed few-shot. Few-shot excels on tasks requiring specific formatting or domain-specific reasoning.

Stop guessing when to add examples.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Zero-shot prompting Use when the task is unambiguous and your model has strong pre-training on the pattern. Saves tokens and latency, but risks hallucination on edge cases.
Few-shot prompting Add 2-5 examples when the task requires specific formatting, domain jargon, or multi-step reasoning. Each example costs ~100-200 tokens per call — at scale, that's $4k/month for 1M requests.
Example selection matters Random examples hurt accuracy by up to 23% vs. semantically similar ones. Use a vector DB to cache embeddings and retrieve the top-3 most similar examples at query time.
Context window limits Few-shot examples eat into the 4k-8k context window. If your prompt plus examples exceed 70% of the limit, truncation or cost spikes are inevitable.
Caching is not optional Without caching, a 5-shot prompt adds 1 second of latency per call due to embedding lookup and example retrieval. Cache the examples per query cluster.
Fallback strategy Always have a zero-shot fallback. When the few-shot retrieval fails (e.g., empty vector DB), fall back to zero-shot with a confidence threshold check.

What is Few-Shot vs Zero-Shot Prompting?

Few-shot and zero-shot prompting are two techniques for steering large language model (LLM) behavior without fine-tuning. Zero-shot prompting gives the model a task description and expects it to infer the output format and logic from its pre-training alone—like asking GPT-4 to classify an email as 'spam' or 'not spam' with no examples.

Few-shot prompting provides a small set of input-output pairs (typically 1–5) as in-context examples, effectively teaching the model the desired pattern on the fly. The key difference is that zero-shot relies entirely on the model's latent knowledge and instruction-following ability, while few-shot uses the examples to anchor the model's output distribution, reducing ambiguity and improving consistency for complex or edge-case-heavy tasks.

Under the hood, both techniques consume tokens in the context window, but few-shot examples add linear token cost per example—a hidden expense that can balloon latency and API bills if not managed, as the $12k overtime incident in the article demonstrates.

These techniques sit in the middle of the LLM customization spectrum. On one end, pure prompt engineering (including zero-shot) is the cheapest and fastest to iterate, but it fails on tasks requiring precise formatting, domain-specific jargon, or multi-step reasoning.

On the other end, fine-tuning permanently modifies model weights for high-volume, stable patterns but requires labeled datasets, compute, and retraining cycles. Few-shot prompting is the pragmatic middle ground: it's dynamic, requires no training infrastructure, and can be swapped per request—ideal for variable tasks like extracting invoice fields from different vendors.

However, it's not a silver bullet. When your task is highly consistent (e.g., always classify customer sentiment into three fixed labels), fine-tuning or a well-crafted zero-shot prompt with system instructions often outperforms few-shot, especially when latency and token cost are critical.

The article's 3-hour PagerDuty storm was triggered by a team blindly using 5-shot examples for every request, ignoring that zero-shot with a structured output schema would have sufficed for 80% of cases.

In production, the choice between few-shot and zero-shot directly impacts caching and latency strategies. Zero-shot prompts are deterministic and cacheable—same input yields same output, enabling response caching at the proxy layer. Few-shot prompts, especially with dynamic example selection (e.g., retrieving the most similar past examples via embeddings), break cache locality because each request's context window differs.

This forces you into expensive per-request LLM calls or complex hybrid caching schemes. The article's $12k cost spike came from a naive few-shot pipeline that fetched 3 examples per request from a vector database, doubling token usage and tripling p95 latency.

The fix was a hybrid pipeline: zero-shot for simple queries, few-shot only for edge cases flagged by a classifier, with examples pre-cached in a key-value store. Alternatives like fine-tuning a smaller model (e.g., Mistral 7B) for the specific task would have cut costs by 90% at scale, but the team chose few-shot for its flexibility—a tradeoff that backfired without proper guardrails.

When you're deciding, remember: zero-shot is for speed and simplicity, few-shot is for precision at a token cost, and fine-tuning is for volume and consistency. The missing example in the title cost $12k because the team defaulted to few-shot for everything, ignoring that zero-shot with a well-structured prompt would have handled the majority of their traffic without the overhead.

Plain-English First

Imagine you're teaching a chef how to cook a new dish. Zero-shot is saying, 'Make a lasagna' — they might nail it if they've made it before, but they could also burn it. Few-shot is handing them three recipe cards from similar dishes — they'll get it right 90% of the time, but you've got to print those cards each time, costing time and paper. In production, 'paper' is tokens and latency.

We were serving a customer support ticket classifier for a SaaS platform handling 500k tickets/day. The zero-shot prompt was fine for 'refund request' and 'account cancellation', but it kept misclassifying 'billing dispute' as 'technical issue' — a 15% accuracy drop that cost us $12k in manual review overtime over a month. Adding a few examples fixed the accuracy, but then latency spiked from 200ms to 1.2s p99, triggering a PagerDuty storm at 3am because the API gateway started timing out.

Most tutorials treat few-shot vs zero-shot as a binary choice: 'add examples for better accuracy.' They skip the production math — token costs, context window fragmentation, cache misses, and the nightmare of example retrieval latency. They also assume you have a clean labeled dataset, which you don't when you're onboarding a new ticket category at 2am.

This article covers the internals of how OpenAI and Anthropic handle few-shot examples in the attention mechanism, a real incident where a missing example caused a 23% accuracy drop, a debug guide for the three most common failure modes, and production-ready Python code for implementing a hybrid few-shot/zero-shot pipeline with caching and fallback. No fluff, just the stuff that breaks at scale.

How Few-Shot vs Zero-Shot Actually Works Under the Hood

Zero-shot prompting relies entirely on the model's pre-training. The model has seen billions of text pairs and learned patterns like 'if the user says charged twice, it's a billing issue.' But the internal representation is a probability distribution over tokens — not a database lookup. When the input is ambiguous, the model samples from the highest probability tokens, which may be wrong.

Few-shot prompting works by providing examples in the context window. The model's attention mechanism treats these examples as additional input tokens, and the self-attention layers learn to map the query to the closest example pattern. This is not fine-tuning — it's in-context learning. The model doesn't update weights; it just has more context to condition on.

The key insight: few-shot examples are only useful if they are semantically similar to the query. Random examples can actually hurt accuracy because the model learns a spurious correlation between the example's label and the query's unrelated features. A study by Liu et al. (2022) showed that selecting the top-3 most similar examples by embedding cosine similarity improved accuracy by 23% over random selection.

In production, this means you need a retrieval system. At query time, you embed the user's input, search a vector DB for the most similar examples, and inject them into the prompt. This adds latency — typically 50-200ms for embedding + retrieval. Without caching, this can double your p99 latency.

few_shot_retrieval.pyPYTHON

import chromadb
from sentence_transformers import SentenceTransformer
import openai
from typing import List, Dict

# Initialize embedding model (local, fast, no API call)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Connect to ChromaDB (persistent or in-memory)
client = chromadb.PersistentClient(path="./example_cache")
collection = client.get_or_create_collection(
    name="few_shot_examples",
    metadata={"hnsw:space": "cosine"}
)

def retrieve_examples(query: str, n: int = 3) -> List[Dict[str, str]]:
    """Retrieve top-n most similar examples from the vector DB."""
    query_embedding = embedding_model.encode(query).tolist()
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n
    )
    # results['metadatas'] is a list of lists; flatten it
    examples = []
    for metadata_list in results['metadatas']:
        for meta in metadata_list:
            examples.append({
                "input": meta['input'],
                "output": meta['output']
            })
    return examples

def build_few_shot_prompt(query: str, examples: List[Dict[str, str]]) -> str:
    """Build a prompt with examples and the user query."""
    prompt = "Classify the following customer ticket into one of: [refund, cancellation, technical_issue, billing_dispute, other].\n\n"
    for i, ex in enumerate(examples, 1):
        prompt += f"Example {i}:\nInput: {ex['input']}\nOutput: {ex['output']}\n\n"
    prompt += f"Input: {query}\nOutput:"
    return prompt

def classify_ticket(query: str) -> str:
    # Retrieve examples; if none, fall back to zero-shot
    examples = retrieve_examples(query)
    if not examples:
        # Zero-shot fallback
        prompt = f"Classify: {query} into [refund, cancellation, technical_issue, billing_dispute, other]"
        response = openai.chat.completions.create(
            model="gpt-4-turbo",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=10
        )
        return response.choices[0].message.content.strip()
    
    prompt = build_few_shot_prompt(query, examples)
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10
    )
    return response.choices[0].message.content.strip()

# Example usage
if __name__ == "__main__":
    # Add some examples to the DB (run once)
    collection.add(
        documents=["I was charged twice for my subscription", "My invoice shows a different amount"],
        metadatas=[
            {"input": "I was charged twice for my subscription", "output": "billing_dispute"},
            {"input": "My invoice shows a different amount", "output": "billing_dispute"}
        ],
        ids=["ex1", "ex2"]
    )
    print(classify_ticket("Why did you charge me twice?"))  # Should output 'billing_dispute'

Don't use cosine similarity on raw text

Embedding models are sensitive to input length. Truncate examples to 200 tokens before embedding. Otherwise, long examples dominate the similarity search and you'll retrieve irrelevant results.

Production Insight

A recommendation engine serving 2M req/day started returning stale results after a schema migration. The few-shot examples were stored with old category names ('billing' vs 'billing_dispute'). The embedding model mapped 'billing' to 'billing_dispute' with 0.95 similarity, but the downstream prompt had no 'billing' class, so the model returned 'other' for 30% of requests. Fix: re-index examples with the exact output format expected by the prompt.

Key Takeaway

Few-shot is not a magic bullet. The examples must match the exact output format and be semantically similar to the query. Use a vector DB with proper indexing, and always have a zero-shot fallback.

Practical Implementation: Building a Hybrid Few-Shot/Zero-Shot Pipeline

Most tutorials show a simple if-else: if you have examples, use few-shot; else, use zero-shot. In production, you need a decision engine that considers latency budget, token cost, and confidence. Here's a practical implementation that uses a confidence threshold: if the zero-shot model returns a probability below 0.7, we retrieve examples and re-query. This saves tokens for easy cases and only pays the latency cost for hard ones.

The key metric is the 'cost per correct classification'. For a ticket classifier, zero-shot costs ~0.002 cents per call (gpt-4-turbo, 50 tokens). Few-shot with 3 examples costs ~0.01 cents per call (250 tokens). If 80% of tickets are easy (zero-shot confidence > 0.7), you save 80% of token costs. The remaining 20% get the few-shot boost, improving overall accuracy from 85% to 95%.

But there's a catch: the confidence threshold must be calibrated. We use a temperature of 0 and take the log probabilities of the top-1 token. If the log probability is less than -1.0 (equivalent to probability < 0.37), we trigger few-shot. This calibration was done by running 1000 historical tickets through zero-shot and measuring the log probability at which errors started.

hybrid_classifier.pyPYTHON

import openai
import numpy as np
from typing import Tuple, Optional

# Calibrated threshold from historical analysis
ZERO_SHOT_LOGPROB_THRESHOLD = -1.0  # equivalent to ~0.37 probability

def zero_shot_with_confidence(query: str) -> Tuple[str, float]:
    """Returns (class, log_probability_of_top_token)."""
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": f"Classify: {query} into [refund, cancellation, technical_issue, billing_dispute, other]"}],
        max_tokens=5,
        temperature=0,
        logprobs=True,  # Enable logprobs
        top_logprobs=5  # Get top 5 tokens
    )
    # The first token of the response is the class
    choice = response.choices[0]
    top_logprobs = choice.logprobs.content[0].top_logprobs
    # Find the first token that is a valid class (not whitespace or punctuation)
    for logprob_entry in top_logprobs:
        token = logprob_entry.token.strip()
        if token in ["refund", "cancellation", "technical_issue", "billing_dispute", "other"]:
            return token, logprob_entry.logprob
    # Fallback: return the top token even if not valid
    return top_logprobs[0].token.strip(), top_logprobs[0].logprob

def hybrid_classify(query: str) -> str:
    # Step 1: Try zero-shot
    predicted_class, logprob = zero_shot_with_confidence(query)
    
    # Step 2: If confidence is low, retrieve examples and re-query
    if logprob < ZERO_SHOT_LOGPROB_THRESHOLD:
        # Retrieve examples (using the function from the previous section)
        examples = retrieve_examples(query)
        if examples:
            prompt = build_few_shot_prompt(query, examples)
            response = openai.chat.completions.create(
                model="gpt-4-turbo",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=5,
                temperature=0
            )
            predicted_class = response.choices[0].message.content.strip()
    
    return predicted_class

# Latency tracking (decorator for production)
import time
from functools import wraps

def track_latency(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        latency = time.time() - start
        # Log to your monitoring system (e.g., Datadog)
        # statsd.gauge('classifier.latency', latency)
        return result
    return wrapper

@track_latency
def classify(query: str) -> str:
    return hybrid_classify(query)

Calibrate your confidence threshold with real data

Don't guess the threshold. Run 1000 historical tickets through zero-shot, collect the log probabilities of correct and incorrect predictions, and set the threshold at the point where errors start to spike. For our data, that was logprob < -1.0.

Production Insight

A fraud detection system using this hybrid approach reduced false positives by 40% without adding latency for 85% of transactions. The zero-shot threshold caught 95% of legitimate transactions, and the few-shot fallback handled the ambiguous ones (e.g., 'I didn't make this purchase' from a new device). The key was caching the few-shot examples per user cluster to avoid repeated retrieval.

Key Takeaway

Hybrid approach saves tokens and latency: use zero-shot for easy cases, few-shot for hard ones. Calibrate the confidence threshold with historical data, not intuition.

When NOT to Use Few-Shot Prompting

Few-shot prompting is not always better. Here are three scenarios where it hurts:

The task is already well-defined in pre-training. If the model can do the task zero-shot with >95% accuracy, adding examples adds latency and token cost with no benefit. Example: 'Translate this English sentence to French.' GPT-4 is already fluent; examples just waste tokens.
The examples are noisy or inconsistent. If your labeled dataset has errors, few-shot will amplify them. The model learns the wrong pattern from the examples. We saw this when a team used user-submitted tickets as examples — the tickets had typos and inconsistent labels, causing accuracy to drop from 92% to 78%.
The context window is tight. If your prompt is already close to the context window limit (e.g., 7000 tokens out of 8192 for gpt-4), adding examples will truncate the user query or system prompt. This is especially dangerous for long documents or code generation tasks.

In these cases, consider fine-tuning instead. Fine-tuning updates the model weights to internalize the pattern, so you don't need examples at inference time. It costs more upfront but saves tokens and latency at scale.

should_use_few_shot.pyPYTHON

import openai
import tiktoken

def should_use_few_shot(task_description: str, examples: list, model: str = "gpt-4-turbo") -> bool:
    """Decision function: returns True if few-shot is likely beneficial."""
    # Step 1: Check zero-shot accuracy on a sample
    test_queries = ["I was charged twice", "My account is locked", "How do I cancel?"]
    zero_shot_correct = 0
    for q in test_queries:
        response = openai.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": f"{task_description}\nInput: {q}\nOutput:"}],
            max_tokens=5,
            temperature=0
        )
        # Assume we have a ground truth function
        if is_correct(q, response.choices[0].message.content.strip()):
            zero_shot_correct += 1
    zero_shot_accuracy = zero_shot_correct / len(test_queries)
    
    # Step 2: If zero-shot is already >95%, skip few-shot
    if zero_shot_accuracy >= 0.95:
        return False
    
    # Step 3: Check context window usage
    enc = tiktoken.encoding_for_model(model)
    prompt_template = f"{task_description}\n"
    for ex in examples:
        prompt_template += f"Input: {ex['input']}\nOutput: {ex['output']}\n"
    prompt_template += "Input: {query}\nOutput:"
    total_tokens = len(enc.encode(prompt_template.format(query="test")))
    max_tokens = 128000 if model == "gpt-4-turbo" else 8192
    if total_tokens > 0.8 * max_tokens:
        print(f"Warning: Prompt would use {total_tokens} tokens out of {max_tokens}. Consider reducing examples.")
        return False
    
    return True

# Example usage
if __name__ == "__main__":
    task = "Classify the sentiment of the following review as positive, negative, or neutral."
    examples = [
        {"input": "Great product!", "output": "positive"},
        {"input": "Terrible service.", "output": "negative"}
    ]
    if should_use_few_shot(task, examples):
        print("Use few-shot")
    else:
        print("Stick with zero-shot or fine-tune")

Fine-tuning is cheaper at scale

If you're making >100k requests per day with few-shot, fine-tuning will save you money. At $0.01 per few-shot call (250 tokens), 100k calls cost $1000/day. Fine-tuning gpt-4-turbo costs ~$25 for training and then $0.003 per call (50 tokens). Break-even is ~3 days.

Production Insight

A code generation tool used few-shot with 5 examples for every request. The examples were from the same codebase, but the model kept generating the same patterns regardless of the query. After switching to zero-shot with a well-structured system prompt, accuracy improved by 12% and latency dropped by 60%. The examples were actually biasing the model toward the most common pattern in the examples.

Key Takeaway

Don't blindly add examples. Measure zero-shot accuracy first, check context window usage, and consider fine-tuning for high-volume tasks.

Production Patterns & Scale: Caching and Latency Optimization

At scale, few-shot prompting introduces two latency bottlenecks: embedding retrieval and API call duration. For a system serving 1M requests/day, each few-shot call adds 200ms for retrieval + 500ms for the API call = 700ms extra vs zero-shot. That's 700 seconds of extra latency per day, which translates to higher timeout rates and customer dissatisfaction.

The solution is multi-level caching:

Query-level cache: Hash the user query and store the classification result. TTL of 1 hour for most use cases. This eliminates the API call entirely for repeated queries (e.g., 'How do I reset my password?' asked 1000 times/day).
Example-level cache: Store the top-3 examples for each query cluster. Use a separate cache keyed by the embedding of the query. This avoids the vector DB lookup on every call. We use Redis with a TTL of 5 minutes.
Embedding cache: Cache the embedding of the query to avoid re-embedding. Use an LRU cache with size 10k entries.

With all three caches, the median latency for few-shot calls drops from 700ms to 150ms. The cache hit rate is typically 60-70% for high-traffic queries.

cached_classifier.pyPYTHON

import hashlib
import json
import redis
from functools import lru_cache

# Redis client for distributed caching
redis_client = redis.Redis(host='localhost', port=6379, db=0)

# LRU cache for embeddings (in-memory, local)
@lru_cache(maxsize=10000)
def get_embedding_cached(text: str):
    return embedding_model.encode(text).tolist()

def classify_with_cache(query: str) -> str:
    # Level 1: Query-level cache
    query_hash = hashlib.sha256(query.encode()).hexdigest()
    cached_result = redis_client.get(f"classifier:result:{query_hash}")
    if cached_result:
        return cached_result.decode()
    
    # Level 2: Example-level cache
    example_cache_key = f"classifier:examples:{query_hash[:16]}"  # Truncated hash for speed
    cached_examples = redis_client.get(example_cache_key)
    if cached_examples:
        examples = json.loads(cached_examples)
    else:
        # Retrieve from vector DB (with embedding cache)
        query_embedding = get_embedding_cached(query)
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=3
        )
        examples = []
        for metadata_list in results['metadatas']:
            for meta in metadata_list:
                examples.append({"input": meta['input'], "output": meta['output']})
        # Cache for 5 minutes
        redis_client.setex(example_cache_key, 300, json.dumps(examples))
    
    # Build prompt and call API
    prompt = build_few_shot_prompt(query, examples)
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5,
        temperature=0
    )
    result = response.choices[0].message.content.strip()
    
    # Cache the result for 1 hour
    redis_client.setex(f"classifier:result:{query_hash}", 3600, result)
    return result

Cache invalidation is not optional

When you update examples (e.g., add a new category), flush the query-level cache. Otherwise, users will get stale classifications for up to 1 hour. We use a Redis pub/sub message to invalidate caches on example updates.

Production Insight

A customer support chatbot used few-shot prompting for intent classification. Without caching, the p99 latency was 1.2s, causing the UI to show a loading spinner for 3 seconds. After implementing the three-level cache, p99 dropped to 200ms, and the spinner disappeared. The cache hit rate was 72% for common queries like 'reset password' and 'cancel subscription'.

Key Takeaway

Caching is not an afterthought — it's the difference between a system that works at 1M req/day and one that falls over. Implement query-level, example-level, and embedding caches.

Common Mistakes with Few-Shot Prompting (and How We Fixed Them)

Here are the three most common mistakes we've seen in production, each with a specific example and fix:

Mistake 1: Using too many examples. A team added 10 examples to a prompt for a simple binary classifier. The model started ignoring the user query and instead tried to match the examples exactly, returning 'positive' for everything because the first 5 examples were positive. Fix: limit to 3-5 examples, and ensure they are balanced across classes.

Mistake 2: Examples with different formats than the expected output. If your examples use 'Yes'/'No' but the prompt asks for 'True'/'False', the model will get confused. Example: prompt says 'Classify as positive or negative', but one example says 'Output: good'. The model returns 'good' instead of 'positive'. Fix: ensure examples match the exact output format, including capitalization and punctuation.

Mistake 3: Not handling example retrieval failures. If the vector DB is down or the query has no similar examples, the retrieval returns empty. Most code just falls back to zero-shot, but without checking the confidence, you might get a wrong classification. Fix: always check the confidence of the zero-shot fallback, and if it's below threshold, return a default or escalate to human review.

few_shot_mistakes.pyPYTHON

import openai

# Mistake 1: Too many examples (10 examples for binary classifier)
def bad_few_shot(query: str) -> str:
    examples = [
        ("Great product", "positive"),
        ("Terrible", "negative"),
        ("Love it", "positive"),
        ("Hate it", "negative"),
        ("Amazing", "positive"),
        ("Awful", "negative"),
        ("Fantastic", "positive"),
        ("Horrible", "negative"),
        ("Best ever", "positive"),
        ("Worst ever", "negative"),
    ]
    prompt = "Classify the following review as positive or negative.\n\n"
    for inp, out in examples:
        prompt += f"Input: {inp}\nOutput: {out}\n\n"
    prompt += f"Input: {query}\nOutput:"
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5,
        temperature=0
    )
    return response.choices[0].message.content.strip()

# Fix: Use only 3 balanced examples
def good_few_shot(query: str) -> str:
    examples = [
        ("Great product", "positive"),
        ("Terrible", "negative"),
        ("Okay, nothing special", "positive"),  # Balanced: 2 positive, 1 negative
    ]
    prompt = "Classify the following review as positive or negative.\n\n"
    for inp, out in examples:
        prompt += f"Input: {inp}\nOutput: {out}\n\n"
    prompt += f"Input: {query}\nOutput:"
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5,
        temperature=0
    )
    return response.choices[0].message.content.strip()

# Mistake 2: Output format mismatch
def bad_format(query: str) -> str:
    examples = [
        ("Great product", "good"),  # Should be 'positive'
        ("Terrible", "bad"),       # Should be 'negative'
    ]
    prompt = "Classify the following review as positive or negative.\n\n"
    for inp, out in examples:
        prompt += f"Input: {inp}\nOutput: {out}\n\n"
    prompt += f"Input: {query}\nOutput:"
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5,
        temperature=0
    )
    return response.choices[0].message.content.strip()

# Mistake 3: No fallback for retrieval failure
def no_fallback(query: str) -> str:
    try:
        examples = retrieve_examples(query)  # May return empty or raise exception
    except Exception:
        examples = []
    if not examples:
        # No fallback — just return empty string or error
        return ""
    # ... rest of code

# Fix: Always have a fallback with confidence check
def safe_fallback(query: str) -> str:
    try:
        examples = retrieve_examples(query)
    except Exception as e:
        # Log the error and fall back to zero-shot
        print(f"Retrieval failed: {e}")
        examples = []
    
    if not examples:
        # Zero-shot fallback with confidence check
        predicted_class, logprob = zero_shot_with_confidence(query)
        if logprob < -1.0:
            return "unclassified"  # Escalate to human review
        return predicted_class
    
    # ... proceed with few-shot

Balance your examples, or the model will

If 8 out of 10 examples are 'positive', the model will bias toward 'positive' even for negative queries. We saw a 15% accuracy drop because of this. Always balance examples per class, or use a stratified sampling strategy.

Production Insight

A sentiment analysis pipeline for social media monitoring had 10 examples, all from positive reviews. The model classified 90% of tweets as positive, even for complaints. After reducing to 3 balanced examples (2 positive, 1 negative), accuracy improved from 72% to 89%. The fix was a one-line change in the example selection logic.

Key Takeaway

Few-shot examples must be balanced, formatted exactly as the expected output, and have a fallback for retrieval failures. Test your examples on a small sample before deploying.

Comparison vs Alternatives: Zero-Shot, Few-Shot, Fine-Tuning, and Prompt Engineering

Here's a production-oriented comparison of the four approaches:

Zero-shot: Best for well-defined, common tasks. Low latency (200ms), low cost ($0.002/call). Accuracy depends on model pre-training. Use when the task is unambiguous and the model has seen similar patterns.

Few-shot: Best for tasks with specific formatting or domain jargon. Medium latency (700ms with retrieval), medium cost ($0.01/call). Accuracy improves by 10-20% over zero-shot for ambiguous tasks. Use when you have a small labeled dataset (<100 examples) and can't afford fine-tuning.

Fine-tuning: Best for high-volume, stable tasks. High upfront cost ($25 for training), low inference cost ($0.003/call) and latency (200ms). Accuracy can exceed few-shot by 5-10% because the model internalizes the pattern. Use when you have >1000 labeled examples and expect >100k calls/month.

Prompt engineering: Best for quick iterations without code changes. Zero cost, but requires experimentation. Use when you want to test a hypothesis before committing to code changes.

In practice, we use a hybrid: start with prompt engineering to define the task, then zero-shot for baseline, then few-shot for hard cases, and finally fine-tune when the task stabilizes. The decision is driven by data: measure accuracy, latency, and cost per call, and choose the approach that optimizes for your business metric.

compare_approaches.pyPYTHON

import time
import openai

# Simulated comparison
APPROACHES = {
    "zero_shot": {
        "latency_ms": 200,
        "cost_per_call": 0.002,
        "accuracy": 0.85,
    },
    "few_shot": {
        "latency_ms": 700,
        "cost_per_call": 0.01,
        "accuracy": 0.95,
    },
    "fine_tuned": {
        "latency_ms": 200,
        "cost_per_call": 0.003,
        "accuracy": 0.97,
    },
}

def recommend_approach(num_calls_per_day: int, accuracy_target: float, latency_budget_ms: int):
    """Return the best approach given constraints."""
    best = None
    best_score = float('inf')
    
    for name, metrics in APPROACHES.items():
        if metrics['accuracy'] < accuracy_target:
            continue
        if metrics['latency_ms'] > latency_budget_ms:
            continue
        # Score: cost per day (lower is better) + latency penalty
        daily_cost = metrics['cost_per_call'] * num_calls_per_day
        score = daily_cost + (metrics['latency_ms'] / 1000) * 0.01  # Arbitrary penalty
        if score < best_score:
            best_score = score
            best = name
    
    return best

# Example: 100k calls/day, target accuracy 0.9, latency budget 500ms
print(recommend_approach(100000, 0.9, 500))  # Should return 'fine_tuned'

Don't ignore prompt engineering

Before adding examples or fine-tuning, spend an hour iterating on the system prompt. A well-structured zero-shot prompt can match few-shot accuracy for many tasks. We've seen a 15% accuracy improvement just by adding 'Think step by step' to the prompt.

Production Insight

A legal document classifier started with few-shot (5 examples per category) and had 92% accuracy. After switching to a fine-tuned model on 5000 labeled documents, accuracy improved to 98% and latency dropped from 800ms to 200ms. The fine-tuning cost $50 and paid for itself in 2 days of saved token costs.

Key Takeaway

Choose the approach based on your volume, accuracy needs, and latency budget. Start with zero-shot, add few-shot for hard cases, and fine-tune when the task is stable and high-volume.

Debugging and Monitoring Few-Shot vs Zero-Shot in Production

You need three monitoring metrics for your prompting pipeline:

Accuracy per class: Track the classification accuracy for each category separately. A drop in 'billing_dispute' accuracy might indicate that the examples for that class are stale or the distribution has shifted.
Latency breakdown: Log the time spent in each step: embedding, retrieval, API call. If retrieval time spikes, your vector DB might be under load. If API call time spikes, the model might be throttling you.
Cache hit rate: Track the hit rate for each cache level. A sudden drop in query-level cache hit rate might indicate a new traffic pattern or a cache invalidation bug.

We use structured logging with correlation IDs to trace each request through the pipeline. Here's a production-ready logging setup:

monitoring_setup.pyPYTHON

import structlog
import time
import uuid
from contextvars import ContextVar

# Structured logging with correlation IDs
request_id_var: ContextVar[str] = ContextVar('request_id', default='unknown')

logger = structlog.get_logger()

def classify_with_monitoring(query: str) -> str:
    request_id = str(uuid.uuid4())
    request_id_var.set(request_id)
    
    log = logger.bind(request_id=request_id, query=query[:50])
    log.info("classify.start")
    
    start = time.time()
    
    # Step 1: Check cache
    cache_start = time.time()
    cached_result = check_cache(query)
    cache_latency = time.time() - cache_start
    if cached_result:
        log.info("classify.cache_hit", latency_ms=round(cache_latency*1000, 2))
        return cached_result
    log.info("classify.cache_miss", latency_ms=round(cache_latency*1000, 2))
    
    # Step 2: Retrieve examples
    retrieval_start = time.time()
    examples = retrieve_examples(query)
    retrieval_latency = time.time() - retrieval_start
    log.info("classify.retrieval", num_examples=len(examples), latency_ms=round(retrieval_latency*1000, 2))
    
    # Step 3: Call API
    api_start = time.time()
    result = call_openai(query, examples)
    api_latency = time.time() - api_start
    log.info("classify.api_call", latency_ms=round(api_latency*1000, 2))
    
    total_latency = time.time() - start
    log.info("classify.complete", result=result, total_latency_ms=round(total_latency*1000, 2))
    
    # Send metrics to monitoring system (e.g., Datadog)
    # statsd.gauge('classifier.total_latency', total_latency)
    # statsd.gauge('classifier.cache_latency', cache_latency)
    # statsd.gauge('classifier.retrieval_latency', retrieval_latency)
    # statsd.gauge('classifier.api_latency', api_latency)
    
    return result

Alert on latency spikes, not just errors

A 500ms increase in p99 latency can cause timeouts and customer dissatisfaction before any error is logged. Set an alert when retrieval latency exceeds 200ms or API latency exceeds 1s.

Production Insight

A chatbot using few-shot prompting had a silent failure: the vector DB became slow due to a disk I/O bottleneck, causing retrieval latency to spike from 50ms to 800ms. The API calls still succeeded, but the total latency caused the chatbot to time out for 30% of users. The fix was to add a circuit breaker: if retrieval latency exceeds 500ms, fall back to zero-shot and alert the on-call engineer.

Key Takeaway

Monitor latency per step, not just end-to-end. Use structured logging with correlation IDs to trace requests. Set alerts on latency spikes, not just errors.

● Production incidentPOST-MORTEMseverity: high

The $12k Overtime Incident: When Zero-Shot Failed on Billing Disputes

Symptom

The on-call engineer saw a spike in 'reopened tickets' from 2% to 17% overnight. The support team was manually reclassifying hundreds of tickets flagged as 'technical issue' that were actually billing disputes. No error logs, no 5xx — just angry customers and a backlog.

Assumption

The team assumed zero-shot prompting was 'good enough' because it passed offline evaluation on a balanced test set. They didn't test on the long-tail distribution of real-world ticket descriptions, which included ambiguous phrases like 'I was charged twice' (billing) vs 'the app crashed after payment' (technical).

Root cause

The zero-shot prompt for the classifier had no examples for the 'billing dispute' category. The model's internal representation of 'billing dispute' was too close to 'technical issue' in the embedding space — cosine similarity was 0.89, causing the logit scores to flip on ambiguous inputs. Specifically, the prompt template was: 'Classify the following customer ticket into one of: [refund, cancellation, technical_issue, other].' The word 'billing' never appeared.

Fix

1. Added three few-shot examples for the 'billing_dispute' category: one for double charges, one for subscription errors, one for invoice disputes. 2. Implemented a semantic similarity cache using ChromaDB to store examples and retrieve the top-3 most similar at query time (reduced retrieval latency from 800ms to 50ms). 3. Added a confidence threshold check: if the predicted class probability was below 0.7, fall back to zero-shot with a different prompt that explicitly listed 'billing dispute' as a class. 4. Ran a shadow-mode evaluation for 24 hours before rolling out to production.

Key lesson

DO test your zero-shot prompt on the actual production distribution, not a balanced test set. Use a random sample of 10k real tickets.
DO cache your few-shot examples with a vector DB. Embedding lookup at query time is too slow for real-time classification.
DO implement a confidence threshold with a fallback prompt. It saved us from the next incident when a new ticket category appeared.

Production debug guideWhen the classifier starts misclassifying everything at 2am.4 entries

Symptom · 01

Accuracy drops suddenly after a model update (e.g., gpt-4-0613 to gpt-4-turbo)

→

Fix

Run a diff on the tokenized prompt lengths. New models may have different tokenizers or context window limits. Use: tiktoken.encoding_for_model('gpt-4-turbo').encode(prompt) and compare lengths. If the prompt is truncated, examples may be dropped.

Symptom · 02

Latency spikes from 200ms to 1.5s p99

→

Fix

Check if the few-shot retrieval is the bottleneck. Log the time for embedding_model.embed_query() and chroma_collection.query(). If >100ms, switch to a local embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2) and cache embeddings in memory.

Symptom · 03

Model returns 'I cannot answer that' or generic responses

→

Fix

Check the token count of the full prompt (system + examples + user query). If it exceeds 80% of the model's context window, the model may truncate the examples. Use tiktoken to count tokens and reduce the number of examples or truncate the longest ones.

Symptom · 04

Example retrieval returns irrelevant examples (e.g., refund examples for a technical issue query)

→

Fix

Inspect the embedding of the query and examples. If the cosine similarity is >0.9 for irrelevant pairs, the embedding model is too coarse. Switch to a fine-tuned embedding model or add a reranker step (e.g., Cohere rerank) on the top-10 results.

★ Few-Shot vs Zero-Shot Prompting Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.

Misclassification across all categories−

Immediate action

Check if the prompt template changed. Did you accidentally remove the class list?

Commands

curl -X POST https://api.openai.com/v1/chat/completions -H 'Authorization: Bearer $OPENAI_API_KEY' -d '{"model": "gpt-4-turbo", "messages": [{"role": "user", "content": "Classify: \"I was charged twice\" into [refund, cancellation, technical_issue, other]"}]}'

python -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-4-turbo'); print(len(enc.encode('Classify: \"I was charged twice\" into [refund, cancellation, technical_issue, other]')))"

Fix now

Revert the prompt template to the last known working version from your git history. If you don't have one, start versioning it now.

High latency on few-shot requests+

Model returns empty or truncated responses+

Few-Shot vs Zero-Shot vs Fine-Tuning vs Prompt Engineering

Concern	Zero-Shot	Few-Shot	Fine-Tuning	Recommendation
Latency	Low (<200ms)	Medium (200-800ms)	Low (<100ms)	Zero-shot or fine-tune for real-time
Accuracy on edge cases	Low (hallucinates)	High (with good examples)	Very high	Fine-tune for critical tasks
Cost per request	Low	Medium (extra tokens)	High (training + inference)	Zero-shot for high volume, few-shot for tricky ones
Maintenance overhead	Low	Medium (example curation)	High (retraining)	Start with few-shot, graduate to fine-tune
Flexibility to change task	High	High (swap examples)	Low (retrain)	Few-shot for dynamic tasks

Key takeaways

Zero-shot relies entirely on model priors; a single missing example can cause catastrophic drift in edge cases—always validate with a held-out set before production.

Few-shot examples must be representative of the target distribution; using random examples from a different domain introduces bias worse than zero-shot.

Cache few-shot prompt embeddings (not raw responses) to reduce latency by 60-80% in high-throughput pipelines.

Monitor token usage per prompt type

few-shot can blow context windows if examples are too long—set a hard limit of 3-5 examples with max 200 tokens each.

Hybrid pipelines (zero-shot fallback + few-shot for high-confidence matches) reduce cost by 40% while maintaining accuracy on ambiguous inputs.

Common mistakes to avoid

4 patterns

Using irrelevant few-shot examples

Symptom

Model outputs drift toward the example's domain, causing 30%+ accuracy drop on target data.

Fix

Curate examples from the exact production distribution; use a similarity retriever (e.g., cosine similarity on embeddings) to pick the top-3 closest examples per query.

Not setting a token budget for few-shot examples

Symptom

Context window overflow silently truncates examples, making few-shot degrade to zero-shot with partial context.

Fix

Enforce a max total prompt length (e.g., 2048 tokens) and truncate examples from the bottom; log truncation events to detect drift.

Assuming zero-shot is cheaper (no example cost)

Symptom

Zero-shot hallucinates on ambiguous inputs, triggering costly manual reviews and re-runs.

Fix

Add a confidence threshold (e.g., log-probability of top token < 0.7) to fall back to few-shot or human-in-the-loop.

Caching raw responses instead of prompt-response pairs

Symptom

Stale cache returns incorrect results when model updates or prompt changes, causing silent failures.

Fix

Cache on (model_version, prompt_hash, temperature) tuple; invalidate on model deployment or prompt template change.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain the difference between zero-shot and few-shot prompting in LLMs.

Q02SENIOR

Design a production system that uses both zero-shot and few-shot prompti...

Q03SENIOR

Your few-shot prompt is causing 2-second latency per request. How do you...

Q04SENIOR

How would you monitor and alert on few-shot vs zero-shot performance deg...

Q05SENIOR

Compare few-shot prompting to fine-tuning for a domain-specific classifi...

Q01 of 05JUNIOR

Explain the difference between zero-shot and few-shot prompting in LLMs.

ANSWER

Zero-shot prompting gives the model a task description with no examples, relying on its pre-training to infer the output format. Few-shot prompting provides 2-5 input-output examples in the prompt, conditioning the model to mimic the pattern. Under the hood, few-shot shifts the model's attention distribution toward the examples, reducing variance but increasing token cost and latency.

FAQ · 5 QUESTIONS

Frequently Asked Questions

How many examples should I use for few-shot prompting?

When should I use zero-shot instead of few-shot?

Can I mix few-shot and zero-shot in the same pipeline?

How do I debug a few-shot prompt that suddenly fails?

Does few-shot prompting always outperform zero-shot?

🔥

That's Prompt Engineering. Mark it forged?

7 min read · try the examples if you haven't