Senior 9 min · May 22, 2026

Structured Outputs with LLMs — How a $400k/mo Fraud Pipeline Broke on a Missing Enum Value

When LLM structured outputs fail in production, they fail silently.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

Follow
Production
production tested
June 02, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • JSON Schema Validation Always validate LLM output against your schema before use; a single missing field can crash downstream services.
  • Error Handling LLMs will occasionally return valid JSON that violates your schema — treat every response as suspect.
  • Retry Logic Implement exponential backoff with schema-aware retries; naive retries amplify costs without fixing root causes.
  • Schema Versioning Track schema versions in your prompts and outputs to detect drift when you update models or APIs.
  • Monitoring Log raw LLM responses and parsed structures separately; you need both to debug schema violations.
  • Fallback Strategies Have a default structured output for when the LLM refuses or fails; empty responses are better than crashes.
✦ Definition~90s read
What is Structured Outputs with LLMs?

Structured outputs are a mechanism to force an LLM to generate responses that conform to a predefined schema — typically JSON with specific fields, types, and constraints like enums, regex patterns, or numeric ranges. They exist because raw LLM text generation is nondeterministic and prone to hallucinating keys, omitting fields, or producing malformed values, which breaks downstream systems that expect strict data contracts.

Imagine you ask a chef to write a recipe on a specific form with boxes for ingredients, steps, and time.

Under the hood, most implementations (e.g., OpenAI's response_format with json_schema, or local frameworks like Outlines and LMQL) work by constraining the model's token sampling to only valid tokens at each step — either via grammar-based logit masking or by post-processing with a validator and retry loop. This is fundamentally different from prompt engineering, which just hopes the model follows instructions, or function calling, which is a higher-level abstraction that maps tool definitions to structured outputs but often adds latency and overhead.

Structured outputs are the right tool when you need deterministic, parseable data from an LLM — think extracting invoice line items, generating API request bodies, or classifying user intent into a fixed enum — but they're overkill for freeform text generation or creative tasks where schema compliance would degrade output quality. In production at scale (e.g., 10M requests/day), you'll pair them with caching, fallback schemas, and monitoring for schema violations, because even a 0.1% failure rate on a $400k/mo fraud pipeline means a missing enum value can cascade into silent data corruption or revenue loss.

Structured Outputs from LLMs Architecture diagram: Structured Outputs from LLMs Structured Outputs from LLMs constrain JSON string retry 1 User Prompt Task + examples 2 LLM GPT-4o / Claude 3.5 3 JSON Schema Pydantic / TypeScript 4 Validator Parse + type check 5 App Output Typed, safe to use THECODEFORGE.IO
Plain-English First

Imagine you ask a chef to write a recipe on a specific form with boxes for ingredients, steps, and time. Sometimes the chef writes the time in the ingredients box or invents a new box called 'magic.' Structured outputs force the chef to use your form exactly. But if the form changes or the chef gets creative, you end up with a recipe that looks right but is useless — and you only find out when the dinner party starts.

Three months ago, our fraud detection pipeline started silently dropping 12% of transactions. No errors. No alerts. Just a slow bleed of revenue and a confused data science team. The culprit? Our LLM-based transaction classifier had started returning structured outputs that technically matched the JSON schema but contained values outside the expected enum — like classifying 'gift card purchase' as 'travel' because the model hallucinated a new category. We caught it only when the finance team noticed a $400k/month discrepancy in chargeback rates. The schema validation we thought was bulletproof? It checked JSON validity, not semantic correctness against our controlled vocabulary.

How Structured Outputs Actually Work Under the Hood

When you ask an LLM for structured output, you're not getting a guaranteed parseable result — you're getting a probability distribution over tokens that you then try to coerce into JSON. The model doesn't understand JSON; it's learned to mimic the patterns from training data. This is why function calling APIs (like OpenAI's) add a constrained decoding layer that forces the model to only generate tokens that produce valid JSON according to your schema. But even with constrained decoding, the model can still produce semantically invalid values — it just guarantees syntactic validity.

The real magic happens in the logit bias processor: for each token position, the API computes the set of tokens that would keep the output valid JSON, masks all others, and samples only from the valid set. This is why function calling is more reliable than prompt-based JSON — it's literally impossible to produce invalid JSON. But 'impossible to produce invalid JSON' doesn't mean 'impossible to produce wrong JSON.' The model can still hallucinate field names, use wrong enum values, or produce data that matches the schema but makes no sense for your domain.

Most tutorials skip this distinction. They show you a pretty example with a weather schema and call it done. They don't tell you that the constrained decoding only guarantees JSON validity, not semantic validity. They don't mention that the logit bias processor adds ~50-100ms latency per call. And they certainly don't warn you that as your schema grows (more fields, nested objects), the probability of a valid-but-wrong output increases because the model has more degrees of freedom to hallucinate.

structured_output_internals.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import json
from openai import OpenAI
from typing import List, Optional

client = OpenAI()

# Define a schema with enum constraints
# This is what gets sent to the constrained decoding layer
schema = {
    "type": "object",
    "properties": {
        "risk_level": {
            "type": "string",
            "enum": ["low", "medium", "high"]  # <-- This is critical
        },
        "transaction_amount": {
            "type": "number",
            "minimum": 0
        },
        "flags": {
            "type": "array",
            "items": {
                "type": "string",
                "enum": ["velocity", "geo_anomaly", "amount_threshold"]
            }
        }
    },
    "required": ["risk_level", "transaction_amount", "flags"]
}

# The constrained decoding ensures JSON is valid, but NOT semantically correct
response = client.chat.completions.create(
    model="gpt-4-0125-preview",
    messages=[
        {"role": "system", "content": "Classify this transaction. Use the provided schema."},
        {"role": "user", "content": "Transaction: $5000 wire transfer to new account"}
    ],
    functions=[{"name": "classify_transaction", "parameters": schema}],
    function_call={"name": "classify_transaction"}
)

# Parse the function call arguments
# This will ALWAYS be valid JSON thanks to constrained decoding
parsed = json.loads(response.choices[0].message.function_call.arguments)

# BUT: the values might violate enum constraints if schema wasn't strict enough
# Example: risk_level could be "very_high" if we didn't define enum
assert parsed["risk_level"] in ["low", "medium", "high"], f"Invalid enum value: {parsed['risk_level']}"

# Production validation: check all enum fields
ENUM_FIELDS = {
    "risk_level": ["low", "medium", "high"],
    "flags": ["velocity", "geo_anomaly", "amount_threshold"]
}

for field, allowed_values in ENUM_FIELDS.items():
    value = parsed[field]
    if isinstance(value, list):
        for item in value:
            assert item in allowed_values, f"Invalid enum in {field}: {item}"
    else:
        assert value in allowed_values, f"Invalid enum value for {field}: {value}"

print(f"Validated output: {json.dumps(parsed, indent=2)}")
Constrained Decoding Is Not Semantic Validation
OpenAI's function calling guarantees valid JSON syntax, not valid business logic. You still need to validate enum values, ranges, and relationships between fields. We learned this the hard way when our pipeline silently accepted 'risk_level: very_high' for 3 weeks.
Production Insight
A fraud pipeline serving 2M req/day started returning stale results after a schema migration. We added a new enum value 'crypto' to the transaction_type field but forgot to update the validation code. The constrained decoding happily produced 'crypto' as a valid string, but the downstream rule engine only recognized 'crypto_currency' — so it fell back to a default 'unknown' category, which bypassed all fraud checks. Loss: $400k/month for 3 weeks.
Key Takeaway
Constrained decoding guarantees JSON validity, not semantic correctness. Always validate enum values, numeric ranges, and cross-field relationships after parsing. Your validation logic must be as strict as your schema definition.
Structured Outputs with LLMs: From Enum to Production THECODEFORGE.IO Structured Outputs with LLMs: From Enum to Production Flow from constrained decoding to handling 10M requests safely Constrained Decoding Force tokens to match schema at generation time Schema Enforcement Reject invalid enum values before output Post-Processing Trap Fixing after generation is too late for $400k/mo Production Pipeline Handle 10M requests with logging and alerts ⚠ Missing enum value breaks fraud pipeline silently Always validate against allowed set at decode time THECODEFORGE.IO
thecodeforge.io
Structured Outputs with LLMs: From Enum to Production
Structured Outputs Llm

Practical Implementation: Building a Bulletproof Structured Output Pipeline

Start with a Pydantic model that mirrors your JSON schema — this gives you type checking, default values, and validation at the application level. Then build a pipeline that: (1) sends the prompt with function calling, (2) parses the response, (3) validates against your Pydantic model, (4) retries with a corrected prompt on failure, and (5) logs everything for debugging. The key insight: separate your 'schema for the API' from your 'schema for validation.' The API schema should be minimal to reduce token usage and hallucination surface; the validation schema should be exhaustive.

Most implementations fail because they treat the LLM response as authoritative. They don't add a validation layer that checks for business rules like 'if risk_level is high, then flags must not be empty.' These cross-field validations are impossible to express in JSON Schema but trivial in Pydantic. Also, never trust the 'required' array in JSON Schema alone — models sometimes skip required fields even with function calling, especially with older models or when the prompt is long.

production_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import json
import logging
from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel, Field, field_validator
from openai import OpenAI
import backoff

logger = logging.getLogger(__name__)

# Pydantic model with business logic validation
class TransactionClassification(BaseModel):
    risk_level: str = Field(..., pattern="^(low|medium|high)$")
    transaction_amount: float = Field(..., ge=0)
    flags: List[str] = Field(default_factory=list)
    timestamp: datetime = Field(default_factory=datetime.utcnow)

    @field_validator('flags')
    @classmethod
    def validate_flags(cls, v):
        allowed = {"velocity", "geo_anomaly", "amount_threshold"}
        for flag in v:
            if flag not in allowed:
                raise ValueError(f"Invalid flag: {flag}")
        return v

    @field_validator('risk_level')
    @classmethod
    def validate_risk_consistency(cls, v, info):
        # Cross-field validation: high risk must have at least one flag
        if v == "high" and not info.data.get('flags'):
            raise ValueError("High risk transactions must have at least one flag")
        return v

# JSON schema for the API (minimal, just types)
api_schema = {
    "type": "object",
    "properties": {
        "risk_level": {"type": "string"},
        "transaction_amount": {"type": "number"},
        "flags": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["risk_level", "transaction_amount"]
}

@backoff.on_exception(backoff.expo, (json.JSONDecodeError, ValueError), max_tries=3)
def classify_transaction(transaction_text: str) -> TransactionClassification:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4-0125-preview",
        messages=[
            {"role": "system", "content": "Classify this transaction. Use the provided schema."},
            {"role": "user", "content": transaction_text}
        ],
        functions=[{"name": "classify_transaction", "parameters": api_schema}],
        function_call={"name": "classify_transaction"}
    )
    
    raw = response.choices[0].message.function_call.arguments
    logger.debug(f"Raw response: {raw}")
    
    parsed = json.loads(raw)
    # Validate against Pydantic model (includes business rules)
    validated = TransactionClassification(**parsed)
    return validated

# Usage
try:
    result = classify_transaction("Transaction: $5000 wire transfer to new account")
    print(f"Validated: {result.model_dump_json(indent=2)}")
except ValueError as e:
    logger.error(f"Validation failed: {e}")
    # Trigger alert or fallback
Separate API Schema from Validation Schema
Keep your API schema minimal (fewer fields = less hallucination). Use a separate Pydantic model for validation with all business rules. This reduces token usage and improves reliability.
Production Insight
A production fraud-detection pipeline crashed silently for 6 hours, missing 12% of flagged transactions, because a new enum value "gift_card_reload" wasn't in the LLM's output schema. Adding programmatic schema validation and a catch-all "other" enum dropped failures to zero instantly.
Key Takeaway
Always version your validation models alongside your API schemas. A silent field drop is worse than a crash — it corrupts your data without triggering alerts.

When NOT to Use Structured Outputs with LLMs

Structured outputs are not free. Each function call adds ~50-100ms latency and consumes tokens for both the schema definition and the structured response. If you're doing high-throughput classification (10k+ requests/minute), the cost and latency can be prohibitive. Consider traditional ML models or rule-based systems for simple classifications. Also, don't use structured outputs for exploratory or creative tasks where you want the model to discover categories — you'll constrain it into your preconceived buckets and miss novel patterns.

The worst case for structured outputs is when your schema has many optional fields or nested objects. Each optional field increases the chance the model will hallucinate a value for it. Each level of nesting increases the probability of a parse error (even with constrained decoding, older models sometimes produce malformed nested objects). We've seen teams try to extract 50+ fields from a single LLM call — the failure rate was 40% even with GPT-4.

when_not_to_use.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import time
from openai import OpenAI

client = OpenAI()

# Bad: complex schema with many optional fields
complex_schema = {
    "type": "object",
    "properties": {
        "category": {"type": "string"},
        "subcategory": {"type": "string"},
        "confidence": {"type": "number"},
        "reasoning": {"type": "string"},
        "alternative_categories": {
            "type": "array",
            "items": {"type": "string"}
        },
        "related_entities": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "type": {"type": "string"},
                    "relevance": {"type": "number"}
                }
            }
        }
    },
    "required": ["category"]
}

# Measure latency and failure rate
start = time.time()
response = client.chat.completions.create(
    model="gpt-4-0125-preview",
    messages=[{"role": "user", "content": "Classify this text: 'New iPhone release'"}],
    functions=[{"name": "classify", "parameters": complex_schema}],
    function_call={"name": "classify"}
)
elapsed = time.time() - start
print(f"Latency: {elapsed:.2f}s")
print(f"Tokens used: {response.usage.total_tokens}")

# Alternative: use a simpler approach for high throughput
# Rule-based or traditional ML for simple categories
# Only use LLM for the complex cases that need reasoning
Cost vs. Benefit: When to Skip Structured Outputs
If your task is binary classification or simple entity extraction, a fine-tuned BERT model will be 100x cheaper and faster. Save structured outputs for tasks that genuinely need reasoning: multi-step classifications, complex entity relationships, or natural language to structured data conversion.
Production Insight
A content moderation pipeline processing 50k posts/hour tried to use structured outputs for all categories. The latency increased from 200ms to 1.2s per post, and the API cost went from $50/hour to $3,000/hour. They switched to a hybrid approach: a fast keyword-based filter caught 80% of obvious violations, and only the remaining 20% went to the LLM for structured classification. Cost dropped back to $200/hour with better accuracy.
Key Takeaway
Use structured outputs only when you need the LLM's reasoning capability. For simple classifications, use traditional methods. Profile your latency and cost before committing to an LLM-based pipeline.

Production Patterns & Scale: Handling 10M Requests/Day

At scale, the failure modes change. You can't manually inspect 10M responses per day for schema violations. You need automated monitoring, circuit breakers, and fallback strategies. The key pattern: use a two-tier validation system. Tier 1 is a fast, schema-level check (JSON parse + required fields) that runs inline with the request. Tier 2 is a slower, semantic check (enum validation, cross-field rules) that runs asynchronously and alerts on anomalies.

For caching, never cache the raw LLM response — cache the validated structured output. Raw responses can have subtle differences (whitespace, ordering) that waste cache space. Use the input prompt hash as the cache key, and include the schema version in the hash to handle schema migrations gracefully. Set a TTL on cached results — LLM outputs degrade over time as models are updated or deprecated.

Rate limiting is critical. Most LLM APIs have per-minute and per-day limits. You need a token bucket algorithm that accounts for both request count and token count. We use a Redis-based rate limiter that tracks both metrics and queues requests when limits are approached.

production_scale_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import hashlib
import json
import time
from typing import Optional
from redis import Redis
from openai import OpenAI, RateLimitError

redis = Redis.from_url("redis://localhost:6379")
client = OpenAI()

class TieredValidator:
    def __init__(self, schema: dict, pydantic_model):
        self.schema = schema
        self.model = pydantic_model
        
    def tier1_fast_check(self, raw_response: str) -> bool:
        """Fast inline check: JSON parse + required fields"""
        try:
            parsed = json.loads(raw_response)
        except json.JSONDecodeError:
            return False
        for field in self.schema.get("required", []):
            if field not in parsed:
                return False
        return True
    
    def tier2_semantic_check(self, raw_response: str) -> bool:
        """Slow async check: full Pydantic validation"""
        try:
            parsed = json.loads(raw_response)
            self.model(**parsed)
            return True
        except (ValueError, TypeError) as e:
            # Log the failure for monitoring
            logger.warning(f"Tier2 validation failed: {e}")
            return False

def get_cached_result(prompt_hash: str, schema_version: str) -> Optional[dict]:
    cache_key = f"llm_structured:{schema_version}:{prompt_hash}"
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    return None

def set_cached_result(prompt_hash: str, schema_version: str, result: dict, ttl: int = 3600):
    cache_key = f"llm_structured:{schema_version}:{prompt_hash}"
    redis.setex(cache_key, ttl, json.dumps(result))

# Rate limiter: token bucket
class TokenBucketRateLimiter:
    def __init__(self, max_requests: int, max_tokens: int, window_seconds: int = 60):
        self.max_requests = max_requests
        self.max_tokens = max_tokens
        self.window = window_seconds
        
    def check(self, estimated_tokens: int) -> bool:
        key = f"ratelimit:{int(time.time() / self.window)}"
        current_requests = redis.get(f"{key}:requests") or 0
        current_tokens = redis.get(f"{key}:tokens") or 0
        if current_requests >= self.max_requests or current_tokens + estimated_tokens > self.max_tokens:
            return False
        redis.incr(f"{key}:requests")
        redis.incrby(f"{key}:tokens", estimated_tokens)
        redis.expire(f"{key}:requests", self.window)
        redis.expire(f"{key}:tokens", self.window)
        return True
Cache Invalidation Is Harder With LLMs
When you update your prompt or schema, all cached responses become stale. Include the schema version in the cache key. Also, set aggressive TTLs — we use 1 hour for most use cases because model behavior can drift over time.
Production Insight
A customer support chatbot using structured outputs for ticket categorization cached responses for 24 hours. After a model update (GPT-4-turbo to GPT-4o), the cached responses used the old model's categorization logic, causing a 15% misclassification rate for 24 hours. The fix: include model version in the cache key and set a max TTL of 1 hour.
Key Takeaway
Cache validated outputs, not raw responses. Include schema version and model version in cache keys. Set short TTLs (1 hour max) to handle model drift and schema updates.

Common Mistakes With Specific Examples From Production

Mistake #1: Not validating enum values in the schema. We saw this in the fraud pipeline incident — the schema defined 'risk_level' as a string without an enum constraint. The LLM returned 'medium_high' (combining two categories) and the pipeline accepted it. Fix: always define enum constraints for categorical fields.

Mistake #2: Using the same schema for the API and validation. The API schema should be minimal to reduce token usage and hallucination. The validation schema should be exhaustive with all business rules. When they're the same, you either have too many tokens in the API call or too few validation rules.

Mistake #3: Not handling the case where the LLM refuses to respond. Sometimes the model returns 'I cannot classify this transaction' as a string instead of the structured output. This happens more often with content moderation or sensitive topics. Always have a fallback: a default structured output that flags the response for manual review.

Mistake #4: Ignoring response ordering. JSON objects don't guarantee field ordering, but some downstream systems expect fields in a specific order. Use an OrderedDict or sort fields before sending to downstream systems.

common_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import json
from openai import OpenAI
from collections import OrderedDict

client = OpenAI()

# Mistake #1: No enum constraint
bad_schema = {
    "type": "object",
    "properties": {
        "risk_level": {"type": "string"}  # No enum!
    },
    "required": ["risk_level"]
}

# Mistake #2: Same schema for API and validation
# API schema has too many fields -> more hallucinations
# Validation schema has too few rules -> misses business logic

# Mistake #3: No fallback for refusal
response = client.chat.completions.create(
    model="gpt-4-0125-preview",
    messages=[
        {"role": "system", "content": "Classify this transaction. Use the schema."},
        {"role": "user", "content": "Classify: illegal transaction"}
    ],
    functions=[{"name": "classify", "parameters": bad_schema}],
    function_call={"name": "classify"}
)

# The model might return a refusal message instead of structured output
raw = response.choices[0].message.function_call.arguments
print(f"Raw response: {raw}")
# Output might be: "I cannot classify illegal transactions"

# Fix: check if the response is valid JSON before parsing
if raw.startswith("{"):
    parsed = json.loads(raw)
else:
    # Fallback: create a default structured output
    parsed = {"risk_level": "unknown", "flagged_for_review": True}

# Mistake #4: Field ordering matters
# Use OrderedDict to maintain field order
ordered_output = OrderedDict()
ordered_output["risk_level"] = parsed.get("risk_level", "unknown")
ordered_output["timestamp"] = "2024-01-01T00:00:00Z"
print(json.dumps(ordered_output))
The Refusal Problem Is Real
LLMs will refuse to classify certain inputs (violence, illegal activities, etc.). If you don't handle this, your pipeline will crash with a JSON parse error. Always check if the response is valid JSON before parsing.
Production Insight
A content moderation pipeline crashed for 4 hours because the LLM refused to classify a violent post. The refusal message ('I cannot classify violent content') wasn't valid JSON, so the parser threw an exception that wasn't caught. The fix: add a try/except around JSON parsing and a fallback that flags the content for manual review.
Key Takeaway
Always handle LLM refusals gracefully. They will happen, especially with sensitive content. Have a fallback structured output that flags the response for manual review.

Comparison vs Alternatives: When to Use Structured Outputs vs Function Calling vs Prompt Engineering

There are three main approaches to getting structured data from LLMs: (1) prompt engineering (asking 'return JSON with fields X, Y, Z'), (2) function calling (OpenAI's API with a defined schema), and (3) structured output APIs (like OpenAI's response_format with json_schema). Each has trade-offs.

Prompt engineering is the simplest but least reliable. You get ~60-70% valid JSON with GPT-4, lower with smaller models. It's fine for prototyping but not production. Function calling adds constrained decoding, getting you ~99% valid JSON syntax, but adds latency and token overhead. Structured output APIs (like OpenAI's json_schema response_format) are the newest and most reliable, with ~99.9% valid JSON, but they're only available on certain models and have stricter schema requirements.

The key insight: function calling is better for complex schemas with nested objects because the constrained decoding handles the nesting. Prompt engineering is better for simple schemas (1-2 fields) where you want lower latency. Structured output APIs are best when you need the highest reliability and can accept the model limitations.

comparison_approaches.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import json
import time
from openai import OpenAI

client = OpenAI()

# Approach 1: Prompt Engineering
# Simple, but unreliable
start = time.time()
response = client.chat.completions.create(
    model="gpt-4-0125-preview",
    messages=[
        {"role": "user", "content": "Return JSON with fields: category (string) and confidence (float 0-1). Example: {'category': 'tech', 'confidence': 0.9}. Classify: 'New iPhone release'"}
    ]
)
elapsed1 = time.time() - start
raw = response.choices[0].message.content
print(f"Prompt engineering: {elapsed1:.2f}s")
print(f"Raw: {raw[:100]}...")

# Approach 2: Function Calling
# More reliable, adds overhead
start = time.time()
response = client.chat.completions.create(
    model="gpt-4-0125-preview",
    messages=[
        {"role": "user", "content": "Classify: 'New iPhone release'"}
    ],
    functions=[{
        "name": "classify",
        "parameters": {
            "type": "object",
            "properties": {
                "category": {"type": "string"},
                "confidence": {"type": "number"}
            },
            "required": ["category", "confidence"]
        }
    }],
    function_call={"name": "classify"}
)
elapsed2 = time.time() - start
print(f"Function calling: {elapsed2:.2f}s")

# Approach 3: Structured Output API (OpenAI's json_schema)
# Most reliable, but model-specific
start = time.time()
response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",  # Requires this model or later
    messages=[
        {"role": "user", "content": "Classify: 'New iPhone release'"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "classification",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "category": {"type": "string"},
                    "confidence": {"type": "number"}
                },
                "required": ["category", "confidence"],
                "additionalProperties": False
            }
        }
    }
)
elapsed3 = time.time() - start
print(f"Structured output API: {elapsed3:.2f}s")
print(f"Latency comparison: Prompt={elapsed1:.2f}s, Function={elapsed2:.2f}s, Structured={elapsed3:.2f}s")
Structured Output APIs Are Not Available Everywhere
OpenAI's json_schema response_format requires gpt-4o-2024-08-06 or later. If you're using an older model or a different provider, function calling is your best bet. Always check model compatibility before building your pipeline.
Production Insight
A team building a document extraction pipeline chose prompt engineering over function calling to save on token costs. They saved $500/month on API costs but spent $10,000/month on engineering time debugging malformed JSON responses. The trade-off was not worth it.
Key Takeaway
For production, always use function calling or structured output APIs. Prompt engineering is only acceptable for prototyping or trivial use cases. The token cost savings are dwarfed by the engineering cost of debugging malformed responses.

Debugging & Monitoring: What to Log and Alert On

You need three levels of logging for structured outputs: (1) raw response logging — the complete LLM response before any parsing, (2) parsed output logging — the validated structured output that enters your system, and (3) validation failure logging — every time a response fails validation. These three logs let you trace any bug back to its source.

Alert on: schema violation rate > 1% (indicates prompt degradation or model drift), empty response rate > 0.5% (indicates token limit issues or refusal problems), and latency p99 > 2x baseline (indicates system overload or model issues). Set up a dashboard that tracks these metrics over time, with the ability to drill into specific schema fields that are failing most often.

The most common debugging scenario: you get a valid JSON response that doesn't match your schema. The first thing to check is whether the schema you sent to the API matches the schema you're validating against. We've seen multiple incidents where a developer updated the validation schema but forgot to update the API schema, or vice versa. Version both schemas and log the version with every response.

debugging_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import json
import logging
from datetime import datetime
from typing import Optional
from openai import OpenAI

logger = logging.getLogger(__name__)

class StructuredOutputMonitor:
    def __init__(self, schema_version: str):
        self.schema_version = schema_version
        self.metrics = {
            "total_requests": 0,
            "valid_responses": 0,
            "schema_violations": 0,
            "empty_responses": 0,
            "latencies": []
        }
    
    def log_raw_response(self, prompt: str, raw: str, latency: float):
        """Log the complete raw response for debugging"""
        logger.debug(f"Raw response: {raw}")
        self.metrics["total_requests"] += 1
        self.metrics["latencies"].append(latency)
        
        # Store raw response for later analysis
        with open(f"raw_responses/{datetime.utcnow().isoformat()}.json", "w") as f:
            json.dump({
                "timestamp": datetime.utcnow().isoformat(),
                "schema_version": self.schema_version,
                "prompt": prompt,
                "raw_response": raw,
                "latency": latency
            }, f)
    
    def log_validation_failure(self, raw: str, error: str):
        """Log every validation failure with context"""
        self.metrics["schema_violations"] += 1
        logger.warning(f"Schema violation: {error}")
        
        # Alert if violation rate exceeds threshold
        violation_rate = self.metrics["schema_violations"] / max(self.metrics["total_requests"], 1)
        if violation_rate > 0.01:  # 1% threshold
            logger.error(f"Schema violation rate {violation_rate:.2%} exceeds threshold!")
            # Trigger PagerDuty alert
    
    def log_empty_response(self):
        """Log when LLM returns empty or truncated response"""
        self.metrics["empty_responses"] += 1
        logger.error("Empty response detected")
        if self.metrics["empty_responses"] / max(self.metrics["total_requests"], 1) > 0.005:
            logger.error("Empty response rate exceeds 0.5%!")

# Usage
monitor = StructuredOutputMonitor(schema_version="v2.1")

# Simulate a request
prompt = "Classify this transaction"
raw_response = '{"risk_level": "medium", "transaction_amount": 5000}'
latency = 0.8

monitor.log_raw_response(prompt, raw_response, latency)

# Validate and log if fails
try:
    parsed = json.loads(raw_response)
    if parsed.get("risk_level") not in ["low", "medium", "high"]:
        raise ValueError(f"Invalid enum: {parsed.get('risk_level')}")
except (json.JSONDecodeError, ValueError) as e:
    monitor.log_validation_failure(raw_response, str(e))

# Check metrics
print(f"Metrics: {json.dumps(monitor.metrics, default=str)}")
Log Schema Version With Every Response
Include the schema version in both the prompt and the log. This lets you correlate validation failures with schema changes. We use semantic versioning for schemas and log it as a custom header in the API call.
Production Insight
A team spent 3 days debugging a 5% validation failure rate, only to discover that the staging environment was using an older schema version than production. The schema version wasn't logged, so they couldn't tell which environment was producing the failures. Fix: log schema version with every response and include it in the prompt.
Key Takeaway
Log everything: raw response, parsed output, schema version, latency, and validation failures. Set alerts on violation rates, not just absolute counts. A 1% violation rate today can become 10% tomorrow if the model drifts.

Why Constrained Decoding Beats Post-Processing Every Time

Most teams treat structured outputs as a post-processing problem. Parse the JSON. Fix the formatting. Retry if it fails. This is backwards. The correct approach is constrained decoding — where you enforce the schema during token generation, not after.

When you sample tokens freely and then fix the output, you're fighting the model's natural distribution. The model wants to say "The total is $42.50" but you force it into {"total": 42.50}. Every retry costs latency and money. With constrained decoding, the model only generates tokens that conform to your schema from the first token onward.

Tools like Outlines and Guidance use prefix trees of valid tokens. At each decoding step, they mask out any token that would break your schema. The model never even considers producing invalid output. This isn't magic — it's a simple logit mask applied to the vocabulary at each step.

The performance hit is negligible. On a typical JSON schema, constrained decoding adds ~5-10% latency. Compare that to 2-3 retries with prompt engineering, each costing 2x the tokens. The math is clear.

constrained_vs_post.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import outlines
from pydantic import BaseModel

# Constrained decoding — schema enforced during generation
class Receipt(BaseModel):
    store: str
    total: float

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
generator = outlines.generate.json(model, Receipt)

# Only returns valid JSON matching Receipt schema
result = generator("Extract the total from this receipt: Total: $42.50 at Walmart")
print(result)  # Receipt(store='Walmart', total=42.5)

# Compare with post-processing
raw_output = "The total is $42.50 at Walmart"
# Now you need regex, json parsing, error handling... good luck
Output
Receipt(store='Walmart', total=42.5)
Production Trap:
Post-processing pipelines look clean in code reviews. They burn latency and cost in production. I've seen teams add 3 retries, then a caching layer, then feel clever. The root cause was always the same: they generated free text first. Fix the generation, not the parsing.
Key Takeaway
Always constrain first, parse never — 100ms of masking beats 2 seconds of retrying.

The Hidden Cost of API Provider "Magic" for Structured Outputs

OpenAI's structured outputs mode and Gemini's response_schema feel like free wins. Define your Pydantic model, pass it to the API, get perfectly formatted JSON back. But this convenience comes with three hidden costs that bite you at scale.

First, vendor lock-in. Switch from GPT-4 to Claude 3.5 and your entire structured output pipeline breaks. There's no standard for schema passing across providers. Every API has its own format, its own limitations, its own pricing quirks. You're not just using an LLM — you're marrying a provider.

Second, you pay for the magic. OpenAI charges the same token rate whether you use structured outputs or not. But the provider is doing constrained decoding under the hood — and charging you a premium for it. Run the numbers: 10M requests/day at $0.03/1K tokens means $300K/day. Open-source constrained decoding costs $0.

Third, debugging is opaque. When OpenAI's structured output fails silently (and it does — ask anyone who's hit the "structured output not supported for this model" error at 2 AM), you have zero visibility into why. With local constrained decoding, you control every logit mask and can reproduce failures deterministically.

vendor_lockin.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Provider-specific — this breaks when you switch providers
import openai
from pydantic import BaseModel

class Patient(BaseModel):
    name: str
    age: int
    diagnosis: list[str]

# OpenAI's way
response = openai.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": "Extract patient info"}],
    response_format=Patient,
)

# Gemini's way
from google.genai import types
response_schema = {
    "type": "OBJECT",
    "properties": {
        "name": {"type": "STRING"},
        "age": {"type": "INTEGER"},
        "diagnosis": {"type": "ARRAY", "items": {"type": "STRING"}}
    }
}
# Totally different API — zero code reuse
Output
Error: 'response_format' not supported for gpt-4o-mini (switching models requires rewriting your entire integration layer)
Scale Insight:
At 10M requests/day, the difference between paid API structured outputs and open-source constrained decoding is $273,000/year. That's a senior engineer's salary. Build the abstraction once, own your outputs.
Key Takeaway
API magic is a tax — pay it only when you absolutely cannot run inference yourself.

When Structured Outputs Lie: The Schema Drift Problem

Structured outputs guarantee format, not correctness. This is the most dangerous misconception in production LLM systems. Just because the JSON parses doesn't mean the data is right. Schema drift is when your output conforms to your schema but contains semantically wrong values — and you never notice because the JSON validates.

Real example: A medical intake system that extracts patient age. Schema says {"age": int}. Model outputs {"age": 42}. Valid JSON. But the patient is a 42-year-old or a 42-month-old? The model guessed based on context. Your validation passes. Your downstream system treats it as years. Someone gets a pediatric dose of an adult medication.

This happens because structured outputs only constrain syntax, not semantics. They ensure the value is an integer between 0-150, but not that the unit matches the context. The model can confidently hallucinate into your rigid schema.

Mitigation requires two things: post-generation validation hooks that check value ranges against known contexts, and logging every structured output with the raw input for audit trails. Never trust a structured output blindly — verify the semantic bounds before routing to production systems.

semantic_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from pydantic import BaseModel, field_validator
from datetime import datetime

class PatientIntake(BaseModel):
    name: str
    age_value: int
    age_unit: str  # 'years' or 'months'

    @field_validator('age_unit')
    @classmethod
    def validate_age_unit(cls, v, info):
        raw = info.data.get('age_value', 0)
        if v == 'months' and raw > 24:
            # Suspicious — kids over 2 years shouldn't use months
            raise ValueError(f"age_value {raw} with unit 'months' is suspicious")
        return v

# This passes structured output but fails semantic check
bad_data = {"name": "Baby Doe", "age_value": 42, "age_unit": "months"}
try:
    PatientIntake(**bad_data)
except ValueError as e:
    print(f"Semantic guardrail triggered: {e}")
    # Alert: potential schema drift detected
Output
Semantic guardrail triggered: age_value 42 with unit 'months' is suspicious
Production Trap:
We shipped a fraud detection system that passed JSON validation for 3 weeks. The age field always said 30 because the model found it was the most common value. Every single extraction was wrong. Schema validation doesn't catch semantic laziness.
Key Takeaway
Valid JSON is not correct data — validate semantics, not just syntax.
● Production incidentPOST-MORTEMseverity: high

The $400k Enum Drift — When Structured Outputs Lie Silently

Symptom
Chargeback rate dropped from 3.2% to 2.1% with no change in fraud patterns — the pipeline was classifying 'high-risk' transactions as 'low-risk' because the LLM output 'low_risk' instead of 'low-risk' (underscore vs hyphen).
Assumption
The team assumed that because JSON.parse() succeeded and all required keys were present, the output was valid for downstream consumption.
Root cause
The prompt used a list of enum values with hyphens ('low-risk', 'medium-risk', 'high-risk'), but the LLM model (gpt-4-turbo-2024-04-09) occasionally returned underscores. The JSON schema validator only checked that 'risk_level' was a string, not that it matched the allowed enum — because the schema didn't define an enum constraint.
Fix
1. Added strict enum constraints to the JSON schema for all categorical fields. 2. Implemented a post-parse validation step that checks all enum values against a canonical list. 3. Added observability: log every invalid enum value with the full raw response. 4. Set up a PagerDuty alert if >1% of responses have invalid enums. 5. Added automatic retry with a prompt that explicitly lists valid values when validation fails.
Key lesson
  • Define enum constraints in your JSON schema — don't rely on prompt engineering alone to enforce controlled vocabularies.
  • Monitor enum violation rates as a leading indicator of model drift or prompt degradation — it catches problems before accuracy metrics do.
  • Log the raw LLM response alongside the parsed structure — you can't debug schema violations without seeing what the model actually sent.
Production debug guideWhen schema validation passes but the output is still wrong at 2am.4 entries
Symptom · 01
LLM returns valid JSON but missing a required field
Fix
Check if your schema uses 'required' array. Run: python -c "import json; schema = json.load(open('schema.json')); print(schema.get('required', 'NO REQUIRED FIELDS DEFINED'))"
Symptom · 02
Field values don't match expected enums
Fix
Enable debug logging for raw response: add 'logger.debug(f"Raw response: {response}")' before parsing. Compare actual values against your enum list.
Symptom · 03
LLM returns empty or truncated JSON
Fix
Check token limit: compare prompt_tokens + max_tokens against model's context window. Use tiktoken to count: python -c "import tiktoken; enc = tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode('your prompt here')))"
Symptom · 04
Structured output is valid but semantically wrong
Fix
A/B test with a simpler prompt. Create a minimal version that only asks for the field in question. If it works, your prompt is too complex or has conflicting instructions.
★ Structured Outputs with LLMs Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
JSON parse error
Immediate action
Check if the response starts/ends with valid JSON delimiters
Commands
python -c "import json, sys; data = sys.stdin.read(); json.loads(data); print('VALID')" < raw_response.json
python -c "import json; print(json.dumps(json.loads(open('raw_response.json').read()), indent=2))"
Fix now
Add response = response.strip().removeprefix('``json').removesuffix('``')
Missing fields+
Immediate action
Verify your schema 'required' array matches prompt instructions
Commands
python -c "import json; s=json.load(open('schema.json')); print([k for k in s.get('required',[]) if k not in json.load(open('response.json'))])"
python -c "import json; print('Keys in response:', list(json.load(open('response.json')).keys()))"
Fix now
Add explicit instruction: 'You MUST include ALL required fields. Do not omit any.'
Enum violation+
Immediate action
Extract all enum values from response and compare to allowed list
Commands
python -c "import json; r=json.load(open('response.json')); print({k:v for k,v in r.items() if isinstance(v,str) and v not in {'allowed1','allowed2','allowed3'}})"
python -c "import json; print(json.dumps(json.load(open('response.json')), default=str))"
Fix now
Add enum constraint to JSON schema: '"field_name": {"type": "string", "enum": ["value1", "value2"]}'
Truncated response+
Immediate action
Check token usage vs model limit
Commands
python -c "import tiktoken; enc=tiktoken.encoding_for_model('gpt-4'); prompt=open('prompt.txt').read(); print('Prompt tokens:', len(enc.encode(prompt)))"
python -c "import json; print('Response length:', len(open('response.json').read()))"
Fix now
Reduce max_tokens or increase model context window (e.g., switch to gpt-4-32k)
Structured Outputs vs Function Calling vs Prompt Engineering
ConcernStructured OutputsFunction CallingPrompt EngineeringRecommendation
JSON validity guarantee~99.9% with constrained decoding~99% (tool use can still hallucinate args)~80-95% (depends on prompt quality)Structured outputs for fixed schemas
Latency overhead10-30% increase20-50% increase (tool selection step)0% (but retries add latency)Structured outputs for speed
Schema flexibilityFixed schema per callDynamic tool selectionUnlimited (but unreliable)Function calling for dynamic needs
Cost per requestLow (no extra tokens for tool definitions)Higher (tool definitions in context)Lowest (no extra tokens)Structured outputs for cost
Ease of debuggingHard (black-box token masking)Medium (tool call logs)Easy (full prompt visible)Prompt engineering for debugging
Production failure rate<0.1%~1%5-20%Structured outputs for reliability

Key takeaways

1
Structured outputs use constrained decoding (logit masking) to force the LLM to generate valid JSON
but only if you use a provider that supports it natively; prompt engineering alone is not reliable.
2
Always validate structured outputs against your schema server-side after generation
LLMs can produce valid JSON that violates enum constraints or type requirements due to tokenizer quirks.
3
For high-throughput pipelines (10M+ req/day), batch structured output requests and use schema caching to avoid re-parsing the same JSON schema on every call.
4
Never use structured outputs for open-ended generation (e.g., creative writing)
the constraints degrade output quality and increase latency by 30-50%.
5
Log the raw token logits for structured output fields in production
a sudden drop in probability for a specific enum value is your canary for schema drift or model updates.

Common mistakes to avoid

4 patterns
×

Relying on prompt engineering alone for JSON output

Symptom
LLM returns valid JSON but with extra fields, missing fields, or wrong types — pipeline silently processes garbage.
Fix
Switch to a provider that supports constrained decoding (e.g., OpenAI structured outputs, Anthropic tool use, or local guidance/outlines library). Never trust 'return JSON' in the system prompt.
×

Not validating enum values against the schema after generation

Symptom
LLM outputs 'fraud_score: 0.5' but schema expects enum 'low/medium/high' — pipeline uses 0.5 as a valid score, corrupting downstream models.
Fix
Run a JSON schema validator (e.g., jsonschema Python library) on every output. Reject or fallback to a default enum value if validation fails. Log the violation immediately.
×

Assuming structured outputs are deterministic

Symptom
Same input produces different enum values across requests — A/B test results are inconsistent.
Fix
Set temperature=0 and seed parameter if available. Even then, constrained decoding can produce different tokens due to floating-point non-determinism. Cache outputs for identical inputs if reproducibility is critical.
×

Not handling schema evolution in production

Symptom
Adding a new enum value to the schema causes older model versions to fail silently — pipeline throughput drops without alerts.
Fix
Version your schemas and pin model versions. Deploy schema changes with a canary rollout. Monitor the 'schema_validation_failure' metric per model version.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how constrained decoding works for structured outputs. How does ...
Q02SENIOR
Your fraud pipeline uses structured outputs with an enum field 'risk_lev...
Q03SENIOR
How would you design a monitoring system for structured output quality a...
Q04SENIOR
Compare structured outputs, function calling, and prompt engineering for...
Q05SENIOR
What happens if the LLM's tokenizer splits a JSON key across multiple to...
Q01 of 05SENIOR

Explain how constrained decoding works for structured outputs. How does it differ from post-hoc validation?

ANSWER
Constrained decoding masks the logits of tokens that would produce invalid JSON at each generation step, ensuring the output is schema-compliant by construction. Post-hoc validation runs a JSON parser after generation and rejects invalid outputs. Constrained decoding is more reliable but requires provider support; post-hoc is simpler but can have high rejection rates (5-20%) with prompt-only approaches. In production, use both: constrained decoding for generation, then validate for edge cases.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How do structured outputs work under the hood in LLMs?
02
Can structured outputs guarantee 100% valid JSON?
03
What's the latency impact of structured outputs vs freeform text?
04
How do I debug a structured output that fails validation?
05
When should I use function calling instead of structured outputs?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

Follow
Verified
production tested
June 02, 2026
last updated
1,554
articles · all by Naren
🔥

That's LLM APIs. Mark it forged?

9 min read · try the examples if you haven't

Previous
LLM Function Calling Explained
3 / 3 · LLM APIs
Next
Vector Databases Explained