Senior 5 min · May 22, 2026

Agent Orchestration Patterns — The $4k Mistake We Made With LLM-Controlled Handoffs

Learn agent orchestration patterns from a production incident that caused a $4k token bill spike.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Sequential Orchestration Agents execute in a fixed pipeline. Use when subtasks have strict dependencies. We saw 23% accuracy drop when we assumed parallel execution was safe.
  • Handoff Orchestration A triage agent routes to specialists. Watch for runaway loops — our payment service hit 800ms p99 because handoffs didn't have timeout limits.
  • Tool-Based Orchestration Manager agent controls subtasks via Agent.as_tool(). Better for bounded subtasks where the manager must retain context. Avoid if the subtask needs independent reasoning.
  • LLM-Controlled Orchestration Letting the LLM decide the agent flow. Flexible but unpredictable. We learned the hard way that without guardrails, token costs can spike 10x in an hour.
  • Code-Controlled Orchestration Flow is hardcoded. Predictable and debuggable. Use when the task structure is known at design time, like a multi-step data pipeline.
  • Hybrid Orchestration Mix of LLM and code control. Best for complex workflows where some steps are fixed and others need dynamic routing. Set iteration limits to prevent infinite loops.
✦ Definition~90s read
What is Agent Orchestration Patterns?

Agent orchestration is the architectural pattern governing how multiple LLM-powered agents coordinate to complete complex tasks, specifically how control flow and context are passed between agents. Unlike simple single-call LLM patterns, orchestration introduces routing logic — deciding which agent handles which subtask, when to hand off control, and how to merge results.

Think of agent orchestration like a restaurant kitchen.

The critical distinction is between handoff-based orchestration (where Agent A decides to pass control to Agent B, often via function calling or special tokens) and tool-based orchestration (where a deterministic orchestrator selects tools/agents based on predefined rules or a router LLM). The $4k mistake referenced in the article comes from treating LLM-controlled handoffs as a default pattern, when in practice they introduce unpredictable latency, context window bloat from passing entire conversation histories, and cascading failures when the handoff decision itself is hallucinated.

In production systems handling 1000+ requests/minute, you typically want deterministic orchestration with guardrails — think a lightweight router (e.g., a fine-tuned classifier or rules engine) that selects specialized agents, each with bounded context and clear exit criteria. Handoffs work for exploratory or creative workflows where the path is unknown, but for anything transactional or latency-sensitive, they're a liability.

The ecosystem includes frameworks like LangGraph (graph-based orchestration), CrewAI (role-based handoffs), and custom solutions using state machines or DAGs — each with tradeoffs in observability, cost, and failure modes.

Agent Orchestration Patterns Architecture diagram: Agent Orchestration Patterns Agent Orchestration Patterns 1 User Task Complex goal 2 Orchestrator LangGraph / CrewAI 3 Agent A Research / Retrieval 4 Agent B Code / Execution 5 Agent C Review / Critique 6 Aggregator Merge + Return THECODEFORGE.IO
Plain-English First

Think of agent orchestration like a restaurant kitchen. You can have a single chef (single agent) who does everything, or a head chef who delegates tasks to specialized line cooks (orchestrator with handoffs). The problem is when the head chef asks a cook to 'make something Italian' without specifying the dish — you get a confusing mess and wasted ingredients. That's what happens when you let the LLM control the flow without clear boundaries.

We were running a multi-agent recommendation engine serving 2M requests per day. The system used a triage agent to route user queries to specialist agents: one for product search, one for inventory, one for pricing. It worked great in staging. In production, our token costs jumped from $200/day to $4,200 in a single afternoon. The triage agent was handing off to itself in a loop, generating 12,000 tokens per request. We had no timeout on handoffs, no iteration limit, and no monitoring on agent routing decisions.

Most tutorials on agent orchestration show you clean examples with two agents and a simple handoff. They don't tell you what happens when the LLM decides to hand off to the wrong agent, or when it enters a loop because the prompt says 'use the most appropriate agent' without constraints. They also skip the cost implications — every handoff is a full LLM call, and if the routing logic is flawed, you're burning money on garbage.

This article covers the three core orchestration patterns — sequential, handoff-based, and tool-based — with production code examples. We'll walk through the incident that cost us $4k, show you how to debug agent routing in production, and give you a cheat sheet for triaging common failures. By the end, you'll know which pattern to use and, more importantly, when to avoid LLM-controlled flows entirely.

How Agent Orchestration Actually Works Under the Hood

Agent orchestration isn't magic — it's a series of LLM calls strung together by a runtime. When you use openai-agents-python, the Runner class manages the event loop. Each agent call goes through: 1) system prompt injection, 2) user message, 3) LLM response parsing, 4) tool execution or handoff. The handoff mechanism creates a new context window for the target agent, discarding the previous agent's conversation history unless you explicitly pass it.

What the docs don't tell you: every handoff costs a full LLM call for the source agent to generate the handoff decision, plus another call for the target agent to start. That's 2 LLM calls per handoff. In our production system, a single request with 5 handoffs cost 10 LLM calls. At $0.01 per call (gpt-4o-mini), that's $0.10 per request. At 1000 requests/minute, that's $100/minute in token costs alone.

The abstraction hides the state management. When you hand off, the source agent's state is frozen. If the target agent needs context from the source (e.g., the user's original query), you must pass it explicitly in the handoff context. Many teams forget this, and the specialist agent starts from scratch, producing irrelevant answers.

orchestration_internals.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from openai import OpenAI
from agents import Agent, Runner, function_tool

client = OpenAI()

# Step 1: Define a simple sequential pipeline
@function_tool()
def extract_keywords(text: str) -> list:
    """Extract keywords from user query."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Extract keywords from: {text}"}]
    )
    return response.choices[0].message.content.split(",")

keyword_agent = Agent(
    name="keyword_extractor",
    instructions="Extract keywords from the user query. Return as comma-separated.",
    tools=[extract_keywords]
)

search_agent = Agent(
    name="search",
    instructions="Search for products matching the keywords.",
    tools=[function_tool(lambda keywords: f"Searching for {keywords}")]
)

# Sequential orchestration: keyword_agent -> search_agent
# Note: Runner.run() returns a RunResult with the final agent's output
result = Runner.run(keyword_agent, "Find me a red dress")
keywords = result.final_output  # e.g., "red, dress"

result2 = Runner.run(search_agent, keywords)
print(result2.final_output)
Token Budgets Are Not Optional
Every LLM call consumes tokens. In a multi-agent setup, a single request can trigger 5-10 calls. Set a token_budget on the Runner to cap per-request spend. We use 10K tokens as a hard limit for internal tools.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The handoff context included the old field names, so the specialist agent couldn't find products. We were passing the entire user query as context but not the updated schema. The fix: always pass a 'context' dict with explicit keys, and validate it at the start of each agent call.
Key Takeaway
Agent orchestration is a state machine. Every handoff is a state transition. Log the state and validate it. Don't assume the LLM will preserve context — pass it explicitly.

Practical Implementation: Sequential Orchestration with Guardrails

Sequential orchestration is the simplest pattern: agent A runs, then agent B, then agent C. No handoffs, no routing decisions. Use this when the subtasks have a fixed order and each step depends on the previous one. The key production concern is error propagation — if agent A fails, the whole pipeline stops.

We use a try-except around each agent call and a fallback response. Also, set a timeout per agent call. A single agent can hang if the LLM decides to think for 30 seconds. We use timeout=10 on the Runner.

sequential_orchestration.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import asyncio
from agents import Agent, Runner

# Define three agents in a pipeline
validate_agent = Agent(
    name="validator",
    instructions="Check if the input is valid. Return 'VALID' or 'INVALID'."
)

process_agent = Agent(
    name="processor",
    instructions="Process the validated input and return a summary."
)

report_agent = Agent(
    name="reporter",
    instructions="Generate a final report from the processed data."
)

async def sequential_pipeline(user_input: str) -> str:
    try:
        # Step 1: Validate
        result = await Runner.run(validate_agent, user_input, timeout=10)
        if result.final_output.strip() != "VALID":
            return "Invalid input. Please try again."
        
        # Step 2: Process
        result = await Runner.run(process_agent, user_input, timeout=10)
        processed = result.final_output
        
        # Step 3: Report
        result = await Runner.run(report_agent, processed, timeout=10)
        return result.final_output
    except asyncio.TimeoutError:
        return "Pipeline timed out. Check agent responsiveness."
    except Exception as e:
        return f"Pipeline error: {str(e)}"
Parallelize Independent Steps
If some steps don't depend on each other, run them in parallel using asyncio.gather. We cut latency from 3s to 1.2s on a 3-step pipeline by parallelizing validation and preprocessing.
Production Insight
A fraud detection pipeline used sequential orchestration with 5 agents. One agent had a bug that returned an empty string on certain inputs, causing all downstream agents to fail. We added a schema validation step after each agent: check that the output matches the expected type and length. This caught 90% of silent failures.
Key Takeaway
Sequential orchestration is predictable but brittle. Add timeouts, fallbacks, and output validation at each step. Treat each agent call like a remote API call.

When NOT to Use Agent Orchestration

Don't use multi-agent orchestration if a single agent can do the job. We see teams adding handoffs because it sounds cool, not because they need it. The rule of thumb: if you can write a single prompt that handles all cases, use a single agent. Multi-agent adds latency, cost, and debugging complexity.

Specific anti-patterns
  • Using handoffs for simple classification (e.g., 'is this email spam?'). A single LLM call is cheaper and faster.
  • Using tool-based orchestration when the subtask is trivial (e.g., 'add 2+2'). Use a function tool instead.
  • Using LLM-controlled routing when the flow is fixed (e.g., always validate, then process, then report). Use sequential orchestration.
when_not_to_orchestrate.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Bad: Multi-agent for a simple classification
# This costs 2 LLM calls instead of 1
classifier_agent = Agent(
    name="classifier",
    instructions="Classify the email as spam or not spam."
)

report_agent = Agent(
    name="reporter",
    instructions="Report the classification."
)

# This is wasteful
result = Runner.run(classifier_agent, email)
classification = result.final_output
result2 = Runner.run(report_agent, classification)

# Good: Single agent with a clear prompt
single_agent = Agent(
    name="spam_classifier",
    instructions="Classify the email. Return 'SPAM' or 'NOT SPAM'."
)
result = Runner.run(single_agent, email)
print(result.final_output)
The 'Cool Factor' Trap
We've seen teams add handoffs because they wanted to use the latest SDK feature. The result: 3x latency and 5x cost for the same accuracy. Start with a single agent. Add complexity only when you have a measurable reason.
Production Insight
A customer support chatbot used 4 agents: triage, billing, technical, and feedback. The triage agent misrouted 12% of queries to the wrong specialist. We replaced the triage agent with a simple keyword-based router and cut misrouting to 2%. The LLM was overkill for routing.
Key Takeaway
Start simple. Add agents only when you have a concrete bottleneck that a single agent can't solve. Measure latency and cost before and after each addition.

Production Patterns & Scale: Handling 1000 Requests/Minute

At scale, agent orchestration patterns break in predictable ways. The most common issues: token rate limits, LLM timeouts, and context window overflow. Here's how we handle each:

  • Token rate limits: Use a token bucket per agent. We use asyncio.Semaphore to limit concurrent LLM calls. For gpt-4o-mini, we allow 10 concurrent calls per agent.
  • LLM timeouts: Set timeout=5 on the Runner. If an agent takes longer, retry once, then fall back to a cached response.
  • Context window overflow: Agents accumulate conversation history. After 3 handoffs, the context can exceed 128K tokens. We truncate history to the last 5 messages before each handoff.

We also use a circuit breaker pattern: if an agent fails 3 times in 1 minute, stop routing to it and return a default response.

production_scaling.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import asyncio
from agents import Agent, Runner

# Token bucket for rate limiting
class TokenBucket:
    def __init__(self, rate: int, capacity: int):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = asyncio.get_event_loop().time()
    
    async def acquire(self):
        now = asyncio.get_event_loop().time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now
        if self.tokens < 1:
            await asyncio.sleep(0.1)  # Wait for tokens
            return await self.acquire()
        self.tokens -= 1
        return True

bucket = TokenBucket(rate=10, capacity=10)  # 10 calls per second

async def safe_agent_call(agent: Agent, input: str) -> str:
    await bucket.acquire()
    try:
        result = await asyncio.wait_for(
            Runner.run(agent, input),
            timeout=5.0
        )
        return result.final_output
    except asyncio.TimeoutError:
        return "Agent timed out. Please try again."
    except Exception as e:
        return f"Agent error: {str(e)}"
Circuit Breaker Implementation
Track agent failure rates in Redis with a 1-minute TTL. If an agent fails 3 times, set a key 'agent:blocked:<name>' and return a cached response for 5 minutes. Reset the key after the cooldown.
Production Insight
At 1000 req/min, our token bucket with rate=10 was too aggressive. Agents queued up and requests timed out. We increased to rate=50 and added a queue with a max size of 100. Requests beyond that get a 429 response. This stabilized P99 latency at 1.2s.
Key Takeaway
Rate limiting is not optional. Use a token bucket per agent, set timeouts, and implement a circuit breaker. Monitor agent-specific error rates, not just overall system health.

Common Mistakes with Specific Examples

  1. Forgetting to pass context on handoff. The specialist agent gets no history and produces generic answers. Fix: always pass a context dict with the user's original query and any intermediate results.
  2. Not setting a max handoff depth. The LLM can loop indefinitely. Fix: set max_handoff_depth=3 on every agent.
  3. Using the same prompt for all agents. Each agent needs a focused prompt. A generic prompt leads to role confusion and wrong outputs. Fix: write a specific prompt for each agent, including what it should NOT do.
common_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Mistake 1: No context passed on handoff
agent_a = Agent(name="a", instructions="Extract keywords.")
agent_b = Agent(name="b", instructions="Search for products.")

result = Runner.run(agent_a, "red dress")
# B gets no context about the original query
result2 = Runner.run(agent_b, result.final_output)  # B only sees keywords, not the original intent

# Fix: Pass context explicitly
result2 = Runner.run(agent_b, result.final_output, context={"original_query": "red dress"})

# Mistake 2: No max handoff depth
agent = Agent(name="triage", instructions="Route to the best agent.", handoffs=[agent_b])
# If triage routes to itself, infinite loop

# Fix: Set max_handoff_depth
agent = Agent(name="triage", instructions="Route to the best agent.", handoffs=[agent_b], max_handoff_depth=2)

# Mistake 3: Generic prompt
agent = Agent(name="helper", instructions="Help the user.")
# This agent might try to do everything and do nothing well

# Fix: Specific prompt
agent = Agent(name="billing", instructions="You handle billing questions only. If asked about anything else, say 'I cannot help with that.'")
The 'Helpful' Agent Trap
LLMs are trained to be helpful. If your prompt says 'help the user', the agent will try to answer anything, even if it's out of scope. Always include a 'do not' clause in the prompt.
Production Insight
A team used a single prompt for all agents: 'You are a helpful assistant.' The billing agent started answering technical support questions, giving wrong answers. Users complained about incorrect billing info. The fix: separate prompts with explicit scope boundaries.
Key Takeaway
Each agent needs a focused prompt with clear boundaries. Include what the agent should NOT do. Test with out-of-scope queries.

Comparison: Handoff vs Tool-Based Orchestration

The two main patterns in the openai-agents-python SDK are handoffs and agents-as-tools. Here's the production tradeoff:

  • Handoffs: The specialist agent takes over the conversation. Good for when the specialist needs to respond directly to the user. Bad for when the manager needs to combine multiple specialist outputs — the manager loses context.
  • Agents-as-tools: The manager calls a specialist via Agent.as_tool(), gets the result, and keeps control. Good for bounded subtasks where the manager needs to aggregate. Bad for long-running specialists that need to maintain state.

We use handoffs for routing (e.g., triage -> billing) and agents-as-tools for data enrichment (e.g., manager calls a summarizer tool, then a formatter tool).

handoff_vs_tool.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from agents import Agent, Runner

# Handoff pattern: specialist takes over
specialist = Agent(
    name="billing",
    instructions="Answer billing questions.",
    handoffs=[]  # no further handoffs
)

triage = Agent(
    name="triage",
    instructions="Route billing questions to the billing agent.",
    handoffs=[specialist]
)

# Tool pattern: manager keeps control
summarizer = Agent(
    name="summarizer",
    instructions="Summarize the text."
)

manager = Agent(
    name="manager",
    instructions="You are a manager. Use the summarizer tool to summarize user input.",
    tools=[summarizer.as_tool(
        tool_name="summarize_text",
        tool_description="Summarize a given text."
    )]
)

# Usage
result = Runner.run(manager, "This is a long text to summarize...")
print(result.final_output)  # Manager's final answer, not the summarizer's
When to Use Each Pattern
Use handoffs when the specialist should respond directly to the user. Use agents-as-tools when the manager needs to combine multiple outputs. We use handoffs for external-facing agents and tools for internal processing.
Production Insight
We had a manager agent that used handoffs to three specialists: search, pricing, and inventory. The manager lost context after the first handoff and couldn't combine results. We switched to agents-as-tools, and the manager aggregated all three outputs into a single response. Latency went from 4s to 2.5s because we parallelized the tool calls.
Key Takeaway
Choose the pattern based on who should own the final response. Handoffs give control to the specialist. Tools keep control with the manager. Measure latency and context retention in your specific use case.

Debugging and Monitoring Agent Orchestration in Production

You can't debug what you can't see. We log every orchestration event: agent name, input tokens, output tokens, handoff target, tool calls, and timing. We use structured logging with JSON and ship to Elasticsearch. The key metrics:

  • Handoff count per request: spike >3 indicates a loop.
  • Token cost per request: >10K tokens triggers an alert.
  • Agent error rate: per agent, not just overall.
  • Handoff routing accuracy: compare the handoff target to the expected target based on the user query.

We also use tracing. The openai-agents-python SDK supports OpenTelemetry. We export traces to Jaeger and look for long agent calls or repeated handoffs.

monitoring_setup.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import json
import logging
from datetime import datetime
from agents import Runner, Agent

# Structured logging setup
logger = logging.getLogger("agent_orchestration")
handler = logging.StreamHandler()
formatter = logging.Formatter(
    '{"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": %(message)s}'
)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Custom wrapper to log orchestration events
async def monitored_agent_call(agent: Agent, input: str, context: dict = None):
    start = datetime.now()
    logger.info(json.dumps({
        "event": "agent_call_start",
        "agent": agent.name,
        "input_length": len(input),
        "context_keys": list(context.keys()) if context else None
    }))
    
    try:
        result = await Runner.run(agent, input, context=context)
        elapsed = (datetime.now() - start).total_seconds()
        logger.info(json.dumps({
            "event": "agent_call_end",
            "agent": agent.name,
            "elapsed_seconds": elapsed,
            "output_length": len(result.final_output),
            "handoffs": result.handoffs  # list of handoff targets
        }))
        return result
    except Exception as e:
        elapsed = (datetime.now() - start).total_seconds()
        logger.error(json.dumps({
            "event": "agent_call_error",
            "agent": agent.name,
            "elapsed_seconds": elapsed,
            "error": str(e)
        }))
        raise
Alert on Handoff Spikes
Set a Prometheus alert: rate(agent_handoff_count[5m]) > 100. A sudden spike usually means a loop. We use a Grafana dashboard with per-agent handoff count, token cost, and error rate.
Production Insight
Our monitoring caught a handoff loop within 2 minutes of deployment. The alert fired, we checked the logs, and saw the triage agent handing off to itself. We rolled back and fixed the prompt. Without monitoring, we would have burned $4k again.
Key Takeaway
Instrument every agent call. Log handoffs, tokens, and timing. Set alerts on handoff count and token cost. Use tracing to visualize the orchestration flow.
● Production incidentPOST-MORTEMseverity: high

The $4k Handoff Loop: When Your Triage Agent Becomes a Spinning Top

Symptom
Token usage spiked from 500K tokens/day to 12M tokens in 4 hours. P99 latency jumped from 200ms to 8s. The billing dashboard showed a hockey-stick curve.
Assumption
The team assumed the triage agent would always pick a specialist agent and never hand off to itself. The prompt said 'route to the most appropriate agent' without a self-handoff guard.
Root cause
The Agent.handoff() method in openai-agents-python v0.0.6 did not prevent an agent from handing off to itself. The triage agent's prompt was ambiguous about what to do when the query was generic (e.g., 'recommend something'), so it called handoff_to('triage_agent') repeatedly.
Fix
1. Added a max_handoff_depth parameter to the triage agent, set to 2. 2. Modified the prompt to explicitly forbid self-handoffs: 'Never hand off to yourself or to the triage agent.' 3. Implemented a token budget per request: if total tokens exceed 10K, kill the agent loop and return a fallback response. 4. Added logging for every handoff decision: agent name, target, tokens consumed, and timestamp. 5. Deployed a circuit breaker that stops the agent if it makes more than 3 handoffs in a single request.
Key lesson
  • Always set a max iteration/handoff depth on every agent, even if you think the LLM won't loop.
  • Log every orchestration decision — you can't debug what you can't see.
  • Budget tokens per request, not just per day. A single rogue request can bankrupt your experiment.
Production debug guideWhen handoff loops and token spikes happen at 2am.4 entries
Symptom · 01
Token usage spikes 10x in 30 minutes
Fix
Check agent handoff logs: grep 'handoff_to' /var/log/agent.log | awk '{print $4}' | sort | uniq -c | sort -nr. Look for agents handing off to themselves.
Symptom · 02
P99 latency > 5s on multi-agent endpoints
Fix
Enable tracing on the agent runner. For openai-agents-python, set Runner(tracing_exporters=[ConsoleExporter()]) and look for repeated tool calls or handoffs.
Symptom · 03
Inconsistent answers from specialist agents
Fix
Check the context passed during handoff. Use agent.handoff_context to log the full context dict. Missing fields cause agents to hallucinate.
Symptom · 04
Agent ignores tool results and retries
Fix
Verify the tool output schema matches what the agent expects. A common bug: tool returns JSON but agent expects a string, so it thinks the tool failed.
★ Agent Orchestration Patterns Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Infinite handoff loop
Immediate action
Check if an agent is handing off to itself
Commands
grep 'handoff_to' agent.log | grep -E '(triage_agent|self)' | head -20
python -c "import json; logs=[json.loads(l) for l in open('agent.log')]; print([l for l in logs if 'handoff_to' in l and l['target']==l['source']])"
Fix now
Add max_handoff_depth=2 to your agent config. Example: agent = Agent(name='triage', max_handoff_depth=2)
Token cost explosion+
Immediate action
Check total tokens per request in logs
Commands
grep 'total_tokens' agent.log | awk '{print $NF}' | sort -rn | head -5
python -c "import json; logs=[json.loads(l) for l in open('agent.log')]; print(max(l['total_tokens'] for l in logs if 'total_tokens' in l))"
Fix now
Set token_budget=10000 on the runner. Example: runner = Runner(agent, token_budget=10000)
Agent returns irrelevant answer+
Immediate action
Check the prompt used for the handoff
Commands
grep 'handoff_context' agent.log | jq '.prompt' | head -1
python -c "import json; logs=[json.loads(l) for l in open('agent.log')]; print([l['prompt'] for l in logs if 'handoff_context' in l][:3])"
Fix now
Explicitly state in the prompt: 'You are a specialist for X. Only answer questions about X. If asked about Y, respond with: I cannot answer that.'
Agent Handoff vs Tool-Based Orchestration
ConcernAgent HandoffTool-Based (Function Calling)Recommendation
Cost per stepFull LLM call ($0.01-0.05)API call or cheap LLM call ($0.001)Use tools for high-volume steps
Latency1-3 seconds per hop100-500ms per callTools for latency-sensitive paths
FlexibilityHigh — LLM can reason and adaptLow — deterministic, schema-boundHandoff for complex reasoning
Hallucination riskHigh — LLM may invent agents or parametersLow — schema validation catches errorsAlways validate handoff output
Debugging complexityHigh — need trace IDs and hop loggingLow — standard API monitoringHandoff requires dedicated tooling
Scalability (1000 req/min)Needs queue, pre-warmed agents, circuit breakersEasier — stateless function callsTools scale more easily

Key takeaways

1
Never let an LLM decide the next agent without a strict schema and token budget
unbounded handoffs cause infinite loops and $4k+ bills.
2
Sequential orchestration with guardrails (max hops, timeout, cost cap) is the only safe default for production; parallel fan-out is a trap without idempotency keys.
3
Tool-based orchestration (function calling) is cheaper and more predictable than agent-to-agent handoffs for any task that doesn't require multi-turn reasoning.
4
At 1000 req/min, pre-warm agent instances, use a shared state store (Redis), and implement circuit breakers per agent to prevent cascading failures.
5
Debug agent handoffs by logging every transition with a trace ID, measuring latency per hop, and alerting on handoff loops (same agent called >3 times in a chain).

Common mistakes to avoid

4 patterns
×

Unbounded handoff loops

Symptom
LLM keeps calling the same agent in a cycle, racking up tokens until you hit your API limit or budget cap.
Fix
Enforce a max handoff depth (e.g., 5 hops) and a unique hop counter in the context. Reject any handoff that exceeds the limit.
×

No timeout on agent execution

Symptom
An agent hangs on a slow LLM call or external API, blocking the entire orchestration pipeline and causing request timeouts.
Fix
Set a per-agent timeout (e.g., 10 seconds) using asyncio.wait_for or a circuit breaker. Kill and retry after timeout.
×

Shared mutable state across agents

Symptom
Two agents overwrite each other's context variables, leading to corrupted data and hallucinated responses.
Fix
Use immutable context snapshots per handoff step. Only allow a designated 'state manager' agent to mutate the shared store.
×

Ignoring token cost per handoff

Symptom
Each handoff re-sends the entire conversation history, causing exponential token growth and surprise bills.
Fix
Summarize or truncate history before handoff. Use a sliding window of last N messages (e.g., 10) and a summary of earlier context.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how agent orchestration works under the hood. What happens when ...
Q02SENIOR
Design an agent orchestration system that handles 1000 requests per minu...
Q03SENIOR
What are the failure modes of LLM-controlled handoffs, and how do you mi...
Q04SENIOR
Compare agent handoff vs tool-based orchestration. When would you choose...
Q05SENIOR
How do you monitor and debug a production agent orchestration pipeline?
Q01 of 05SENIOR

Explain how agent orchestration works under the hood. What happens when an LLM decides to hand off to another agent?

ANSWER
Under the hood, the orchestrator maintains a context object (conversation history, state, max hops). The LLM outputs a structured response (e.g., JSON with 'agent' and 'parameters'). The orchestrator parses this, validates against a schema, increments the hop counter, and invokes the target agent with the new context. Each handoff is a full LLM call. The orchestrator must enforce guardrails: max hops, timeout, cost cap, and loop detection.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What's the difference between agent handoff and tool-based orchestration?
02
How do I prevent infinite loops in agent orchestration?
03
What's the best way to scale agent orchestration to 1000 requests per minute?
04
How do I debug a broken agent handoff chain?
05
When should I avoid agent orchestration entirely?
🔥

That's Agent Frameworks. Mark it forged?

5 min read · try the examples if you haven't

Previous
CrewAI Multi-Agent Tutorial
3 / 4 · Agent Frameworks
Next
Agentic Planning Strategies