Agent Orchestration Patterns — The $4k Mistake We Made With LLM-Controlled Handoffs
Learn agent orchestration patterns from a production incident that caused a $4k token bill spike.
- Sequential Orchestration Agents execute in a fixed pipeline. Use when subtasks have strict dependencies. We saw 23% accuracy drop when we assumed parallel execution was safe.
- Handoff Orchestration A triage agent routes to specialists. Watch for runaway loops — our payment service hit 800ms p99 because handoffs didn't have timeout limits.
- Tool-Based Orchestration Manager agent controls subtasks via
Agent.as_tool(). Better for bounded subtasks where the manager must retain context. Avoid if the subtask needs independent reasoning. - LLM-Controlled Orchestration Letting the LLM decide the agent flow. Flexible but unpredictable. We learned the hard way that without guardrails, token costs can spike 10x in an hour.
- Code-Controlled Orchestration Flow is hardcoded. Predictable and debuggable. Use when the task structure is known at design time, like a multi-step data pipeline.
- Hybrid Orchestration Mix of LLM and code control. Best for complex workflows where some steps are fixed and others need dynamic routing. Set iteration limits to prevent infinite loops.
Think of agent orchestration like a restaurant kitchen. You can have a single chef (single agent) who does everything, or a head chef who delegates tasks to specialized line cooks (orchestrator with handoffs). The problem is when the head chef asks a cook to 'make something Italian' without specifying the dish — you get a confusing mess and wasted ingredients. That's what happens when you let the LLM control the flow without clear boundaries.
We were running a multi-agent recommendation engine serving 2M requests per day. The system used a triage agent to route user queries to specialist agents: one for product search, one for inventory, one for pricing. It worked great in staging. In production, our token costs jumped from $200/day to $4,200 in a single afternoon. The triage agent was handing off to itself in a loop, generating 12,000 tokens per request. We had no timeout on handoffs, no iteration limit, and no monitoring on agent routing decisions.
Most tutorials on agent orchestration show you clean examples with two agents and a simple handoff. They don't tell you what happens when the LLM decides to hand off to the wrong agent, or when it enters a loop because the prompt says 'use the most appropriate agent' without constraints. They also skip the cost implications — every handoff is a full LLM call, and if the routing logic is flawed, you're burning money on garbage.
This article covers the three core orchestration patterns — sequential, handoff-based, and tool-based — with production code examples. We'll walk through the incident that cost us $4k, show you how to debug agent routing in production, and give you a cheat sheet for triaging common failures. By the end, you'll know which pattern to use and, more importantly, when to avoid LLM-controlled flows entirely.
How Agent Orchestration Actually Works Under the Hood
Agent orchestration isn't magic — it's a series of LLM calls strung together by a runtime. When you use openai-agents-python, the Runner class manages the event loop. Each agent call goes through: 1) system prompt injection, 2) user message, 3) LLM response parsing, 4) tool execution or handoff. The handoff mechanism creates a new context window for the target agent, discarding the previous agent's conversation history unless you explicitly pass it.
What the docs don't tell you: every handoff costs a full LLM call for the source agent to generate the handoff decision, plus another call for the target agent to start. That's 2 LLM calls per handoff. In our production system, a single request with 5 handoffs cost 10 LLM calls. At $0.01 per call (gpt-4o-mini), that's $0.10 per request. At 1000 requests/minute, that's $100/minute in token costs alone.
The abstraction hides the state management. When you hand off, the source agent's state is frozen. If the target agent needs context from the source (e.g., the user's original query), you must pass it explicitly in the handoff context. Many teams forget this, and the specialist agent starts from scratch, producing irrelevant answers.
token_budget on the Runner to cap per-request spend. We use 10K tokens as a hard limit for internal tools.Practical Implementation: Sequential Orchestration with Guardrails
Sequential orchestration is the simplest pattern: agent A runs, then agent B, then agent C. No handoffs, no routing decisions. Use this when the subtasks have a fixed order and each step depends on the previous one. The key production concern is error propagation — if agent A fails, the whole pipeline stops.
We use a try-except around each agent call and a fallback response. Also, set a timeout per agent call. A single agent can hang if the LLM decides to think for 30 seconds. We use timeout=10 on the Runner.
When NOT to Use Agent Orchestration
Don't use multi-agent orchestration if a single agent can do the job. We see teams adding handoffs because it sounds cool, not because they need it. The rule of thumb: if you can write a single prompt that handles all cases, use a single agent. Multi-agent adds latency, cost, and debugging complexity.
- Using handoffs for simple classification (e.g., 'is this email spam?'). A single LLM call is cheaper and faster.
- Using tool-based orchestration when the subtask is trivial (e.g., 'add 2+2'). Use a function tool instead.
- Using LLM-controlled routing when the flow is fixed (e.g., always validate, then process, then report). Use sequential orchestration.
Production Patterns & Scale: Handling 1000 Requests/Minute
At scale, agent orchestration patterns break in predictable ways. The most common issues: token rate limits, LLM timeouts, and context window overflow. Here's how we handle each:
- Token rate limits: Use a token bucket per agent. We use
asyncio.Semaphoreto limit concurrent LLM calls. For gpt-4o-mini, we allow 10 concurrent calls per agent. - LLM timeouts: Set
timeout=5on the Runner. If an agent takes longer, retry once, then fall back to a cached response. - Context window overflow: Agents accumulate conversation history. After 3 handoffs, the context can exceed 128K tokens. We truncate history to the last 5 messages before each handoff.
We also use a circuit breaker pattern: if an agent fails 3 times in 1 minute, stop routing to it and return a default response.
Common Mistakes with Specific Examples
Here are the top 3 mistakes we see in production:
- Forgetting to pass context on handoff. The specialist agent gets no history and produces generic answers. Fix: always pass a
contextdict with the user's original query and any intermediate results. - Not setting a max handoff depth. The LLM can loop indefinitely. Fix: set
max_handoff_depth=3on every agent. - Using the same prompt for all agents. Each agent needs a focused prompt. A generic prompt leads to role confusion and wrong outputs. Fix: write a specific prompt for each agent, including what it should NOT do.
Comparison: Handoff vs Tool-Based Orchestration
The two main patterns in the openai-agents-python SDK are handoffs and agents-as-tools. Here's the production tradeoff:
- Handoffs: The specialist agent takes over the conversation. Good for when the specialist needs to respond directly to the user. Bad for when the manager needs to combine multiple specialist outputs — the manager loses context.
- Agents-as-tools: The manager calls a specialist via
Agent.as_tool(), gets the result, and keeps control. Good for bounded subtasks where the manager needs to aggregate. Bad for long-running specialists that need to maintain state.
We use handoffs for routing (e.g., triage -> billing) and agents-as-tools for data enrichment (e.g., manager calls a summarizer tool, then a formatter tool).
Debugging and Monitoring Agent Orchestration in Production
You can't debug what you can't see. We log every orchestration event: agent name, input tokens, output tokens, handoff target, tool calls, and timing. We use structured logging with JSON and ship to Elasticsearch. The key metrics:
- Handoff count per request: spike >3 indicates a loop.
- Token cost per request: >10K tokens triggers an alert.
- Agent error rate: per agent, not just overall.
- Handoff routing accuracy: compare the handoff target to the expected target based on the user query.
We also use tracing. The openai-agents-python SDK supports OpenTelemetry. We export traces to Jaeger and look for long agent calls or repeated handoffs.
rate(agent_handoff_count[5m]) > 100. A sudden spike usually means a loop. We use a Grafana dashboard with per-agent handoff count, token cost, and error rate.The $4k Handoff Loop: When Your Triage Agent Becomes a Spinning Top
Agent.handoff() method in openai-agents-python v0.0.6 did not prevent an agent from handing off to itself. The triage agent's prompt was ambiguous about what to do when the query was generic (e.g., 'recommend something'), so it called handoff_to('triage_agent') repeatedly.- Always set a max iteration/handoff depth on every agent, even if you think the LLM won't loop.
- Log every orchestration decision — you can't debug what you can't see.
- Budget tokens per request, not just per day. A single rogue request can bankrupt your experiment.
grep 'handoff_to' /var/log/agent.log | awk '{print $4}' | sort | uniq -c | sort -nr. Look for agents handing off to themselves.Runner(tracing_exporters=[ConsoleExporter()]) and look for repeated tool calls or handoffs.agent.handoff_context to log the full context dict. Missing fields cause agents to hallucinate.grep 'handoff_to' agent.log | grep -E '(triage_agent|self)' | head -20python -c "import json; logs=[json.loads(l) for l in open('agent.log')]; print([l for l in logs if 'handoff_to' in l and l['target']==l['source']])"max_handoff_depth=2 to your agent config. Example: agent = Agent(name='triage', max_handoff_depth=2)Key takeaways
Common mistakes to avoid
4 patternsUnbounded handoff loops
No timeout on agent execution
Shared mutable state across agents
Ignoring token cost per handoff
Interview Questions on This Topic
Explain how agent orchestration works under the hood. What happens when an LLM decides to hand off to another agent?
Frequently Asked Questions
That's Agent Frameworks. Mark it forged?
5 min read · try the examples if you haven't