AI Agents Explained — The 3am Incident That Broke Our Multi-Agent Orchestrator
Learn how AI agents work under the hood, avoid the 3am pager from a runaway agent loop, and build production-grade autonomous systems with Python and LangGraph.
- Agent Loop The core loop that calls an LLM, parses the response, and executes a tool — if not bounded, it can spin forever, burning $400/hr in tokens.
- Tool Execution Each tool call is a side effect; a buggy tool can corrupt state or trigger cascading failures across agents.
- Memory Window Agents with finite context windows will silently drop old messages, causing hallucinations or task abandonment.
- Orchestrator Pattern A single orchestrator agent managing sub-agents creates a single point of failure; a crash in the orchestrator loses all sub-agent progress.
- Structured Output Using pydantic models for agent responses prevents parsing errors that crash the pipeline at 2am.
- Observability Without tracing every LLM call and tool execution, debugging a multi-agent system is impossible.
Think of an AI agent like a very eager intern who can use any tool in the office but has no sense of time. You give them a task, they start making phone calls, sending emails, and searching the web. If you don't give them a strict deadline and a way to report back, they'll keep working forever, burning through your budget and never telling you they're stuck. A production AI agent is that intern with a stopwatch, a notepad, and a manager who checks in every 30 seconds.
We rolled out a multi-agent system to handle customer support tickets. Three agents: one for triage, one for knowledge base lookup, one for escalation. The first week was magic — 80% of tickets resolved without human touch. Then the pager went off at 3am. A single ticket about 'printer not working' had triggered a 47-minute agent loop, called the knowledge base API 1,200 times, and racked up $340 in OpenAI costs. The agent was stuck in a loop: look up 'printer', get vague answer, ask for clarification, look up 'printer troubleshooting', get another vague answer, repeat. No timeout, no max retries, no circuit breaker.
How AI Agents Actually Work Under the Hood
An AI agent is not magic — it's a loop. The loop calls an LLM, gets a structured response (usually a JSON with 'action' and 'action_input'), executes the action (a function call), appends the result to the message history, and repeats. The LLM decides when to stop by returning a 'final_answer' action. The tricky part is that the LLM has no inherent concept of time or cost. It will keep generating actions until it thinks the task is done, which may be never. The abstraction you should care about is the context window. Every loop iteration adds tokens. After ~10 iterations with tool results, you can easily hit 8k tokens. If your LLM's max context is 4k, older messages get silently dropped, causing the agent to 'forget' the original task. This is why you need to explicitly manage the context window — either by summarizing old messages or using a sliding window.
response_format with pydantic models to guarantee a parseable response. We learned this when 2% of our agent calls crashed with JSONDecodeError at 2am.Practical Implementation: Building a Multi-Agent Orchestrator with LangGraph
LangGraph is the de facto framework for building multi-agent systems in production. It models agents as nodes in a directed graph, with edges defining the flow. The key insight is that each node is a function that takes state and returns state. The graph's executor runs the nodes in order, handling branching and cycles. The gotcha is state management. Each node can modify the shared state, and if two nodes modify the same key concurrently, you get race conditions. LangGraph handles this with a reducer pattern — you define how to merge updates to each state key. In production, we use a single reducer that appends to a list, so no data is lost. Another gotcha: the graph's recursion limit. By default, LangGraph limits recursion to 25 steps. If your agent needs more, you must increase it explicitly. We hit this when a complex workflow required 30 steps, and the graph silently stopped at 25.
graph.compile(recursion_limit=100) explicitly. We learned this when a complex customer support flow silently failed after 25 steps.When NOT to Use AI Agents
AI agents are not the right tool for every problem. If your task is a simple, deterministic workflow (e.g., 'if this, then that'), use a rules engine or a simple script. Agents add latency, cost, and failure modes. Specifically, avoid agents when: 1) The decision logic is deterministic and well-defined. 2) The cost of a wrong action is high (e.g., deleting a database record). 3) You need guaranteed response times — LLM calls have unpredictable latency. 4) The task requires no external tools or data. A simple LLM call with a prompt is cheaper and faster. We made this mistake with a password reset flow. We used an agent to decide whether to send a reset email. The agent sometimes decided to 'call the user' instead, which was not implemented. The fix was to replace the agent with a simple if-else statement.
Production Patterns & Scale: Caching, Rate Limiting, and Observability
At scale, AI agents consume a lot of resources. A single agent doing 10 tool calls per session, with 1,000 sessions per hour, generates 10,000 tool calls per hour. If each tool call takes 500ms, that's 5,000 seconds of compute time per hour. You need caching for repeated tool calls (e.g., knowledge base lookups for the same query). You need rate limiting to protect downstream APIs. And you need observability to debug failures. The most important metric is token usage per session. Set alerts for sessions that exceed 10,000 tokens. Also track tool call latency and error rates. We use OpenTelemetry to trace every LLM call and tool execution. The trace includes the input, output, latency, and token count. This allows us to replay any session for debugging.
Common Mistakes with Specific Examples
Mistake 1: Not validating tool inputs. An agent might call a tool with a SQL injection payload if you're not careful. Always sanitize inputs. Mistake 2: Ignoring the context window. If the agent's context exceeds the model's limit, older messages are dropped silently. This causes the agent to 'forget' the original task. Mistake 3: Using a single agent for everything. A single agent with too many tools becomes confused. Split responsibilities across specialized agents. Mistake 4: Not handling tool failures gracefully. If a tool returns an error, the agent might retry indefinitely or crash. Implement retries with backoff and a max retry count. Mistake 5: Not testing with real-world data. Synthetic tests don't capture the ambiguity of real user queries. We once tested with 'What is the weather?' and deployed, only to find that real users asked 'What's the weather like in Tokyo next Tuesday?' which required a date parser the agent didn't have.
Comparison vs Alternatives: Agents vs RAG vs Fine-Tuning
Agents are not always the best solution. For question-answering over a fixed knowledge base, RAG (Retrieval-Augmented Generation) is simpler and more reliable. For specialized tasks with fixed output formats, fine-tuning a model is cheaper and faster. Agents are best when the task requires multiple steps, tool use, and adaptation. The trade-off is complexity and cost. A RAG system costs ~$0.01 per query. An agent costs ~$0.10 per query. But an agent can handle tasks a RAG system cannot, like booking a flight or debugging code. The decision matrix: if the task is a single-turn Q&A, use RAG. If the task is multi-turn with tool use, use an agent. If the task is a fixed, repetitive pattern, fine-tune a model.
Debugging and Monitoring AI Agents in Production
Debugging an agent in production is hard because the behavior is non-deterministic. The same input can produce different outputs. You need to log everything: the LLM response, the tool inputs and outputs, the state at each step, and the final output. Use a trace ID to correlate all logs for a single session. The most common debugging scenario is 'the agent returned the wrong answer'. You need to replay the session step by step. We built a replay tool that takes a trace ID and re-executes the agent with the same inputs, printing each step. This allows us to see exactly where the agent went wrong. Another common issue is 'the agent is slow'. Profile each step: LLM call latency, tool call latency, and state processing time. We found that 80% of latency was from LLM calls, and 20% from tool calls.
The Runaway Agent: $340 in 47 Minutes
while loop in the orchestrator ran until the LLM returned a 'final_answer' action. The LLM kept generating 'search_knowledge_base' actions because the results were always ambiguous.max_iterations=10 parameter to the agent loop.
2. Implemented a timeout of 120 seconds per agent session.
3. Added a circuit breaker that kills the agent after 5 consecutive failed tool calls.
4. Logged all tool call inputs and outputs for post-mortem analysis.
``python
# Before:
while action.type != "final_answer":
action = llm.invoke(messages)
result = execute_tool(action)
messages.append(result)
# After:
for i in range(MAX_ITERATIONS):
if time.monotonic() - start_time > TIMEOUT_SECONDS:
raise TimeoutError("Agent exceeded timeout")
action = llm.invoke(messages)
if action.type == "error":
consecutive_errors += 1
if consecutive_errors >= 5:
raise CircuitBreakerError("Too many consecutive errors")
else:
consecutive_errors = 0
result = execute_tool(action)
messages.append(result)
``- Always set a hard limit on agent iterations and wall-clock time before deploying to production.
- Monitor token usage per session and alert on anomalies — not just total cost.
- Implement circuit breakers for tool calls; a flaky API should not crash the entire agent.
kubectl logs <pod> | grep 'iteration' | tail -20 to see if it's stuck in a loop. If iteration count is > 10, you have a loop.SELECT session_id, COUNT(*) as calls, SUM(token_count) as tokens FROM agent_traces WHERE timestamp > NOW() - INTERVAL '1 hour' GROUP BY session_id ORDER BY tokens DESC LIMIT 5; — find the runaway session.python -c "import json; data=json.load(open('agent_messages.json')); print(json.dumps(data[-5:], indent=2))" to see if the context was truncated or corrupted.curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health to verify the tool is up. If rate limited, add exponential backoff to the tool executor.kubectl exec <pod> -- cat /proc/<pid>/fd/1 | grep 'iteration' | tail -5python -c "import json; traces=json.load(open('traces.json')); print([t for t in traces if t['iterations'] > 10])"max_iterations=10 in the agent config and restart the pod.Key takeaways
Common mistakes to avoid
4 patternsNo max iterations on agent loop
add_condition_edges to route to an END node after N steps.Allowing agents to call themselves or other agents without a circuit breaker
No caching of tool outputs
functools.lru_cache or Redis) keyed by (tool_name, hash(input)). Invalidate on conversation reset. Our fix: 98% cache hit rate, reduced average tool latency from 3s to 15ms.No per-step observability on agent state
Interview Questions on This Topic
Explain how an AI agent works under the hood. What is the core loop?
Frequently Asked Questions
That's AI Agents. Mark it forged?
5 min read · try the examples if you haven't