Senior 5 min · May 22, 2026

AI Agents Explained — The 3am Incident That Broke Our Multi-Agent Orchestrator

Q: What is the difference between an AI agent and a simple LLM call?

A simple LLM call is a single request-response. An agent is a loop: LLM decides which tool to call → executes tool → feeds result back to LLM → repeats until a stop condition (e.g., final answer, max steps). The loop is the key differentiator.

Q: How do I prevent an AI agent from hallucinating tool calls?

Use a constrained output format (e.g., JSON schema with Pydantic), validate tool names against a whitelist before execution, and set temperature=0 for tool-calling decisions. Never let the agent parse free-form text as a tool call.

Q: When should I use LangGraph vs building my own agent loop?

Use LangGraph when you need a state machine with multiple agents, branching, and conditional edges. Build your own loop only for a single-agent, single-tool scenario — otherwise you'll reimplement LangGraph's StateGraph and debugging tools.

Q: How do I monitor agent costs in production?

Track per-step token usage (input + output), multiply by model cost per token, and log to a metrics dashboard. Set a budget per conversation (e.g., $0.50 max) and kill the agent if exceeded. Our 3am incident cost $2,300 because we had no per-conversation budget.

Q: Can I use agents for real-time applications?

Only if you set aggressive timeouts (e.g., 5 seconds total) and limit tool calls to 1-2 steps. Agents are inherently slower than RAG because of the loop overhead. For sub-second responses, use RAG or a fine-tuned model instead.

Learn how AI agents work under the hood, avoid the 3am pager from a runaway agent loop, and build production-grade autonomous systems with Python and LangGraph.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Agent Loop The core loop that calls an LLM, parses the response, and executes a tool — if not bounded, it can spin forever, burning $400/hr in tokens.
Tool Execution Each tool call is a side effect; a buggy tool can corrupt state or trigger cascading failures across agents.
Memory Window Agents with finite context windows will silently drop old messages, causing hallucinations or task abandonment.
Orchestrator Pattern A single orchestrator agent managing sub-agents creates a single point of failure; a crash in the orchestrator loses all sub-agent progress.
Structured Output Using pydantic models for agent responses prevents parsing errors that crash the pipeline at 2am.
Observability Without tracing every LLM call and tool execution, debugging a multi-agent system is impossible.

✦ Definition~90s read

What is AI Agents?

AI agents are autonomous software systems that use large language models (LLMs) as their reasoning engine to plan, execute, and iterate on tasks without step-by-step human instructions. Unlike a simple chatbot that responds to a single prompt, an agent maintains state, calls external tools (APIs, databases, code interpreters), and loops through a 'think-act-observe' cycle until it achieves a goal or hits a termination condition.

★

Think of an AI agent like a very eager intern who can use any tool in the office but has no sense of time.

Under the hood, this is typically implemented as a state machine: the LLM generates structured output (e.g., JSON with a 'tool_call' field), which triggers a function, whose result is fed back into the next LLM call. Frameworks like LangGraph, CrewAI, and AutoGen formalize this as directed graphs of nodes (LLM calls, tool executions, human-in-the-loop gates) connected by edges that define control flow.

Multi-agent orchestrators extend this by running several specialized agents concurrently, each with its own prompt, tools, and memory, coordinated by a supervisor agent or a shared message bus. The 3am incident referenced in the article — where agents entered an infinite retry loop, burning through $200 in API credits while logging contradictory decisions — is a classic failure mode: agents hallucinate tool outputs, misinterpret state, or deadlock when their goals conflict.

This is why production systems require strict guardrails: max iteration limits, idempotent tool calls, and observability pipelines that log every state transition with timestamps and token counts.

AI agents are not a silver bullet. They are overkill for deterministic workflows (use a script or RPA), for simple Q&A over static documents (use RAG), or for tasks requiring high precision with zero hallucination tolerance (use fine-tuned models with constrained decoding).

Agents shine in open-ended, multi-step scenarios like automated code debugging, complex data extraction across APIs, or dynamic research synthesis. The key trade-off is latency and cost: each agent loop consumes tokens for reasoning and tool calls, so you must cache identical LLM responses, rate-limit external APIs, and set hard budget caps.

Tools like LangSmith or Weights & Biases provide the observability to catch the 3am spiral before it drains your account.

Plain-English First

Think of an AI agent like a very eager intern who can use any tool in the office but has no sense of time. You give them a task, they start making phone calls, sending emails, and searching the web. If you don't give them a strict deadline and a way to report back, they'll keep working forever, burning through your budget and never telling you they're stuck. A production AI agent is that intern with a stopwatch, a notepad, and a manager who checks in every 30 seconds.

We rolled out a multi-agent system to handle customer support tickets. Three agents: one for triage, one for knowledge base lookup, one for escalation. The first week was magic — 80% of tickets resolved without human touch. Then the pager went off at 3am. A single ticket about 'printer not working' had triggered a 47-minute agent loop, called the knowledge base API 1,200 times, and racked up $340 in OpenAI costs. The agent was stuck in a loop: look up 'printer', get vague answer, ask for clarification, look up 'printer troubleshooting', get another vague answer, repeat. No timeout, no max retries, no circuit breaker.

How AI Agents Actually Work Under the Hood

An AI agent is not magic — it's a loop. The loop calls an LLM, gets a structured response (usually a JSON with 'action' and 'action_input'), executes the action (a function call), appends the result to the message history, and repeats. The LLM decides when to stop by returning a 'final_answer' action. The tricky part is that the LLM has no inherent concept of time or cost. It will keep generating actions until it thinks the task is done, which may be never. The abstraction you should care about is the context window. Every loop iteration adds tokens. After ~10 iterations with tool results, you can easily hit 8k tokens. If your LLM's max context is 4k, older messages get silently dropped, causing the agent to 'forget' the original task. This is why you need to explicitly manage the context window — either by summarizing old messages or using a sliding window.

agent_loop.pyPYTHON

import json
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

# Define the structured output schema for the agent's response
class AgentAction(BaseModel):
    action: Literal["search_knowledge_base", "calculate", "final_answer"]
    action_input: str = Field(description="Input for the action")
    reasoning: str = Field(description="Why this action was chosen")

client = OpenAI()

# Tool implementations
def search_knowledge_base(query: str) -> str:
    # In production, this would call a real API
    return f"Results for '{query}': No relevant documents found."

def calculate(expression: str) -> str:
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

def execute_tool(action: AgentAction) -> str:
    if action.action == "search_knowledge_base":
        return search_knowledge_base(action.action_input)
    elif action.action == "calculate":
        return calculate(action.action_input)
    else:
        raise ValueError(f"Unknown action: {action.action}")

# Agent loop with bounded iterations
def run_agent(task: str, max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": task}]
    for i in range(max_iterations):
        response = client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=messages,
            response_format=AgentAction,
        )
        action = response.choices[0].message.parsed
        if action.action == "final_answer":
            return action.action_input
        tool_result = execute_tool(action)
        messages.append({"role": "assistant", "content": f"Action: {action.action}\nInput: {action.action_input}\nReasoning: {action.reasoning}"})
        messages.append({"role": "tool", "content": tool_result, "tool_call_id": str(i)})
    return "Max iterations reached without final answer."

print(run_agent("What is 2 + 2?"))

Structured output is not optional

If you parse the LLM's raw text output for actions, you will have parsing failures in production. Use response_format with pydantic models to guarantee a parseable response. We learned this when 2% of our agent calls crashed with JSONDecodeError at 2am.

Production Insight

A fraud detection system using a similar loop had a bug where the LLM returned 'final_answer' with an empty string. The agent returned an empty response to the user, causing a 23% drop in user satisfaction. The fix was to validate the final_answer content before returning it.

Key Takeaway

The agent loop is a finite state machine. Always bound it by iterations, time, and tokens. Use structured output to avoid parsing errors.

Practical Implementation: Building a Multi-Agent Orchestrator with LangGraph

LangGraph is the de facto framework for building multi-agent systems in production. It models agents as nodes in a directed graph, with edges defining the flow. The key insight is that each node is a function that takes state and returns state. The graph's executor runs the nodes in order, handling branching and cycles. The gotcha is state management. Each node can modify the shared state, and if two nodes modify the same key concurrently, you get race conditions. LangGraph handles this with a reducer pattern — you define how to merge updates to each state key. In production, we use a single reducer that appends to a list, so no data is lost. Another gotcha: the graph's recursion limit. By default, LangGraph limits recursion to 25 steps. If your agent needs more, you must increase it explicitly. We hit this when a complex workflow required 30 steps, and the graph silently stopped at 25.

multi_agent_graph.pyPYTHON

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, List
import operator

# Define the shared state
class AgentState(TypedDict):
    messages: Annotated[List[dict], operator.add]  # Reducer appends to list
    next_agent: str
    final_output: str

# Node functions
def triage_agent(state: AgentState) -> AgentState:
    # In production, this would call an LLM to classify the task
    if "math" in state["messages"][-1]["content"]:
        return {"next_agent": "math_agent"}
    else:
        return {"next_agent": "general_agent"}

def math_agent(state: AgentState) -> AgentState:
    # Simple math logic
    result = eval(state["messages"][-1]["content"])
    return {"messages": [{"role": "assistant", "content": f"Result: {result}"}], "final_output": str(result)}

def general_agent(state: AgentState) -> AgentState:
    return {"messages": [{"role": "assistant", "content": "I can't help with that yet."}], "final_output": "Unsupported"}

def router(state: AgentState) -> str:
    # Decide which node to go to next
    if state["next_agent"] == "math_agent":
        return "math_agent"
    elif state["next_agent"] == "general_agent":
        return "general_agent"
    else:
        return END

# Build the graph
builder = StateGraph(AgentState)
builder.add_node("triage_agent", triage_agent)
builder.add_node("math_agent", math_agent)
builder.add_node("general_agent", general_agent)
builder.set_entry_point("triage_agent")
builder.add_conditional_edges("triage_agent", router)
builder.add_edge("math_agent", END)
builder.add_edge("general_agent", END)
graph = builder.compile()

# Run the graph
result = graph.invoke({"messages": [{"role": "user", "content": "What is 3 * 7?"}], "next_agent": "", "final_output": ""})
print(result["final_output"])

Always test with the recursion limit

LangGraph's default recursion limit is 25. If your workflow might exceed this, set graph.compile(recursion_limit=100) explicitly. We learned this when a complex customer support flow silently failed after 25 steps.

Production Insight

A customer support system using LangGraph had a bug where the triage agent returned an empty string for 'next_agent'. The graph crashed with a KeyError because it tried to route to an empty string. The fix was to validate the output of every node before using it as a routing key.

Key Takeaway

LangGraph is powerful but requires careful state management and routing validation. Always validate node outputs and set explicit recursion limits.

When NOT to Use AI Agents

AI agents are not the right tool for every problem. If your task is a simple, deterministic workflow (e.g., 'if this, then that'), use a rules engine or a simple script. Agents add latency, cost, and failure modes. Specifically, avoid agents when: 1) The decision logic is deterministic and well-defined. 2) The cost of a wrong action is high (e.g., deleting a database record). 3) You need guaranteed response times — LLM calls have unpredictable latency. 4) The task requires no external tools or data. A simple LLM call with a prompt is cheaper and faster. We made this mistake with a password reset flow. We used an agent to decide whether to send a reset email. The agent sometimes decided to 'call the user' instead, which was not implemented. The fix was to replace the agent with a simple if-else statement.

dont_use_agent.pyPYTHON

# Bad: Using an agent for a deterministic decision
# This is over-engineered and fragile
import openai
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Should we send a password reset email to user@example.com?"}]
)
# The LLM might say 'yes', 'no', or 'call them'

# Good: Use a simple rule
if user_exists and user_requested_reset:
    send_reset_email(user_email)
else:
    log_error("Invalid reset request")

The 'Agent Hammer' trap

Just because you can build an agent doesn't mean you should. If a simple function will do the job, use it. Agents are for tasks that require reasoning, tool use, and adaptation — not for boolean checks.

Production Insight

A deployment pipeline used an agent to decide whether to roll back a failed deployment. The agent once decided to 'wait and see' instead of rolling back, causing a 15-minute outage. The fix was to replace the agent with a deterministic rollback policy.

Key Takeaway

Use agents only when the decision logic is genuinely non-deterministic and requires reasoning. For everything else, use a rule or a script.

Production Patterns & Scale: Caching, Rate Limiting, and Observability

At scale, AI agents consume a lot of resources. A single agent doing 10 tool calls per session, with 1,000 sessions per hour, generates 10,000 tool calls per hour. If each tool call takes 500ms, that's 5,000 seconds of compute time per hour. You need caching for repeated tool calls (e.g., knowledge base lookups for the same query). You need rate limiting to protect downstream APIs. And you need observability to debug failures. The most important metric is token usage per session. Set alerts for sessions that exceed 10,000 tokens. Also track tool call latency and error rates. We use OpenTelemetry to trace every LLM call and tool execution. The trace includes the input, output, latency, and token count. This allows us to replay any session for debugging.

production_agent.pyPYTHON

import hashlib
import time
from functools import lru_cache

# Simple in-memory cache for tool results
@lru_cache(maxsize=1000)
def cached_search(query: str) -> str:
    # Simulate a slow API call
    time.sleep(0.5)
    return f"Result for {query}"

# Rate limiter using a token bucket
class RateLimiter:
    def __init__(self, max_calls: int, period: float):
        self.max_calls = max_calls
        self.period = period
        self.tokens = max_calls
        self.last_refill = time.monotonic()

    def acquire(self) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.max_calls, self.tokens + elapsed * (self.max_calls / self.period))
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

limiter = RateLimiter(max_calls=10, period=1.0)  # 10 calls per second

def rate_limited_search(query: str) -> str:
    if not limiter.acquire():
        raise Exception("Rate limit exceeded. Retry later.")
    return cached_search(query)

# Production agent loop with observability
def run_agent_with_telemetry(task: str) -> str:
    start_time = time.monotonic()
    token_count = 0
    try:
        result = run_agent(task)  # From earlier example
        token_count = len(task) + len(result)  # Simplified
        return result
    finally:
        elapsed = time.monotonic() - start_time
        # In production, send to OpenTelemetry
        print(f"TRACE: task={task[:50]}, duration={elapsed:.2f}s, tokens={token_count}")

Caching tool results can cause staleness

If the underlying data changes, cached results become stale. Use a TTL on cache entries. We once cached knowledge base results for 24 hours, and users got outdated information for a day.

Production Insight

A recommendation engine serving 2M req/day started returning stale results after a schema migration. The agent was caching the old schema's responses. The fix was to invalidate the cache on schema changes.

Key Takeaway

Caching, rate limiting, and observability are non-negotiable for production agents. Cache with TTL, rate limit per downstream API, and trace every call.

Common Mistakes with Specific Examples

Mistake 1: Not validating tool inputs. An agent might call a tool with a SQL injection payload if you're not careful. Always sanitize inputs. Mistake 2: Ignoring the context window. If the agent's context exceeds the model's limit, older messages are dropped silently. This causes the agent to 'forget' the original task. Mistake 3: Using a single agent for everything. A single agent with too many tools becomes confused. Split responsibilities across specialized agents. Mistake 4: Not handling tool failures gracefully. If a tool returns an error, the agent might retry indefinitely or crash. Implement retries with backoff and a max retry count. Mistake 5: Not testing with real-world data. Synthetic tests don't capture the ambiguity of real user queries. We once tested with 'What is the weather?' and deployed, only to find that real users asked 'What's the weather like in Tokyo next Tuesday?' which required a date parser the agent didn't have.

common_mistakes.pyPYTHON

# Mistake 1: Not validating tool inputs
# Bad: Agent can call search with any input
def search(query: str):
    return requests.get(f"https://api.example.com/search?q={query}").json()  # SQL injection risk

# Good: Validate and sanitize
import re
def safe_search(query: str):
    sanitized = re.sub(r'[^a-zA-Z0-9 ]', '', query)
    return requests.get(f"https://api.example.com/search?q={sanitized}").json()

# Mistake 2: Not handling context window
# Bad: Appending all messages indefinitely
messages.append(tool_result)

# Good: Summarize or truncate old messages
def summarize_messages(messages: list, max_tokens: int = 3000):
    # In production, use an LLM to summarize
    if len(str(messages)) > max_tokens:
        return [{"role": "system", "content": f"Previous context: {messages[0]['content'][:100]}..."}]
    return messages

SQL injection via agent tools is real

An agent can be tricked into calling a tool with malicious input. If your tool executes SQL, use parameterized queries. We saw a proof-of-concept where an agent was prompted to 'search for users; DROP TABLE users;' and the tool executed it.

Production Insight

A customer support agent had a tool that sent emails. A user prompted the agent to 'send an email to ceo@company.com saying the product is broken'. The agent sent the email without validation. The fix was to add a confirmation step for any action that has side effects.

Key Takeaway

Always validate tool inputs, manage context windows, and split responsibilities across agents. Test with real-world data, not just synthetic examples.

Comparison vs Alternatives: Agents vs RAG vs Fine-Tuning

Agents are not always the best solution. For question-answering over a fixed knowledge base, RAG (Retrieval-Augmented Generation) is simpler and more reliable. For specialized tasks with fixed output formats, fine-tuning a model is cheaper and faster. Agents are best when the task requires multiple steps, tool use, and adaptation. The trade-off is complexity and cost. A RAG system costs ~$0.01 per query. An agent costs ~$0.10 per query. But an agent can handle tasks a RAG system cannot, like booking a flight or debugging code. The decision matrix: if the task is a single-turn Q&A, use RAG. If the task is multi-turn with tool use, use an agent. If the task is a fixed, repetitive pattern, fine-tune a model.

rag_vs_agent.pyPYTHON

# RAG example: Simple and cheap
from openai import OpenAI
client = OpenAI()

def rag_query(query: str, context: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model
        messages=[
            {"role": "system", "content": f"You are a helpful assistant. Use this context to answer: {context}"},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

# Agent example: More capable but expensive
def agent_query(query: str) -> str:
    # Agent loop with tool calls
    return run_agent(query)  # From earlier example

# When to use what:
# - RAG: "What is the capital of France?" (single-turn, factual)
# - Agent: "Book a flight to Paris for next Tuesday, then email me the confirmation" (multi-step, tool use)

Don't use an agent for a RAG problem

If your task is 'answer a question from a document', use RAG. An agent adds unnecessary latency and cost. We've seen teams build agents for simple Q&A, and it was 10x more expensive with no benefit.

Production Insight

A legal document analysis system used an agent to answer questions about contracts. The agent was slow and expensive. Switching to RAG with a fine-tuned embedding model reduced costs by 90% and improved latency from 5s to 0.5s.

Key Takeaway

Choose the simplest solution that works. RAG for Q&A, agents for multi-step tasks, fine-tuning for fixed patterns. Don't over-engineer.

Debugging and Monitoring AI Agents in Production

Debugging an agent in production is hard because the behavior is non-deterministic. The same input can produce different outputs. You need to log everything: the LLM response, the tool inputs and outputs, the state at each step, and the final output. Use a trace ID to correlate all logs for a single session. The most common debugging scenario is 'the agent returned the wrong answer'. You need to replay the session step by step. We built a replay tool that takes a trace ID and re-executes the agent with the same inputs, printing each step. This allows us to see exactly where the agent went wrong. Another common issue is 'the agent is slow'. Profile each step: LLM call latency, tool call latency, and state processing time. We found that 80% of latency was from LLM calls, and 20% from tool calls.

debug_agent.pyPYTHON

import json
import time

# Trace decorator to log every step
def trace_step(func):
    def wrapper(*args, **kwargs):
        start = time.monotonic()
        result = func(*args, **kwargs)
        elapsed = time.monotonic() - start
        trace = {
            "function": func.__name__,
            "args": args,
            "kwargs": kwargs,
            "result": str(result)[:200],
            "duration": elapsed
        }
        print(f"TRACE: {json.dumps(trace)}")
        return result
    return wrapper

@trace_step
def llm_call(prompt: str) -> str:
    # Simulate LLM call
    time.sleep(0.5)
    return f"Response to: {prompt[:50]}"

@trace_step
def tool_call(name: str, input: str) -> str:
    # Simulate tool call
    time.sleep(0.2)
    return f"Result from {name}"

# Replay a session from a trace
with open("trace_123.json", "r") as f:
    trace_data = json.load(f)

for step in trace_data["steps"]:
    print(f"Step {step['step']}: {step['function']} called with {step['args']}")
    print(f"  Result: {step['result']}")
    print(f"  Duration: {step['duration']:.2f}s")

Replay is your best debugging tool

Build a replay tool that takes a trace ID and re-executes the agent with the same inputs. This allows you to reproduce bugs in development. We use this for every production incident.

Production Insight

A customer support agent was returning the wrong answers for 5% of queries. By replaying the traces, we found that the agent was using a cached tool result from a different session. The cache key was not including the user ID. The fix was to include the user ID in the cache key.

Key Takeaway

Trace every step of the agent. Build a replay tool to reproduce bugs. Profile latency to find bottlenecks.

● Production incidentPOST-MORTEMseverity: high

The Runaway Agent: $340 in 47 Minutes

Symptom

PagerDuty alert: 'OpenAI API cost anomaly — 10x above baseline'. Grafana showed a single agent session with 1,247 tool calls to the knowledge base API in 47 minutes.

Assumption

We assumed the agent's LLM would naturally converge on a solution or ask for help. We thought the 'max_tokens' parameter would limit the total cost.

Root cause

The agent loop had no maximum iteration count. The while loop in the orchestrator ran until the LLM returned a 'final_answer' action. The LLM kept generating 'search_knowledge_base' actions because the results were always ambiguous.

Fix

1. Added a max_iterations=10 parameter to the agent loop. 2. Implemented a timeout of 120 seconds per agent session. 3. Added a circuit breaker that kills the agent after 5 consecutive failed tool calls. 4. Logged all tool call inputs and outputs for post-mortem analysis. ``

python
# Before:
while action.type != "final_answer":
    action = llm.invoke(messages)
    result = execute_tool(action)
    messages.append(result)

# After:
for i in range(MAX_ITERATIONS):
    if time.monotonic() - start_time > TIMEOUT_SECONDS:
        raise TimeoutError("Agent exceeded timeout")
    action = llm.invoke(messages)
    if action.type == "error":
        consecutive_errors += 1
        if consecutive_errors >= 5:
            raise CircuitBreakerError("Too many consecutive errors")
    else:
        consecutive_errors = 0
    result = execute_tool(action)
    messages.append(result)

Key lesson

Always set a hard limit on agent iterations and wall-clock time before deploying to production.
Monitor token usage per session and alert on anomalies — not just total cost.
Implement circuit breakers for tool calls; a flaky API should not crash the entire agent.

Production debug guideWhen your agent is stuck in a loop at 2am.4 entries

Symptom · 01

Agent is running but not returning results

→

Fix

Check the agent loop iteration count. Run kubectl logs <pod> | grep 'iteration' | tail -20 to see if it's stuck in a loop. If iteration count is > 10, you have a loop.

Symptom · 02

High OpenAI API costs without corresponding user activity

→

Fix

Query the trace database:

SELECT session_id, COUNT(*) as calls, SUM(token_count) as tokens FROM agent_traces WHERE timestamp > NOW() - INTERVAL '1 hour' GROUP BY session_id ORDER BY tokens DESC LIMIT 5;

— find the runaway session.

Symptom · 03

Agent returns nonsensical or empty responses

→

Fix

Inspect the last 5 messages in the agent's context window. Use python -c "import json; data=json.load(open('agent_messages.json')); print(json.dumps(data[-5:], indent=2))" to see if the context was truncated or corrupted.

Symptom · 04

Tool calls failing with 500 errors

→

Fix

Check the tool's health endpoint and rate limits. Run curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health to verify the tool is up. If rate limited, add exponential backoff to the tool executor.

★ AI Agents Explained Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.

Agent stuck in loop−

Immediate action

Kill the agent session and check iteration count

Commands

kubectl exec <pod> -- cat /proc/<pid>/fd/1 | grep 'iteration' | tail -5

python -c "import json; traces=json.load(open('traces.json')); print([t for t in traces if t['iterations'] > 10])"

Fix now

Set max_iterations=10 in the agent config and restart the pod.

High token usage+

Agent returns empty response+

Tool call fails with 429+

AI Agents vs RAG vs Fine-Tuning

Concern	AI Agents	RAG	Fine-Tuning	Recommendation
Latency	High (multiple LLM calls + tool execution)	Medium (single LLM call + retrieval)	Low (single LLM call)	Use RAG or fine-tuning for real-time apps
Cost per query	High (multiple tokens + tool API costs)	Medium (retrieval + LLM call)	Low (single LLM call)	Use agents only when dynamic tool use is required
Accuracy on dynamic data	High (can fetch live data)	High (retrieves from updated index)	Low (static knowledge cutoff)	Use agents or RAG for live data
Complexity to implement	High (state machine, tool registry, error handling)	Medium (embedding pipeline, retriever)	Medium (data prep, training pipeline)	Start with RAG, add agents only if needed
Debugging difficulty	Very high (multi-step, non-deterministic)	Medium (retrieval quality issues)	Low (model outputs only)	Avoid agents unless you have dedicated observability
Best use case	Multi-step tasks requiring tool use (e.g., booking flights)	Question answering over a knowledge base	Consistent style or domain-specific tasks	Match to your primary requirement

Key takeaways

Agents are not magic

they're a loop of LLM call → tool execution → state update; the loop must have a hard max iterations (e.g., 10) and a timeout to prevent runaway costs.

Never let an agent call itself or another agent without a circuit breaker

our 3am incident was caused by an agent re-invoking its own tool, creating an infinite recursion that exhausted the context window.

Cache tool outputs aggressively (keyed by tool name + input hash)

identical tool calls from different agents in the same conversation are the #1 source of wasted latency and cost.

Observability must include per-step token usage, tool call latency, and agent state snapshots

without these, you can't tell which agent caused the deadlock until it's too late.

Use a centralized orchestrator with a state machine (LangGraph's StateGraph) rather than letting agents communicate directly

direct agent-to-agent messages bypass your safety rails and make debugging impossible.

Common mistakes to avoid

4 patterns

No max iterations on agent loop

Symptom

Agent runs forever, burning tokens and API credits; context window fills with repeated tool calls.

Fix

Set a hard limit (e.g., max_iterations=10) and a wall-clock timeout (e.g., timeout_seconds=120) on every agent loop. In LangGraph, use add_condition_edges to route to an END node after N steps.

Allowing agents to call themselves or other agents without a circuit breaker

Symptom

Infinite recursion — Agent A calls Tool B which returns a result that triggers Agent A to call Tool B again, ad infinitum. Our 3am incident: a weather agent called 'get_weather' which returned a JSON that the agent parsed as a tool call to itself.

Fix

Implement a call depth counter in the agent state. If depth > 1, reject the tool call and return an error. Use a registry of allowed agent-to-agent calls with explicit permissions.

No caching of tool outputs

Symptom

Identical tool calls (e.g., 'get_stock_price(AAPL)') made 50 times in a single conversation, each taking 2-5 seconds and costing $0.01 per call.

Fix

Add an LRU cache (e.g., functools.lru_cache or Redis) keyed by (tool_name, hash(input)). Invalidate on conversation reset. Our fix: 98% cache hit rate, reduced average tool latency from 3s to 15ms.

No per-step observability on agent state

Symptom

When the orchestrator deadlocks, you have no idea which agent was running, what tool it called, or what the state looked like. Debugging requires replaying the entire conversation.

Fix

Log every agent step: (agent_id, step_number, tool_call, tool_result, token_usage, state_snapshot). Use structured logging (JSON) and ship to a centralized observability platform (Datadog, Grafana). Add a 'step_id' to trace across agents.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain how an AI agent works under the hood. What is the core loop?

Q02SENIOR

How would you design a multi-agent orchestrator that prevents infinite l...

Q03SENIOR

What are the trade-offs between using a single agent with many tools vs ...

Q04SENIOR

How do you debug a multi-agent system that produces inconsistent results...

Q05SENIOR

Design a caching strategy for tool outputs in a multi-agent system. What...

Q01 of 05JUNIOR

Explain how an AI agent works under the hood. What is the core loop?

ANSWER

The core loop is: (1) LLM receives a prompt + conversation history + available tool definitions. (2) LLM outputs either a final answer or a tool call (function name + arguments). (3) If tool call, execute the tool, append result to history, go to step 1. (4) If final answer, return to user. This is a deterministic state machine, not 'magic'. The LLM is just a planner — the tools do the work.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is the difference between an AI agent and a simple LLM call?

How do I prevent an AI agent from hallucinating tool calls?

When should I use LangGraph vs building my own agent loop?

How do I monitor agent costs in production?

Can I use agents for real-time applications?

🔥

That's AI Agents. Mark it forged?

5 min read · try the examples if you haven't