Senior 7 min · May 22, 2026

ReAct Agent Pattern — Why Your Agent Loops Forever at 3am and How to Fix It

Learn the ReAct agent pattern from a production perspective: how it works, common failures, debugging strategies, and a real incident where a loop cost $4k in token overrun.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Reasoning+Acting Loop The agent thinks, acts, observes, and repeats until it has a final answer. In production, this loop can run indefinitely without a max iteration limit.
  • Tool Call Formatting The model must output tool calls in a strict JSON schema. A single malformed action can crash the loop or cause silent retries.
  • Observation Truncation Tool outputs (e.g., API responses) can exceed the model's context window. Truncate or summarize observations to avoid token blowout.
  • State Management Each iteration appends to the conversation history. Without pruning, you hit the context limit after 5-10 turns with large tool results.
  • Error Handling Tool failures (timeouts, rate limits) must be passed as observations, not exceptions. The model needs to decide retry vs. alternative.
  • Cost Control Each loop iteration is an LLM call. A bug that causes 50 iterations costs 50x the base call. Set a hard limit and monitor token usage.
✦ Definition~90s read
What is ReAct Agent Pattern?

The ReAct (Reasoning + Acting) pattern is an agent architecture that interleaves chain-of-thought reasoning with tool calls in a structured loop. Instead of a monolithic prompt that tries to guess the entire solution upfront, ReAct agents explicitly output a 'Thought' (what they know, what they need), an 'Action' (which tool to call and with what arguments), and an 'Observation' (the tool's result), then repeat.

Think of a ReAct agent like a detective solving a case.

This loop prevents the agent from hallucinating answers by forcing it to ground each reasoning step in real data from external tools—APIs, databases, calculators, or search engines. The pattern was formalized in the 2022 paper 'ReAct: Synergizing Reasoning and Acting in Language Models' by Yao et al., and it's the foundation behind production agents like LangChain's AgentExecutor and AutoGPT's core loop.

Where this pattern shines is in tasks requiring multi-step reasoning with external verification: answering questions that need database lookups, performing calculations, or navigating APIs where each step depends on the previous result. But it's not a silver bullet.

The ReAct loop is inherently sequential and token-expensive—each iteration burns LLM context for the full history of thoughts, actions, and observations. For simple single-step tasks (e.g., 'translate this sentence'), a direct tool call or a simpler pattern like function calling is faster and cheaper.

The loop also fails catastrophically when tools return ambiguous or error states: without explicit error handling, the agent will retry the same failing action forever, burning tokens and API costs. That's the 'loops forever at 3am' problem—the agent has no built-in termination criteria beyond max iterations, so a malformed tool output or a missing API key sends it into an infinite reasoning spiral.

In practice, ReAct is the default pattern for agents that need to 'think before they leap'—it's what powers GitHub Copilot's code generation with tool access, and it's the architecture behind most retrieval-augmented generation (RAG) agents that query vector stores. But you should reach for alternatives like Plan-and-Solve (which plans all steps upfront then executes) when the action sequence is predictable, or Tree-of-Thoughts when you need to explore multiple reasoning branches.

The key insight: ReAct trades efficiency for robustness. It's the right choice when you can't predict the tool call order and need the LLM to adapt dynamically, but you must pair it with strict iteration limits, timeout guards, and observation validation to prevent the 3am spiral.

ReAct Agent Pattern Architecture diagram: ReAct Agent Pattern ReAct Agent Pattern action result loop done 1 User Query Task input 2 Reason Think step-by-step 3 Act Call tool / API 4 Observe Parse tool output 5 Final Answer Return to user THECODEFORGE.IO
Plain-English First

Think of a ReAct agent like a detective solving a case. The detective thinks about what clue they need (Reason), goes to find it (Act), reads the clue (Observe), then decides if they can solve the case or need more clues. The loop repeats until they have enough evidence. If the detective never decides they're done, they'll keep searching forever — and that's exactly what happens when you forget to set a max iteration limit in production.

You've built a chatbot that can search the web, query databases, and call APIs. It works in your demo — three turns and it answers perfectly. Then you deploy it to production, and at 2am your pager goes off: the agent has been looping for 47 iterations, burned through $400 in tokens, and returned nothing. Welcome to the ReAct agent pattern, where the gap between a tutorial and production is a single missing max_iterations parameter.

Most tutorials show you the loop: Thought, Action, Observation, repeat. They hand-wave the hard parts: what happens when the model outputs malformed JSON for a tool call, when an API returns 50KB of data that blows your context window, or when the model decides it needs to search again and again because it never learned to stop. These aren't edge cases — they're the norm at scale.

This article covers the ReAct pattern from the inside out: how the loop works under the hood, how to implement it with proper error handling and cost controls, when not to use it (yes, there are better patterns), and the exact debugging steps we used when our recommendation engine's agent loop cost us $4k in one night. You'll get runnable code, real incident details, and a triage cheat sheet for that 2am page.

How the ReAct Loop Actually Works Under the Hood

The ReAct pattern is deceptively simple: a loop where the model generates a thought, decides an action, executes it, and observes the result. But the production reality is more nuanced. The loop is essentially a state machine with three states: REASONING, ACTING, OBSERVING. The model's output determines the transition. If it outputs a final answer, you're done. If it outputs a tool call, you execute and feed back the result.

The critical detail: the model doesn't 'know' it's in a loop. It sees a flat conversation history. Each iteration appends the previous thought, action, and observation. The model generates the next token based on this growing context. This means the loop's behavior changes as history grows — early iterations are crisp, later ones can become repetitive as the context window fills.

Most implementations hide the raw token-level mechanics. The model outputs a JSON blob like {"action": "search", "query": "latest news"}. Your code parses this, calls the tool, and appends the result. But if the model outputs {"action": "search", "query": ""} — empty query — your tool might return garbage or error. You need to validate the action schema before executing.

Another hidden gotcha: the model can output multiple actions in one response. Some frameworks handle this (ReAct with tool calling), but a naive implementation expects exactly one action per loop iteration. If the model outputs two actions, you'll parse only the first and lose the second. The fix is to either enforce single-action output in the prompt or parse all actions and execute them sequentially.

react_loop_internals.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import json
import logging
from typing import Any

logger = logging.getLogger(__name__)

class ReActLoop:
    def __init__(self, model, tools: dict[str, Any], max_iterations: int = 10):
        self.model = model
        self.tools = tools
        self.max_iterations = max_iterations
        self.history: list[dict] = []

    def run(self, user_query: str) -> str:
        self.history.append({"role": "user", "content": user_query})
        for step in range(self.max_iterations):
            response = self.model.generate(self.history)
            # Parse the model's output — expect JSON with 'thought' and 'action' or 'answer'
            try:
                parsed = json.loads(response)
            except json.JSONDecodeError as e:
                logger.warning(f"Malformed JSON at step {step}: {e}. Raw: {response[:200]}")
                # Send correction prompt
                self.history.append({"role": "assistant", "content": response})
                self.history.append({"role": "user", "content": "Your output was not valid JSON. Please output a JSON object with 'thought' and 'action' or 'answer'."})
                continue
            if "answer" in parsed:
                return parsed["answer"]
            if "action" not in parsed or "tool" not in parsed["action"]:
                logger.warning(f"Missing action/tool at step {step}: {parsed}")
                self.history.append({"role": "assistant", "content": response})
                self.history.append({"role": "user", "content": "Your output must include an 'action' with a 'tool' field. Please try again."})
                continue
            tool_name = parsed["action"]["tool"]
            tool_args = parsed["action"].get("args", {})
            if tool_name not in self.tools:
                logger.error(f"Unknown tool '{tool_name}' at step {step}")
                observation = f"Error: tool '{tool_name}' is not available."
            else:
                try:
                    observation = self.tools[tool_name](**tool_args)
                except Exception as e:
                    observation = f"Tool error: {str(e)}"
            # Append the thought, action, and observation to history
            self.history.append({"role": "assistant", "content": response})
            self.history.append({"role": "observation", "content": json.dumps(observation)})
        raise MaxIterationsExceeded(f"Agent did not produce a final answer within {self.max_iterations} steps.")

class MaxIterationsExceeded(Exception):
    pass
Always Validate the Action Schema
Don't assume the model will output a valid tool call. Parse the JSON, check for required fields ('tool', 'args'), and handle missing or malformed fields gracefully. A silent failure here leads to infinite loops.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The model's tool call output changed from {"action": "get_recommendations", "user_id": 123} to {"action": "get_recommendations", "user_id": 123, "extra": "field"}. The parser was strict and rejected any extra fields, causing the agent to retry indefinitely. We fixed it by using a lenient parser that ignores unknown fields.
Key Takeaway
The ReAct loop is a state machine driven by model output. Validate the output schema, handle malformed JSON with a correction prompt, and always set a max iteration limit.

Practical Implementation: Building a ReAct Agent from Scratch

Let's build a ReAct agent that can search the web and calculate math. We'll use OpenAI's GPT-4 for the model and a simple web search tool. The key is to structure the prompt so the model knows exactly what format to output and when to stop.

The prompt should include
  • Available tools with descriptions
  • The expected output format (JSON with 'thought' and 'action' or 'answer')
  • A stop condition: 'You must answer within {max_iterations} steps.'
  • Examples of valid tool calls and final answers

We'll also add proper error handling: if the tool fails, we return the error as an observation so the model can decide to retry or use a different approach. This is crucial in production where APIs fail.

build_react_agent.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import json
import requests
from openai import OpenAI

# Initialize the LLM client
client = OpenAI(api_key="sk-...")  # Replace with your key

def search_web(query: str) -> str:
    """Search the web and return top result snippets."""
    try:
        response = requests.get(f"https://api.duckduckgo.com/?q={query}&format=json")
        response.raise_for_status()
        data = response.json()
        # Extract relevant snippets
        results = [item["Text"] for item in data.get("RelatedTopics", []) if "Text" in item]
        return "\n".join(results[:3]) if results else "No results found."
    except Exception as e:
        return f"Search failed: {str(e)}"

def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        result = eval(expression)  # Use with caution; consider a safer eval
        return str(result)
    except Exception as e:
        return f"Calculation error: {str(e)}"

tools = {
    "search": search_web,
    "calculate": calculate
}

# Build the system prompt
system_prompt = f"""You are a helpful assistant with access to the following tools:
- search(query): Search the web for information.
- calculate(expression): Evaluate a mathematical expression.

You must output your response in JSON format:
- To use a tool: {{"thought": "...", "action": {{"tool": "tool_name", "args": {{...}}}}}}
- To answer: {{"thought": "...", "answer": "..."}}

You have a maximum of 10 steps to answer. If you cannot answer within that limit, output an answer with your best guess.
"""

def run_agent(user_query: str) -> str:
    messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_query}]
    max_iterations = 10
    for step in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            temperature=0
        )
        content = response.choices[0].message.content
        try:
            parsed = json.loads(content)
        except json.JSONDecodeError:
            messages.append({"role": "assistant", "content": content})
            messages.append({"role": "user", "content": "Invalid JSON. Please output a valid JSON object."})
            continue
        if "answer" in parsed:
            return parsed["answer"]
        if "action" in parsed:
            tool_name = parsed["action"]["tool"]
            tool_args = parsed["action"].get("args", {})
            if tool_name not in tools:
                observation = f"Error: tool '{tool_name}' not found."
            else:
                observation = tools[tool_name](**tool_args)
            messages.append({"role": "assistant", "content": content})
            messages.append({"role": "user", "content": f"Observation: {observation}"})
        else:
            messages.append({"role": "assistant", "content": content})
            messages.append({"role": "user", "content": "Your output must include either 'action' or 'answer'."})
    return "I was unable to find a definitive answer within the allowed steps. Please refine your query."

# Example usage
print(run_agent("What is the capital of France?"))
print(run_agent("Calculate 123 * 456"))
Use Function Calling API for Reliability
Instead of parsing JSON from free-form text, use OpenAI's function calling API. It guarantees structured output and reduces malformed action errors. The pattern is the same, but the model outputs a function call object instead of raw JSON.
Production Insight
A customer support bot using this pattern had a bug where the 'calculate' tool was called with a string like 'What is 2+2?' instead of '2+2'. The tool failed, the agent retried with the same input, and it looped 10 times before hitting the limit. We fixed it by adding input validation in the tool: if the expression contains non-math characters, return an error asking the model to extract the expression.
Key Takeaway
Build your agent with a clear prompt, strict JSON parsing, and tool input validation. Test with edge cases like malformed queries and tool failures.

When NOT to Use the ReAct Pattern

ReAct is powerful, but it's not a silver bullet. There are clear cases where other patterns outperform it. The most common mistake is using ReAct for tasks that don't need external tools — you're paying for multiple LLM calls when a single call would suffice.

Here's when to avoid ReAct
  • Pure reasoning tasks: If the task is purely reasoning (e.g., 'Explain quantum entanglement'), use a single LLM call or a reflection pattern. ReAct adds unnecessary cost and latency.
  • Deterministic workflows: If you know the exact sequence of steps (e.g., 'Get user data, then get orders, then summarize'), use a Plan & Solve pattern. It's cheaper and faster because the plan is generated once.
  • Cost-sensitive applications: Each ReAct iteration is an LLM call. If you're on a tight budget, consider REWOO (ReAct Without Observation) which skips the observation step and uses a single pass.
  • Tasks requiring learning from failures: If the agent needs to learn from past mistakes (e.g., iterative debugging), use Reflexion which maintains a memory of failures.

Another anti-pattern: using ReAct for real-time systems. The loop introduces unpredictable latency. A search agent might take 2 seconds or 20 seconds depending on how many iterations it needs. If you need bounded latency, use a fixed-step pattern.

Don't Use ReAct for Simple Q&A
If your task can be answered with a single LLM call, don't add a loop. You're paying for 3-5 extra calls and adding latency for no benefit. Profile your use case first.
Production Insight
A financial chatbot used ReAct for every query, including 'What is the current stock price of AAPL?' which only needed a single API call. The loop added 2 seconds of latency and 4x the cost. We switched to a classifier: if the query is a simple data lookup, use a direct function call; only use ReAct for complex multi-step queries.
Key Takeaway
ReAct is for tasks that require dynamic reasoning and tool use. For deterministic or simple tasks, use a cheaper pattern. Profile your queries and route them accordingly.

Production Patterns: Scaling the ReAct Agent

Running a ReAct agent at scale introduces challenges that tutorials ignore: concurrent requests, state management, and cost control. Here's how to handle them.

Concurrency: Each agent session is stateful — it maintains a conversation history. In a web app with 1000 concurrent users, you need 1000 separate histories. Use a session store (Redis, DynamoDB) to persist histories. Key by session_id, value by the message list. Each iteration reads and writes to this store.

State Pruning: The conversation history grows with each iteration. After 10 iterations with large tool outputs, you might exceed the context window (e.g., 8K tokens for GPT-4). Implement a sliding window: keep the system prompt, the last N messages, and a summary of older ones. Or use a max token limit and truncate the oldest messages when exceeded.

Cost Control: Monitor token usage per session. Set a hard budget per query (e.g., $0.10). If exceeded, terminate the loop and return a fallback. Use a token counter library (tiktoken) to estimate before sending the request.

Observability: Log every iteration: step number, token count, tool called, observation length, latency. Ship these logs to your monitoring system (Datadog, Grafana). Set alerts for high iteration counts, high token usage, or high failure rates.

production_agent_with_redis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import json
import redis
from openai import OpenAI
import tiktoken

r = redis.Redis(host='localhost', port=6379, db=0)
client = OpenAI(api_key="sk-...")

def get_session_history(session_id: str) -> list:
    data = r.get(f"session:{session_id}")
    return json.loads(data) if data else []

def save_session_history(session_id: str, history: list):
    r.setex(f"session:{session_id}", 3600, json.dumps(history))  # Expire after 1 hour

def prune_history(history: list, max_tokens: int = 6000):
    enc = tiktoken.encoding_for_model("gpt-4")
    # Always keep system prompt and last 3 messages
    system = history[0]
    recent = history[-3:]
    older = history[1:-3]
    total_tokens = len(enc.encode(json.dumps(system))) + sum(len(enc.encode(json.dumps(m))) for m in recent)
    # Prune older messages until under limit
    while older and total_tokens > max_tokens:
        removed = older.pop(0)
        total_tokens -= len(enc.encode(json.dumps(removed)))
    return [system] + older + recent

def run_agent_production(session_id: str, user_query: str) -> str:
    history = get_session_history(session_id)
    if not history:
        history = [{"role": "system", "content": system_prompt}]  # Assume system_prompt defined
    history.append({"role": "user", "content": user_query})
    max_iterations = 10
    for step in range(max_iterations):
        # Prune history before sending to LLM
        pruned_history = prune_history(history)
        response = client.chat.completions.create(
            model="gpt-4",
            messages=pruned_history,
            temperature=0
        )
        content = response.choices[0].message.content
        # Parse and handle as before
        # ... (same as previous implementation)
        # Save history after each iteration
        save_session_history(session_id, history)
    return "Fallback: Could not answer."
Use Redis for Session State
Redis is ideal for storing conversation histories. It's fast, supports TTL for automatic cleanup, and handles concurrent access. Set a TTL of 1 hour to avoid stale sessions.
Production Insight
A travel booking agent using this pattern hit a problem: after 5 iterations, the history exceeded 8K tokens because each search result was 2K tokens. The LLM started ignoring old observations. We implemented a summarization step: after every 3 iterations, we summarized the conversation into a single 'summary' message and dropped the detailed history. This kept the context under control.
Key Takeaway
At scale, manage state with a session store, prune history to avoid context limits, and monitor token usage per session. Don't let the agent run unchecked.

Common Mistakes with Specific Examples

Here are the most common mistakes we've seen in production ReAct agents, with real examples.

Mistake 1: No input validation on tool arguments. Example: a weather agent called get_weather(city='New York') but the model output get_weather(city=''). The tool returned an error, and the agent retried with the same empty string. Fix: validate that required arguments are non-empty before calling the tool.

Mistake 2: Ignoring tool errors. Example: a database query tool threw an exception because the table didn't exist. The agent code caught the exception and returned 'Error: database error'. The model then tried the same query again because it didn't understand the error. Fix: return a descriptive error message that tells the model what went wrong and suggests alternatives.

Mistake 3: Not handling observation truncation. Example: a search tool returned 10KB of text. The agent appended this to the history, and after 3 searches, the context window overflowed. The LLM started generating incoherent responses. Fix: truncate observations to 1000 characters or summarize them.

Mistake 4: Relying on the model to stop. Example: the prompt said 'Answer when you have enough information.' The model never decided it had enough and kept looping. Fix: add explicit max iterations and a stop condition in the prompt: 'You must answer within {max_iterations} steps.'

common_mistakes_fixes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Mistake 1: No input validation
def get_weather(city: str) -> str:
    if not city.strip():
        return "Error: city name cannot be empty. Please provide a valid city."
    # ... actual API call

# Mistake 2: Ignoring tool errors
# Instead of:
# except Exception as e:
#     observation = "Error occurred"
# Do:
except Exception as e:
    observation = f"Tool failed: {str(e)}. You can try a different approach or rephrase the input."

# Mistake 3: Not truncating observations
def truncate_observation(obs: str, max_chars: int = 1000) -> str:
    if len(obs) > max_chars:
        return obs[:max_chars] + "... [truncated]"
    return obs

# Mistake 4: No max iterations in prompt
system_prompt = """You have access to tools. You must answer within 10 steps.
If you cannot answer after 10 steps, output an answer with your best guess."""
Tool Errors Are Observations, Not Exceptions
Never let a tool exception crash the loop. Catch it, format it as a string, and pass it back as an observation. The model needs to see the error to decide the next action.
Production Insight
A customer service agent kept calling the same tool with the same invalid arguments because the error message was 'Error: invalid input'. The model didn't know what was invalid. We changed the error to 'Error: invalid input. The 'order_id' field must be a 10-digit number.' The model then corrected its output.
Key Takeaway
Validate inputs, handle errors gracefully, truncate observations, and always set a max iteration limit. These four fixes prevent 90% of production issues.

ReAct vs. Other Agent Patterns: When to Choose What

ReAct is one of many agent patterns. Here's a comparison to help you choose.

  • ReAct: Best for tasks requiring dynamic tool use and reasoning. Example: 'Find the latest news about AI and summarize it.' The agent decides which tools to call and in what order.
  • Plan & Solve: Best for tasks with a known sequence. The model generates a plan first, then executes each step without re-planning. Example: 'Book a flight: search flights, compare prices, book the cheapest.' Cheaper and faster than ReAct because it makes fewer LLM calls.
  • Reflexion: Best for tasks that require learning from mistakes. The agent maintains a memory of past failures and uses it to improve. Example: 'Debug this code: try a fix, observe the error, try another fix.' More robust but more expensive.
  • REWOO: Best for cost-sensitive tasks. It skips the observation step and uses a single pass to generate actions and final answer. Example: 'Get the weather for New York, London, and Tokyo.' The model generates all tool calls at once, executes them, and synthesizes the answer. Cheapest but least flexible.

In production, we often combine patterns. For example, use a classifier to decide which pattern to use based on the query. Simple queries go to REWOO, complex ones go to ReAct, and debugging tasks go to Reflexion.

pattern_selector.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from openai import OpenAI

client = OpenAI(api_key="sk-...")

def classify_query(query: str) -> str:
    """Classify the query into a pattern type."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Classify the user query into one of: 'react', 'plan_solve', 'reflexion', 'rewoo'. Respond with only the label."},
            {"role": "user", "content": query}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip()

# Example usage
query = "Find the latest news about AI and summarize it."
pattern = classify_query(query)
if pattern == "react":
    run_react_agent(query)
elif pattern == "plan_solve":
    run_plan_solve_agent(query)
# ... etc.
Use a Router to Combine Patterns
Don't commit to one pattern for all queries. Use a lightweight classifier (or even a simple keyword-based router) to select the best pattern for each query. This optimizes cost and latency.
Production Insight
A SaaS platform initially used ReAct for all queries. After profiling, they found that 60% of queries were simple data lookups that could be handled by REWOO. They implemented a router that reduced average cost per query by 50% and latency by 70%.
Key Takeaway
ReAct is not the only pattern. Understand the trade-offs and use a router to select the best pattern for each query. This saves money and improves user experience.

Debugging and Monitoring the ReAct Agent

When your agent misbehaves in production, you need tools to diagnose the problem. Here's our monitoring stack.

Log every iteration: Log the step number, the model's raw output, the tool called, the observation length, and the time taken. Store this in a structured log (JSON lines) for easy querying.

Trace the conversation: Use OpenTelemetry to trace the entire agent flow. Each iteration is a span. This lets you see where time is spent and which steps fail.

Set alerts: Alert on: - Iteration count > 5 (possible loop) - Token usage per session > $0.10 - Tool failure rate > 10% - Latency per step > 10 seconds

Use a debug mode: In development, add a 'verbose' mode that prints the full conversation history after each step. This helps you see what the model is seeing.

Test with edge cases: Before deploying, test with: - Empty tool results - Tool failures - Very long tool results - Ambiguous queries that could lead to infinite loops

monitoring_setup.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import logging
import time
from opentelemetry import trace

tracer = trace.get_tracer(__name__)
logger = logging.getLogger(__name__)

def monitored_run_agent(session_id: str, user_query: str) -> str:
    with tracer.start_as_current_span("agent_run") as span:
        span.set_attribute("session_id", session_id)
        span.set_attribute("query", user_query)
        history = get_session_history(session_id)
        max_iterations = 10
        for step in range(max_iterations):
            with tracer.start_as_current_span(f"iteration_{step}") as iter_span:
                start_time = time.time()
                # ... agent logic ...
                latency = time.time() - start_time
                iter_span.set_attribute("latency", latency)
                iter_span.set_attribute("step", step)
                logger.info({
                    "event": "agent_iteration",
                    "session_id": session_id,
                    "step": step,
                    "latency": latency,
                    "tool_called": tool_name if 'tool_name' in locals() else None,
                    "observation_length": len(observation) if 'observation' in locals() else 0
                })
                # Check for alerts
                if step > 5:
                    logger.warning(f"High iteration count: {step} for session {session_id}")
                if latency > 10:
                    logger.warning(f"High latency: {latency}s for session {session_id} step {step}")
        return "Fallback"
OpenTelemetry for Distributed Tracing
Use OpenTelemetry to trace agent iterations as spans. This integrates with Datadog, Grafana, or Jaeger and lets you visualize the agent's decision flow.
Production Insight
A team noticed that their agent's latency spiked every hour. They traced it to a specific tool that made an API call to a service that had a rate limit. The tool was failing silently (returning empty results) and the agent retried 3 times before moving on. They fixed it by adding a retry with exponential backoff in the tool itself.
Key Takeaway
Monitor every iteration with structured logs and traces. Set alerts for high iteration counts, high latency, and high token usage. Test with edge cases before deployment.
● Production incidentPOST-MORTEMseverity: high

The $4,000 Agent Loop: How a Missing Max Iterations Cost Us a Night

Symptom
Cost alert: token spend spiked from $2/hour to $800/hour in 30 minutes. The agent returned 'I need more information to verify this transaction' after 127 iterations.
Assumption
The team assumed the model would naturally stop after 3-5 iterations because the prompt said 'Answer when you have enough information.' They thought the LLM would self-terminate.
Root cause
The LLM's prompt did not include a max iteration limit, and the agent loop had no hard stop. The model kept generating 'Thought: I need more data' because it was never explicitly told to stop after N attempts.
Fix
1. Added max_iterations=10 to the agent configuration. 2. Modified the loop to raise a MaxIterationsExceeded exception after the limit. 3. Added a fallback response: 'I was unable to find a definitive answer within the allowed steps. Please refine your query.' 4. Deployed a token usage monitor that alerts if a single conversation exceeds $10.
Key lesson
  • Always set a hard maximum iteration count in the agent loop — never rely on the model to self-terminate.
  • Monitor token spend per conversation, not just aggregate. Spikes catch runaway loops before they cost thousands.
  • Design the prompt to include a stop condition: 'You must answer within {max_iterations} steps.' The model needs explicit constraints.
Production debug guideWhen the agent loops forever at 2am.4 entries
Symptom · 01
Agent returns 'I need more information' repeatedly
Fix
Check the conversation history length. If it's growing but the model keeps generating new tool calls, the max_iterations check is missing or too high. Run len(conversation_history) and compare to your limit.
Symptom · 02
Token usage spikes without visible progress
Fix
Log the last 5 observations. Large API responses can blow the context window. Check if observations are being truncated. Use len(observation) > 2000 as a warning threshold.
Symptom · 03
Agent crashes with JSONDecodeError
Fix
The model's tool call output is malformed. Log the raw model output before parsing. Add a retry with a prompt correction: 'Your previous output was not valid JSON. Please output a valid JSON action.'
Symptom · 04
Agent ignores tool results and repeats the same action
Fix
The model might be stuck in a reasoning loop. Inspect the last 3 Thought-Action pairs. If they're identical, the model is not incorporating observations. Check if the observation format is confusing (e.g., missing tool name prefix).
★ ReAct Agent Pattern Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Infinite loop
Immediate action
Check max_iterations setting
Commands
grep -r 'max_iterations' ./agent_config.py
kubectl logs deployment/agent --tail=50 | grep -E 'iteration|step'
Fix now
Set max_iterations=10 in agent config and redeploy.
Token cost spike+
Immediate action
Check conversation history token count
Commands
python -c "import tiktoken; enc=tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode(open('conversation.log').read())))"
tail -100 conversation.log | wc -c
Fix now
Add observation truncation: observation = observation[:2000] before appending to history.
Malformed tool call+
Immediate action
Log raw model output before parsing
Commands
echo "$(tail -1 model_output.log)" | python -m json.tool 2>&1
cat model_output.log | grep -E 'action|tool' | head -5
Fix now
Add a retry loop: if JSONDecodeError, send a correction prompt to the model.
Agent repeats same action+
Immediate action
Compare last 3 actions
Commands
tail -20 conversation.log | grep -E 'Action:' | uniq -c
python -c "import hashlib; [print(hashlib.md5(line.encode()).hexdigest()) for line in open('conversation.log') if 'Action:' in line]" | tail -3
Fix now
Add a deduplication check: if the same action repeats 3 times, force the model to answer with current data.
ReAct vs. Other Agent Patterns
ConcernReActPlan-and-SolveTool-Use-OnlyRecommendation
Latency per taskHigh (multiple LLM calls)Medium (one plan + execution)Low (one call)Use Tool-Use-Only for simple lookups
AdaptabilityHigh (reacts to observations)Low (fixed plan)NoneReAct for dynamic tasks
Token costHigh (full history in context)MediumLowBudget-sensitive → Plan-and-Solve
Debugging complexityHigh (loops, parsing errors)MediumLowStart simple, add complexity
Best use caseMulti-step reasoning with toolsKnown sequence of stepsSingle API callMatch to task complexity

Key takeaways

1
Always set a hard max iteration limit (e.g., 10) and a token budget per step to prevent runaway loops.
2
Validate tool outputs before feeding them back into the LLM—malformed JSON or empty results cause hallucinated reasoning.
3
Use a structured output parser (e.g., Pydantic) for the Thought/Action/Action Input fields to catch parsing failures early.
4
Add a 'stop' action that the agent can explicitly emit, and enforce it with a regex check before the next iteration.
5
Monitor the entropy of the LLM's token probabilities—if it spikes, the agent is likely confused and looping.

Common mistakes to avoid

4 patterns
×

No max iteration cap

Symptom
Agent runs forever, burning tokens and never finishing tasks.
Fix
Add a hard limit: max_iterations = 10. After that, force a final answer or raise an error.
×

Ignoring malformed tool output

Symptom
Agent repeats the same action because it can't parse the observation (e.g., JSON parse error).
Fix
Wrap tool output in a try/except. On failure, inject a synthetic observation: 'Error: tool returned invalid data. Retry with corrected input.'
×

No stop condition in the prompt

Symptom
Agent never emits 'Final Answer' because the prompt doesn't explicitly require it.
Fix
Add to system prompt: 'You MUST end with "Final Answer: <answer>" after at most 10 steps. No exceptions.'
×

Reusing the same conversation history without trimming

Symptom
Context window fills up, causing the agent to lose track and loop on old observations.
Fix
Implement a sliding window: keep only the last N (e.g., 5) Thought-Action-Observation triples in the prompt.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain the ReAct agent pattern and its core components.
Q02SENIOR
How would you prevent infinite loops in a ReAct agent?
Q03SENIOR
Design a ReAct agent that can handle 1000 concurrent users with sub-seco...
Q04SENIOR
How would you handle a tool that returns inconsistent or malformed data ...
Q05SENIOR
Compare ReAct with Plan-and-Solve and Tool-Use-Only patterns. When would...
Q01 of 05JUNIOR

Explain the ReAct agent pattern and its core components.

ANSWER
ReAct (Reasoning + Acting) interleaves chain-of-thought reasoning with tool calls. The loop: (1) LLM generates a Thought (reasoning step), (2) an Action (tool name), (3) an Action Input (parameters), (4) the system executes the tool and returns an Observation, (5) the Observation is appended to the prompt, and the LLM generates the next Thought. This continues until the LLM emits 'Final Answer'. Key components: system prompt with tool definitions, a parser for structured output, a tool execution environment, and a stop condition.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Why does my ReAct agent keep repeating the same action?
02
How do I stop a ReAct agent from hallucinating tool calls?
03
What's the best way to handle tool failures in ReAct?
04
How do I debug a ReAct agent that loops at 3am?
05
Can ReAct work with streaming responses?
🔥

That's AI Agents. Mark it forged?

7 min read · try the examples if you haven't

Previous
AI Agents Explained
2 / 5 · AI Agents
Next
Agent Memory Types