ReAct Agent Pattern — Why Your Agent Loops Forever at 3am and How to Fix It
Learn the ReAct agent pattern from a production perspective: how it works, common failures, debugging strategies, and a real incident where a loop cost $4k in token overrun.
- Reasoning+Acting Loop The agent thinks, acts, observes, and repeats until it has a final answer. In production, this loop can run indefinitely without a max iteration limit.
- Tool Call Formatting The model must output tool calls in a strict JSON schema. A single malformed action can crash the loop or cause silent retries.
- Observation Truncation Tool outputs (e.g., API responses) can exceed the model's context window. Truncate or summarize observations to avoid token blowout.
- State Management Each iteration appends to the conversation history. Without pruning, you hit the context limit after 5-10 turns with large tool results.
- Error Handling Tool failures (timeouts, rate limits) must be passed as observations, not exceptions. The model needs to decide retry vs. alternative.
- Cost Control Each loop iteration is an LLM call. A bug that causes 50 iterations costs 50x the base call. Set a hard limit and monitor token usage.
Think of a ReAct agent like a detective solving a case. The detective thinks about what clue they need (Reason), goes to find it (Act), reads the clue (Observe), then decides if they can solve the case or need more clues. The loop repeats until they have enough evidence. If the detective never decides they're done, they'll keep searching forever — and that's exactly what happens when you forget to set a max iteration limit in production.
You've built a chatbot that can search the web, query databases, and call APIs. It works in your demo — three turns and it answers perfectly. Then you deploy it to production, and at 2am your pager goes off: the agent has been looping for 47 iterations, burned through $400 in tokens, and returned nothing. Welcome to the ReAct agent pattern, where the gap between a tutorial and production is a single missing max_iterations parameter.
Most tutorials show you the loop: Thought, Action, Observation, repeat. They hand-wave the hard parts: what happens when the model outputs malformed JSON for a tool call, when an API returns 50KB of data that blows your context window, or when the model decides it needs to search again and again because it never learned to stop. These aren't edge cases — they're the norm at scale.
This article covers the ReAct pattern from the inside out: how the loop works under the hood, how to implement it with proper error handling and cost controls, when not to use it (yes, there are better patterns), and the exact debugging steps we used when our recommendation engine's agent loop cost us $4k in one night. You'll get runnable code, real incident details, and a triage cheat sheet for that 2am page.
How the ReAct Loop Actually Works Under the Hood
The ReAct pattern is deceptively simple: a loop where the model generates a thought, decides an action, executes it, and observes the result. But the production reality is more nuanced. The loop is essentially a state machine with three states: REASONING, ACTING, OBSERVING. The model's output determines the transition. If it outputs a final answer, you're done. If it outputs a tool call, you execute and feed back the result.
The critical detail: the model doesn't 'know' it's in a loop. It sees a flat conversation history. Each iteration appends the previous thought, action, and observation. The model generates the next token based on this growing context. This means the loop's behavior changes as history grows — early iterations are crisp, later ones can become repetitive as the context window fills.
Most implementations hide the raw token-level mechanics. The model outputs a JSON blob like {"action": "search", "query": "latest news"}. Your code parses this, calls the tool, and appends the result. But if the model outputs {"action": "search", "query": ""} — empty query — your tool might return garbage or error. You need to validate the action schema before executing.
Another hidden gotcha: the model can output multiple actions in one response. Some frameworks handle this (ReAct with tool calling), but a naive implementation expects exactly one action per loop iteration. If the model outputs two actions, you'll parse only the first and lose the second. The fix is to either enforce single-action output in the prompt or parse all actions and execute them sequentially.
{"action": "get_recommendations", "user_id": 123} to {"action": "get_recommendations", "user_id": 123, "extra": "field"}. The parser was strict and rejected any extra fields, causing the agent to retry indefinitely. We fixed it by using a lenient parser that ignores unknown fields.Practical Implementation: Building a ReAct Agent from Scratch
Let's build a ReAct agent that can search the web and calculate math. We'll use OpenAI's GPT-4 for the model and a simple web search tool. The key is to structure the prompt so the model knows exactly what format to output and when to stop.
- Available tools with descriptions
- The expected output format (JSON with 'thought' and 'action' or 'answer')
- A stop condition: 'You must answer within {max_iterations} steps.'
- Examples of valid tool calls and final answers
We'll also add proper error handling: if the tool fails, we return the error as an observation so the model can decide to retry or use a different approach. This is crucial in production where APIs fail.
When NOT to Use the ReAct Pattern
ReAct is powerful, but it's not a silver bullet. There are clear cases where other patterns outperform it. The most common mistake is using ReAct for tasks that don't need external tools — you're paying for multiple LLM calls when a single call would suffice.
- Pure reasoning tasks: If the task is purely reasoning (e.g., 'Explain quantum entanglement'), use a single LLM call or a reflection pattern. ReAct adds unnecessary cost and latency.
- Deterministic workflows: If you know the exact sequence of steps (e.g., 'Get user data, then get orders, then summarize'), use a Plan & Solve pattern. It's cheaper and faster because the plan is generated once.
- Cost-sensitive applications: Each ReAct iteration is an LLM call. If you're on a tight budget, consider REWOO (ReAct Without Observation) which skips the observation step and uses a single pass.
- Tasks requiring learning from failures: If the agent needs to learn from past mistakes (e.g., iterative debugging), use Reflexion which maintains a memory of failures.
Another anti-pattern: using ReAct for real-time systems. The loop introduces unpredictable latency. A search agent might take 2 seconds or 20 seconds depending on how many iterations it needs. If you need bounded latency, use a fixed-step pattern.
Production Patterns: Scaling the ReAct Agent
Running a ReAct agent at scale introduces challenges that tutorials ignore: concurrent requests, state management, and cost control. Here's how to handle them.
Concurrency: Each agent session is stateful — it maintains a conversation history. In a web app with 1000 concurrent users, you need 1000 separate histories. Use a session store (Redis, DynamoDB) to persist histories. Key by session_id, value by the message list. Each iteration reads and writes to this store.
State Pruning: The conversation history grows with each iteration. After 10 iterations with large tool outputs, you might exceed the context window (e.g., 8K tokens for GPT-4). Implement a sliding window: keep the system prompt, the last N messages, and a summary of older ones. Or use a max token limit and truncate the oldest messages when exceeded.
Cost Control: Monitor token usage per session. Set a hard budget per query (e.g., $0.10). If exceeded, terminate the loop and return a fallback. Use a token counter library (tiktoken) to estimate before sending the request.
Observability: Log every iteration: step number, token count, tool called, observation length, latency. Ship these logs to your monitoring system (Datadog, Grafana). Set alerts for high iteration counts, high token usage, or high failure rates.
Common Mistakes with Specific Examples
Here are the most common mistakes we've seen in production ReAct agents, with real examples.
Mistake 1: No input validation on tool arguments. Example: a weather agent called get_weather(city='New York') but the model output get_weather(city=''). The tool returned an error, and the agent retried with the same empty string. Fix: validate that required arguments are non-empty before calling the tool.
Mistake 2: Ignoring tool errors. Example: a database query tool threw an exception because the table didn't exist. The agent code caught the exception and returned 'Error: database error'. The model then tried the same query again because it didn't understand the error. Fix: return a descriptive error message that tells the model what went wrong and suggests alternatives.
Mistake 3: Not handling observation truncation. Example: a search tool returned 10KB of text. The agent appended this to the history, and after 3 searches, the context window overflowed. The LLM started generating incoherent responses. Fix: truncate observations to 1000 characters or summarize them.
Mistake 4: Relying on the model to stop. Example: the prompt said 'Answer when you have enough information.' The model never decided it had enough and kept looping. Fix: add explicit max iterations and a stop condition in the prompt: 'You must answer within {max_iterations} steps.'
ReAct vs. Other Agent Patterns: When to Choose What
ReAct is one of many agent patterns. Here's a comparison to help you choose.
- ReAct: Best for tasks requiring dynamic tool use and reasoning. Example: 'Find the latest news about AI and summarize it.' The agent decides which tools to call and in what order.
- Plan & Solve: Best for tasks with a known sequence. The model generates a plan first, then executes each step without re-planning. Example: 'Book a flight: search flights, compare prices, book the cheapest.' Cheaper and faster than ReAct because it makes fewer LLM calls.
- Reflexion: Best for tasks that require learning from mistakes. The agent maintains a memory of past failures and uses it to improve. Example: 'Debug this code: try a fix, observe the error, try another fix.' More robust but more expensive.
- REWOO: Best for cost-sensitive tasks. It skips the observation step and uses a single pass to generate actions and final answer. Example: 'Get the weather for New York, London, and Tokyo.' The model generates all tool calls at once, executes them, and synthesizes the answer. Cheapest but least flexible.
In production, we often combine patterns. For example, use a classifier to decide which pattern to use based on the query. Simple queries go to REWOO, complex ones go to ReAct, and debugging tasks go to Reflexion.
Debugging and Monitoring the ReAct Agent
When your agent misbehaves in production, you need tools to diagnose the problem. Here's our monitoring stack.
Log every iteration: Log the step number, the model's raw output, the tool called, the observation length, and the time taken. Store this in a structured log (JSON lines) for easy querying.
Trace the conversation: Use OpenTelemetry to trace the entire agent flow. Each iteration is a span. This lets you see where time is spent and which steps fail.
Set alerts: Alert on: - Iteration count > 5 (possible loop) - Token usage per session > $0.10 - Tool failure rate > 10% - Latency per step > 10 seconds
Use a debug mode: In development, add a 'verbose' mode that prints the full conversation history after each step. This helps you see what the model is seeing.
Test with edge cases: Before deploying, test with: - Empty tool results - Tool failures - Very long tool results - Ambiguous queries that could lead to infinite loops
The $4,000 Agent Loop: How a Missing Max Iterations Cost Us a Night
max_iterations=10 to the agent configuration.
2. Modified the loop to raise a MaxIterationsExceeded exception after the limit.
3. Added a fallback response: 'I was unable to find a definitive answer within the allowed steps. Please refine your query.'
4. Deployed a token usage monitor that alerts if a single conversation exceeds $10.- Always set a hard maximum iteration count in the agent loop — never rely on the model to self-terminate.
- Monitor token spend per conversation, not just aggregate. Spikes catch runaway loops before they cost thousands.
- Design the prompt to include a stop condition: 'You must answer within {max_iterations} steps.' The model needs explicit constraints.
len(conversation_history) and compare to your limit.len(observation) > 2000 as a warning threshold.grep -r 'max_iterations' ./agent_config.pykubectl logs deployment/agent --tail=50 | grep -E 'iteration|step'max_iterations=10 in agent config and redeploy.Key takeaways
Common mistakes to avoid
4 patternsNo max iteration cap
max_iterations = 10. After that, force a final answer or raise an error.Ignoring malformed tool output
No stop condition in the prompt
Reusing the same conversation history without trimming
Interview Questions on This Topic
Explain the ReAct agent pattern and its core components.
Frequently Asked Questions
That's AI Agents. Mark it forged?
7 min read · try the examples if you haven't