Advanced 4 min · April 12, 2026

AI Agent Infinite Loops — LangChain + Next.js Cost Control

One missing maxIterations caused 14,000 tool calls and $4,200 in OpenAI spend.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • An AI Agent uses an LLM to decide which tools to call and in what order — unlike a chain, it reasons about actions
  • LangChain provides the agent framework — createToolCallingAgent + RunnableWithMessageHistory runs the ReAct loop (reason, act, observe)
  • Supabase stores conversation history and vector embeddings for long-term memory
  • Next.js 16 Server Actions and Route Handlers stream responses via Vercel AI SDK (streamText)
  • Token costs scale with conversation length — truncate or summarize history to control spend
  • Biggest mistake: no tool guardrails — agents can call tools infinitely without maxSteps or deduplication

Building an AI agent requires orchestrating four components: the LLM (reasoning engine), tools (external capabilities), memory (conversation history), and a runtime loop (agent executor). LangChain provides the agent framework with ReAct prompting — the model reasons about what to do, selects a tool, observes the result, and decides the next step.

Supabase serves two roles: PostgreSQL stores conversation history for session continuity, and pgvector stores embeddings for semantic search over past interactions. Next.js 16 provides the API layer — Server Actions and Route Handlers process requests, and the Vercel AI SDK streams token-by-token responses to the client.

The production challenges are cost control (token usage scales with history length), latency (multi-step agent loops add sequential delay), reliability (tools can fail, LLMs can hallucinate tool inputs), and observability (debugging why an agent chose a specific action). This guide covers the complete implementation with production patterns for each.

Architecture: Agent, Tools, Memory, and Runtime

An AI agent has four components that work together in a loop. The LLM is the reasoning engine — it reads the conversation history and tool descriptions, decides which tool to call, and interprets the result. Tools are external functions the agent can invoke execution, API calls. Memory stores conversation history so the agent maintains context across turns. The runtime (AgentExecutor) orchestrates the loop: send context to the LLM, parse the tool call, execute the tool, append the result, repeat.

The key difference between an agent and a chain: a chain follows a fixed sequence of steps (retrieve -> format -> generate). An agent decides the sequence dynamically — it might search, then calculate, then search again, then answer. This flexibility comes at a cost: each step is a separate LLM call, adding latency and token usage.

LangChain's ReAct agent type implements this pattern. The prompt instructs the LLM to Think (reason about what to do), Act (call a tool with specific inputs), and Observe (read the tool result). The loop continues until the agent decides it has enough information to answer, or maxSteps is reached.

Supabase: Memory, Vector Search, and Conversation Storage

Supabase serves three roles in the agent architecture. PostgreSQL stores conversation history — every message (user and assistant) is persisted for session continuity. pgvector stores document embeddings for semantic search — the agent can retrieve relevant knowledge base entries. The Supabase client provides real-time subscriptions if you want to show agent activity to the user in real time.

The conversation storage pattern: each conversation has a unique ID. Messages are appended to a messages table with the conversation_id as a foreign key. When the agent processes a new message, it loads the conversation history from Supabase and prepends it to the LLM context. This gives the agent memory across turns without storing state in the application server.

Vector search uses pgvector's cosine similarity. Documents are chunked, embedded with OpenAI's text-embedding-3-small model, and stored with their embedding vectors. The match_documents RPC function performs nearest-neighbor search and returns the top-k results above a similarity threshold (use 0.78–0.82 for text-embedding-3-small in production).

Next.js Integration: Streaming API and Server Actions

The Next.js integration has two concerns: the API layer that processes agent requests, and the streaming layer that delivers token-by-token responses to the client. The Vercel AI SDK (ai package) handles both — it provides useChat for the client and streamText/StreamingTextResponse for the server.

The API route receives the user's message, loads conversation history from Supabase, runs the agent, and streams the response. The client uses useChat to manage the message list and display streaming tokens. The key pattern: the Route Handler returns streamText(...).toDataStreamResponse(), which is a streaming response that the client consumes incrementally.

For non-streaming use cases (background processing, webhook-triggered agents), use Server Actions instead. They run the agent synchronously and return the result. Server Actions are simpler but do not support streaming — the user waits for the full response.

Tools: Building and Validating Agent Capabilities

Tools are the agent's external capabilities — each tool is a function with a name, description, and input schema. The LLM reads the tool descriptions to decide which tool to call, and parses the input schema to generate valid arguments. The quality of the description directly affects the agent's ability to use the tool correctly.

The critical pattern: validate tool inputs with Zod before execution. LLMs can hallucinate invalid inputs — missing required fields, wrong types, or malformed queries. Zod validation catches these before the tool executes, returning a clear error message that the agent can learn from.

Tool design principles: descriptions should be specific (not just "search the database"), input schemas should use .describe() on every field (the LLM reads these descriptions), and error messages should be actionable (tell the agent what went wrong and how to fix it).

Cost Control: Token Management and Conversation Summarization

Token costs are the primary production concern for AI agents. Each agent iteration is a separate LLM call that includes the full conversation history, tool descriptions, and the agent's reasoning. A 20-turn conversation with 10 tool calls per turn can consume 100,000+ tokens — costing $1-10 depending on the model.

Three strategies control costs: history truncation (keep only the last N messages), conversation summarization (replace old messages with a summary), and model selection (use gpt-4o-mini for simple tasks, gpt-4o for complex reasoning). Summarization is the most effective — it preserves context while reducing token count by 80-90%.

Deployment: Environment, Monitoring, and Failure Handling

Production deployment requires three additions beyond the development setup: environment variable management (API keys, Supabase credentials), monitoring (token usage, error rates, latency), and failure handling (tool errors, LLM timeouts, rate limits).

The deployment target matters: Vercel has a 10-second timeout for Serverless Functions (300 seconds on Pro plan). Agent loops with multiple tool calls can exceed this. For long-running agents, use Vercel's Edge Runtime (no timeout limit) or deploy the agent logic to a separate service (AWS Lambda with 15-minute timeout, or a containerized service).

Agent Frameworks Compared
FrameworkLanguageAgent TypeStreamingTool EcosystemProduction ReadyBest For
LangChainPython, JS/TSReAct, Tool CallingYes (with streamEvents)LargeYes*Complex multi-tool agents, RAG pipelines
Vercel AI SDKJS/TSOpenAI FunctionsYes (native)SmallYesSimple chat agents, streaming-first apps
CrewAIPythonRole-based multi-agentLimitedMediumGrowingMulti-agent collaboration, research tasks
AutoGenPythonConversational multi-agentYesMediumYesMulti-agent conversations, code generation
Direct OpenAI APIAnyFunction callingYesNone (manual)YesSimple single-tool agents, full control

Key Takeaways

  • An agent decides tool sequences dynamically — unlike chains which follow fixed steps
  • maxSteps on the agent prevents infinite loops — always set it to 10 in production
  • Supabase stores conversation history and vector embeddings — the database is the agent's memory
  • Conversation summarization reduces token costs by 80-90% while preserving context
  • Tool descriptions guide the LLM's tool selection — write them like instructions, not documentation
  • Validate tool inputs with Zod — LLMs hallucinate invalid arguments in multi-step workflows

Common Mistakes to Avoid

  • No maxSteps on the agent
    Symptom: Agent enters infinite tool-calling loop. Consumes millions of tokens and thousands of dollars in API credits within hours. The agent keeps calling the same tool with slightly different inputs, never reaching a conclusion.
    Fix: Set maxSteps to 10 on createToolCallingAgent. Add deduplication logic to detect repeated tool results. Set OpenAI usage limits as a safety net. Monitor per-conversation token counts.
  • Vague tool descriptions that do not explain when to use the tool
    Symptom: Agent either never uses the tool (does not know when it is relevant) or uses it incorrectly (applies it to the wrong type of query). Tool selection accuracy drops below 50%.
    Fix: Write tool descriptions that explain WHEN to use the tool, not just what it does. Include examples of good queries. List what the tool should NOT be used for. Add .describe() to every Zod schema field.
  • No input validation on tool parameters
    Symptom: Tool receives hallucinated inputs from the LLM — missing required fields, wrong types, malformed queries. Tool crashes with unhandled errors, and the agent gets an unclear error message.
    Fix: Validate all tool inputs with Zod before execution. Return clear, actionable error messages that tell the agent what went wrong and how to retry with valid inputs.
  • Storing conversation history only in application memory
    Symptom: Conversation history is lost when the serverless function terminates. Users lose context on every page refresh. Multi-turn conversations do not work across requests.
    Fix: Store conversation history in Supabase PostgreSQL. Load history from the database on every request. Use a custom history class that matches your token-tracking schema.
  • No conversation summarization for long sessions
    Symptom: Token costs increase linearly with conversation length. A 50-turn conversation consumes 100,000+ tokens per agent iteration. Monthly API costs exceed budget by 5-10x.
    Fix: Implement conversation summarization when history exceeds 8,000 tokens. Use gpt-4o-mini for summarization (cheaper). Replace old messages with a summary in the agent context.
  • Using gpt-4o for all tasks including simple lookups
    Symptom: Simple tool calls (order status lookups, basic searches) cost 15x more than necessary. Monthly API costs are dominated by trivial operations that do not require advanced reasoning.
    Fix: Use gpt-4o-mini for simple tasks (order lookups, basic searches) and gpt-4o for complex reasoning (multi-tool workflows, refund calculations). Route tasks to the appropriate model based on complexity.
  • No rate limiting on the chat API endpoint
    Symptom: A single user or bot sends 1,000 messages per minute, exhausting the OpenAI rate limit and causing errors for all users. API costs spike unexpectedly.
    Fix: Add rate limiting middleware using Upstash Redis (sliding window) — 10 messages per minute per user is a reasonable default.
  • Not logging agent runs for observability
    Symptom: When the agent produces a wrong answer or enters a loop, there is no way to debug what happened. No visibility into which tools were called, what inputs were used, or how many steps were taken.
    Fix: Log every agent run with: conversation ID, input, output, intermediate steps, duration, token count, and error details. Use these logs to identify patterns in agent failures.

Interview Questions on This Topic

  • QWhat is the difference between an AI agent and a chain in LangChain?Mid-levelReveal
    A chain follows a fixed sequence of steps — retrieve context, format prompt, generate response. The sequence is defined at build time and does not change based on the input. An agent decides the sequence dynamically. It uses an LLM to reason about which tool to call, calls the tool, observes the result, and decides the next step. The sequence is determined at runtime based on the input and intermediate results. The trade-off: agents are more flexible (they can handle unexpected queries by choosing different tools) but more expensive (each step is a separate LLM call) and harder to debug (the sequence is non-deterministic). Chains are cheaper, faster, and predictable but limited to predefined workflows.
  • QHow do you prevent an AI agent from entering an infinite loop?SeniorReveal
    Three layers of protection: 1. maxSteps on the agent — set to 10 for production. This is the hard stop that prevents infinite loops regardless of the agent's behavior. 2. Prompt-level instructions — tell the agent what to do when tools return no new information. Example: "If the search returns the same results twice, synthesize an answer from available information instead of searching again." 3. Tool-level deduplication — detect when the same tool is called with the same inputs and return a message like "This query was already searched. The results have not changed. Please answer based on the available information." Additionally, set OpenAI usage limits as a financial safety net, and monitor per-conversation token counts to detect loops early.
  • QHow do you manage conversation memory in a stateless serverless environment?Mid-levelReveal
    Serverless functions are stateless — they terminate after each request and do not retain memory. To persist conversation history across requests: 1. Store every message in a database (Supabase PostgreSQL). Each message includes the conversation ID, role, content, and timestamp. 2. On each request, load the conversation history from the database and pass it to the agent as context. 3. Use a custom history class that matches your token-tracking schema. The key insight: the database is the memory, not the application server. Each request loads the memory, processes the message, and writes the result back.
  • QHow do you control token costs for an AI agent in production?SeniorReveal
    Four strategies: 1. Conversation summarization — when history exceeds 8,000 tokens, summarize old messages with a cheaper model (gpt-4o-mini). This reduces token count by 80-90% while preserving context. 2. Model selection — use gpt-4o-mini for simple tasks (order lookups, basic searches at $0.15/M input tokens) and gpt-4o for complex reasoning ($2.50/M input tokens). Route based on task complexity. 3. Token budgets — set a per-conversation limit (e.g., 50,000 tokens). When exceeded, summarize and continue or stop processing. 4. maxSteps — limit the number of agent steps per turn. Each step is a separate LLM call with the full history. Fewer steps means fewer tokens. Additionally, track per-conversation token usage in the database, set daily/monthly budgets, and alert when spending exceeds thresholds.
  • QWhat is the role of Supabase in an AI agent architecture?JuniorReveal
    Supabase serves three roles: 1. Conversation storage — PostgreSQL stores conversation history (messages table) and conversation metadata (conversations table). This enables session persistence across requests in a stateless serverless environment. 2. Vector search — pgvector stores document embeddings for the knowledge base. The match_documents RPC function performs cosine similarity search, enabling the agent to retrieve relevant documents based on semantic meaning rather than keyword matching. 3. Row Level Security — RLS policies ensure users can only access their own conversations. This provides authorization at the database level, independent of the application code. Additionally, Supabase can store conversation summaries (for cost control), usage logs (for billing and monitoring), and agent run logs (for observability).

Frequently Asked Questions

Can I use a different LLM instead of OpenAI with LangChain?

Yes. LangChain supports many LLM providers: Anthropic (ChatAnthropic), Google (ChatGoogleGenerativeAI), Azure OpenAI (AzureChatOpenAI), Ollama (ChatOllama for local models), and any OpenAI-compatible API. The agent framework works the same regardless of the LLM — only the model initialization changes. For function calling (tool use), ensure the model supports it — OpenAI and Anthropic models have native support, while others may require prompt-based tool calling.

How do I test an AI agent in my CI/CD pipeline?

Test at three levels: unit test individual tools (mock external calls, verify output format), integration test the agent with a fixed input (verify it calls the expected tools and produces a correct answer), and end-to-end test the API endpoint (verify streaming, authentication, and error handling). For deterministic tests, use temperature=0 and record/replay LLM responses with tools like VCR or Polly. Mock the Supabase client for database operations.

How do I handle multi-agent architectures where agents delegate to each other?

LangChain supports agent delegation through the agent's tools — one agent can be wrapped as a tool for another agent. The supervisor agent decides which specialist agent to call (e.g., a billing agent, a technical support agent) and passes the relevant context. CrewAI and AutoGen provide higher-level abstractions for multi-agent collaboration with role definitions and conversation patterns.

What is the difference between gpt-4o and gpt-4o-mini for agent tasks?

gpt-4o is better at complex reasoning, multi-step planning, and understanding nuanced tool descriptions. gpt-4o-mini is faster and 15x cheaper but may struggle with complex tool selection or multi-step workflows. Use gpt-4o-mini for simple tasks (single tool calls, order lookups, basic searches) and gpt-4o for complex tasks (multi-tool workflows, refund calculations). The cost difference is significant: $0.15/M vs $2.50/M input tokens.

How do I deploy an AI agent that exceeds Vercel's serverless timeout?

Three options: use Vercel Edge Runtime (no timeout limit, but no Node.js APIs — works for pure API calls), move the agent logic to a separate service (AWS Lambda with 15-minute timeout, or a containerized service on ECS/Fargate with no timeout), or implement a queue-based architecture (user submits a message, a worker processes it asynchronously, the client polls or receives a webhook when done). The queue approach is the most scalable but adds complexity.

🔥

That's React.js. Mark it forged?

4 min read · try the examples if you haven't

Previous
tRPC v11 + Next.js 16: Complete Setup and Best Practices
28 / 47 · React.js
Next
Next.js 16 Caching Strategies Explained: The 2026 Guide