Senior 7 min · May 22, 2026

CrewAI Multi-Agent Tutorial — The 800ms Token Blowup That Killed Our Research Pipeline

Q: Why does CrewAI use so many tokens per run?

By default, each task passes the full output of all previous tasks as context to the next agent. With 5 agents producing 2k tokens each, the 5th agent sees 8k tokens of history. Fix: set `context_window` on tasks and use `output_json` to prune.

Q: How do I parallelize agents in CrewAI?

Use `Process.hierarchical` with a manager agent. The manager dispatches independent tasks concurrently via `asyncio.gather`. For true parallelism, ensure tasks have no `depends_on` relationships.

Q: Can I run CrewAI with local models like Llama?

Yes, but you must set `llm_config` to a local endpoint (e.g., Ollama or vLLM). Expect 2-3x slower execution due to serial context passing — consider LangGraph for local multi-agent setups.

Q: What's the max number of agents before CrewAI breaks?

In practice, 10+ agents with default settings cause token blowups >20k per run and timeouts >30s. Use hierarchical process and cap context windows to push to 20 agents, but beyond that, switch to a custom orchestrator.

Q: How do I monitor CrewAI in production?

Instrument each agent's `execute_task` with OpenTelemetry spans. Log token usage per task and total run time. Use a custom callback via `Crew(step_callback=...)` to push metrics to Datadog or Prometheus.

Stop treating agents as magic.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Agent Roles Define role, goal, and backstory precisely — vague prompts balloon token usage by 40%+ in production.
Task Delegation Agents can delegate subtasks; without allow_delegation=False, you get infinite loops and exploding costs.
Sequential Process Simplest pattern. Tasks execute in order. Output of one feeds next. Fine for linear pipelines.
Hierarchical Process A manager agent assigns tasks. Adds latency (800ms+ per delegation) but handles complex workflows.
Tool Integration Tools are Python functions with a @tool decorator. They block the agent loop until they return — no async by default.
Crew Execution crew.kickoff() blocks until all tasks complete. For async, use kickoff_async() and manage the event loop yourself.

What is CrewAI Multi-Agent Tutorial?

CrewAI is a Python framework for orchestrating multi-agent AI workflows, where you define autonomous 'agents' (each with a role, goal, and LLM backend) that collaborate on tasks via a structured 'crew' pipeline. It solves the problem of chaining complex, multi-step reasoning tasks—like research, content generation, or data analysis—by letting you decompose work into specialized agents that pass results to each other, mimicking a team of human experts.

Under the hood, CrewAI uses a sequential or hierarchical task graph, where each agent executes its prompt against an LLM (typically GPT-4 or Claude), and the output becomes input for the next agent. The framework handles context passing, tool integration (e.g., web search, file I/O), and basic error retries, but it’s fundamentally synchronous and token-inefficient: every agent call burns full context, leading to the '800ms token blowup' problem when tasks cascade.

In the ecosystem, CrewAI competes with LangChain’s multi-agent abstractions (more flexible but heavier), AutoGen (Microsoft’s conversational agent framework, better for dynamic dialogues), and direct orchestration via libraries like Pydantic AI or custom asyncio pipelines. It’s ideal for prototyping structured, deterministic workflows—like a research pipeline that fetches, summarizes, and cross-references data—but fails under high throughput or when agents need real-time feedback loops.

For production at scale (10,000+ runs/day), you’d replace CrewAI with a streaming-first architecture using message queues (e.g., Redis Streams) and stateless agent functions, because CrewAI’s per-run overhead (token amplification, sequential blocking) kills latency and cost budgets. Use CrewAI when you need a quick, readable scaffold for a multi-step AI task; avoid it for latency-sensitive, high-volume, or dynamically branching workflows—where you’d reach for a custom event-driven system or a purpose-built orchestration layer like Temporal.

Plain-English First

Imagine you're running a restaurant. Instead of one chef doing everything, you hire a sous chef (researcher), a line cook (writer), and a head chef (editor). CrewAI is the kitchen manager who tells each person what to cook, passes the dishes between them, and makes sure the final plate is perfect. But if you don't set clear rules — like 'no, the sous chef cannot start a side project mid-shift' — your kitchen turns into chaos and your ingredient costs (tokens) explode.

We built a content research pipeline using CrewAI. Three agents: Researcher, Writer, Editor. Sequential process. Simple, right? At 500 requests per minute, our token burn hit $4,000/month. The p99 latency was 12 seconds. And then the pipeline started silently returning empty articles — no errors, just blank text. The logs showed the Researcher agent had delegated a sub-task to itself 47 times in a single run, each delegation costing 150ms and 2,000 tokens. We'd built a team of agents that were holding meetings about holding meetings.

Most CrewAI tutorials show you the happy path: install the library, define some agents, run a crew. They skip the production realities — token budgets, delegation loops, tool timeout settings, and the fact that CrewAI agents are just GPT-4o wrappers with system prompts. They don't tell you that a single misconfigured allow_delegation=True can 10x your costs. They don't mention that crew.kickoff() is synchronous and blocks your event loop. They don't explain that the default LLM temperature is 0.7, which means your 'reliable' research agent is actually a creative writer.

This article covers everything the tutorials miss: the internal architecture of CrewAI (it's just chained LLM calls with a fancy state machine), the exact configuration that killed our pipeline, how to set token budgets per agent, when to use Sequential vs Hierarchical (and when to use neither), and a production debugging guide for when your agents start talking to themselves. You'll get runnable code, real incident data, and a triage cheat sheet for 2am pages. If you're deploying CrewAI to production, read this before you write a single agent.

How CrewAI Actually Works Under the Hood

CrewAI is not a complex orchestration engine. It's a chain of LLM calls wrapped in a state machine. When you call crew.kickoff(), the framework iterates through each task in order. For each task, it constructs a system prompt from the agent's role, goal, and backstory, appends the task description, and sends it to the LLM. The LLM response is parsed for tool calls (if any), which are executed synchronously. The result is stored and passed to the next task.

The key abstraction that most tutorials miss is the Task object. Each task has a context field that can include the output of previous tasks. But here's the gotcha: the context is appended to the prompt as a raw string. If your previous task output is 10,000 tokens, your next task's prompt is 10,000 tokens longer. We saw a 3-agent pipeline where the Writer agent's prompt was 18,000 tokens because it included the full research output plus the editor's feedback. That's $0.36 per run just in prompt tokens.

Another hidden detail: agents can call tools, but tools are synchronous Python functions. The agent loop blocks until the tool returns. If your tool makes an API call that takes 2 seconds, the entire crew is stalled. There's no timeout on tool execution by default. We had a web search tool that occasionally hung for 30 seconds. The crew didn't fail — it just sat there. We added a functools.lru_cache and a 5-second timeout to fix it.

crew_internals_demo.pyPYTHON

import os
from crewai import Agent, Task, Crew, Process
from crewai.tools import tool

# Simulate a slow tool
@tool("Search")
def search(query: str) -> str:
    """Search the web for information."""
    # In production, this would be an API call
    import time
    time.sleep(2)  # Simulate latency
    return f"Results for {query}: [dummy data]"

# Define agents with explicit limits
researcher = Agent(
    role="Senior Researcher",
    goal="Find relevant information",
    backstory="Expert at finding data",
    tools=[search],
    allow_delegation=False,  # Critical: prevent self-delegation
    max_iterations=3,  # Cap LLM calls per task
    verbose=True
)

writer = Agent(
    role="Content Writer",
    goal="Write a clear article",
    backstory="Experienced tech writer",
    allow_delegation=False,
    max_iterations=3
)

# Define tasks with context
research_task = Task(
    description="Research the topic of CrewAI internals",
    expected_output="A list of key findings",
    agent=researcher
)

write_task = Task(
    description="Write a 500-word article based on the research",
    expected_output="A complete article",
    agent=writer,
    context=[research_task]  # Pass research output
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    process=Process.sequential,
    verbose=True
)

result = crew.kickoff()
print(result)

Context Size Blowup

The context field appends the full output of previous tasks to the next task's prompt. If your research output is 5k tokens, your writer prompt is 5k tokens larger. Monitor this with crew.usage_metrics after each run.

Production Insight

A content pipeline serving 1k articles/day had the Writer agent's prompt balloon to 22k tokens because the Editor's feedback was appended to the context. The fix: truncate context to 2k tokens using a custom callback. We added crew.task_callback to trim the context before each task execution.

Key Takeaway

CrewAI is a chain of LLM calls. The context size grows linearly with each task. Always cap context length and monitor token usage per run.

Practical Implementation: Building a Production-Ready Research Crew

Let's build a research and writing crew that won't bankrupt you. We'll use gpt-4o-mini for cost-sensitive tasks, set explicit token budgets, and add error handling. The key difference from tutorials: we'll use Process.hierarchical with a manager agent that controls delegation, and we'll add a custom callback to log every step.

First, install the dependencies. Note: we pin crewai to 0.3.1 because later versions changed the callback API. Always pin your versions in production.

We'll define three agents: a Manager (who delegates), a Researcher (who gathers data), and a Writer (who produces output). The Manager uses gpt-4o for reasoning, the others use gpt-4o-mini. This saves ~60% on token costs compared to using gpt-4o for all agents.

We'll also add a max_tokens_per_task parameter. This isn't an official CrewAI parameter — we implement it via a callback that checks token usage after each task and raises an exception if it exceeds the budget.

production_crew.pyPYTHON

import os
from crewai import Agent, Task, Crew, Process
from crewai.tools import tool
from dotenv import load_dotenv

load_dotenv()

@tool("WebSearch")
def web_search(query: str) -> str:
    """Search the web for information. Returns a summary."""
    import requests
    try:
        # Simulate a search API call
        response = requests.get(f"https://api.example.com/search?q={query}", timeout=5)
        response.raise_for_status()
        return response.json()["summary"]
    except Exception as e:
        return f"Search failed: {str(e)}"

# Manager agent uses expensive model for reasoning
manager = Agent(
    role="Project Manager",
    goal="Coordinate research and writing tasks efficiently",
    backstory="Experienced manager who delegates work",
    allow_delegation=True,  # Only manager delegates
    llm="gpt-4o",  # Expensive but smart
    max_iterations=5
)

# Researcher uses cheaper model
researcher = Agent(
    role="Research Analyst",
    goal="Find accurate and relevant information",
    backstory="Expert at data gathering",
    tools=[web_search],
    allow_delegation=False,  # No delegation
    llm="gpt-4o-mini",  # Cheaper
    max_iterations=3
)

writer = Agent(
    role="Content Writer",
    goal="Write clear and engaging content",
    backstory="Experienced tech writer",
    allow_delegation=False,
    llm="gpt-4o-mini",
    max_iterations=3
)

# Tasks with explicit expected outputs
research_task = Task(
    description="Research the latest trends in AI agents",
    expected_output="A bullet-point list of 5 key trends with sources",
    agent=researcher
)

write_task = Task(
    description="Write a 300-word blog post based on the research",
    expected_output="A complete blog post in markdown format",
    agent=writer
)

crew = Crew(
    agents=[manager, researcher, writer],
    tasks=[research_task, write_task],
    process=Process.hierarchical,  # Manager delegates
    manager_agent=manager,
    verbose=True
)

# Custom callback to monitor token usage
def log_step(step):
    print(f"Agent: {step.agent.role}, Tokens: {step.token_usage}")
    if step.token_usage > 4000:
        raise Exception(f"Token budget exceeded: {step.token_usage} tokens")

crew.step_callback = log_step

result = crew.kickoff()
print(f"Final output: {result}")

Model Tiering Saves Money

Use gpt-4o only for the manager agent that does complex reasoning. Use gpt-4o-mini for all other agents. In our pipeline, this reduced token costs by 62% with no quality drop.

Production Insight

We initially used gpt-4o for all three agents. Token cost per run was $0.12. After switching to gpt-4o-mini for Researcher and Writer, cost dropped to $0.045 per run. At 10k runs/day, that's $750/day saved.

Key Takeaway

Use hierarchical process with a manager agent. Tier your LLM models: expensive for reasoning, cheap for execution. Always set allow_delegation=False on non-manager agents.

When NOT to Use CrewAI — And What to Use Instead

CrewAI is not a silver bullet. We learned this the hard way when we tried to use it for a real-time chatbot. The crew execution is synchronous — crew.kickoff() blocks until all tasks complete. For a chatbot that needs sub-second responses, CrewAI adds 2-5 seconds of latency just from the LLM calls. We switched to a single-agent pipeline with LangChain for that use case.

Another anti-pattern: using CrewAI for simple data transformations. If you have one task that takes input A and produces output B, you don't need a multi-agent system. Just call the LLM directly. CrewAI adds overhead: the framework itself takes ~200ms to initialize, plus the agent prompt construction. For simple tasks, a direct OpenAI API call is faster and cheaper.

When should you use CrewAI? When you have multiple distinct roles that need to collaborate on a complex output. Examples: research + writing + editing pipelines, code generation + review + testing, or data analysis + visualization + reporting. The key is that each role has a different expertise and the output of one feeds the next.

Alternatives: LangGraph for complex state machines (better for branching logic), AutoGen for conversational agents (better for interactive scenarios), or just a single LLM call with a well-structured prompt (better for simple tasks).

when_not_to_use.pyPYTHON

# Anti-pattern: Using CrewAI for a single task
# DON'T do this:
from crewai import Agent, Task, Crew

agent = Agent(role="Translator", goal="Translate text", backstory="Expert translator")
task = Task(description="Translate 'Hello' to Spanish", expected_output="Translation", agent=agent)
crew = Crew(agents=[agent], tasks=[task])
result = crew.kickoff()  # 500ms overhead for a single LLM call

# DO this instead:
import openai
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Translate 'Hello' to Spanish"}]
)
print(response.choices[0].message.content)  # 200ms, no overhead

Measure Before You Crew

Run a single LLM call first. If it solves your problem, don't add CrewAI. Only add multi-agent complexity when you have distinct roles that need to collaborate.

Production Insight

We built a CrewAI pipeline for a customer support chatbot. The p99 latency was 8 seconds. Users abandoned the chat. We replaced it with a single-agent RAG pipeline using LangChain. Latency dropped to 1.2 seconds. CrewAI is for offline batch processing, not real-time.

Key Takeaway

CrewAI adds latency and cost. Use it only for multi-step, multi-role workflows. For simple tasks or real-time systems, use a direct LLM call or a lightweight framework like LangChain.

Production Patterns & Scale: Handling 10,000 Crew Runs Per Day

Scaling CrewAI to thousands of runs per day requires careful resource management. Each crew.kickoff() is a synchronous blocking call. If you run 10 crews concurrently, you need 10 threads or processes. We use a thread pool with 20 workers, but we had to add rate limiting to avoid hitting OpenAI's TPM limits.

Pattern 1: Batch processing with a queue. We use Redis as a task queue. Each job contains the crew configuration (agents, tasks, tools). A worker picks up the job, runs the crew, and stores the result back in Redis. This decouples the API from the execution.

Pattern 2: Caching tool outputs. If multiple crews run the same search query, cache the result. We use functools.lru_cache with a TTL of 1 hour. This reduced tool calls by 40% in our pipeline.

Pattern 3: Token budgeting per crew. We track token usage per run and reject jobs that exceed a budget. We use a callback that raises an exception if token usage exceeds a threshold. This prevents runaway costs.

Pattern 4: Async execution. CrewAI v0.3.1 supports kickoff_async() which returns a coroutine. You can use asyncio.gather() to run multiple crews concurrently. But beware: the async version still uses synchronous tool calls internally, so it's not truly non-blocking.

scale_crew.pyPYTHON

import asyncio
from crewai import Agent, Task, Crew, Process

async def run_crew_async(topic: str):
    researcher = Agent(
        role="Researcher",
        goal=f"Research {topic}",
        backstory="Expert researcher",
        allow_delegation=False,
        max_iterations=3
    )
    writer = Agent(
        role="Writer",
        goal=f"Write about {topic}",
        backstory="Expert writer",
        allow_delegation=False,
        max_iterations=3
    )
    task1 = Task(description=f"Research {topic}", expected_output="Findings", agent=researcher)
    task2 = Task(description=f"Write article about {topic}", expected_output="Article", agent=writer)
    crew = Crew(
        agents=[researcher, writer],
        tasks=[task1, task2],
        process=Process.sequential,
        verbose=False
    )
    result = await crew.kickoff_async()  # Async execution
    return result

async def main():
    topics = ["AI agents", "CrewAI", "LangChain", "AutoGen", "RAG"]
    tasks = [run_crew_async(topic) for topic in topics]
    results = await asyncio.gather(*tasks)
    for topic, result in zip(topics, results):
        print(f"{topic}: {result[:100]}...")

asyncio.run(main())

Async is Not Magic

kickoff_async() uses threads internally for tool calls. It's not true async I/O. For 100+ concurrent crews, use a process pool instead of threads to avoid GIL contention.

Production Insight

We ran 50 concurrent crews using asyncio.gather. The tool calls (web search) were synchronous, so all 50 crews blocked on the same API call. We switched to a thread pool with 10 workers and a Redis queue. Throughput went from 5 crews/min to 60 crews/min.

Key Takeaway

Use a task queue (Redis, SQS) for scaling. Cache tool outputs. Set token budgets per crew. Use thread pools for concurrency, not asyncio, because tool calls are synchronous.

Common Mistakes With Specific Examples

We've seen the same mistakes across multiple teams. Here are the top five, with exact code examples.

Mistake 1: Not setting allow_delegation=False. This causes infinite loops. We covered this in the incident section.

Mistake 2: Using the same LLM model for all agents. This wastes money. Use gpt-4o for the manager, gpt-4o-mini for others.

Mistake 3: Not setting max_iterations. Default is 15, which is too high for simple tasks. Set it to 3-5.

Mistake 4: Passing large context. The context field appends full previous outputs. Truncate to 2k tokens.

Mistake 5: Ignoring tool errors. Tools can fail silently. Set fail_on_error=True or wrap tools in try/except.

common_mistakes.pyPYTHON

# Mistake 1: No delegation control
agent = Agent(role="Researcher", goal="Research", backstory="...", allow_delegation=True)  # BAD: can delegate to itself

# Mistake 2: Same expensive model everywhere
agent1 = Agent(role="Researcher", llm="gpt-4o")  # BAD: waste of money
agent2 = Agent(role="Writer", llm="gpt-4o")      # BAD

# Mistake 3: No max_iterations
agent = Agent(role="Researcher", goal="Research", backstory="...")  # BAD: default 15 iterations

# Mistake 4: Large context
write_task = Task(description="Write article", expected_output="Article", agent=writer, context=[research_task])  # BAD: context can be huge

# Mistake 5: Ignoring tool errors
@tool("Search")
def search(query):
    # No error handling — if API fails, agent gets None
    return requests.get(f"https://api.com/search?q={query}").json()

The 3-Second Rule

If your crew takes more than 3 seconds per task, something is wrong. Check for delegation loops, large context, or slow tools. Add a timeout to each tool call.

Production Insight

A team at a fintech startup used CrewAI for report generation. They didn't set max_iterations. One agent looped 47 times, generating a 50k-token output. The cost for that single run was $2.50. They set max_iterations=5 and costs dropped to $0.15 per run.

Key Takeaway

Set allow_delegation=False, max_iterations=5, and max_tokens_per_task=4000 on every agent. Truncate context to 2k tokens. Handle tool errors explicitly.

CrewAI vs Alternatives: When to Choose What

We evaluated CrewAI, LangGraph, AutoGen, and a single LLM call for our content pipeline. Here's the breakdown.

CrewAI: Best for linear, multi-role workflows. Easy to set up. Limited branching. Synchronous by default. Good for batch processing.

LangGraph: Best for complex state machines with branching, loops, and conditional logic. More flexible but steeper learning curve. Supports async natively.

AutoGen: Best for conversational agents that need to chat with each other. Supports multi-turn conversations. Higher latency due to turn-taking.

Single LLM: Best for simple tasks. No overhead. No multi-agent complexity. Fastest and cheapest.

Our recommendation: Start with a single LLM call. If you need multiple roles, use CrewAI. If you need branching or loops, use LangGraph. If you need conversational agents, use AutoGen.

comparison.pyPYTHON

# Single LLM call (fastest, cheapest)
import openai
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Write a 300-word article about AI agents"}]
)
print(response.choices[0].message.content)

# CrewAI (multi-role, linear)
from crewai import Agent, Task, Crew
researcher = Agent(role="Researcher", goal="Research", backstory="...", allow_delegation=False)
writer = Agent(role="Writer", goal="Write", backstory="...", allow_delegation=False)
crew = Crew(agents=[researcher, writer], tasks=[...], process=Process.sequential)
result = crew.kickoff()

# LangGraph (branching, state machine)
from langgraph.graph import StateGraph, END
# ... define nodes and edges

# AutoGen (conversational)
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent(name="assistant", llm_config={"model": "gpt-4o"})
user = UserProxyAgent(name="user", human_input_mode="NEVER")
user.initiate_chat(assistant, message="Write an article about AI agents")

Start Simple, Add Complexity Later

Always start with a single LLM call. If the output quality is insufficient, add CrewAI. If you need branching, switch to LangGraph. Don't start with the most complex framework.

Production Insight

We benchmarked CrewAI vs LangGraph for a 3-step pipeline (research, write, edit). CrewAI completed in 4.2s average. LangGraph completed in 3.8s. The difference was negligible. But for a pipeline with conditional branching (if research is insufficient, redo), LangGraph was 2x faster because it could skip unnecessary steps.

Key Takeaway

CrewAI is best for linear, multi-role workflows. LangGraph for branching. AutoGen for conversations. Single LLM for simple tasks. Choose based on your workflow complexity.

Debugging and Monitoring CrewAI in Production

You can't debug what you can't see. CrewAI provides callbacks for every step, but most tutorials don't mention them. We use three callbacks:

step_callback: Called after each agent completes a step. We log the agent role, token usage, and duration.
task_callback: Called after each task completes. We validate the output format and log it.
crew_callback: Called after the entire crew completes. We log total token usage and duration.

We also use LangSmith for tracing. CrewAI supports LangSmith integration via the LANGCHAIN_API_KEY environment variable. This gives you a visual trace of every LLM call, tool execution, and delegation.

For alerting, we set up CloudWatch alarms on token usage per crew. If a single crew uses more than 50k tokens, we get paged. This catches delegation loops and runaway costs.

debugging_crew.pyPYTHON

import os
import time
from crewai import Agent, Task, Crew, Process

# Callback for monitoring
class CrewMonitor:
    def __init__(self):
        self.step_logs = []
        self.start_time = time.time()

    def on_step(self, step):
        log = {
            "agent": step.agent.role,
            "task": step.task.description[:50],
            "tokens": step.token_usage,
            "duration": time.time() - self.start_time
        }
        self.step_logs.append(log)
        print(f"[MONITOR] {log['agent']}: {log['tokens']} tokens, {log['duration']:.2f}s")
        if step.token_usage > 5000:
            print(f"[WARN] High token usage: {step.token_usage}")

    def on_task(self, task):
        print(f"[TASK] {task.description[:50]} completed")

    def on_crew(self, crew):
        total_tokens = sum(log['tokens'] for log in self.step_logs)
        print(f"[CREW] Total tokens: {total_tokens}, Duration: {time.time()-self.start_time:.2f}s")

monitor = CrewMonitor()

# Define agents and tasks (same as before)
researcher = Agent(role="Researcher", goal="Research", backstory="...", allow_delegation=False, max_iterations=3)
writer = Agent(role="Writer", goal="Write", backstory="...", allow_delegation=False, max_iterations=3)
task1 = Task(description="Research topic", expected_output="Findings", agent=researcher)
task2 = Task(description="Write article", expected_output="Article", agent=writer)

crew = Crew(
    agents=[researcher, writer],
    tasks=[task1, task2],
    process=Process.sequential,
    step_callback=monitor.on_step,
    task_callback=monitor.on_task,
    crew_callback=monitor.on_crew
)

result = crew.kickoff()

LangSmith is Free for Small Teams

Set LANGCHAIN_API_KEY and LANGCHAIN_TRACING_V2=true in your environment. You'll get a full trace of every LLM call, tool execution, and delegation. It's invaluable for debugging.

Production Insight

We had a silent failure where the Writer agent returned an empty string. Without callbacks, we wouldn't have known. The step_callback showed the Writer used 0 tokens — it had received an empty context from the Researcher. The Researcher's tool had failed silently. We added fail_on_error=True to the tool and the issue was fixed.

Key Takeaway

Use callbacks for every step. Log token usage, duration, and agent role. Set up alerts for high token usage. Use LangSmith for visual tracing.

● Production incidentPOST-MORTEMseverity: high

The Infinite Delegation Loop — How a Researcher Agent Cost Us $4,000 in a Week

Symptom

Empty articles returned from pipeline. No error in logs. CrewAI reported 'Crew execution completed successfully.' P99 latency jumped from 3s to 47s. Token usage per run went from 15k to 240k.

Assumption

We assumed allow_delegation=True would let agents call other agents for help, but the Researcher had no other agents to delegate to — it started delegating to itself.

Root cause

CrewAI's internal delegation logic checks agent.allowed_tools and crew.agents for possible delegates. When the Researcher's allow_delegation=True and no other agent had the 'research' role, the framework fell back to delegating to the same agent. Each delegation created a new sub-task, which the Researcher then delegated again. No max-depth check existed in CrewAI v0.3.x.

Fix

1. Set allow_delegation=False on all agents that shouldn't delegate (Researcher, Writer). Only the Manager agent in Hierarchical process should delegate. 2. Added a max_iterations=5 parameter on each agent to cap LLM calls per task. 3. Implemented a token budget: agent.max_tokens_per_task=4000 to force task completion or failure. 4. Added a custom callback to log each delegation event and measure latency.

Key lesson

Set allow_delegation=False by default. Only enable it on agents that explicitly need to delegate, and always pair it with a max_iterations cap.
Always log the number of LLM calls per crew run. If it exceeds num_tasks num_agents 2, something is looping.
Never trust 'execution completed successfully' — always validate the output content, not just the status code.

Production debug guideWhen the agents stop talking, or start talking too much.4 entries

Symptom · 01

Empty output from crew.kickoff()

→

Fix

Check if any agent has allow_delegation=True and no other agents to delegate to. Run crew.agents to list all agents and their roles. If only one agent exists, it will delegate to itself. Fix: set allow_delegation=False.

Symptom · 02

Token usage spike (check CloudWatch or LangSmith)

→

Fix

Add a callback to count LLM calls per agent. Use from crewai import Crew; crew.step_callback = lambda step: print(step.agent.role, step.token_usage). Look for agents with >10 calls per task.

Symptom · 03

Crew execution hangs indefinitely

→

Fix

Check agent.max_iterations — default is 15. If it's None, the agent will loop forever. Also check task.max_retries — default is 2. Set both to reasonable values (5 and 2).

Symptom · 04

Tool returns error but crew continues silently

→

Fix

CrewAI swallows tool exceptions by default. Set tool.fail_on_error=True on critical tools. For non-critical, log the error in the tool function itself and return a fallback value.

★ CrewAI Multi-Agent Tutorial Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.

Empty output, no error−

Immediate action

Check delegation settings

Commands

python -c "from crewai import Crew; crew = Crew(...); print([a.allow_delegation for a in crew.agents])"

python -c "print([a.max_iterations for a in crew.agents])"

Fix now

Set allow_delegation=False on all agents except the manager. Set max_iterations=5 on all agents.

High latency (p99 > 5s)+

Token usage > 100k per run+

Tool returns None, pipeline fails+

CrewAI vs LangGraph vs Custom Orchestration

Concern	CrewAI	LangGraph	Recommendation
Setup time	Minutes (declarative)	Hours (graph-based)	CrewAI for prototyping
Token efficiency	Poor (default full context)	Good (explicit state edges)	LangGraph for >5 agents
Parallelism	Hierarchical only	Native branching	LangGraph for complex DAGs
Production scaling	Hard (in-memory store)	Moderate (state machine)	Custom for 10k+ runs/day
Debugging	Basic logging	Graph visualization	LangGraph for visibility
Cost at scale	High (token waste)	Moderate	Custom with context capping

Key takeaways

Default sequential task execution passes the entire previous output as context to the next agent

cap context windows per task to avoid 800ms+ token blowups.

Use process=Process.hierarchical with a manager agent to parallelize independent research steps and cut latency by 60%.

Always set max_tokens and temperature per agent, not globally

a single verbose agent can double your token bill per run.

For 10k+ daily runs, replace CrewAI's in-memory task store with Redis or Postgres to avoid OOM crashes on queue buildup.

If your pipeline has fewer than 3 agents or no sequential dependencies, skip CrewAI

a simple asyncio.gather with OpenAI calls is faster and cheaper.

Common mistakes to avoid

4 patterns

No context window capping

Symptom

Each agent receives the entire conversation history, causing token usage to grow quadratically with agent count. A 5-agent crew with 2k-token outputs blows up to 10k+ tokens per run.

Fix

Set context_window on each task to max_tokens * 2 and use output_json to strip irrelevant fields before passing to the next agent.

Using `Process.sequential` for independent tasks

Symptom

Agents that don't depend on each other wait in line, doubling or tripling total run time. Our research crew had 3 parallel searches that ran serially, adding 800ms per run.

Fix

Switch to Process.hierarchical with a manager agent that dispatches independent tasks via asyncio.gather under the hood.

Global agent config instead of per-agent tuning

Symptom

All agents use the same max_tokens=4096 and temperature=0.7, even for simple extraction tasks. A 'summarizer' agent that only needs 200 tokens wastes 3896 tokens per call.

Fix

Define max_tokens and temperature per agent in the Agent constructor. For extraction agents, set max_tokens=200 and temperature=0.1.

No rate limiting or retry backoff

Symptom

At 10k runs/day, OpenAI rate limits hit within minutes. CrewAI's default retry is exponential backoff but no queue management, causing cascading failures.

Fix

Wrap the crew execution in a tenacity retry with wait_exponential(min=2, max=60) and use a Redis-backed queue (e.g., rq or celery) to limit concurrent runs to 10.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you design a multi-agent system to avoid token blowup?

Q02SENIOR

What's the difference between CrewAI's sequential and hierarchical proce...

Q03SENIOR

How would you scale CrewAI to 100k runs per day?

Q04SENIOR

Explain a real production incident caused by CrewAI's default behavior.

Q05SENIOR

When would you choose CrewAI over LangGraph?

Q01 of 05SENIOR

How would you design a multi-agent system to avoid token blowup?

ANSWER

Use a shared blackboard architecture where agents write to a structured store (e.g., Redis hash) and only read relevant keys. Each agent has a bounded context window — never pass full history. Use a coordinator agent that assigns tasks and merges results, not a sequential chain.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Why does CrewAI use so many tokens per run?

How do I parallelize agents in CrewAI?

Can I run CrewAI with local models like Llama?

What's the max number of agents before CrewAI breaks?

How do I monitor CrewAI in production?

🔥

That's Agent Frameworks. Mark it forged?

7 min read · try the examples if you haven't