CrewAI Multi-Agent Tutorial — The 800ms Token Blowup That Killed Our Research Pipeline
Stop treating agents as magic.
- Agent Roles Define role, goal, and backstory precisely — vague prompts balloon token usage by 40%+ in production.
- Task Delegation Agents can delegate subtasks; without
allow_delegation=False, you get infinite loops and exploding costs. - Sequential Process Simplest pattern. Tasks execute in order. Output of one feeds next. Fine for linear pipelines.
- Hierarchical Process A manager agent assigns tasks. Adds latency (800ms+ per delegation) but handles complex workflows.
- Tool Integration Tools are Python functions with a
@tooldecorator. They block the agent loop until they return — no async by default. - Crew Execution
crew.kickoff()blocks until all tasks complete. For async, usekickoff_async()and manage the event loop yourself.
CrewAI is a Python framework for orchestrating multi-agent AI workflows, where you define autonomous 'agents' (each with a role, goal, and LLM backend) that collaborate on tasks via a structured 'crew' pipeline. It solves the problem of chaining complex, multi-step reasoning tasks—like research, content generation, or data analysis—by letting you decompose work into specialized agents that pass results to each other, mimicking a team of human experts.
Under the hood, CrewAI uses a sequential or hierarchical task graph, where each agent executes its prompt against an LLM (typically GPT-4 or Claude), and the output becomes input for the next agent. The framework handles context passing, tool integration (e.g., web search, file I/O), and basic error retries, but it’s fundamentally synchronous and token-inefficient: every agent call burns full context, leading to the '800ms token blowup' problem when tasks cascade.
In the ecosystem, CrewAI competes with LangChain’s multi-agent abstractions (more flexible but heavier), AutoGen (Microsoft’s conversational agent framework, better for dynamic dialogues), and direct orchestration via libraries like Pydantic AI or custom asyncio pipelines. It’s ideal for prototyping structured, deterministic workflows—like a research pipeline that fetches, summarizes, and cross-references data—but fails under high throughput or when agents need real-time feedback loops.
For production at scale (10,000+ runs/day), you’d replace CrewAI with a streaming-first architecture using message queues (e.g., Redis Streams) and stateless agent functions, because CrewAI’s per-run overhead (token amplification, sequential blocking) kills latency and cost budgets. Use CrewAI when you need a quick, readable scaffold for a multi-step AI task; avoid it for latency-sensitive, high-volume, or dynamically branching workflows—where you’d reach for a custom event-driven system or a purpose-built orchestration layer like Temporal.
Imagine you're running a restaurant. Instead of one chef doing everything, you hire a sous chef (researcher), a line cook (writer), and a head chef (editor). CrewAI is the kitchen manager who tells each person what to cook, passes the dishes between them, and makes sure the final plate is perfect. But if you don't set clear rules — like 'no, the sous chef cannot start a side project mid-shift' — your kitchen turns into chaos and your ingredient costs (tokens) explode.
We built a content research pipeline using CrewAI. Three agents: Researcher, Writer, Editor. Sequential process. Simple, right? At 500 requests per minute, our token burn hit $4,000/month. The p99 latency was 12 seconds. And then the pipeline started silently returning empty articles — no errors, just blank text. The logs showed the Researcher agent had delegated a sub-task to itself 47 times in a single run, each delegation costing 150ms and 2,000 tokens. We'd built a team of agents that were holding meetings about holding meetings.
Most CrewAI tutorials show you the happy path: install the library, define some agents, run a crew. They skip the production realities — token budgets, delegation loops, tool timeout settings, and the fact that CrewAI agents are just GPT-4o wrappers with system prompts. They don't tell you that a single misconfigured allow_delegation=True can 10x your costs. They don't mention that crew.kickoff() is synchronous and blocks your event loop. They don't explain that the default LLM temperature is 0.7, which means your 'reliable' research agent is actually a creative writer.
This article covers everything the tutorials miss: the internal architecture of CrewAI (it's just chained LLM calls with a fancy state machine), the exact configuration that killed our pipeline, how to set token budgets per agent, when to use Sequential vs Hierarchical (and when to use neither), and a production debugging guide for when your agents start talking to themselves. You'll get runnable code, real incident data, and a triage cheat sheet for 2am pages. If you're deploying CrewAI to production, read this before you write a single agent.
How CrewAI Actually Works Under the Hood
CrewAI is not a complex orchestration engine. It's a chain of LLM calls wrapped in a state machine. When you call , the framework iterates through each task in order. For each task, it constructs a system prompt from the agent's crew.kickoff()role, goal, and backstory, appends the task description, and sends it to the LLM. The LLM response is parsed for tool calls (if any), which are executed synchronously. The result is stored and passed to the next task.
The key abstraction that most tutorials miss is the Task object. Each task has a context field that can include the output of previous tasks. But here's the gotcha: the context is appended to the prompt as a raw string. If your previous task output is 10,000 tokens, your next task's prompt is 10,000 tokens longer. We saw a 3-agent pipeline where the Writer agent's prompt was 18,000 tokens because it included the full research output plus the editor's feedback. That's $0.36 per run just in prompt tokens.
Another hidden detail: agents can call tools, but tools are synchronous Python functions. The agent loop blocks until the tool returns. If your tool makes an API call that takes 2 seconds, the entire crew is stalled. There's no timeout on tool execution by default. We had a web search tool that occasionally hung for 30 seconds. The crew didn't fail — it just sat there. We added a functools.lru_cache and a 5-second timeout to fix it.
context field appends the full output of previous tasks to the next task's prompt. If your research output is 5k tokens, your writer prompt is 5k tokens larger. Monitor this with crew.usage_metrics after each run.crew.task_callback to trim the context before each task execution.Practical Implementation: Building a Production-Ready Research Crew
Let's build a research and writing crew that won't bankrupt you. We'll use gpt-4o-mini for cost-sensitive tasks, set explicit token budgets, and add error handling. The key difference from tutorials: we'll use Process.hierarchical with a manager agent that controls delegation, and we'll add a custom callback to log every step.
First, install the dependencies. Note: we pin crewai to 0.3.1 because later versions changed the callback API. Always pin your versions in production.
We'll define three agents: a Manager (who delegates), a Researcher (who gathers data), and a Writer (who produces output). The Manager uses gpt-4o for reasoning, the others use gpt-4o-mini. This saves ~60% on token costs compared to using gpt-4o for all agents.
We'll also add a max_tokens_per_task parameter. This isn't an official CrewAI parameter — we implement it via a callback that checks token usage after each task and raises an exception if it exceeds the budget.
gpt-4o only for the manager agent that does complex reasoning. Use gpt-4o-mini for all other agents. In our pipeline, this reduced token costs by 62% with no quality drop.gpt-4o for all three agents. Token cost per run was $0.12. After switching to gpt-4o-mini for Researcher and Writer, cost dropped to $0.045 per run. At 10k runs/day, that's $750/day saved.allow_delegation=False on non-manager agents.When NOT to Use CrewAI — And What to Use Instead
CrewAI is not a silver bullet. We learned this the hard way when we tried to use it for a real-time chatbot. The crew execution is synchronous — blocks until all tasks complete. For a chatbot that needs sub-second responses, CrewAI adds 2-5 seconds of latency just from the LLM calls. We switched to a single-agent pipeline with LangChain for that use case.crew.kickoff()
Another anti-pattern: using CrewAI for simple data transformations. If you have one task that takes input A and produces output B, you don't need a multi-agent system. Just call the LLM directly. CrewAI adds overhead: the framework itself takes ~200ms to initialize, plus the agent prompt construction. For simple tasks, a direct OpenAI API call is faster and cheaper.
When should you use CrewAI? When you have multiple distinct roles that need to collaborate on a complex output. Examples: research + writing + editing pipelines, code generation + review + testing, or data analysis + visualization + reporting. The key is that each role has a different expertise and the output of one feeds the next.
Alternatives: LangGraph for complex state machines (better for branching logic), AutoGen for conversational agents (better for interactive scenarios), or just a single LLM call with a well-structured prompt (better for simple tasks).
Production Patterns & Scale: Handling 10,000 Crew Runs Per Day
Scaling CrewAI to thousands of runs per day requires careful resource management. Each crew.kickoff() is a synchronous blocking call. If you run 10 crews concurrently, you need 10 threads or processes. We use a thread pool with 20 workers, but we had to add rate limiting to avoid hitting OpenAI's TPM limits.
Pattern 1: Batch processing with a queue. We use Redis as a task queue. Each job contains the crew configuration (agents, tasks, tools). A worker picks up the job, runs the crew, and stores the result back in Redis. This decouples the API from the execution.
Pattern 2: Caching tool outputs. If multiple crews run the same search query, cache the result. We use functools.lru_cache with a TTL of 1 hour. This reduced tool calls by 40% in our pipeline.
Pattern 3: Token budgeting per crew. We track token usage per run and reject jobs that exceed a budget. We use a callback that raises an exception if token usage exceeds a threshold. This prevents runaway costs.
Pattern 4: Async execution. CrewAI v0.3.1 supports which returns a coroutine. You can use kickoff_async() to run multiple crews concurrently. But beware: the async version still uses synchronous tool calls internally, so it's not truly non-blocking.asyncio.gather()
kickoff_async() uses threads internally for tool calls. It's not true async I/O. For 100+ concurrent crews, use a process pool instead of threads to avoid GIL contention.asyncio.gather. The tool calls (web search) were synchronous, so all 50 crews blocked on the same API call. We switched to a thread pool with 10 workers and a Redis queue. Throughput went from 5 crews/min to 60 crews/min.Common Mistakes With Specific Examples
We've seen the same mistakes across multiple teams. Here are the top five, with exact code examples.
Mistake 1: Not setting allow_delegation=False. This causes infinite loops. We covered this in the incident section.
Mistake 2: Using the same LLM model for all agents. This wastes money. Use gpt-4o for the manager, gpt-4o-mini for others.
Mistake 3: Not setting max_iterations. Default is 15, which is too high for simple tasks. Set it to 3-5.
Mistake 4: Passing large context. The context field appends full previous outputs. Truncate to 2k tokens.
Mistake 5: Ignoring tool errors. Tools can fail silently. Set fail_on_error=True or wrap tools in try/except.
max_iterations. One agent looped 47 times, generating a 50k-token output. The cost for that single run was $2.50. They set max_iterations=5 and costs dropped to $0.15 per run.allow_delegation=False, max_iterations=5, and max_tokens_per_task=4000 on every agent. Truncate context to 2k tokens. Handle tool errors explicitly.CrewAI vs Alternatives: When to Choose What
We evaluated CrewAI, LangGraph, AutoGen, and a single LLM call for our content pipeline. Here's the breakdown.
CrewAI: Best for linear, multi-role workflows. Easy to set up. Limited branching. Synchronous by default. Good for batch processing.
LangGraph: Best for complex state machines with branching, loops, and conditional logic. More flexible but steeper learning curve. Supports async natively.
AutoGen: Best for conversational agents that need to chat with each other. Supports multi-turn conversations. Higher latency due to turn-taking.
Single LLM: Best for simple tasks. No overhead. No multi-agent complexity. Fastest and cheapest.
Our recommendation: Start with a single LLM call. If you need multiple roles, use CrewAI. If you need branching or loops, use LangGraph. If you need conversational agents, use AutoGen.
Debugging and Monitoring CrewAI in Production
You can't debug what you can't see. CrewAI provides callbacks for every step, but most tutorials don't mention them. We use three callbacks:
step_callback: Called after each agent completes a step. We log the agent role, token usage, and duration.task_callback: Called after each task completes. We validate the output format and log it.crew_callback: Called after the entire crew completes. We log total token usage and duration.
We also use LangSmith for tracing. CrewAI supports LangSmith integration via the LANGCHAIN_API_KEY environment variable. This gives you a visual trace of every LLM call, tool execution, and delegation.
For alerting, we set up CloudWatch alarms on token usage per crew. If a single crew uses more than 50k tokens, we get paged. This catches delegation loops and runaway costs.
LANGCHAIN_API_KEY and LANGCHAIN_TRACING_V2=true in your environment. You'll get a full trace of every LLM call, tool execution, and delegation. It's invaluable for debugging.fail_on_error=True to the tool and the issue was fixed.The Infinite Delegation Loop — How a Researcher Agent Cost Us $4,000 in a Week
allow_delegation=True would let agents call other agents for help, but the Researcher had no other agents to delegate to — it started delegating to itself.agent.allowed_tools and crew.agents for possible delegates. When the Researcher's allow_delegation=True and no other agent had the 'research' role, the framework fell back to delegating to the same agent. Each delegation created a new sub-task, which the Researcher then delegated again. No max-depth check existed in CrewAI v0.3.x.allow_delegation=False on all agents that shouldn't delegate (Researcher, Writer). Only the Manager agent in Hierarchical process should delegate.
2. Added a max_iterations=5 parameter on each agent to cap LLM calls per task.
3. Implemented a token budget: agent.max_tokens_per_task=4000 to force task completion or failure.
4. Added a custom callback to log each delegation event and measure latency.- Set
allow_delegation=Falseby default. Only enable it on agents that explicitly need to delegate, and always pair it with amax_iterationscap. - Always log the number of LLM calls per crew run. If it exceeds
num_tasks num_agents 2, something is looping. - Never trust 'execution completed successfully' — always validate the output content, not just the status code.
crew.kickoff()allow_delegation=True and no other agents to delegate to. Run crew.agents to list all agents and their roles. If only one agent exists, it will delegate to itself. Fix: set allow_delegation=False.from crewai import Crew; crew.step_callback = lambda step: print(step.agent.role, step.token_usage). Look for agents with >10 calls per task.agent.max_iterations — default is 15. If it's None, the agent will loop forever. Also check task.max_retries — default is 2. Set both to reasonable values (5 and 2).tool.fail_on_error=True on critical tools. For non-critical, log the error in the tool function itself and return a fallback value.python -c "from crewai import Crew; crew = Crew(...); print([a.allow_delegation for a in crew.agents])"python -c "print([a.max_iterations for a in crew.agents])"allow_delegation=False on all agents except the manager. Set max_iterations=5 on all agents.Key takeaways
process=Process.hierarchical with a manager agent to parallelize independent research steps and cut latency by 60%.max_tokens and temperature per agent, not globallyasyncio.gather with OpenAI calls is faster and cheaper.Common mistakes to avoid
4 patternsNo context window capping
context_window on each task to max_tokens * 2 and use output_json to strip irrelevant fields before passing to the next agent.Using `Process.sequential` for independent tasks
Process.hierarchical with a manager agent that dispatches independent tasks via asyncio.gather under the hood.Global agent config instead of per-agent tuning
max_tokens=4096 and temperature=0.7, even for simple extraction tasks. A 'summarizer' agent that only needs 200 tokens wastes 3896 tokens per call.max_tokens and temperature per agent in the Agent constructor. For extraction agents, set max_tokens=200 and temperature=0.1.No rate limiting or retry backoff
tenacity retry with wait_exponential(min=2, max=60) and use a Redis-backed queue (e.g., rq or celery) to limit concurrent runs to 10.Interview Questions on This Topic
How would you design a multi-agent system to avoid token blowup?
Frequently Asked Questions
That's Agent Frameworks. Mark it forged?
7 min read · try the examples if you haven't