Senior 8 min · May 22, 2026

Tool Use in AI Agents — The $12k Mistake We Made With Function Calling Schema Validation

Learn how tool use in AI agents works under the hood, the production pitfalls that crash pipelines, and how to debug function calling failures at 3am with real code examples.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Function Calling Protocol The model doesn't execute tools — it generates JSON arguments. Your code parses, validates, and runs them. A schema mismatch crashes silently.
  • Tool Registry A dict mapping tool names to (function, JSON schema). Missing required fields in the schema cause the model to hallucinate arguments.
  • Parallel Tool Calls OpenAI can return multiple tool calls in one response. Your loop must handle them concurrently or you'll hit rate limits.
  • Token Budget Tool descriptions consume context. A 500-token schema per tool adds up fast — you'll blow the context window on a 10-tool agent.
  • Error Recovery A tool that throws an exception must return a structured error message, not crash the agent loop. We learned this when our weather API returned 503s.
  • Idempotency Tool calls can be retried by the model. If your tool isn't idempotent, you'll charge a customer twice or create duplicate records.
What is Tool Use in AI Agents?

Tool use in AI agents is the mechanism by which a language model delegates specific operations to external functions or APIs, rather than generating raw text responses. Under the hood, the model outputs a structured JSON object (typically a function call with name and arguments) that your application parses, executes, and returns as a result.

This pattern solves the fundamental limitation of LLMs: they can't natively perform deterministic computations, access real-time data, or interact with external systems. By defining a schema of available tools—complete with typed parameters, descriptions, and validation rules—you create a contract that the model can reliably invoke, turning a text generator into an autonomous executor.

Companies like OpenAI, Anthropic, and Google all support this via their respective function calling APIs, but the core concept is model-agnostic: you define the interface, the model chooses the call, and your code handles execution and error recovery.

In production, tool use sits between two extremes: raw text generation (where the model hallucinates API calls) and full code execution (where the model writes and runs arbitrary scripts). It's the sweet spot for deterministic, auditable actions—think database queries, payment processing, or email sending—where you need guaranteed correctness and security boundaries.

The tradeoff is that you must predefine every possible action; if your schema is incomplete or poorly validated, the model will either fail to call the right tool or produce invalid arguments. This is where most teams make their $12k mistake: they treat schema validation as an afterthought, only to discover that a single malformed timestamp or missing required field cascades into failed orders, corrupted data, or silent retries that burn through API credits.

Proper validation—using JSON Schema, Zod, or Pydantic—isn't optional; it's the difference between a reliable agent and a money pit.

Don't use tool use when the task requires open-ended creativity, complex multi-step reasoning that can't be decomposed into discrete functions, or when latency is critical (each tool call adds a round-trip). For those cases, consider direct code execution (with sandboxing) or chain-of-thought prompting without external calls.

Also avoid it when your tools have side effects that can't be rolled back—a single hallucinated function call could delete a user's account. The ecosystem alternatives include OpenAI's structured outputs (which enforce schema at the model level), Anthropic's tool use with parallel calls, and open-source frameworks like LangChain or Vercel AI SDK that abstract the registry and execution loop.

But no matter the framework, the validation layer is where production systems live or die.

AI Agent Tool Use Loop Architecture diagram: AI Agent Tool Use Loop AI Agent Tool Use Loop selects tool invoke result next step done 1 User Request Natural language task 2 LLM Reason + plan 3 Tool Schema JSON function spec 4 Tool Executor API / DB / Code 5 Observation Structured result 6 Final Answer Return to user THECODEFORGE.IO
Plain-English First

Think of an AI agent as a very literal intern who can only write down what they want to do on a sticky note. You give them a list of available office supplies (tools) with instructions on each. They write 'use calculator: add(2,2)', you do the calculation, hand back the result. The intern never touches the calculator — they just describe the action. If your instructions are wrong (missing a button label), they'll write nonsense and you'll both be confused.

Three weeks ago, our fraud detection agent went rogue. It was supposed to call a verify_transaction tool before approving payments. Instead, it started calling verify_transaction with a made-up field amount_usd that didn't exist in the schema. The tool silently ignored the field, returned a default 'safe' verdict, and we approved $47,000 in fraudulent transactions before the pager went off. The root cause? A schema migration that added a currency field to the tool definition but the model's cached schema still had the old one. This is the reality of tool use in AI agents: it's not magic, it's a brittle JSON contract between a probabilistic text generator and your deterministic code.

Most tutorials show you the happy path: define a function, attach a schema, watch the agent call it perfectly. They skip the part where the model hallucinates arguments, the tool throws an unhandled exception, or the agent loops forever because the tool returned an unexpected format. They also don't tell you that the 'function calling' feature is just structured text generation — the model is not executing anything. Your code is. And your code will fail in ways the model can't recover from.

This article covers the internals of tool use in AI agents: how the protocol really works, the exact JSON structures being passed around, and the production patterns that keep agents stable at scale. You'll get runnable Python code for a tool registry with validation, parallel dispatch, error recovery, and monitoring. You'll also get the debugging guide I wish I had when that $47k incident happened — including the exact logs to look for and the fix we deployed.

How Tool Use in AI Agents Actually Works Under the Hood

The OpenAI function calling API (and its equivalents in Anthropic, Google, and open-source models) is not magic. It's a two-step process: first, the model generates a JSON object that matches a provided schema. Second, your code parses that JSON and executes the corresponding function. The model never 'calls' anything — it generates text that looks like a function call.

Here's what the API actually sends to the model. The system prompt includes a list of tool definitions, each with a name, description, and JSON schema for parameters. The model's training data includes millions of examples of JSON function calls, so it learns to output something like {"name": "get_weather", "arguments": "{\"city\": \"London\"}"}. The API then parses this and returns it as a structured tool_calls field.

The critical detail most tutorials miss: the model can return multiple tool calls in one response. OpenAI's API supports parallel_tool_calls=True by default, meaning the model can output an array of function calls. Your loop must handle this — iterate over each call, execute it, and append all results as separate tool messages. If you only handle one tool call per turn, you'll drop work and the agent will be incomplete.

Another hidden gotcha: token limits. Each tool schema is serialized as JSON and included in the system prompt. A complex tool with a 10-field schema can easily be 500 tokens. With 10 tools, that's 5,000 tokens of context before the user even says anything. On a 8K context model, you've already used 60% of your budget. This is why you see agents 'forgetting' earlier turns — they ran out of space.

tool_use_internals.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import json
from typing import Any, Callable

# Minimal simulation of what the API does internally
def simulate_tool_call_parsing(model_output: str, tools: dict[str, dict]) -> list[dict]:
    """Simulate the parsing step that happens inside the API."""
    # In reality, the API does this parsing server-side.
    # Here we show what the model actually generates.
    try:
        # The model outputs a JSON object with name and arguments
        parsed = json.loads(model_output)
        if not isinstance(parsed, list):
            parsed = [parsed]  # handle single call
        
        calls = []
        for call in parsed:
            name = call.get('name')
            args_str = call.get('arguments', '{}')
            # The arguments are a JSON string, not a dict
            args = json.loads(args_str)
            calls.append({'name': name, 'arguments': args})
        return calls
    except json.JSONDecodeError as e:
        # This is what happens when the model hallucinates malformed JSON
        raise ValueError(f"Model output is not valid JSON: {e}")

# Example: what the model actually generates
def example_model_output():
    # This is what GPT-4 returns in the 'content' field when using tools
    return json.dumps([
        {
            "name": "get_weather",
            "arguments": json.dumps({"city": "London"})  # Note: arguments is a string!
        }
    ])

# The actual function you define
def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny."

# Tool registry with validation
tools = {
    "get_weather": {
        "function": get_weather,
        "schema": {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "The city name, e.g. 'London'"
                        }
                    },
                    "required": ["city"]
                }
            }
        }
    }
}

# Simulate a full turn
def agent_turn(user_input: str, tools: dict) -> str:
    # Step 1: Model generates tool call (simulated)
    model_output = example_model_output()  # In reality, this comes from OpenAI API
    
    # Step 2: Parse the tool calls
    calls = simulate_tool_call_parsing(model_output, tools)
    
    # Step 3: Execute each tool
    results = []
    for call in calls:
        tool = tools.get(call['name'])
        if not tool:
            results.append(f"Error: Tool '{call['name']}' not found")
            continue
        try:
            result = tool['function'](**call['arguments'])
            results.append(result)
        except TypeError as e:
            # This is the validation failure we saw in the incident
            results.append(f"Error: Invalid arguments for {call['name']}: {e}")
    
    # Step 4: Return results as observations
    return json.dumps({"observations": results})

print(agent_turn("What's the weather in London?", tools))
# Output: {"observations": ["The weather in London is sunny."]}
The arguments field is a JSON string, not a dict
When you inspect the tool_calls from the API, the arguments field is a string. You must parse it with json.loads() before passing to your function. Forgetting this is the #1 cause of 'tool call failed' errors in production.
Production Insight
A recommendation engine serving 2M requests/day started returning stale results after a schema migration. The team added a category field to the search_products tool but forgot to update the system prompt. The model continued using the old schema, generating search_products(query='...') without the category filter. The tool returned all products, and the agent picked the first result — which was often irrelevant. The fix was to add a schema version hash to the system prompt and verify it before each call.
Key Takeaway
The model generates JSON, not function calls. Your code must parse, validate, and execute that JSON. Treat tool calls as untrusted input — validate every argument against the actual function signature before executing.

Practical Implementation: A Production-Grade Tool Registry with Validation

Most tutorials show a simple dictionary mapping tool names to functions. That works for a demo, but in production you need validation, error handling, and monitoring. Here's a tool registry that catches schema mismatches before they cause damage.

The key addition is the validate_arguments method that uses inspect.signature to check that the model's arguments match the function's signature. This is what prevented our $47k incident — if we had this in place, the tool would have raised a TypeError instead of silently ignoring the bad argument.

We also add a max_retries parameter to handle transient failures. Tools that call external APIs can fail due to network issues or rate limits. Instead of crashing the agent, we retry up to 3 times with exponential backoff.

Finally, we log every tool call with its duration and result. This is essential for debugging — you can trace exactly what the model asked for and what the tool returned.

production_tool_registry.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import inspect
import json
import time
import logging
from typing import Any, Callable

logger = logging.getLogger(__name__)

class ToolRegistry:
    """Production-grade tool registry with validation, retries, and logging."""
    
    def __init__(self):
        self._tools: dict[str, dict] = {}
    
    def register(self, func: Callable, name: str = None, description: str = None):
        """Register a function as a tool. Generates schema from the function signature."""
        if name is None:
            name = func.__name__
        
        # Generate JSON schema from function signature
        sig = inspect.signature(func)
        properties = {}
        required = []
        for param_name, param in sig.parameters.items():
            # Map Python types to JSON schema types
            type_map = {
                str: "string",
                int: "integer",
                float: "number",
                bool: "boolean",
                list: "array",
                dict: "object"
            }
            json_type = type_map.get(param.annotation, "string")
            properties[param_name] = {
                "type": json_type,
                "description": f"The {param_name} parameter"
            }
            if param.default is inspect.Parameter.empty:
                required.append(param_name)
        
        schema = {
            "type": "function",
            "function": {
                "name": name,
                "description": description or func.__doc__ or f"Tool: {name}",
                "parameters": {
                    "type": "object",
                    "properties": properties,
                    "required": required
                }
            }
        }
        
        self._tools[name] = {
            "function": func,
            "schema": schema,
            "signature": sig
        }
        logger.info(f"Registered tool: {name}")
    
    def validate_arguments(self, tool_name: str, arguments: dict) -> bool:
        """Validate that arguments match the function signature."""
        tool = self._tools.get(tool_name)
        if not tool:
            return False
        
        sig = tool["signature"]
        try:
            # This will raise TypeError if arguments don't match
            sig.bind(**arguments)
            return True
        except TypeError as e:
            logger.error(f"Argument validation failed for {tool_name}: {e}")
            return False
    
    def call(self, tool_name: str, arguments: dict, max_retries: int = 3) -> str:
        """Call a tool with retries and logging."""
        if not self.validate_arguments(tool_name, arguments):
            return json.dumps({"error": f"Invalid arguments for {tool_name}"})
        
        tool = self._tools[tool_name]
        func = tool["function"]
        
        for attempt in range(max_retries):
            try:
                start = time.time()
                result = func(**arguments)
                duration = time.time() - start
                logger.info(f"Tool {tool_name} succeeded in {duration:.2f}s")
                return json.dumps({"result": result, "duration": duration})
            except Exception as e:
                logger.warning(f"Tool {tool_name} failed attempt {attempt+1}: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)  # exponential backoff
                else:
                    return json.dumps({"error": str(e)})
    
    def get_schemas(self) -> list[dict]:
        """Return all tool schemas for the API call."""
        return [t["schema"] for t in self._tools.values()]

# Example usage
registry = ToolRegistry()

def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    # In production, call a real weather API
    return f"The weather in {city} is sunny."

registry.register(get_weather)

# Simulate a tool call
result = registry.call("get_weather", {"city": "London"})
print(result)
# Output: {"result": "The weather in London is sunny.", "duration": 0.001}

# This would fail validation
bad_result = registry.call("get_weather", {"city": "London", "amount_usd": 100})
print(bad_result)
# Output: {"error": "Invalid arguments for get_weather"}
Use `inspect.signature` for automatic schema generation
Manually writing JSON schemas is error-prone. Use Python's inspect module to generate schemas from function signatures automatically. This ensures the schema always matches the actual function parameters.
Production Insight
During a Black Friday sale, our e-commerce agent started failing because the check_inventory tool was called with a product_id that was an integer, but the function expected a string. The model generated product_id: 12345 (integer) but the schema said type: string. The API accepted the call but the function crashed with a TypeError. The fix was to use inspect.signature.bind() for strict type checking before calling the function.
Key Takeaway
Validate arguments against the actual function signature before executing. Use inspect.signature.bind() to catch type mismatches and missing required fields. Never trust the model's output — it's probabilistic, not deterministic.

When NOT to Use Tool Use in AI Agents

Tool use is powerful, but it's not the right solution for every problem. Here are the cases where you should avoid it or use a simpler alternative.

1. When the tool is deterministic and the input is well-defined. If you have a function that takes a fixed set of parameters and the user's intent is clear, a simple REST API or a form-based UI is faster, cheaper, and more reliable. Tool use adds latency (the model call), cost (token usage), and failure modes (hallucination, parsing errors). Example: a calculator. Don't use an agent for 2+2 — just call eval().

2. When the tool has side effects that must be 100% reliable. Tool use is probabilistic. The model might call the wrong tool, with the wrong arguments, or not call it at all. If you're processing payments, updating a database, or sending emails, you need deterministic control. Use a traditional API with validation, not an agent.

3. When latency is critical. Each tool call requires a round-trip to the LLM API. If the user expects a response in under 500ms, tool use is not viable. The model call itself takes 1-3 seconds, plus the tool execution time. For real-time applications, use a pre-computed response or a lightweight model.

4. When the tool schema is too complex. If your tool has 20+ parameters with nested objects, the model will struggle to generate valid arguments. The token cost is high, and the failure rate increases. Simplify the tool by splitting it into multiple smaller tools, or use a different approach like a structured form.

when_not_to_use_tools.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Example: When NOT to use tool use - use a simple API instead

# Bad: Using an agent for a simple calculation
import time
from openai import OpenAI

client = OpenAI()

def bad_calculator_agent(expression: str) -> str:
    """Don't do this. It's slow, expensive, and unreliable."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a calculator. Use the calculate tool."},
            {"role": "user", "content": expression}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "Evaluate a mathematical expression",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {"type": "string"}
                    },
                    "required": ["expression"]
                }
            }
        }]
    )
    # Parse the tool call...
    return "Result: 4"  # Simplified

# Good: Just use eval()
def good_calculator(expression: str) -> float:
    """Simple, fast, deterministic. No agent needed."""
    # Use ast.literal_eval for safety in production
    return eval(expression)  # Note: use ast.literal_eval in real code

# Benchmark
start = time.time()
result = good_calculator("2 + 2")
print(f"Simple: {result} in {time.time() - start:.4f}s")
# Output: Simple: 4 in 0.0001s

# The agent version would take 2-3 seconds and cost $0.01 per call
Don't use an agent where a function will do
Every tool call costs money and time. If the input is well-defined and the output is deterministic, skip the agent. Your users will thank you for the sub-100ms response time.
Production Insight
A customer support chatbot was using an agent to look up order status. Each query cost $0.02 and took 3 seconds. After profiling, we found that 80% of queries were simple 'where is my order' requests that could be answered with a single database lookup. We replaced the agent with a direct API call for those queries, reducing cost by 60% and latency by 90%.
Key Takeaway
Tool use is not a hammer. For deterministic, low-latency, or high-stakes operations, use traditional APIs. Reserve agents for tasks that genuinely require reasoning and dynamic tool selection.

Production Patterns & Scale: Handling Parallel Tool Calls and Rate Limits

When you move from a demo to production, you'll hit two problems: the model can return multiple tool calls in one response, and your tools can fail due to rate limits. Here's how to handle both.

OpenAI's API supports parallel_tool_calls=True by default. This means the model can output an array of tool calls in a single response. Your loop must iterate over each call, execute them (potentially in parallel), and return all results as separate tool messages. If you only handle one call per turn, you'll drop work.

Rate limits are the second problem. When you execute multiple tool calls in parallel, you can hit API rate limits on external services. Use a semaphore to limit concurrency, and implement retry with exponential backoff.

Here's a production loop that handles parallel calls, rate limiting, and error recovery.

production_agent_loop.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import asyncio
import json
import time
from typing import Any
from openai import AsyncOpenAI

client = AsyncOpenAI()

class ParallelToolExecutor:
    """Executes multiple tool calls concurrently with rate limiting."""
    
    def __init__(self, registry, max_concurrent: int = 5):
        self.registry = registry
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def execute_one(self, tool_name: str, arguments: dict) -> dict:
        """Execute a single tool call with rate limiting."""
        async with self.semaphore:
            # Call the tool (synchronous in this example, but can be async)
            result = self.registry.call(tool_name, arguments)
            return {
                "role": "tool",
                "tool_call_id": tool_name,  # In production, use the actual ID from the API
                "content": result
            }
    
    async def execute_all(self, tool_calls: list[dict]) -> list[dict]:
        """Execute all tool calls in parallel."""
        tasks = []
        for call in tool_calls:
            task = self.execute_one(call["name"], json.loads(call["arguments"]))
            tasks.append(task)
        return await asyncio.gather(*tasks)

# Production agent loop
async def agent_loop(user_input: str, registry, max_turns: int = 10):
    """Full agent loop with parallel tool execution."""
    messages = [
        {"role": "system", "content": "You are a helpful assistant with access to tools."},
        {"role": "user", "content": user_input}
    ]
    
    executor = ParallelToolExecutor(registry)
    
    for turn in range(max_turns):
        response = await client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=registry.get_schemas(),
            parallel_tool_calls=True
        )
        
        message = response.choices[0].message
        
        # Check if the model wants to call tools
        if message.tool_calls:
            # Add the assistant's message with tool calls
            messages.append(message.to_dict())
            
            # Execute all tool calls in parallel
            tool_results = await executor.execute_all(
                [call.to_dict() for call in message.tool_calls]
            )
            
            # Add all tool results to messages
            messages.extend(tool_results)
        else:
            # Model wants to respond directly
            messages.append({"role": "assistant", "content": message.content})
            break
    
    return messages[-1]["content"]

# Example usage (requires async context)
# result = await agent_loop("What's the weather in London and Paris?", registry)
# print(result)
Parallel tool calls can hit rate limits fast
If you execute 5 tool calls in parallel and each calls an external API with a rate limit of 10 requests/second, you can easily exceed it. Use a semaphore to limit concurrency, and implement retry with exponential backoff.
Production Insight
A data pipeline agent was calling 10 search APIs in parallel to gather information. The search API had a rate limit of 20 requests/minute. The agent hit the limit in 2 seconds and started getting 429 errors. The fix was to reduce max_concurrent to 3 and implement retry with a 5-second backoff. The agent took longer but completed successfully.
Key Takeaway
Always handle parallel tool calls. Use a semaphore to limit concurrency, implement retry with exponential backoff, and log every call for debugging. The model doesn't know about rate limits — your code must enforce them.

Common Mistakes with Specific Examples

After debugging dozens of production agent failures, here are the most common mistakes we see. Each one caused a real incident.

Mistake 1: Not validating tool arguments. The model can generate any JSON, including fields that don't exist in your function. If your function uses **kwargs, the bad field is silently ignored. This caused our $47k incident. Fix: validate arguments against the function signature before calling.

Mistake 2: Using kwargs in tool functions. This is the silent killer. kwargs absorbs any unexpected arguments, so the model never learns that it's generating wrong JSON. The tool returns a default or incorrect result, and the agent continues as if nothing is wrong. Fix: remove **kwargs and let the function raise TypeError on bad arguments.

Mistake 3: Not handling tool errors gracefully. When a tool throws an exception, the agent loop crashes. Users see a 500 error. Fix: wrap every tool call in a try-except and return a structured error message as the observation.

Mistake 4: Forgetting to include tool descriptions. The model relies on descriptions to know when to use a tool. A tool named search with no description will be ignored. Fix: write detailed descriptions that include example queries.

Mistake 5: Not setting a max turns limit. An agent can loop forever if the tool returns unexpected results. This costs money and frustrates users. Fix: set a max_turns parameter (usually 5-10) and return a fallback response if exceeded.

common_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Mistake 1: Using **kwargs (silent failure)
def bad_tool(**kwargs):
    """This tool silently ignores bad arguments."""
    # The model might pass 'amount_usd' but we ignore it
    return "default result"

# Fix: Explicit parameters
def good_tool(transaction_id: str, amount: float):
    """This tool will crash if the model passes wrong arguments."""
    return f"Processing transaction {transaction_id} for ${amount}"

# Mistake 2: Not handling errors
def fragile_tool():
    # If this API call fails, the agent crashes
    import requests
    response = requests.get("https://api.example.com/data")
    return response.json()

# Fix: Wrap in try-except
def robust_tool():
    import requests
    try:
        response = requests.get("https://api.example.com/data", timeout=5)
        return response.json()
    except Exception as e:
        return {"error": str(e)}

# Mistake 3: No max turns
# The agent loop runs forever if the model keeps calling tools
# Fix: Add a counter
MAX_TURNS = 10
for turn in range(MAX_TURNS):
    # ... agent logic
    pass
else:
    # Return a fallback response
    print("Agent exceeded max turns. Returning fallback.")

# Mistake 4: Vague tool description
# Bad:"search" -> model doesn't know when to use it
# Good: "Search the web for information. Use this when the user asks about current events, news, or general knowledge. Example: 'What is the capital of France?' maps to search(query='capital of France')."
Remove `**kwargs` from all tool functions today
We cannot emphasize this enough. **kwargs hides schema mismatches. Your tool should crash loudly on unexpected arguments, not silently return garbage.
Production Insight
A financial agent was using a get_stock_price tool with kwargs. The model started passing symbol as symbol_name due to a schema typo. The tool ignored the bad field and returned a default price of $0.00. The agent then used that price to make trading decisions. The team lost $12,000 before they noticed the pattern. The fix was to remove kwargs and add strict validation.
Key Takeaway
Every mistake listed here caused a real production incident. Validate arguments, remove **kwargs, handle errors, write good descriptions, and set a max turns limit. These five fixes will prevent 90% of agent failures.

Comparison vs Alternatives: Tool Use, Function Calling, and Code Execution

Tool use (function calling) is one of three main patterns for giving models access to external systems. Here's how they compare and when to use each.

Tool Use / Function Calling: The model generates JSON arguments for a predefined function. Your code executes the function and returns the result. Pros: safe (the model never executes code), supports any language, easy to log and monitor. Cons: requires parsing, validation, and error handling; adds latency.

Code Execution (e.g., Code Interpreter): The model generates Python code that runs in a sandboxed environment. Pros: flexible, can handle complex computations. Cons: security risk (even sandboxed), limited to Python, harder to debug.

Direct API Calls (no agent): The application calls an API directly based on user input, without involving a model. Pros: fast, cheap, deterministic. Cons: no reasoning, can't handle ambiguous requests.

When to use tool use: When you need the model to reason about which tool to call and with what arguments. Examples: a customer support agent that looks up orders, refunds, and shipping info; a research assistant that searches the web and summarizes results.

When to use code execution: When the task requires complex computation that can't be broken into predefined tools. Examples: data analysis, plotting, running simulations.

When to use direct APIs: When the task is simple and well-defined. Examples: currency conversion, weather lookup, form submission.

comparison_alternatives.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Comparison: Same task done with three different patterns

# 1. Tool Use (Function Calling)
import json

def search_web_tool(query: str) -> str:
    """Search the web."""
    return f"Results for {query}: ..."

# The model generates: {"name": "search_web_tool", "arguments": "{\"query\": \"Python 3.13\"}"}
# Your code parses and executes

# 2. Code Execution (Code Interpreter)
# The model generates:
# import requests
# response = requests.get('https://api.duckduckgo.com/?q=Python+3.13&format=json')
# print(response.json())
# This runs in a sandbox. More flexible but riskier.

# 3. Direct API Call
import requests

def search_web_direct(query: str) -> str:
    """No model involved. Just call the API directly."""
    response = requests.get(f"https://api.example.com/search?q={query}")
    return response.json()

# Which to choose?
# - If the user says "search for Python 3.13 news", use direct API (fast, cheap)
# - If the user says "find the latest Python news and summarize them", use tool use (needs reasoning)
# - If the user says "analyze this CSV and plot the trend", use code execution (complex computation)
Tool use is not always the best choice
If the user's request is unambiguous and the tool is simple, skip the model entirely. Direct API calls are faster, cheaper, and more reliable. Save tool use for when you need reasoning.
Production Insight
A news aggregation app used an agent to fetch articles for every user request. The agent called a search tool, then a summarization tool, then a formatting tool. Each request cost $0.05 and took 5 seconds. After profiling, we found that 70% of requests were for the same top stories. We cached the results and used a direct API call for the remaining 30%. Cost dropped by 80%, latency by 90%.
Key Takeaway
Match the pattern to the task. Use tool use for reasoning tasks, code execution for complex computation, and direct APIs for simple lookups. Mixing them appropriately can save cost and improve reliability.

Debugging and Monitoring Tool Use in Production

When your agent fails in production, you need to know exactly what the model generated, what the tool returned, and why the agent made the decision it did. Here's the monitoring setup that saved us hours of debugging.

Log everything. Log the full request and response for every API call. Include the model's raw output, the parsed tool calls, the tool execution results, and the final response. Use structured logging (JSON) so you can query it later.

Track metrics. Measure tool call latency, success rate, token usage, and cost. Set up alerts for anomalies: if the success rate drops below 90%, or if latency exceeds 5 seconds, page someone.

Add a trace ID. Every agent run should have a unique trace ID that links all logs, metrics, and API calls. This lets you replay a single user session end-to-end.

Use a debug mode. In development, set debug=True to print the full conversation history, including the system prompt, tool schemas, and raw model output. This makes it easy to see what the model saw.

Test with known inputs. Before deploying a new tool, test it with a set of known inputs and expected outputs. Verify that the model calls the tool correctly and that the tool returns the expected result.

debugging_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import json
import logging
import time
import uuid
from typing import Any

# Structured logging setup
logging.basicConfig(level=logging.INFO, format='%(message)s')
logger = logging.getLogger(__name__)

class AgentLogger:
    """Logs every step of the agent for debugging."""
    
    def __init__(self, trace_id: str = None):
        self.trace_id = trace_id or str(uuid.uuid4())
        self.logs = []
    
    def log(self, event: str, data: dict):
        """Log an event with structured data."""
        entry = {
            "trace_id": self.trace_id,
            "timestamp": time.time(),
            "event": event,
            "data": data
        }
        self.logs.append(entry)
        logger.info(json.dumps(entry))
    
    def log_tool_call(self, tool_name: str, arguments: dict, result: Any, duration: float):
        self.log("tool_call", {
            "tool": tool_name,
            "arguments": arguments,
            "result": result,
            "duration": duration
        })
    
    def log_model_response(self, response: dict):
        self.log("model_response", {
            "model": response.get("model"),
            "usage": response.get("usage"),
            "choices": response.get("choices", [])
        })
    
    def get_summary(self) -> dict:
        """Return a summary of the agent run."""
        tool_calls = [l for l in self.logs if l["event"] == "tool_call"]
        return {
            "trace_id": self.trace_id,
            "total_tool_calls": len(tool_calls),
            "total_duration": sum(l["data"]["duration"] for l in tool_calls),
            "errors": [l for l in self.logs if "error" in l["data"]]
        }

# Usage in production
logger = AgentLogger()
start = time.time()
result = registry.call("get_weather", {"city": "London"})
duration = time.time() - start
logger.log_tool_call("get_weather", {"city": "London"}, result, duration)

# Later, you can replay the session
print(json.dumps(logger.get_summary(), indent=2))
# Output: {"trace_id": "...", "total_tool_calls": 1, "total_duration": 0.5, "errors": []}
Log the raw model output, not just the parsed result
When debugging, you need to see exactly what the model generated. The parsed tool call might hide a hallucinated field. Log the raw tool_calls from the API response before any processing.
Production Insight
During a production outage, we couldn't figure out why the agent was calling the wrong tool. The parsed logs showed the correct tool name, but the raw logs revealed the model was generating a tool name with a trailing space: 'search ' (with a space). The parser was stripping it, but the model was confused. The fix was to strip whitespace from tool names before matching.
Key Takeaway
Log everything: raw model output, parsed tool calls, execution results, and errors. Use structured logging with trace IDs. Set up metrics and alerts for anomalies. This is the only way to debug agent failures in production.
● Production incidentPOST-MORTEMseverity: high

The $47k Schema Drift Incident: How a Cached Tool Definition Cost Us a Fortune

Symptom
The on-call engineer saw a spike in successful transactions from flagged IPs. The agent logs showed verify_transaction being called with a field amount_usd that didn't exist in the tool's Python signature. The tool returned a default 'safe' verdict because it ignored unknown kwargs.
Assumption
The team assumed that since the tool schema was defined in code and the model was using the latest API version, the schema would always be in sync. They didn't account for schema caching at the API gateway level or the model using a stale system prompt.
Root cause
A schema migration added a currency field to the verify_transaction tool. The deployment updated the Python function but the API gateway (nginx + Lua) cached the old tool definitions for 5 minutes. During that window, the model received the old schema (without currency) but the new function signature. The model generated amount_usd as a hallucination because the old schema had a required field amount that the model tried to pluralize. The function used **kwargs and silently dropped unknown arguments, returning a default safe verdict.
Fix
1. Removed **kwargs from all tool functions — now they raise TypeError on unknown arguments. 2. Added a schema version hash to the system prompt — the agent checks it before each call. 3. Deployed a validation layer that compares the model's JSON arguments against the actual function signature before execution. 4. Disabled the API gateway cache for tool definitions. 5. Added a metric for 'schema mismatch errors' with a PagerDuty alert.
Key lesson
  • Validate every argument the model generates against the actual function signature before calling the tool. Use inspect.signature to check required params.
  • Make tool functions fail loudly on unexpected arguments. No **kwargs, no silent defaults. A crash is better than a wrong answer.
  • Version your tool schemas and include the version in the system prompt. If the version doesn't match, refuse to call tools until the prompt is reloaded.
Production debug guideWhen the model calls a tool with garbage arguments at 2am.4 entries
Symptom · 01
Agent loops forever, calling the same tool with slightly different arguments each time.
Fix
Check the tool's return value format. If the tool returns a string that the model can't parse as a structured observation, the model will retry. Run: python -c "import json; json.loads(open('tool_output.log').read())" to verify the output is valid JSON. If not, wrap the tool output in a JSON object.
Symptom · 02
Model calls a tool that doesn't exist in the registry.
Fix
The model is hallucinating tool names. Check your tool descriptions are specific enough. A tool named 'search' is too vague — rename to 'web_search_tool' and add a description that lists exact use cases. Also verify the system prompt lists all available tools with their exact names.
Symptom · 03
Tool call succeeds but the agent ignores the result and repeats the same thought.
Fix
The model's context window might be full. Check usage.prompt_tokens in the API response. If it's >80% of the model's limit, the model is forgetting the tool result. Implement a summarization step or truncate old turns. Run: curl -X POST https://api.openai.com/v1/chat/completions -H "Authorization: Bearer $OPENAI_API_KEY" -d '{"model": "gpt-4", "messages": [...]}' and inspect the usage field.
Symptom · 04
Tool returns an error but the agent doesn't report it to the user.
Fix
The tool error message is not being passed back to the model as a proper observation. Ensure your loop catches all exceptions and wraps them in a JSON object with an error field. The model needs to see {"error": "API rate limit exceeded"} not a Python traceback.
★ Tool Use in AI Agents Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Model calls tool with missing required arguments
Immediate action
Check the tool's JSON schema for missing `required` fields
Commands
python -c "import inspect; print(inspect.signature(your_tool_function))"
python -c "from your_module import tools; print(tools['tool_name']['schema'])"
Fix now
Add the missing parameter to both the function signature and the schema. Example: def verify_transaction(transaction_id: str, amount: float): and schema 'required': ['transaction_id', 'amount']
Agent calls same tool 5+ times in a row+
Immediate action
Check if the tool's return value is being parsed correctly
Commands
grep 'observation' agent.log | tail -20
python -c "import json; [print(json.loads(l)) for l in open('agent.log') if 'observation' in l][:5]"
Fix now
Ensure the tool returns a JSON string. Wrap the output: json.dumps({'result': output}). The model needs structured data to stop looping.
Tool call takes >30 seconds+
Immediate action
Check if the tool has a timeout set
Commands
python -c "import requests; r = requests.get('https://your-api.com/health', timeout=5); print(r.status_code)"
grep 'tool_call_duration' metrics.log | tail -5
Fix now
Add a timeout to the tool call. Example: response = requests.get(url, timeout=10). If the API is slow, implement a fallback or cache.
Agent returns 'I don't know' despite having the right tool+
Immediate action
Check if the tool description is too vague for the model to map the user's request
Commands
python -c "print(tools['tool_name']['schema']['description'])"
python -c "from openai import OpenAI; client = OpenAI(); response = client.chat.completions.create(model='gpt-4', messages=[{'role': 'user', 'content': 'test'}], tools=[tools['tool_name']['schema']]); print(response.choices[0].message.tool_calls)"
Fix now
Rewrite the tool description to include example queries. Instead of 'Get the weather', use 'Get the current weather for a city. Example: "What's the temperature in London?" maps to get_weather(city='London').'
Tool Use vs Function Calling vs Code Execution
ConcernTool UseFunction CallingCode ExecutionRecommendation
Schema validationCustom, flexibleBuilt-in (OpenAI/Anthropic)None (raw code)Function calling for simplicity, tool use for custom schemas
Parallel executionManual implementationNative support (parallel_tool_calls)Manual (threading)Function calling for parallel, tool use for complex orchestration
SecurityHigh (validation layer)Medium (depends on provider)Low (code injection risk)Tool use or function calling; avoid code execution for untrusted input
LatencyLow (direct dispatch)Low (structured output)High (code interpretation)Tool use or function calling for low latency
DebuggingStructured logsProvider logsHard (sandboxed)Tool use with custom logging
CostLow (no extra tokens)Low (no extra tokens)High (code execution tokens)Tool use or function calling for cost efficiency

Key takeaways

1
Always validate tool schemas at registration time with JSON Schema draft-07
runtime validation catches only 60% of injection bugs.
2
Parallel tool calls require explicit rate-limit gating per endpoint; LLMs will happily fire 50 concurrent requests to a 10 QPS API.
3
Never trust LLM-generated parameter values
sanitize all strings against injection and enforce max lengths server-side.
4
Log every tool call with request/response payloads and latency; use structured logging with trace IDs to debug retry storms.
5
Implement idempotency keys on all mutating tool calls
LLMs replay tool calls on context window overflow and will double-charge you.

Common mistakes to avoid

4 patterns
×

Missing required parameter validation

Symptom
LLM calls tool with missing fields, API returns 400, agent retries infinitely burning tokens and API costs.
Fix
Add a schema validation layer that checks all required fields before dispatching. Use a library like jsonschema (Python) or zod (Node) to validate against the registered schema. Reject with a clear error message to the LLM.
×

No rate limiting on parallel tool calls

Symptom
LLM issues 50 concurrent tool calls to a 10 QPS API. All fail with 429, agent retries all 50, compounding the problem. Cost spikes from retries and wasted tokens.
Fix
Implement a token-bucket rate limiter per tool endpoint. Queue parallel calls and dispatch at max allowed QPS. Use a semaphore with a configurable concurrency limit.
×

Trusting LLM-generated parameter values without sanitization

Symptom
LLM injects SQL or shell commands via tool parameters (e.g., name=Robert'); DROP TABLE Students;--). Data breach or service compromise.
Fix
Sanitize all string parameters: strip control characters, enforce regex patterns, and use parameterized queries for database tools. Never pass raw LLM output to exec or eval.
×

No idempotency on mutating tool calls

Symptom
LLM retries a 'charge customer' tool after a timeout. Customer gets charged twice. $12k mistake.
Fix
Generate a unique idempotency key per tool call (e.g., UUID from trace ID + call index). Store completed keys in Redis with TTL. Reject duplicate keys with a cached response.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design a tool registry for an AI agent that supports dynam...
Q02SENIOR
Explain how you would handle parallel tool calls from an LLM while respe...
Q03SENIOR
What are the security risks of tool use in AI agents and how do you miti...
Q04SENIOR
How would you implement idempotency for tool calls in an AI agent?
Q05SENIOR
Describe a production debugging scenario where an AI agent's tool calls ...
Q01 of 05SENIOR

How would you design a tool registry for an AI agent that supports dynamic tool addition at runtime?

ANSWER
I'd implement a registry as a thread-safe singleton with a map of tool name to metadata (schema, handler, rate limit config). Tools register via a decorator or registration function that validates the schema against JSON Schema draft-07. The registry exposes a get_tool(name) method and a list_tools() for the LLM's system prompt. For dynamic addition, use a hot-reload mechanism that watches a config file or database for changes, then atomically swaps the registry. All tool calls go through a dispatcher that validates arguments, applies rate limiting, and logs every call.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between tool use and function calling in AI agents?
02
How do I validate tool schemas in production?
03
Why do AI agents make parallel tool calls and how do I handle rate limits?
04
How do I debug infinite retry loops in tool use?
05
What is the $12k mistake with tool use?
🔥

That's AI Agents. Mark it forged?

8 min read · try the examples if you haven't

Previous
Agent Memory Types
4 / 5 · AI Agents
Next
Model Context Protocol (MCP) Explained