Senior 5 min · May 22, 2026

LLM Function Calling Explained — The $47k Mistake We Made with Parallel Tool Calls

How LLM function calling works under the hood, the production failures that break naive implementations, and a debug guide for when your agent starts calling the wrong tool at 2am.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Tool Schema Definition The JSON schema you define is not just documentation — it's the only thing the model sees. If your descriptions are vague, the model will hallucinate arguments. We learned this when a weather function started returning 'pineapple' for location.
  • Parallel vs Sequential Calls Most tutorials show one function call. In production, models can emit multiple tool calls in a single turn. If your loop doesn't handle that, you'll silently drop requests and corrupt state.
  • Stop Reason Parsing Relying on finish_reason: 'tool_calls' is fragile. Some providers return 'function_call' or omit it entirely. Always check for the actual tool call object in the response.
  • Token Budget for Tools The tool schema counts against your context window. A 10-function schema with verbose descriptions can eat 2k tokens. That's 20% of your budget gone before the user types a word.
  • Error Propagation If your tool raises an exception, the model doesn't know. You must return a structured error response in the tool output, or the model will retry the same bad call infinitely.
  • Idempotency Tokens Function calls are not transactional. If your payment service processes a charge inside a tool call, and the LLM retries due to a timeout, you'll double-charge the customer. Always include an idempotency key.
What is LLM Function Calling Explained?

LLM function calling (also called tool use) is a mechanism that lets large language models request the execution of external functions during a conversation, rather than just generating text. Instead of the model guessing an answer, it outputs a structured JSON object specifying a function name and arguments — your application then runs that function and feeds the result back to the model.

This turns the LLM from a static text generator into an agent that can query databases, call APIs, perform calculations, or trigger side effects. The core insight: the model doesn't execute code; it describes what code should run, and you control execution.

This pattern is what powers real-world systems like customer support bots that look up orders, code assistants that run tests, and data analysis tools that query live databases.

Function calling exists because raw LLM outputs are unreliable for actions — they hallucinate facts, can't access real-time data, and have no ability to affect the outside world. By forcing the model to declare its intent in a structured format, you get deterministic, auditable, and retryable operations.

The key alternatives are: (1) prompt engineering where you ask the model to output function calls in plain text and parse it yourself (fragile, error-prone), (2) code generation where the model writes and executes arbitrary code (dangerous, no sandboxing), or (3) retrieval-augmented generation (RAG) for read-only data access. Use function calling when you need the model to take actions or access dynamic data; don't use it for pure text generation, simple Q&A, or when latency is critical — the round-trip to execute a function and feed results back adds 500ms-2s per call.

In production, function calling is deceptively simple in demos but brutal at scale. The $47k mistake referenced in the article comes from a specific failure mode: parallel tool calls. When you define multiple functions, the model can request several at once — and if those functions have side effects (like decrementing inventory or charging a credit card), executing them in parallel without proper idempotency or ordering guarantees can corrupt state, double-charge customers, or create race conditions.

The naive implementation just runs all requested functions simultaneously and returns results; the correct pattern requires sequential execution with dependency tracking, idempotency keys, and rollback logic. This is why production-grade function calling loops include a state machine, not just a for loop over tool calls.

LLM Function Calling Architecture diagram: LLM Function Calling LLM Function Calling match invoke inject 1 User Message Natural language intent 2 LLM GPT-4 / Claude 3 Function Schema JSON spec + params 4 Function Runner Your app code 5 Tool Result Structured JSON 6 Final Reply Grounded in tool output THECODEFORGE.IO
Plain-English First

Think of an LLM as a brilliant but forgetful chef. You give them a recipe book (the tool schema) and ask for dinner. If the recipe says 'add a pinch of salt' without specifying 'to taste', the chef might dump the whole shaker in. Function calling is you handing the chef a phone to call the grocery store — but if you don't write the phone number clearly, they'll call the pet store instead and you'll get dog food on your pasta.

We were three weeks into a production rollout of an LLM-powered customer support agent. The agent could look up orders, process refunds, and escalate to humans. It was working beautifully in staging. Then, at 2:14 AM on a Thursday, a customer asked 'Can you refund my order #8472?' The agent called the refund function — with the wrong order ID. It refunded someone else's purchase. The customer wasn't happy. The finance team wasn't happy. I wasn't happy.

Most tutorials treat function calling as magic: define a schema, call the API, get JSON back. They skip the part where the model hallucinates arguments, where parallel calls deadlock your event loop, where a missing 'required' field silently drops a critical parameter. They assume the model will always pick the right tool. It won't. We've seen it call a 'get_weather' function with a parameter called 'pineapple'.

This article covers the internals you need to know before you put function calling in production: how the model actually processes tool definitions, the exact parsing logic that breaks under load, and the debugging patterns that will save you when your agent starts calling the wrong function at 3am. We'll walk through a concrete incident, a production-grade implementation, and a triage cheat sheet you can paste into your on-call runbook.

How LLM Function Calling Actually Works Under the Hood

When you send a tool schema to an LLM, you're not 'registering' a function. You're injecting a JSON blob into the model's context window, formatted as a system message. The model is then fine-tuned to output a special token sequence that signals a tool call. This is why the schema counts against your token budget — it's literally part of the prompt.

OpenAI's implementation appends the tool definitions to the system message before tokenization. The model then generates a tool_calls field in the response. Under the hood, the model outputs a JSON string inside the arguments field. The API then parses this JSON for you — but if the model outputs malformed JSON, the API returns an error. We've seen models output unescaped newlines inside strings, which breaks the parser.

Anthropic's Claude uses a different approach: it outputs a special XML-like tag <function_calls> and then a JSON block. This is more robust against malformed output but adds overhead. The key insight: the model doesn't 'understand' your function — it's pattern-matching based on the description and parameter names. If two functions have similar descriptions, the model will confuse them.

inspect_tool_schema.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import json
from openai import OpenAI

client = OpenAI()

# Define a minimal tool schema
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]

# Send a request and inspect the raw response
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

# The raw response object contains the tool call
message = response.choices[0].message
print(f"Finish reason: {response.choices[0].finish_reason}")
print(f"Tool calls: {message.tool_calls}")

# Inspect the raw JSON of the arguments — this is what the model actually output
if message.tool_calls:
    for tc in message.tool_calls:
        print(f"Function: {tc.function.name}")
        print(f"Raw arguments JSON: {tc.function.arguments}")
        # The API parses this into a dict, but the raw string is what the model generated
        parsed = json.loads(tc.function.arguments)
        print(f"Parsed arguments: {parsed}")
Token cost of tool schemas
A single tool schema with 3 parameters and verbose descriptions can be ~200 tokens. If you have 10 tools, that's 2,000 tokens per request — before the user says anything. At $0.01 per 1k tokens for GPT-4, that's $0.02 per call just for the schema. Scale to 1M requests/month and you're paying $20k/month for nothing but function definitions.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The team added a new tool get_user_preferences with a description that overlapped with the existing get_recommendations tool. The model started calling get_user_preferences for recommendation requests, which returned a cached subset of data. The fix was to add explicit exclusion language in the descriptions: 'Use this only for preference settings, NOT for recommendations.'
Key Takeaway
The tool schema is not metadata — it's a prompt. Every word in the description influences the model's behavior. Treat it like a system prompt, not a docstring.

Practical Implementation: A Production-Grade Function Calling Loop

Most tutorials show a single turn: user asks, model calls tool, you return result. In production, you need a loop that handles multiple tool calls, parallel calls, errors, and context window limits. The loop must also track the conversation history to maintain state.

Here's a pattern we use in production. It handles
  • Multiple tool calls in one response (parallel)
  • Tool execution errors with structured error messages
  • Context window overflow detection
  • Idempotency for retries
production_function_calling_loop.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import json
import logging
from openai import OpenAI
from typing import List, Dict, Any

logger = logging.getLogger(__name__)

class FunctionCallingAgent:
    def __init__(self, model: str = "gpt-4", max_turns: int = 5):
        self.client = OpenAI()
        self.model = model
        self.max_turns = max_turns
        self.tools = self._define_tools()

    def _define_tools(self) -> List[Dict]:
        return [
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "Get current weather for a city. Format: city name only, no state/country.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {"type": "string", "description": "City name (e.g., Tokyo, London)"},
                            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                        },
                        "required": ["location"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "search_web",
                    "description": "Search the web for current information. Use for news, prices, or anything time-sensitive.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string", "description": "Search query"}
                        },
                        "required": ["query"]
                    }
                }
            }
        ]

    def _execute_tool(self, tool_call: Any) -> Dict:
        """Execute a single tool call and return the result with the matching ID."""
        try:
            func_name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            
            if func_name == "get_weather":
                # Simulate API call — in production, call your weather API
                result = {"temperature": 22, "unit": args.get("unit", "celsius")}
            elif func_name == "search_web":
                result = {"results": [f"Simulated result for {args['query']}"]}
            else:
                raise ValueError(f"Unknown function: {func_name}")
            
            return {
                "role": "tool",
                "tool_call_id": tool_call.id,  # Critical: must match the original call ID
                "content": json.dumps(result)
            }
        except Exception as e:
            logger.error(f"Tool execution failed: {e}")
            return {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps({"error": str(e)})
            }

    def run(self, user_message: str) -> str:
        messages = [{"role": "user", "content": user_message}]
        
        for turn in range(self.max_turns):
            response = self.client.chat.completions.create(
                model=self.model,
                messages=messages,
                tools=self.tools,
                tool_choice="auto"
            )
            
            message = response.choices[0].message
            messages.append(message)
            
            # Check if the model wants to call tools
            if message.tool_calls:
                # Execute all tool calls in parallel (or sequentially — your call)
                tool_results = [self._execute_tool(tc) for tc in message.tool_calls]
                messages.extend(tool_results)
                continue
            
            # No tool calls — this is the final response
            return message.content
        
        return "Max turns reached without final response."

# Usage
agent = FunctionCallingAgent()
print(agent.run("What's the weather in London and search for latest AI news?"))
Always log the raw arguments string
When debugging, log tool_call.function.arguments (the raw string) not just the parsed dict. If the model outputs malformed JSON, you'll see the exact string that broke the parser. We've caught unescaped quotes and newlines this way.
Production Insight
A fraud detection system processing 500 transactions/minute used function calling to query a risk database. The loop didn't handle parallel tool calls — it only processed the first one. When the model requested both 'check_risk_score' and 'get_user_history' in one turn, the second call was silently dropped. The fraud score was calculated without the user history, leading to false negatives. We lost $12k in chargebacks before we caught it.
Key Takeaway
Always iterate over tool_calls as a list. Never assume a single call. Use asyncio.gather for parallel execution but beware of rate limits.

When NOT to Use Function Calling

Function calling is not the right tool for every job. Here are three scenarios where you should avoid it:

  1. Simple data extraction: If you just need to extract a structured field from text (e.g., 'extract the date from this email'), use a structured output mode or a fine-tuned model. Function calling adds latency and cost for no benefit.
  2. Real-time streaming: Function calling requires a round-trip to the model. If you need sub-second responses, use a dedicated API call instead of routing through an LLM.
  3. High-frequency, low-variability tasks: If you're calling the same function with the same parameters thousands of times (e.g., 'get_stock_price for a list of tickers'), a batch API call is cheaper and faster than LLM-mediated calls.
when_not_to_use.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# BAD: Using function calling for simple extraction
# This costs $0.02 per call and adds 500ms latency
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Extract the date from: 'Meeting on 2024-01-15'"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_date",
            "parameters": {
                "type": "object",
                "properties": {"date": {"type": "string"}},
                "required": ["date"]
            }
        }
    }]
)

# GOOD: Use a regex or a dedicated parser
import re
match = re.search(r'\d{4}-\d{2}-\d{2}', "Meeting on 2024-01-15")
date = match.group(0) if match else None
print(date)  # 2024-01-15 — free and instant
Cost comparison: LLM vs. regex
At 1M requests/month, using GPT-4 for date extraction costs $20k/month. A regex costs $0. The LLM is not always the answer.
Production Insight
A stock market dashboard used function calling to fetch prices for 100 tickers every minute. Each request cost $0.02 and took 2 seconds. The dashboard was always 2 minutes behind. We replaced it with a direct API call that fetched all 100 prices in one batch request for $0.001 and 200ms. The LLM was the bottleneck.
Key Takeaway
Function calling is for dynamic, context-dependent tool selection. For static, repetitive tasks, use a direct API call.

Production Patterns: Scaling Function Calling to Millions of Requests

When you scale function calling to production loads, three patterns emerge:

  1. Caching tool schemas: The tool schema is static per deployment. Cache it on the client side and only send it on the first request. For subsequent requests in the same session, reuse the cached schema. This cuts token costs by 30%.
  2. Rate limiting tool execution: If your tools call external APIs (e.g., weather, payment), you'll hit rate limits. Implement a token bucket per tool. If the limit is exceeded, return a structured error: 'Rate limit exceeded. Try again in 5 seconds.' The model will wait or ask the user.
  3. Context window management: After multiple tool calls, the conversation history grows. Implement a sliding window that drops the oldest messages. Keep the system prompt and tool schema, drop user/tool exchanges older than N turns.
scale_patterns.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import time
from collections import deque

class TokenBucket:
    """Rate limiter per tool."""
    def __init__(self, rate: float, capacity: int):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.time()

    def consume(self, tokens: int = 1) -> bool:
        now = time.time()
        self.tokens = min(self.capacity, self.tokens + (now - self.last_refill) * self.rate)
        self.last_refill = now
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

class ContextWindow:
    """Sliding window for conversation history."""
    def __init__(self, max_messages: int = 20):
        self.messages = deque(maxlen=max_messages)

    def add(self, message: dict):
        self.messages.append(message)

    def get_messages(self, system_prompt: str, tools: list) -> list:
        # Always include system prompt and tools at the start
        return [{"role": "system", "content": system_prompt}] + list(self.messages)

# Usage
rate_limiter = TokenBucket(rate=10, capacity=20)  # 10 calls/sec, burst up to 20
if not rate_limiter.consume():
    return {"error": "Rate limit exceeded. Try again in 5 seconds."}
Context window is not infinite
GPT-4 has a 8k or 32k context window. Each tool call + result adds ~500 tokens. After 10 tool calls, you've consumed 5k tokens. If your system prompt is 2k and tool schema is 2k, you're at 9k — over the limit. Implement message dropping before you hit the wall.
Production Insight
A customer support bot handling 10k conversations/day hit context window limits after 3 turns. The team was appending every tool call result without pruning. The model started ignoring the tool schema because it was truncated. The fix: keep only the last 5 messages and drop tool results older than 2 turns.
Key Takeaway
Scale requires explicit resource management: cache schemas, rate limit tool execution, and prune conversation history.

Common Mistakes with Specific Examples (and How They Broke Production)

  1. Not validating tool output before returning to the model: The model will believe whatever you return. If your tool returns an error message like 'API key expired', the model might try to fix it by calling another tool with the API key as an argument. Always validate and sanitize tool output.
  2. Using tool_choice: 'required' blindly: This forces the model to call a tool on every turn. If the user says 'hello', the model will still call a tool, wasting tokens and latency. Use 'auto' and let the model decide.
  3. Assuming the model will call tools in order: The model can call multiple tools in any order. If tool B depends on tool A's output, you must enforce ordering in your code. The model won't do it for you.
common_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Mistake 1: Not validating tool output
# BAD: Returning raw error message
tool_output = {"error": "Database connection failed: timeout"}
# The model might then call a 'fix_database' function that doesn't exist

# GOOD: Return a sanitized error
if "timeout" in str(error):
    tool_output = {"error": "Service temporarily unavailable. Please try again later."}

# Mistake 2: Using tool_choice='required' for a greeting
# BAD: Forces tool call even for 'hello'
response = client.chat.completions.create(
    ...,
    tool_choice="required",  # Don't do this unless you have a reason
)

# GOOD: Let the model decide
response = client.chat.completions.create(
    ...,
    tool_choice="auto"
)

# Mistake 3: Assuming tool call order
# If the model calls get_user_id and then get_orders, you must execute them in order
# The model might return them in any order in the list
# Always check dependencies before executing
Test with 'hello' first
Before deploying, send a simple greeting like 'hello' to your agent. If it calls a tool, your tool_choice setting is wrong. The model should not need a tool to respond to 'hello'.
Production Insight
A travel booking agent used tool_choice: 'required' to always call 'search_flights'. When a user said 'I'm just browsing', the agent still called the flights API, costing $0.03 per call. Over 100k sessions, that's $3k/month wasted on meaningless API calls.
Key Takeaway
Let the model decide when to use tools. Forced tool calls waste money and degrade user experience.

Comparison: OpenAI vs. Anthropic vs. Open-Source Function Calling

Each provider implements function calling differently. Here's the production-relevant differences:

OpenAI: Returns tool_calls as a list. Supports parallel calls natively. The model is fine-tuned to output valid JSON. However, it can still produce malformed JSON under stress (e.g., when the context window is nearly full).

Anthropic Claude: Uses XML-like tags for function calls. More robust against malformed output but adds parsing overhead. Claude 3 Opus is better at following complex schema descriptions but slower.

Open-source (Llama 3, Mistral): Requires a specific system prompt format. No native API support — you must parse the output yourself. More flexible but more error-prone. We've seen Llama 3 output function calls inside a code block instead of JSON, breaking the parser.

provider_comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# OpenAI style (native)
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=[weather_tool]
)
tool_call = response.choices[0].message.tool_calls[0]

# Anthropic style (XML-like)
# The model outputs: <function_calls><invoke name="get_weather">...
# You must parse this yourself
import re
pattern = r'<invoke name="(\w+)">(.*?)</invoke>'
matches = re.findall(pattern, response.content, re.DOTALL)

# Open-source style (varies by model)
# Llama 3 might output:
# {
#   "function": "get_weather",
#   "arguments": {"location": "Paris"}
# }
# Or inside a markdown code block
# You need a robust parser that handles multiple formats
Provider choice impacts latency
OpenAI GPT-4: ~2s for first token with tools. Anthropic Claude 3: ~3s. Open-source (local): ~5-10s depending on hardware. For latency-sensitive apps, consider caching or using a smaller model for tool selection.
Production Insight
We migrated from OpenAI to Anthropic for a medical diagnosis assistant because Claude was better at following complex schema descriptions. However, the XML parsing added 50ms overhead per call. We had to rewrite the parser to handle edge cases like nested XML tags in tool arguments.
Key Takeaway
Choose your provider based on schema complexity tolerance and latency requirements. OpenAI for speed and simplicity, Anthropic for complex schemas, open-source for cost control.

Debugging and Monitoring Function Calling in Production

You need observability into every step of the function calling pipeline. Here's what to log:

  1. Raw API request: Log the full messages array and tools array. This lets you replay the exact request.
  2. Raw API response: Log the full response object, including finish_reason, tool_calls, and content.
  3. Tool execution: Log the function name, arguments, result, and execution time.
  4. Context window usage: Log the token count before and after each call.
Set up alerts for
  • finish_reason is 'length' (context window exceeded)
  • Tool call count > 5 (possible infinite loop)
  • Tool execution time > 5s (external API slow)
monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import logging
import time
from openai import OpenAI

logger = logging.getLogger(__name__)

class MonitoredAgent:
    def __init__(self):
        self.client = OpenAI()
    
    def call_with_logging(self, messages, tools):
        start = time.time()
        
        # Log request
        logger.info(f"Request: {len(messages)} messages, {len(tools)} tools")
        
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=tools
        )
        
        elapsed = time.time() - start
        
        # Log response
        choice = response.choices[0]
        logger.info(f"Response time: {elapsed:.2f}s, finish_reason: {choice.finish_reason}")
        
        if choice.finish_reason == "length":
            logger.warning("Context window exceeded! Consider pruning history.")
        
        if choice.message.tool_calls:
            logger.info(f"Tool calls: {len(choice.message.tool_calls)}")
            for tc in choice.message.tool_calls:
                logger.info(f"  - {tc.function.name}({tc.function.arguments})")
        
        return response
Don't log sensitive data
Tool arguments may contain PII (e.g., customer names, credit card numbers). Sanitize logs by redacting sensitive fields before writing to your logging system.
Production Insight
A fintech app was logging full tool arguments including credit card numbers. A log aggregation service was breached, exposing 10k customer records. The fix: implement a redaction layer that replaces sensitive fields with '***' before logging.
Key Takeaway
Observability is critical, but so is data privacy. Redact sensitive fields in tool arguments before logging.
● Production incidentPOST-MORTEMseverity: high

The $47k Refund Incident: When Function Calling Refunded the Wrong Customer

Symptom
Customer complained they received a refund confirmation for an order they didn't return. The agent's logs showed a successful process_refund(order_id='8472') call — but '8472' was the customer's account number, not an order ID. The order ID was supposed to be 'A8472'.
Assumption
The team assumed that if the function schema defined order_id as type: string with a description like 'The order ID to refund', the model would always extract the exact order ID from the conversation. They didn't test edge cases where the customer mentioned their account number.
Root cause
The function schema for process_refund had order_id as a required string parameter, but the description was ambiguous: 'The order ID or account number to refund'. The LLM interpreted 'account number' as a valid input and extracted the customer's account number instead of the order ID. The schema validation only checked for type (string), not semantic correctness.
Fix
1. Tightened the parameter description to 'The exact order ID from the order confirmation email (format: AXXXXX)'. 2. Added a regex validation step before calling the refund API that rejects any value not matching the order ID pattern. 3. Implemented a confirmation step: the agent now asks 'I found order A8472 for $47.00 — shall I proceed?' before executing the refund. 4. Added idempotency keys to all payment mutations.
Key lesson
  • Make every tool parameter description a regex-like pattern, not a vague label. If the format is fixed, specify it in the description.
  • Never let the LLM execute destructive actions without a confirmation step that shows the exact parameters. The model will hallucinate arguments even with a perfect schema.
  • Add input validation outside the LLM — the model's JSON output is just a suggestion. Your application code must enforce business rules.
Production debug guideWhen the model calls the wrong function at 2am.4 entries
Symptom · 01
Model returns a tool call with arguments that don't match any function schema (e.g., 'pineapple' for a city name).
Fix
Check the raw API response for finish_reason and tool_calls. Log the full response object, not just the parsed JSON. Often the model emits a partial or malformed JSON inside the arguments field.
Symptom · 02
Model calls the correct function but with missing or null required parameters.
Fix
Verify the function schema's required array matches your expectations. Also check if the model's context window is too full — it may be truncating the schema. Run len(tokenizer.encode(json.dumps(tools))) to measure schema token count.
Symptom · 03
Model calls the same function multiple times in one response (parallel calls), but your loop processes only the first one.
Fix
Check if your code iterates over response.choices[0].message.tool_calls (plural) or just accesses .tool_call (singular). The OpenAI API returns a list. If you only handle one, you're dropping calls silently.
Symptom · 04
Tool call succeeds but the model ignores the result and generates a hallucinated response.
Fix
Inspect the tool_call_id in the response. The model expects the tool output to include the same ID. If your code returns output without the matching ID, the model can't associate the result with the call and falls back to guessing.
★ LLM Function Calling Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Model calls wrong function entirely
Immediate action
Check function names and descriptions for ambiguity
Commands
python -c "import json; tools = json.load(open('tools.json')); print([t['function']['name'] for t in tools['tools']])"
python -c "from openai import OpenAI; client=OpenAI(); r=client.chat.completions.create(model='gpt-4', messages=[{'role':'user','content':'test'}], tools=json.load(open('tools.json'))['tools']); print(r.choices[0].message)"
Fix now
Rename functions to be more distinct (e.g., 'get_current_weather' vs 'get_forecast_weather'). Add explicit 'do not use for X' in descriptions.
Model returns malformed JSON in arguments+
Immediate action
Check if the model is being asked to generate JSON in a constrained format without proper schema
Commands
python -c "import json; r='{\"location\":\"NYC\"'; json.loads(r) # This will fail. The model often omits closing braces."
grep -o '"arguments":"[^"]*"' /var/log/app/llm.log | head -5 # Check raw argument strings
Fix now
Wrap the JSON parsing in a try/except and log the raw string. Implement a retry with a simpler schema or add 'Respond in valid JSON only' to the system prompt.
Parallel tool calls cause race conditions+
Immediate action
Check if your tool execution is async and if you're awaiting all calls before returning results
Commands
python -c "import asyncio; async def main(): tasks=[asyncio.create_task(tool()) for tool in tools]; results=await asyncio.gather(*tasks); print(results)"
grep 'tool_calls' /var/log/app/llm.log | wc -l # Count how many parallel calls per request
Fix now
Force sequential tool execution by setting parallel_tool_calls=False in the API request. Or implement a semaphore to limit concurrency.
Model ignores tool output and hallucinates+
Immediate action
Verify tool output includes the correct `tool_call_id`
Commands
python -c "print({'role':'tool','tool_call_id':'call_abc123','content':'result'} == {'role':'tool','tool_call_id':'call_def456','content':'result'}) # False — ID mismatch"
jq '.choices[0].message.tool_calls[].id' /var/log/app/llm_response.json
Fix now
Extract the tool_call_id from the model's response and include it verbatim in the tool output message. Never hardcode or generate a new ID.
Function Calling: OpenAI vs. Anthropic vs. Open-Source
ConcernOpenAIAnthropicOpen-Source (e.g., Llama 3)
Native supportYes, via function_call parameterYes, via tool_use parameterNo, requires prompt engineering
Parallel callsSupported (parallel_tool_calls=true)Supported (multiple tools in one response)Not natively supported; must implement yourself
JSON schema validationBuilt-in, strict mode availableBuilt-in, but less strictMust validate manually
Cost per call$0.01–$0.03 per 1K tokens$0.008–$0.024 per 1K tokensFree (self-hosted), but compute cost
Reliability of output formatHigh (99.9% valid JSON)High (99.5% valid JSON)Variable (70–90% depending on prompt)
RecommendationBest for production with high reliability needsGood for cost-sensitive apps with moderate reliabilityBest for prototyping or offline batch processing

Key takeaways

1
Parallel tool calls require explicit concurrency control
without a semaphore or queue, you'll hit rate limits and deadlocks that silently drain your budget.
2
Always validate tool call arguments server-side before execution
LLMs hallucinate parameter values, and one bad SQL injection via a tool call can take down your DB.
3
Implement a max retry counter per function call cycle
runaway loops of failed tool calls can rack up $10k+ in API costs in minutes.
4
Log every tool call with a unique correlation ID tied to the user session
debugging a production incident without this is impossible.
5
Use a timeout per tool call (e.g., 5 seconds) and a global timeout for the entire function calling loop (e.g., 30 seconds) to prevent hung calls from blocking the pipeline.

Common mistakes to avoid

4 patterns
×

No concurrency control on parallel tool calls

Symptom
Under load, tool calls start timing out or returning 429 errors, causing the LLM to retry and compound the problem. You see a spike in API costs and latency.
Fix
Implement a semaphore (e.g., asyncio.Semaphore(5)) to limit concurrent tool calls. Also add a per-call timeout and a global loop timeout.
×

Trusting LLM-generated arguments without validation

Symptom
LLM passes a string like '1; DROP TABLE users;' as a parameter to your database query tool. Production DB gets wiped.
Fix
Always validate and sanitize tool call arguments against a schema (e.g., Pydantic models) before executing. Reject invalid calls and return a clear error to the LLM.
×

No retry limit on failed tool calls

Symptom
A tool call fails due to a transient error. The LLM retries it 10 times, each costing $0.01. You wake up to a $47k bill.
Fix
Set a max retry count (e.g., 3) per tool call. After that, return a terminal error to the LLM and escalate to a fallback handler.
×

Missing correlation IDs in logs

Symptom
A user reports a bug. You grep logs and see thousands of tool calls with no way to link them to a specific session. Incident response takes hours.
Fix
Generate a unique correlation ID per user request and attach it to every log line, API call, and tool execution. Use structured logging (e.g., JSON) for easy querying.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain how function calling works in LLMs at a technical level.
Q02SENIOR
What are the risks of parallel tool calls and how do you mitigate them?
Q03SENIOR
Design a production-grade function calling system that handles millions ...
Q04SENIOR
How would you handle a scenario where an LLM keeps calling the same tool...
Q05SENIOR
Describe a real production incident you've seen with function calling an...
Q01 of 05JUNIOR

Explain how function calling works in LLMs at a technical level.

ANSWER
The LLM is given a list of function definitions as part of the API request. It doesn't execute functions — it outputs a structured JSON object with the function name and arguments. The developer's code then parses that JSON, executes the function, and sends the result back to the LLM. The LLM can then use that result to generate a final response or request another function call. This is a loop: LLM -> function call -> execution -> result -> LLM.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How does LLM function calling work under the hood?
02
What is the difference between parallel and sequential tool calls?
03
How do I handle rate limits with parallel tool calls?
04
Can I use function calling with open-source LLMs?
05
How do I debug a function calling loop that's stuck?
🔥

That's LLM APIs. Mark it forged?

5 min read · try the examples if you haven't

Previous
OpenAI API Python Guide
2 / 3 · LLM APIs
Next
Structured Outputs with LLMs