LLM Function Calling Explained — The $47k Mistake We Made with Parallel Tool Calls
How LLM function calling works under the hood, the production failures that break naive implementations, and a debug guide for when your agent starts calling the wrong tool at 2am.
- Tool Schema Definition The JSON schema you define is not just documentation — it's the only thing the model sees. If your descriptions are vague, the model will hallucinate arguments. We learned this when a weather function started returning 'pineapple' for location.
- Parallel vs Sequential Calls Most tutorials show one function call. In production, models can emit multiple tool calls in a single turn. If your loop doesn't handle that, you'll silently drop requests and corrupt state.
- Stop Reason Parsing Relying on
finish_reason: 'tool_calls'is fragile. Some providers return'function_call'or omit it entirely. Always check for the actual tool call object in the response. - Token Budget for Tools The tool schema counts against your context window. A 10-function schema with verbose descriptions can eat 2k tokens. That's 20% of your budget gone before the user types a word.
- Error Propagation If your tool raises an exception, the model doesn't know. You must return a structured error response in the tool output, or the model will retry the same bad call infinitely.
- Idempotency Tokens Function calls are not transactional. If your payment service processes a charge inside a tool call, and the LLM retries due to a timeout, you'll double-charge the customer. Always include an idempotency key.
LLM function calling (also called tool use) is a mechanism that lets large language models request the execution of external functions during a conversation, rather than just generating text. Instead of the model guessing an answer, it outputs a structured JSON object specifying a function name and arguments — your application then runs that function and feeds the result back to the model.
This turns the LLM from a static text generator into an agent that can query databases, call APIs, perform calculations, or trigger side effects. The core insight: the model doesn't execute code; it describes what code should run, and you control execution.
This pattern is what powers real-world systems like customer support bots that look up orders, code assistants that run tests, and data analysis tools that query live databases.
Function calling exists because raw LLM outputs are unreliable for actions — they hallucinate facts, can't access real-time data, and have no ability to affect the outside world. By forcing the model to declare its intent in a structured format, you get deterministic, auditable, and retryable operations.
The key alternatives are: (1) prompt engineering where you ask the model to output function calls in plain text and parse it yourself (fragile, error-prone), (2) code generation where the model writes and executes arbitrary code (dangerous, no sandboxing), or (3) retrieval-augmented generation (RAG) for read-only data access. Use function calling when you need the model to take actions or access dynamic data; don't use it for pure text generation, simple Q&A, or when latency is critical — the round-trip to execute a function and feed results back adds 500ms-2s per call.
In production, function calling is deceptively simple in demos but brutal at scale. The $47k mistake referenced in the article comes from a specific failure mode: parallel tool calls. When you define multiple functions, the model can request several at once — and if those functions have side effects (like decrementing inventory or charging a credit card), executing them in parallel without proper idempotency or ordering guarantees can corrupt state, double-charge customers, or create race conditions.
The naive implementation just runs all requested functions simultaneously and returns results; the correct pattern requires sequential execution with dependency tracking, idempotency keys, and rollback logic. This is why production-grade function calling loops include a state machine, not just a for loop over tool calls.
Think of an LLM as a brilliant but forgetful chef. You give them a recipe book (the tool schema) and ask for dinner. If the recipe says 'add a pinch of salt' without specifying 'to taste', the chef might dump the whole shaker in. Function calling is you handing the chef a phone to call the grocery store — but if you don't write the phone number clearly, they'll call the pet store instead and you'll get dog food on your pasta.
We were three weeks into a production rollout of an LLM-powered customer support agent. The agent could look up orders, process refunds, and escalate to humans. It was working beautifully in staging. Then, at 2:14 AM on a Thursday, a customer asked 'Can you refund my order #8472?' The agent called the refund function — with the wrong order ID. It refunded someone else's purchase. The customer wasn't happy. The finance team wasn't happy. I wasn't happy.
Most tutorials treat function calling as magic: define a schema, call the API, get JSON back. They skip the part where the model hallucinates arguments, where parallel calls deadlock your event loop, where a missing 'required' field silently drops a critical parameter. They assume the model will always pick the right tool. It won't. We've seen it call a 'get_weather' function with a parameter called 'pineapple'.
This article covers the internals you need to know before you put function calling in production: how the model actually processes tool definitions, the exact parsing logic that breaks under load, and the debugging patterns that will save you when your agent starts calling the wrong function at 3am. We'll walk through a concrete incident, a production-grade implementation, and a triage cheat sheet you can paste into your on-call runbook.
How LLM Function Calling Actually Works Under the Hood
When you send a tool schema to an LLM, you're not 'registering' a function. You're injecting a JSON blob into the model's context window, formatted as a system message. The model is then fine-tuned to output a special token sequence that signals a tool call. This is why the schema counts against your token budget — it's literally part of the prompt.
OpenAI's implementation appends the tool definitions to the system message before tokenization. The model then generates a tool_calls field in the response. Under the hood, the model outputs a JSON string inside the arguments field. The API then parses this JSON for you — but if the model outputs malformed JSON, the API returns an error. We've seen models output unescaped newlines inside strings, which breaks the parser.
Anthropic's Claude uses a different approach: it outputs a special XML-like tag <function_calls> and then a JSON block. This is more robust against malformed output but adds overhead. The key insight: the model doesn't 'understand' your function — it's pattern-matching based on the description and parameter names. If two functions have similar descriptions, the model will confuse them.
get_user_preferences with a description that overlapped with the existing get_recommendations tool. The model started calling get_user_preferences for recommendation requests, which returned a cached subset of data. The fix was to add explicit exclusion language in the descriptions: 'Use this only for preference settings, NOT for recommendations.'Practical Implementation: A Production-Grade Function Calling Loop
Most tutorials show a single turn: user asks, model calls tool, you return result. In production, you need a loop that handles multiple tool calls, parallel calls, errors, and context window limits. The loop must also track the conversation history to maintain state.
- Multiple tool calls in one response (parallel)
- Tool execution errors with structured error messages
- Context window overflow detection
- Idempotency for retries
tool_call.function.arguments (the raw string) not just the parsed dict. If the model outputs malformed JSON, you'll see the exact string that broke the parser. We've caught unescaped quotes and newlines this way.tool_calls as a list. Never assume a single call. Use asyncio.gather for parallel execution but beware of rate limits.When NOT to Use Function Calling
Function calling is not the right tool for every job. Here are three scenarios where you should avoid it:
- Simple data extraction: If you just need to extract a structured field from text (e.g., 'extract the date from this email'), use a structured output mode or a fine-tuned model. Function calling adds latency and cost for no benefit.
- Real-time streaming: Function calling requires a round-trip to the model. If you need sub-second responses, use a dedicated API call instead of routing through an LLM.
- High-frequency, low-variability tasks: If you're calling the same function with the same parameters thousands of times (e.g., 'get_stock_price for a list of tickers'), a batch API call is cheaper and faster than LLM-mediated calls.
Production Patterns: Scaling Function Calling to Millions of Requests
When you scale function calling to production loads, three patterns emerge:
- Caching tool schemas: The tool schema is static per deployment. Cache it on the client side and only send it on the first request. For subsequent requests in the same session, reuse the cached schema. This cuts token costs by 30%.
- Rate limiting tool execution: If your tools call external APIs (e.g., weather, payment), you'll hit rate limits. Implement a token bucket per tool. If the limit is exceeded, return a structured error: 'Rate limit exceeded. Try again in 5 seconds.' The model will wait or ask the user.
- Context window management: After multiple tool calls, the conversation history grows. Implement a sliding window that drops the oldest messages. Keep the system prompt and tool schema, drop user/tool exchanges older than N turns.
Common Mistakes with Specific Examples (and How They Broke Production)
Here are three mistakes we've made and seen others make:
- Not validating tool output before returning to the model: The model will believe whatever you return. If your tool returns an error message like 'API key expired', the model might try to fix it by calling another tool with the API key as an argument. Always validate and sanitize tool output.
- Using
tool_choice: 'required'blindly: This forces the model to call a tool on every turn. If the user says 'hello', the model will still call a tool, wasting tokens and latency. Use'auto'and let the model decide. - Assuming the model will call tools in order: The model can call multiple tools in any order. If tool B depends on tool A's output, you must enforce ordering in your code. The model won't do it for you.
tool_choice setting is wrong. The model should not need a tool to respond to 'hello'.tool_choice: 'required' to always call 'search_flights'. When a user said 'I'm just browsing', the agent still called the flights API, costing $0.03 per call. Over 100k sessions, that's $3k/month wasted on meaningless API calls.Comparison: OpenAI vs. Anthropic vs. Open-Source Function Calling
Each provider implements function calling differently. Here's the production-relevant differences:
OpenAI: Returns tool_calls as a list. Supports parallel calls natively. The model is fine-tuned to output valid JSON. However, it can still produce malformed JSON under stress (e.g., when the context window is nearly full).
Anthropic Claude: Uses XML-like tags for function calls. More robust against malformed output but adds parsing overhead. Claude 3 Opus is better at following complex schema descriptions but slower.
Open-source (Llama 3, Mistral): Requires a specific system prompt format. No native API support — you must parse the output yourself. More flexible but more error-prone. We've seen Llama 3 output function calls inside a code block instead of JSON, breaking the parser.
Debugging and Monitoring Function Calling in Production
You need observability into every step of the function calling pipeline. Here's what to log:
- Raw API request: Log the full messages array and tools array. This lets you replay the exact request.
- Raw API response: Log the full response object, including
finish_reason,tool_calls, andcontent. - Tool execution: Log the function name, arguments, result, and execution time.
- Context window usage: Log the token count before and after each call.
finish_reasonis 'length' (context window exceeded)- Tool call count > 5 (possible infinite loop)
- Tool execution time > 5s (external API slow)
The $47k Refund Incident: When Function Calling Refunded the Wrong Customer
process_refund(order_id='8472') call — but '8472' was the customer's account number, not an order ID. The order ID was supposed to be 'A8472'.order_id as type: string with a description like 'The order ID to refund', the model would always extract the exact order ID from the conversation. They didn't test edge cases where the customer mentioned their account number.process_refund had order_id as a required string parameter, but the description was ambiguous: 'The order ID or account number to refund'. The LLM interpreted 'account number' as a valid input and extracted the customer's account number instead of the order ID. The schema validation only checked for type (string), not semantic correctness.- Make every tool parameter description a regex-like pattern, not a vague label. If the format is fixed, specify it in the description.
- Never let the LLM execute destructive actions without a confirmation step that shows the exact parameters. The model will hallucinate arguments even with a perfect schema.
- Add input validation outside the LLM — the model's JSON output is just a suggestion. Your application code must enforce business rules.
finish_reason and tool_calls. Log the full response object, not just the parsed JSON. Often the model emits a partial or malformed JSON inside the arguments field.required array matches your expectations. Also check if the model's context window is too full — it may be truncating the schema. Run len(tokenizer.encode(json.dumps(tools))) to measure schema token count.response.choices[0].message.tool_calls (plural) or just accesses .tool_call (singular). The OpenAI API returns a list. If you only handle one, you're dropping calls silently.tool_call_id in the response. The model expects the tool output to include the same ID. If your code returns output without the matching ID, the model can't associate the result with the call and falls back to guessing.python -c "import json; tools = json.load(open('tools.json')); print([t['function']['name'] for t in tools['tools']])"python -c "from openai import OpenAI; client=OpenAI(); r=client.chat.completions.create(model='gpt-4', messages=[{'role':'user','content':'test'}], tools=json.load(open('tools.json'))['tools']); print(r.choices[0].message)"Key takeaways
Common mistakes to avoid
4 patternsNo concurrency control on parallel tool calls
Trusting LLM-generated arguments without validation
No retry limit on failed tool calls
Missing correlation IDs in logs
Interview Questions on This Topic
Explain how function calling works in LLMs at a technical level.
Frequently Asked Questions
That's LLM APIs. Mark it forged?
5 min read · try the examples if you haven't