Model Context Protocol (MCP) Explained — The $4k/Month Token Waste We Fixed by Ditching Custom Integrations
Stop wiring AI agents to APIs by hand.
- MCP Client The host process that manages server connections and tool dispatch. In production, a single misconfigured client can leak 10k connections/hour.
- MCP Server Exposes resources, tools, and prompts over a JSON-RPC transport. We saw a 23% accuracy drop when a server returned stale resources due to missing cache headers.
- Resources Read-only data like files or database rows. Our fraud pipeline crashed when a resource returned 50MB because the server didn't paginate — fixed with a 3-line cursor.
- Tools Callable functions the LLM can invoke. A payment server tool that expected ISO 8601 but got Unix timestamps caused 800ms p99 spikes on 12% of requests.
- Prompts Pre-written templates for the LLM. We shipped a prompt that leaked PII because it interpolated user input into a system message — never do that.
- Transport The underlying communication layer (stdio or SSE). Stdio is fine for dev; SSE in production needs backpressure — we lost 2% of messages in a burst.
Think of MCP as a universal remote for AI assistants. Instead of building a separate remote for your TV, stereo, and lights, MCP gives you one remote that any assistant (Claude, ChatGPT) can use to control any tool (calendar, database, 3D printer). It's the USB-C of AI integrations — one plug, everything works.
Last quarter, my team spent $4,000 on OpenAI tokens just to keep three custom API integrations alive. Each integration had its own authentication, its own retry logic, its own error handling. When the CRM API changed its schema, we had to redeploy three microservices. That's the fragmentation problem MCP solves — but most tutorials treat it like a magic wand. They show you how to build a "hello world" server and call it a day. In production, you'll hit connection storms, payload bombs, and tool-calling loops that burn through your token budget before breakfast.
How Model Context Protocol Actually Works Under the Hood
MCP is not a library — it's a wire protocol. At its core, it uses JSON-RPC 2.0 over a transport layer (stdio or SSE). The host process (your app) spawns a client instance for each server. The client sends initialize, tools/list, resources/list, and prompts/list requests to discover capabilities. Then it sends tools/call to invoke a tool. The server responds with a ToolResult containing TextContent, ImageContent, or EmbeddedResource. The critical detail most tutorials skip: MCP is stateless between calls. Each tools/call is independent. If your tool needs session state (e.g., a database connection), you must manage it inside the server. We learned this when a server kept opening new DB connections on every call — we hit the connection pool limit at 500 concurrent calls.
last_updated timestamp to each resource and force a refresh if the timestamp is older than 5 minutes.Practical Implementation: Building a Production-Ready MCP Server
Let's build an MCP server that queries a PostgreSQL database. Most tutorials show a toy example with in-memory data. In production, you need connection pooling, prepared statements, and error handling. We'll use asyncpg for async Postgres access and pydantic for input validation. The key pattern: register tools that accept validated arguments, execute a parameterized query, and return structured results. Never interpolate user input into SQL — the LLM can inject SQL through tool arguments. We saw this happen when a user asked 'show me all users named Robert; DROP TABLE users;' — the tool executed it because the server used f-strings.
$1 placeholders with asyncpg or parameterized queries with any DB driver. Never use f-strings.When NOT to Use MCP — The Hidden Costs
MCP is not a silver bullet. It adds latency: each tool call goes through JSON-RPC serialization, transport, and deserialization. For a simple lookup, this adds 50-100ms compared to a direct API call. If you're building a latency-sensitive pipeline (e.g., real-time fraud detection under 100ms), MCP is too slow. We benchmarked it: a direct Postgres query took 15ms; the same query through MCP took 85ms. Also, MCP has no built-in rate limiting or circuit breaking. If the LLM calls a tool 1000 times in a minute (which happened to us when a prompt looped), your downstream system gets hammered. Finally, MCP is overkill for simple key-value lookups — a REST API with a simple prompt is faster and cheaper.
Production Patterns & Scale: Handling 10k Requests/Minute
When your MCP server handles thousands of requests per minute, three things break: connection management, error handling, and monitoring. First, use a connection pool for your database and reuse it across tool calls. Second, implement a circuit breaker for downstream APIs — if a tool calls an external API that's down, don't keep hammering it. Third, log every tool call with a unique request ID and trace it through your system. We use OpenTelemetry spans for each tools/call and resources/read. This lets us pinpoint which tool is slow. We also set up alerts on mcp_tool_call_duration_seconds — anything over 5 seconds triggers a PagerDuty.
tools/call and resources/read. Add a unique request ID to every log line. This is how you'll debug production issues.Common Mistakes with Specific Examples
We've seen teams make the same mistakes repeatedly. First: not paginating resources. One team returned 10,000 records in a single resource — the LLM couldn't process it and started hallucinating. Fix: always paginate with cursor and limit. Second: not validating tool arguments. A team accepted a user_id as a string without checking it was an integer. The LLM sent 'abc', the server crashed, and the client retried 3 times before giving up. Fix: use Pydantic models. Third: not handling tool errors gracefully. A tool that raised an unhandled exception caused the entire server to crash. Fix: wrap tool handlers in try/except and return a ToolResult with an error message. Fourth: assuming MCP handles authentication. It doesn't. If your server exposes sensitive data, add authentication at the transport layer (e.g., API key in the SSE handshake).
None['balance']. The fix: validate the account ID is positive before querying.MCP vs Alternatives: REST, gRPC, and Function Calling
MCP is not the only game in town. OpenAI's function calling lets you define tools in the API request — it's simpler but limited to OpenAI models. gRPC is faster (binary protocol) but harder to debug. REST is universal but requires custom integration code. MCP's advantage is standardization: any MCP-compatible client can talk to any MCP server. But that comes at a cost: MCP is slower than gRPC (text-based JSON-RPC vs binary protobuf) and more complex than function calling. Our rule of thumb: use MCP if you have multiple clients (Claude, ChatGPT, Cursor) that need to access the same tools. Use function calling if you only use one model provider. Use gRPC if latency is critical.
Debugging and Monitoring MCP in Production
When your MCP server misbehaves, you need fast diagnostics. First, enable debug logging on the server: logging.basicConfig(level=logging.DEBUG). This logs every JSON-RPC message. Second, use the mcp-cli tool to test tools and resources without an LLM client. Third, monitor key metrics: mcp_tool_call_count, mcp_tool_call_duration_seconds, mcp_resource_size_bytes, and mcp_error_count. Alert on any tool call taking longer than 5 seconds or any resource returning more than 1MB. Fourth, add health check endpoints: GET /health that returns 200 if the server can connect to its dependencies. Finally, use distributed tracing to correlate LLM requests with MCP tool calls.
mcp_tool_call_duration_seconds spiking to 30 seconds for the get_patient tool. The root cause: the database connection pool was exhausted because the server didn't release connections properly. The fix: add a with pool.acquire() context manager that releases the connection even on exceptions.The 50MB Resource That Killed Our Fraud Pipeline
ResourceNotFound errors in the logs — the LLM was timing out waiting for the resource and falling back to a default.list_accounts resource returned all 50,000 accounts in a single ResourceContents list. The MCP Python SDK (v0.1.0) serialized the entire list into one JSON-RPC response message. The LLM client had a 10MB message size limit, so it truncated the response mid-stream, causing the model to see partial data and hallucinate account statuses.list_accounts now accepts cursor and limit parameters. 2) Set a max page size of 1000 records. 3) Added a Content-Length header check in the client to reject messages over 5MB. 4) Deployed with a feature flag to roll back if accuracy didn't recover. Accuracy returned to 94% within 30 minutes.- Always paginate MCP resources — the protocol doesn't do it for you. Use cursor-based pagination with a configurable page size.
- Set a hard message size limit on the client side. The MCP spec doesn't enforce one; your LLM provider probably does.
- Monitor resource response sizes in production. Add a metric for
mcp_resource_bytes_returnedand alert when it exceeds 1MB.
mcp-cli inspect server and verify the outputSchema field in the tool definition.0.0.0.0. Run ss -tlnp | grep <port> on the server host. Also check firewall rules — SSE uses a persistent HTTP connection.mcp-cli list-tools against the server. If the tool isn't listed, the server didn't register it. Common cause: the tool function raised an exception during server startup. Check server logs for ToolRegistrationError.Cache-Control: no-cache header to the resource response, or set a TTL in the client config. Check the mcp_resource_cache_hit metric.mcp-cli inspect server --tools | jq '.tools[].outputSchema'mcp-cli call-tool --name my_tool --args '{"input": "test"}' | jq '.content[0].type'type field. Example: return ToolResult(content=[TextContent(type='text', text='done')])Key takeaways
Common mistakes to avoid
4 patternsUnversioned MCP schemas
Over-fetching context in MCP messages
context_id field) to send only deltas after the initial handshake. Cache tool definitions client-side for the session.Synchronous MCP calls in high-throughput pipelines
Ignoring MCP error codes in production
TOOL_NOT_FOUND errors, burning tokens and latency.TOOL_NOT_FOUND = no retry (log and alert); RATE_LIMITED = exponential backoff; INTERNAL_ERROR = retry with jitter.Interview Questions on This Topic
Explain how Model Context Protocol works under the hood. How does it differ from a simple REST API for tool integration?
discover handshake where the LLM client fetches a typed schema of available tools (parameters, return types, descriptions). Then, execute calls use that schema to validate inputs and outputs. Unlike REST, MCP includes context caching (via context_id) to avoid resending tool definitions on every call, reducing token overhead. It also standardizes error codes and versioning, which REST APIs typically handle ad-hoc.Frequently Asked Questions
That's AI Agents. Mark it forged?
4 min read · try the examples if you haven't