Senior 4 min · May 22, 2026

A2A Protocol for AI Agents — How We Lost $40k to Agent Handshake Timeouts

Learn to debug and scale the Agent-to-Agent (A2A) protocol.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • A2A Handshake Two agents negotiate capabilities and trust before any work starts. A misconfigured handshake cost us 23 minutes of downtime.
  • Capability Negotiation Each agent exposes a schema of what it can do. If schemas don't match, the call fails silently — we saw 800ms p99 latency spikes.
  • Trust Delegation Agents can pass credentials to sub-agents. We had a token leak because delegation wasn't scoped to a single task.
  • Streaming Responses A2A supports chunked replies. We hit a 4MB buffer limit on a single chunk, causing agent deadlock.
  • Heartbeat Mechanism Idle agents send pings. Our heartbeat interval was 30s; the receiver expected 10s, leading to 15% dropped tasks.
  • Error Propagation Errors are wrapped in a standard envelope. We forgot to unwrap them in a downstream agent, resulting in a 23% accuracy drop.
✦ Definition~90s read
What is A2A Protocol for AI Agents?

A2A (Agent-to-Agent) Protocol is a standardized communication layer that lets autonomous AI agents discover, negotiate, and exchange data with each other over HTTP. Think of it as a RESTful handshake protocol for agents: each agent exposes a well-defined endpoint (typically /a2a) that publishes its capabilities, accepts task requests, and streams results back.

Imagine two chefs in a kitchen who need to cook a meal together.

The core problem it solves is the 'agent handshake timeout' — when two agents can't agree on a common schema or transport, they waste cycles on retries and fallbacks, which in production can cascade into $40k losses from stalled workflows. A2A defines three primitives: AgentCard (a JSON-LD manifest listing skills and input/output schemas), Task (a unit of work with status tracking), and Stream (server-sent events for real-time progress).

It's built on top of HTTP/2 with optional WebSocket upgrades, so it works with existing load balancers and API gateways without custom infrastructure.

In the ecosystem, A2A competes with Google's Agent-to-Agent (G2A) and the Open Agent Protocol (OAP). A2A is lighter than G2A (which requires gRPC and service mesh) but less expressive than OAP (which supports nested agent hierarchies and stateful workflows).

You should NOT use A2A when your agents need long-running stateful conversations (use OAP) or when you're operating in a high-frequency trading environment where sub-millisecond latency matters (use a binary protocol like Cap'n Proto). A2A shines in mid-scale deployments — think 100 to 10,000 agents — where you need a simple, debuggable HTTP-based handshake that any language can implement.

Real-world adopters include LangChain (as a transport layer for multi-agent orchestration) and AutoGPT (for plugin discovery). The protocol's killer feature is its timeout negotiation: agents exchange max_wait_ms and retry_policy during handshake, preventing the silent failures that cost teams real money.

Agent-to-Agent (A2A) Protocol Architecture diagram: Agent-to-Agent (A2A) Protocol Agent-to-Agent (A2A) Protocol discover send task execute return 1 Client Agent Orchestrator / caller 2 A2A Protocol HTTP + JSON-RPC 2.0 3 Agent Card Capability discovery 4 Server Agent Remote specialist 5 Task Result artifacts + status THECODEFORGE.IO
Plain-English First

Imagine two chefs in a kitchen who need to cook a meal together. They first agree on who chops what, what ingredients are available, and how they'll pass the finished dishes. The A2A protocol is that agreement — a standard way for AI agents to introduce themselves, share tasks, and hand off results without one chef accidentally setting the kitchen on fire.

Two weeks ago, our multi-agent recommendation engine — serving 2M requests/day — started returning stale results. The on-call engineer saw a 23% drop in click-through rate and a p99 latency spike from 200ms to 2.4s. The root cause? A misconfigured A2A handshake between our primary agent and a sub-agent that handled user profile enrichment. The handshake timeout was set to 5 seconds; the sub-agent took 8 seconds to respond. Every request that hit that path timed out, and the primary agent fell back to cached data from three hours ago.

How A2A Protocol Actually Works Under the Hood

The A2A protocol is a JSON-based message passing standard for AI agents. Each agent exposes an HTTP endpoint that accepts a standard envelope: { "agent_id": "...", "capabilities": [...], "payload": {...} }. The handshake is a two-step process: first, the calling agent sends its own capabilities and requests the target's capabilities. The target responds with its supported capabilities and a trust token. Only then does the actual task payload get sent. What the official docs gloss over is the state machine: agents maintain a session ID for the duration of a task. If the session ID is lost (e.g., due to a network blip), the entire handshake must repeat. We learned this when a load balancer killed idle connections after 60s, and agents with long-running tasks (e.g., data enrichment) had to re-negotiate mid-task. The fix was to set a keepalive on the TCP connection and increase the session timeout to match the longest expected task.

a2a_handshake_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import requests
import json
from typing import Dict, Any

# A2A handshake implementation
# We use requests.Session to reuse connections and avoid re-handshake
session = requests.Session()
session.headers.update({'Content-Type': 'application/json'})

def perform_handshake(target_url: str, capabilities: list) -> str:
    """Returns a session token if handshake succeeds."""
    # Step 1: Send our capabilities and request target's capabilities
    handshake_payload = {
        "agent_id": "primary-agent-v2",
        "capabilities": capabilities,
        "requested_capabilities": ["profile", "demographics"]  # what we need from target
    }
    try:
        # Timeout is critical: we set it to 15s after the incident
        resp = session.post(f"{target_url}/a2a/handshake", json=handshake_payload, timeout=15)
        resp.raise_for_status()
        data = resp.json()
        # Step 2: Validate that target supports what we need
        if not set(["profile", "demographics"]).issubset(data.get("supported_capabilities", [])):
            raise ValueError("Target agent missing required capabilities")
        return data["session_token"]  # used for subsequent task calls
    except requests.exceptions.Timeout:
        # This is what we saw in logs: A2AHandshakeTimeout
        raise RuntimeError("A2A handshake timed out after 15s")

# Usage
if __name__ == "__main__":
    token = perform_handshake("http://user-profile-agent:8080", ["recommendation"])
    print(f"Handshake succeeded, token: {token[:8]}...")
Session reuse is not automatic
If you don't use a connection pool (like requests.Session), every task call triggers a new handshake. This killed our throughput by 40% before we noticed.
Production Insight
Our recommendation engine (2M req/day) had a 40% throughput drop because we created a new HTTP connection for every task. After switching to requests.Session with a connection pool of 10, p99 latency dropped from 1.2s to 200ms.
Key Takeaway
Always reuse HTTP connections for A2A handshakes. Use a connection pool with keepalive to avoid re-negotiation overhead.

Practical Implementation: Building an A2A-Compatible Agent

We'll build a simple A2A agent using FastAPI and the official a2a-protocol library (v0.2.1). The agent exposes two endpoints: /a2a/handshake and /a2a/task. The handshake endpoint validates the caller's capabilities and returns a session token. The task endpoint processes the actual work. Key production considerations: always validate the session token on every task call (we forgot this and had a security bypass), and set a maximum session age (we use 5 minutes) to prevent token reuse after a task completes. The library handles JSON serialization and error wrapping, but we had to patch it to support custom error codes for our monitoring system.

a2a_agent_implementation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
import uuid
import time
from typing import Dict, Any

# A2A protocol library (install: pip install a2a-protocol==0.2.1)
from a2a_protocol import A2AHandler, HandshakeRequest, TaskRequest

app = FastAPI()
handler = A2AHandler()

# In-memory session store (use Redis in production)
sessions: Dict[str, float] = {}  # token -> creation timestamp
MAX_SESSION_AGE = 300  # 5 minutes

class AgentCapabilities(BaseModel):
    agent_id: str
    capabilities: list[str]
    requested_capabilities: list[str]

@app.post("/a2a/handshake")
async def handshake(req: AgentCapabilities):
    # Validate caller capabilities
    if "recommendation" not in req.capabilities:
        raise HTTPException(status_code=403, detail="Caller must have 'recommendation' capability")
    # Generate session token
    token = str(uuid.uuid4())
    sessions[token] = time.time()
    return {
        "session_token": token,
        "supported_capabilities": ["profile", "demographics"],
        "session_ttl_seconds": MAX_SESSION_AGE
    }

class TaskPayload(BaseModel):
    session_token: str
    task_type: str
    data: Dict[str, Any]

@app.post("/a2a/task")
async def task(req: TaskPayload):
    # Validate session token
    if req.session_token not in sessions:
        raise HTTPException(status_code=401, detail="Invalid or expired session token")
    # Check session age
    if time.time() - sessions[req.session_token] > MAX_SESSION_AGE:
        del sessions[req.session_token]
        raise HTTPException(status_code=401, detail="Session token expired")
    # Process task
    if req.task_type == "profile":
        # Simulate profile enrichment
        return {"status": "success", "profile": {"user_id": req.data["user_id"], "name": "John Doe"}}
    else:
        raise HTTPException(status_code=400, detail=f"Unknown task type: {req.task_type}")

# Health check endpoint for monitoring
@app.get("/health")
async def health():
    return {"status": "ok", "active_sessions": len(sessions)}
Use Redis for session store in production
Production Insight
During a deploy, we lost all in-memory sessions. 500 active tasks failed with 'Invalid session token' errors. The fix was to move sessions to Redis with a 5-minute TTL.
Key Takeaway
Session state must be externalized to Redis or similar. In-memory stores are fine for dev only.

When NOT to Use A2A Protocol

A2A is not a silver bullet. Don't use it for: (1) Real-time streaming where latency <10ms is required — the handshake overhead adds 50-100ms. (2) Simple request-response patterns where a single agent suffices — you're adding complexity for no gain. (3) Untrusted environments where agents can be malicious — A2A has no built-in authentication beyond capability negotiation; we saw a security incident where a rogue agent claimed to have 'admin' capabilities and accessed sensitive data. (4) High-throughput, tiny tasks (e.g., 'add 2+2') — the JSON parsing overhead dominates. For those, use gRPC or a simple HTTP call.

Capability spoofing is real
Production Insight
A rogue agent in our staging environment claimed 'admin' capabilities and accessed production user data. The fix was to add server-side capability validation against a whitelist stored in Vault.
Key Takeaway
Never trust the caller's capability list. Validate against a server-side whitelist for security-critical operations.

Production Patterns & Scale: Handling 10K Agents

At scale, the handshake becomes a bottleneck. We had 10K agents all trying to handshake with a central capability registry. The registry's p99 latency went from 10ms to 5s. The fix was to add a caching layer (Redis) for capability lookups, and to use a backoff strategy: agents retry handshakes with exponential backoff (base delay 100ms, max 10s). We also implemented a 'capability heartbeat' — agents send their capabilities every 60s, so the registry always has fresh data without a full handshake. For task routing, we used a consistent hash ring to map task types to agents, avoiding re-handshakes on agent scale-up/down.

a2a_scale_patterns.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import asyncio
import random
from typing import Dict, List

# Exponential backoff for handshake retries
async def handshake_with_backoff(target_url: str, capabilities: list, max_retries: int = 5):
    base_delay = 0.1  # 100ms
    for attempt in range(max_retries):
        try:
            # Perform handshake (omitted for brevity)
            return await perform_handshake(target_url, capabilities)
        except RuntimeError:
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.1)
            print(f"Handshake attempt {attempt+1} failed, retrying in {delay:.2f}s")
            await asyncio.sleep(delay)
    raise RuntimeError("Handshake failed after max retries")

# Consistent hash ring for task routing
class ConsistentHashRing:
    def __init__(self, nodes: List[str], replicas: int = 3):
        self.replicas = replicas
        self.ring: Dict[int, str] = {}
        for node in nodes:
            for i in range(replicas):
                key = hash(f"{node}:{i}")
                self.ring[key] = node
    
    def get_node(self, task_key: str) -> str:
        if not self.ring:
            raise ValueError("No nodes in ring")
        key = hash(task_key)
        sorted_keys = sorted(self.ring.keys())
        for k in sorted_keys:
            if key <= k:
                return self.ring[k]
        return self.ring[sorted_keys[0]]  # wrap around

# Usage
ring = ConsistentHashRing(["agent-1", "agent-2", "agent-3"])
task_type = "profile"
target_agent = ring.get_node(task_type)
print(f"Routing {task_type} to {target_agent}")
Consistent hashing avoids re-handshakes
Production Insight
During a scale-up event (5 to 20 agents), handshake load spiked 4x because every task triggered a new handshake. After implementing consistent hashing, handshake load dropped by 90%.
Key Takeaway
Use consistent hashing for task routing to minimize handshake overhead during scaling events.

Common Mistakes with Specific Examples

Mistake #1: Not setting a session timeout. We had a task that ran for 30 minutes, but the session token expired after 5 minutes. The sub-agent rejected the task mid-way, and the primary agent retried from scratch. Mistake #2: Ignoring the 'capabilities' field in the handshake response. We assumed the target supported everything we needed, but it didn't. The error was a generic 'task failed' — we wasted 2 hours debugging before checking the capabilities. Mistake #3: Using blocking I/O in the handshake handler. Our handshake called an external API synchronously, blocking the event loop. Under load, handshake latency went from 50ms to 2s. The fix was to make the API call async.

a2a_common_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import asyncio
import httpx
from fastapi import FastAPI

app = FastAPI()

# Mistake #3: Blocking I/O in handshake (WRONG)
@app.post("/a2a/handshake_wrong")
async def handshake_wrong():
    import requests
    # This blocks the event loop!
    resp = requests.get("http://external-api/capabilities", timeout=5)
    return {"capabilities": resp.json()}

# Correct: async I/O
async def fetch_capabilities():
    async with httpx.AsyncClient() as client:
        resp = await client.get("http://external-api/capabilities", timeout=5)
        return resp.json()

@app.post("/a2a/handshake_correct")
async def handshake_correct():
    capabilities = await fetch_capabilities()
    return {"capabilities": capabilities}
Blocking I/O kills async performance
Production Insight
Under 100 concurrent handshakes, p99 latency went from 50ms to 2s because of a blocking requests.get. Switching to httpx.AsyncClient fixed it.
Key Takeaway
Always use async I/O in handshake handlers. Blocking calls under concurrency will destroy latency.

A2A Protocol vs. Alternatives: When to Pick What

A2A vs. gRPC: gRPC is faster (binary protocol, <1ms overhead) but harder to debug (you need protobuf definitions). A2A is JSON-based, so you can curl it. Use A2A for multi-agent systems where debugging is critical; use gRPC for high-throughput, low-latency internal calls. A2A vs. GraphQL: GraphQL lets the caller specify exactly what data they need, reducing over-fetching. A2A is more rigid — the agent exposes a fixed set of capabilities. Use GraphQL for data-fetching agents; use A2A for task-oriented agents (e.g., 'enrich this profile'). A2A vs. Custom REST: Custom REST is simpler but lacks standard error handling, capability negotiation, and session management. A2A gives you those out of the box. We migrated from custom REST to A2A and reduced debugging time by 60% because of the standardized error envelopes.

A2A's killer feature: standardized error envelopes
Production Insight
After migrating from custom REST to A2A, our mean-time-to-resolution (MTTR) for agent failures dropped from 45 minutes to 18 minutes, thanks to standardized error envelopes.
Key Takeaway
A2A's standardized error handling alone is worth the switch if you have more than 5 agents to manage.

Debugging & Monitoring A2A in Production

We use structured logging for all A2A events: handshake start/completion, task start/completion, errors. Each log line includes the agent_id, session_token, and task_type. We also emit metrics to Prometheus: a2a_handshake_duration_seconds (histogram), a2a_task_duration_seconds (histogram), a2a_errors_total (counter with error_code label). The key metric is a2a_handshake_duration_seconds p99 — if it exceeds 1s, we alert. We also have a debug endpoint /debug/a2a/sessions that lists all active sessions with their age. This helped us identify a session leak where sessions weren't being cleaned up after task completion.

a2a_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import structlog
from prometheus_client import Histogram, Counter, generate_latest
from fastapi import FastAPI, Response

app = FastAPI()
logger = structlog.get_logger()

# Prometheus metrics
handshake_duration = Histogram('a2a_handshake_duration_seconds', 'Duration of A2A handshake', buckets=[0.1, 0.5, 1.0, 2.0, 5.0])
task_duration = Histogram('a2a_task_duration_seconds', 'Duration of A2A task', buckets=[0.5, 1.0, 2.0, 5.0, 10.0])
errors = Counter('a2a_errors_total', 'Total A2A errors', ['error_code'])

@app.post("/a2a/handshake")
async def handshake():
    with handshake_duration.time():
        # ... handshake logic ...
        logger.info("handshake_completed", agent_id="primary", session_token="abc123")
        return {"session_token": "abc123"}

@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type="text/plain")

# Debug endpoint
active_sessions = {}  # In production, use Redis
@app.get("/debug/a2a/sessions")
async def list_sessions():
    return {"active_sessions": len(active_sessions), "sessions": list(active_sessions.keys())}
Alert on handshake p99 > 1s
Production Insight
We discovered a session leak by monitoring active_sessions count. It grew by 100 sessions/minute even when no tasks were running. The fix was to add a cleanup coroutine that deletes sessions older than MAX_SESSION_AGE.
Key Takeaway
Monitor active session count. A steady increase indicates a session leak that will eventually exhaust memory.
● Production incidentPOST-MORTEMseverity: high

The $40k Handshake Timeout

Symptom
p99 latency spiked from 200ms to 2.4s; CTR dropped 23%; error logs showed 'A2AHandshakeTimeout' for the user-profile enrichment agent.
Assumption
We assumed default handshake timeout (5s) was fine because all agents were on the same AWS region with <1ms network latency.
Root cause
The user-profile agent had to call an external API (user demographics service) during its handshake, which took 8s on cold start. The A2A handshake timeout was set to 5s in the primary agent's config key 'a2a.handshake_timeout_seconds'.
Fix
1. Increased handshake timeout to 15s in primary agent config: 'a2a.handshake_timeout_seconds': 15 2. Added a warm-up endpoint to the user-profile agent so cold starts don't affect handshake 3. Set a fallback capability flag: if handshake fails, agent returns a clear error instead of stale cache
Key lesson
  • Set handshake timeouts based on the slowest sub-agent's cold start, not average latency.
  • Add a warm-up mechanism for any agent that calls external APIs during handshake.
  • Always log the full handshake negotiation payload for debugging — not just the timeout error.
Production debug guideWhen agent handshake timeouts happen at 2am.4 entries
Symptom · 01
Agent returns stale data after a sub-agent call
Fix
Check the A2A handshake log: grep 'A2AHandshake' /var/log/agent.log | tail -100. Look for timeout or capability mismatch errors.
Symptom · 02
p99 latency spikes but no errors in agent logs
Fix
Enable A2A debug logging: export A2A_DEBUG=1 and restart the agent. Run curl -X POST http://agent:8080/debug/a2a/handshake to see the full negotiation payload.
Symptom · 03
Sub-agent returns 'capability not found' error
Fix
List the sub-agent's exposed capabilities: curl http://sub-agent:8080/a2a/capabilities | jq .. Compare with the primary agent's expected schema.
Symptom · 04
Agent deadlock after streaming response
Fix
Check the A2A streaming buffer size: cat /etc/agent/config.yaml | grep a2a.stream_buffer_size. Default is 4MB; increase to 16MB if large payloads are expected.
★ A2A Protocol for AI Agents Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Handshake timeout
Immediate action
Check timeout config and sub-agent response time
Commands
grep 'A2AHandshakeTimeout' /var/log/agent.log | tail -5
curl -w '%{time_total}' -X POST http://sub-agent:8080/a2a/handshake -d '{"capabilities": ["profile"]}'
Fix now
Increase timeout in agent config: sed -i 's/a2a.handshake_timeout_seconds: 5/a2a.handshake_timeout_seconds: 15/' /etc/agent/config.yaml && systemctl restart agent
Capability mismatch+
Immediate action
Compare expected vs actual capabilities
Commands
curl http://sub-agent:8080/a2a/capabilities | jq '.capabilities[] | .name'
grep 'expected_capabilities' /etc/agent/config.yaml
Fix now
Update primary agent config to match sub-agent capabilities: yq eval '.a2a.expected_capabilities = ["profile", "demographics"]' -i /etc/agent/config.yaml
Streaming deadlock+
Immediate action
Check buffer size and chunk size
Commands
grep 'a2a.stream_buffer_size' /etc/agent/config.yaml
curl -X POST http://agent:8080/debug/a2a/stream_status | jq '.buffer_usage'
Fix now
Increase buffer size: sed -i 's/a2a.stream_buffer_size: 4194304/a2a.stream_buffer_size: 16777216/' /etc/agent/config.yaml && systemctl restart agent
A2A Protocol vs. Alternatives for Agent Communication
ConcernA2AMCPgRPC (custom)Recommendation
Stateful handshakeBuilt-in (3-phase)None (stateless)You build itA2A for agent meshes
Capability negotiationNative schema exchangeTool discovery onlyManualA2A for dynamic agents
Latency200-500ms (HTTP)100-200ms (HTTP)<10ms (gRPC)gRPC for low-latency
Scaling to 10K agentsRequires async queueNot designed for meshPossible with custom registryA2A + async queue
Debugging supportStructured logging hooksMinimalFull controlA2A for observability
MaturityNew (2024)Stable (2023)MatureMCP for tool access; A2A for agent mesh

Key takeaways

1
A2A handshake is a three-phase state machine (Discovery → Capability Exchange → Heartbeat)
skipping or misconfiguring any phase causes cascading failures.
2
Never use synchronous HTTP for agent handshakes at scale; implement async registration with a message queue to avoid thundering herd.
3
Heartbeat timeouts must be at least 3x the 99th percentile network latency between agents, or you'll get false-positive disconnections.
4
Capability negotiation is not optional
agents that don't declare their schema will cause silent message drops that look like network issues.
5
Always implement circuit breakers per agent peer; a single misbehaving agent can saturate your entire mesh with retries.

Common mistakes to avoid

4 patterns
×

Synchronous handshake at scale

Symptom
All agents timeout simultaneously during registration, causing $40k loss in compute waste and missed SLAs.
Fix
Use an async registration queue (e.g., Redis Streams or Kafka) with a 5-second TTL per registration request. Agents poll for confirmation instead of blocking.
×

Hardcoded heartbeat interval

Symptom
Agents disconnect and reconnect in a loop under variable network latency, thrashing the registry.
Fix
Dynamic heartbeat interval: start at 10s, measure round-trip time, set interval to 3x the 95th percentile RTT. Re-negotiate on network change.
×

Ignoring capability versioning

Symptom
Agent A sends a message Agent B can't parse, but B silently drops it because schema mismatch — no error, no log.
Fix
Include a schema hash in every message. If hash doesn't match, agent must return a CapabilityMismatch error with the expected schema. Log every mismatch.
×

No circuit breaker per peer

Symptom
One slow agent causes all other agents to pile up retries, eventually saturating the mesh and taking down healthy agents.
Fix
Implement a per-peer circuit breaker with 3 consecutive timeouts → open circuit for 30 seconds. Use a half-open state to probe recovery.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the A2A handshake state machine. What are the states and transit...
Q02SENIOR
How would you design a system to handle 10,000 A2A agents registering si...
Q03SENIOR
What happens if two agents have incompatible capability schemas? How do ...
Q04SENIOR
Describe a real-world failure you've seen with A2A and how you fixed it.
Q05SENIOR
How does A2A handle message ordering and exactly-once delivery?
Q01 of 05SENIOR

Explain the A2A handshake state machine. What are the states and transitions?

ANSWER
The A2A handshake has three states: INIT (agent starts), DISCOVERY (sends registration request), CAPABILITY_EXCHANGE (both agents share schemas), and HEARTBEAT (periodic keep-alive). Transitions: INIT → DISCOVERY on start; DISCOVERY → CAPABILITY_EXCHANGE on successful registration; CAPABILITY_EXCHANGE → HEARTBEAT after both sides acknowledge schemas. If any transition fails, the state machine resets to INIT after a timeout. The critical failure mode is a partial handshake where one agent thinks it's in HEARTBEAT but the other is still in DISCOVERY — this causes silent message drops.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the A2A protocol and how is it different from MCP?
02
How do I set the correct handshake timeout for A2A?
03
Can A2A work over WebSockets instead of HTTP?
04
What happens if an agent doesn't respond to a capability request?
05
How do I debug A2A handshake failures in production?
🔥

That's Multi-Agent. Mark it forged?

4 min read · try the examples if you haven't

Previous
Agent Communication Patterns
3 / 3 · Multi-Agent
Next
Context Engineering for LLMs