Senior 6 min · March 29, 2026

AWS Bedrock Agents — Token Bills 15x Higher Than Logged

Q: How much does AWS Bedrock actually cost in production?

It depends entirely on token volume and model choice, but here's the concrete breakdown: Claude 3 Sonnet costs $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens at time of writing. A typical customer support response that consumes 500 input tokens and 200 output tokens costs roughly $0.0045 — under half a cent. At 100,000 requests per day that's $450/day or ~$13,500/month just in inference costs, before any Agents or Knowledge Base overhead. The cost curve is steep at scale, which is why you should be tracking token usage from day one, not month three.

Q: What's the difference between AWS Bedrock and SageMaker for running AI models?

Bedrock is a managed API for calling pre-trained foundation models — you don't manage any infrastructure. SageMaker is a full ML platform where you can deploy any model (including custom or fine-tuned ones) on dedicated endpoints you control. The rule of thumb: use Bedrock when you want to call a foundation model and ship fast; use SageMaker when you need consistent low-latency inference, models outside Bedrock's catalogue, or full control over the serving environment.

Q: How do I handle Bedrock ThrottlingException in production without losing requests?

Use boto3's 'adaptive' retry mode with max_attempts set to 3-5, which applies exponential backoff with jitter automatically. For user-facing features, wrap the call in a queue with a dead-letter path so throttled requests don't just disappear. Long-term fix: request a Service Quotas increase for your specific model's TPS limit via the AWS console — the default limits are designed for development, not production traffic.

Q: Can Bedrock Agents maintain conversation history across multiple user sessions?

Within a single session yes — Bedrock Agents maintains context for up to one hour using a sessionId you provide. Across separate sessions or after the one-hour timeout, no — the agent has zero memory. For persistent cross-session memory you need to store conversation history in your own database (DynamoDB is the obvious choice), retrieve the relevant history at the start of each new session, and inject it into the agent's initial prompt or as part of your Action Group context. This is a design requirement, not a configuration option.

Bedrock Agents burn ~2,400 tokens per call while apps see only ~400.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide

⚡Quick Answer

AWS Bedrock is a managed inference proxy — you call an API, Amazon runs the model on shared GPU fleet
Supported models: Claude (Anthropic), Titan (Amazon), Llama (Meta), Mistral, Cohere — each with versioned model IDs
Core APIs: InvokeModel (sync), InvokeModelWithResponseStream (streaming), Agents (multi-step orchestration), Knowledge Bases (managed RAG)
Pricing: pay per input/output token on-demand, or reserve Model Units (Provisioned Throughput) for guaranteed TPS
Production trap: default TPS quota is 5 requests/second for most models — file Service Quotas increase 2+ weeks before launch
Cost insight: Agent internal reasoning chains consume 5-15x more tokens than the final response suggests — meter everything from day one

Plain-English First

Imagine you need fresh bread for your restaurant every morning. You could buy a wheat farm, hire agronomists, build a mill, and train bakers — or you could just call a bakery and say 'send me 200 sourdough loaves.' AWS Bedrock is the bakery. The foundation models — Claude, Titan, Llama, Mistral — are already baked, scaled, and maintained by someone else. You just make the call, get the output, and pay per loaf. The moment you think you need to 'own the farm' is the moment you've stopped shipping features and started running an AI infrastructure team.

A fintech startup I consulted for spent four months standing up a self-hosted Llama 2 cluster on EC2. GPU reservations, CUDA driver mismatches, custom inference servers, auto-scaling that never quite worked right. They burned $180k in compute before their first user ever typed a prompt. AWS Bedrock would have had them in production in an afternoon for a few cents per thousand tokens. That's not a sales pitch — it's a pattern I've watched repeat at least six times across different orgs.

Bedrock solves a specific and expensive problem: most product teams don't need to run a model — they need a model's output. The operational surface area between those two things is massive. You're talking GPU fleet management, model versioning, inference server tuning, cold-start latency, and on-call rotations that wake up ML engineers at 2am because the VRAM exploded under load. Bedrock collapses all of that into a single API. You pick a model, send a request, get a response. The fleet management, the scaling, the hardware — Amazon's problem now.

After reading this you'll be able to: wire up Bedrock's InvokeModel API in a real service context, implement streaming responses without blocking your web workers, set up Bedrock Agents for multi-step task orchestration, avoid the three quota and cost traps that silently destroy GenAI budgets, and make an informed decision about when Bedrock is the right call versus when you actually do need to self-host.

The Bedrock Model: What You're Actually Paying For and How It Routes

Before you write a single line of code, understand what Bedrock is under the hood — because the mental model directly affects how you design for cost, latency, and failure.

Bedrock is a managed inference proxy. When you call InvokeModel, you're not getting a dedicated GPU instance. Your request goes into Amazon's shared inference fleet for that model family. Amazon handles queuing, routing, scaling, and the hardware underneath. You pay per input token and per output token. There's no idle cost, no reserved capacity fee by default — unless you opt into Provisioned Throughput, which we'll get to.

This shared-fleet model is why you'll see latency variance that would be unacceptable from your own infrastructure. On a busy Tuesday afternoon, a Claude 3 Sonnet call might take 800ms. On Sunday at 6am it might take 280ms. You don't control that. Plan for p99 latency, not average. I've seen teams build chatbots that felt broken in production because they load-tested at 2am and designed for 400ms response times — then their 9am Monday demo crawled.

The model IDs matter more than you think. They're not stable aliases — they're versioned strings like anthropic.claude-3-sonnet-20240229-v1:0. When Anthropic ships a new version, the old ID stays available but you don't get automatically migrated. That's intentional. But it means you need a config-driven model ID system, not hardcoded strings in your service. Teams that hardcode model IDs end up doing find-and-replace across repos when they want to upgrade — which is exactly as painful as it sounds.

bedrock_inference_client.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

# io.thecodeforge — DevOps tutorial

import boto3
import json
import os
from botocore.config import Config
from botocore.exceptions import ClientError, EndpointResolutionError

# Config-driven model ID — never hardcode this in your service layer.
# Pull from environment or parameter store so upgrades don't require redeploys.
MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")

# Set explicit timeouts. Bedrock calls on large prompts can run 30-60s.
# Without this, boto3 defaults to 60s connect / no read timeout — silent hangs kill your web workers.
boto_config = Config(
    region_name=AWS_REGION,
    connect_timeout=5,        # fail fast if the endpoint is unreachable
    read_timeout=120,         # long enough for large completions, not infinite
    retries={
        "max_attempts": 3,
        "mode": "adaptive"    # exponential backoff with jitter — don't use 'legacy' mode in prod
    }
)

bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config)


def invoke_document_summariser(raw_document: str, max_tokens: int = 1024) -> dict:
    """
    Production pattern: document summarisation for a content pipeline.
    Returns structured output including token usage so the caller can track cost.
    """

    # Claude models use the Messages API format — not the legacy text-completion format.
    # Mixing them up gives you a cryptic ValidationException, not a helpful error.
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",  # required field for Anthropic models on Bedrock
        "max_tokens": max_tokens,
        "messages": [
            {
                "role": "user",
                "content": (
                    "Summarise the following document in 3 bullet points. "
                    "Focus on decisions made, not background context.\n\n"
                    f"{raw_document}"
                )
            }
        ],
        "temperature": 0.2,   # low temp for summarisation — you want deterministic, not creative
    }

    try:
        response = bedrock_runtime.invoke_model(
            modelId=MODEL_ID,
            contentType="application/json",
            accept="application/json",
            body=json.dumps(request_body)
        )

        response_body = json.loads(response["body"].read())

        # Always capture usage — this is your cost telemetry.
        # Log it to CloudWatch metrics or your billing system. Don't discard it.
        input_tokens = response_body["usage"]["input_tokens"]
        output_tokens = response_body["usage"]["output_tokens"]

        summary_text = response_body["content"][0]["text"]

        return {
            "summary": summary_text,
            "model_id": MODEL_ID,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            # Rough cost estimate for Claude 3 Sonnet at time of writing:
            # $0.003/1K input tokens, $0.015/1K output tokens
            "estimated_cost_usd": round(
                (input_tokens / 1000 * 0.003) + (output_tokens / 1000 * 0.015), 6
            )
        }

    except ClientError as e:
        error_code = e.response["Error"]["Code"]
        error_message = e.response["Error"]["Message"]

        # ThrottlingException hits when you exceed your account's TPS quota.
        # Default is 5 TPS for Claude 3 Sonnet in most regions — shockingly low for production.
        if error_code == "ThrottlingException":
            raise RuntimeError(
                f"Bedrock quota exceeded for model {MODEL_ID}. "
                "Request a limit increase via Service Quotas before going live."
            ) from e

        # ValidationException usually means malformed request body — check your model's spec.
        if error_code == "ValidationException":
            raise ValueError(f"Invalid request format for {MODEL_ID}: {error_message}") from e

        raise RuntimeError(f"Bedrock API error [{error_code}]: {error_message}") from e


if __name__ == "__main__":
    sample_doc = """
    Engineering Review — Q3 Platform Migration
    Decision: Move API gateway to AWS API Gateway v2 (HTTP APIs).
    Rationale: 60% cost reduction vs REST APIs for our traffic pattern.
    Rejected alternative: Kong on EKS — operational overhead too high for current team size.
    Timeline: Cutover scheduled for October 15th. Rollback plan approved.
    Owner: Platform team. Risk: Medium. Stakeholder sign-off: CTO, VP Engineering.
    """

    result = invoke_document_summariser(sample_doc)
    print(f"Summary:\n{result['summary']}")
    print(f"\nTokens — Input: {result['input_tokens']} | Output: {result['output_tokens']}")
    print(f"Estimated cost: ${result['estimated_cost_usd']}")

Output

Summary:

• Decided to migrate API gateway to AWS API Gateway v2 (HTTP APIs) for a 60% cost reduction.

• Rejected Kong on EKS due to excessive operational overhead for the current team size.

• Cutover set for October 15th with an approved rollback plan; medium risk, sign-off from CTO and VP Engineering.

Tokens — Input: 187 | Output: 73

Estimated cost: $0.001662

Production Trap: Default TPS Quota Will Destroy Your Launch

AWS Bedrock default TPS limits are 5 requests/second for most Claude models in new accounts. That's fine for a demo. At production load with 50 concurrent users, you'll hit ThrottlingException inside two seconds. File a Service Quotas increase request at least two weeks before your launch date — AWS approval isn't instant, and the support ticket queue gets long.

Production Insight

Shared fleet latency variance is 2-5x between p50 and p99.

You cannot control routing or queue position — Amazon decides.

Rule: design for p99 latency from day one, not average. Load test at peak hours (9am-5pm weekdays), not at 2am when the fleet is idle.

Key Takeaway

Bedrock is a shared inference proxy — you get no dedicated hardware and no latency guarantees.

Model IDs are versioned and must be config-driven, not hardcoded.

Design for p99, meter token usage from CloudWatch, and file your TPS increase before you write application code.

Model ID Management Strategy

IfSingle model, single version, no upgrade planned

→

UseEnvironment variable is sufficient — set BEDROCK_MODEL_ID at deploy time.

IfMultiple models or frequent version upgrades

→

UseUse AWS Systems Manager Parameter Store with a config-driven lookup. CI/CD pipeline updates the parameter, not the code.

IfA/B testing model versions

→

UseStore a JSON map of model aliases to versioned IDs in Parameter Store. Route by feature flag or percentage.

Streaming Responses: Stop Blocking Your Threads and Start Shipping Perceived Speed

Here's what kills GenAI UX before a user ever reads a word: a 12-second blank screen while your server waits for the full completion before flushing anything to the client. Users think it's broken. They hit refresh. You get duplicate charges. Your support queue fills up.

Bedrock's InvokeModelWithResponseStream fixes this. It returns a streaming event iterator — text chunks arrive as the model generates them, and you pipe each chunk to the client immediately. From the user's perspective, text starts appearing in under a second and keeps flowing. Perceived latency drops dramatically even when total generation time is identical.

The tricky part isn't the streaming itself — it's the infrastructure around it. Your web framework needs to support streaming responses, your load balancer needs idle timeout configured high enough (ALB defaults to 60s — too low for long completions), and your error handling needs to account for the fact that the stream can fail mid-response. I've seen services that catch exceptions from InvokeModel just fine but have zero error handling inside the stream event loop — so when the stream dies at token 400 of a 600-token response, the client gets a truncated response with no indication that something went wrong. Silent data loss in a production AI system is a bad day.

bedrock_streaming_handler.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

# io.thecodeforge — DevOps tutorial

import boto3
import json
import os
from botocore.config import Config
from typing import Generator

MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")

boto_config = Config(
    region_name=os.environ.get("AWS_REGION", "us-east-1"),
    connect_timeout=5,
    read_timeout=300,   # streaming completions can run long — 120s isn't always enough
    retries={"max_attempts": 1, "mode": "standard"}  # don't retry mid-stream — retry at the caller level
)

bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config)


def stream_customer_support_response(
    customer_query: str,
    account_context: dict
) -> Generator[str, None, None]:
    """
    Production pattern: real-time customer support response generation.
    Yields text chunks as a generator so the caller (e.g. FastAPI StreamingResponse)
    can flush each chunk to the HTTP client immediately.

    account_context: dict with keys like 'plan', 'open_tickets', 'last_login'
    """

    # Build a system prompt from account context so the model responds with
    # customer-specific information rather than generic advice.
    system_prompt = (
        f"You are a support agent for TheCodeForge platform. "
        f"The customer is on the {account_context.get('plan', 'free')} plan. "
        f"They have {account_context.get('open_tickets', 0)} open support tickets. "
        "Be concise, direct, and actionable. Do not apologise excessively."
    )

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "system": system_prompt,
        "messages": [
            {"role": "user", "content": customer_query}
        ],
        "temperature": 0.3,
    }

    try:
        streaming_response = bedrock_runtime.invoke_model_with_response_stream(
            modelId=MODEL_ID,
            contentType="application/json",
            accept="application/json",
            body=json.dumps(request_body)
        )

        # event_stream is an iterator of EventStream objects — each one is a chunk
        event_stream = streaming_response["body"]

        for event in event_stream:
            chunk = event.get("chunk")
            if not chunk:
                # Non-chunk events exist (metadata, message_start, etc.) — skip them gracefully
                continue

            chunk_data = json.loads(chunk["bytes"].decode("utf-8"))

            # Claude streaming emits different event types — only 'content_block_delta' carries text
            if chunk_data.get("type") == "content_block_delta":
                delta = chunk_data.get("delta", {})
                if delta.get("type") == "text_delta":
                    text_piece = delta.get("text", "")
                    if text_piece:
                        yield text_piece  # flush this chunk to the caller immediately

            # message_stop event signals clean completion — log it for observability
            elif chunk_data.get("type") == "message_stop":
                # Amazon metrics come in the stop event — useful for billing dashboards
                amazon_metrics = chunk_data.get("amazon-bedrock-invocationMetrics", {})
                input_tokens = amazon_metrics.get("inputTokenCount", 0)
                output_tokens = amazon_metrics.get("outputTokenCount", 0)
                # In production: emit these as CloudWatch custom metrics here
                print(f"[Stream complete] input={input_tokens} output={output_tokens} tokens")

    except Exception as e:
        # Critical: yield an error marker so the client knows the stream died mid-response
        # Don't silently stop — the client will think the truncated response is complete
        yield f"\n[ERROR: Response generation interrupted — {type(e).__name__}]"
        raise


# --- Simulated FastAPI usage (shows how the generator plugs into a real web framework) ---
# In production this would be in your router module:
#
# from fastapi import FastAPI
# from fastapi.responses import StreamingResponse
#
# app = FastAPI()
#
# @app.post("/support/stream")
# async def stream_support(query: SupportQuery):
#     account_ctx = fetch_account_context(query.account_id)  # your DB call
#     return StreamingResponse(
#         stream_customer_support_response(query.text, account_ctx),
#         media_type="text/plain"
#     )


if __name__ == "__main__":
    query = "I deployed to production and my API calls are returning 429s. What do I do?"
    context = {"plan": "pro", "open_tickets": 1, "last_login": "2024-03-15"}

    print("Streaming response:\n")
    for chunk in stream_customer_support_response(query, context):
        print(chunk, end="", flush=True)  # flush=True is essential — don't buffer
    print("\n")

Output

Streaming response:

You're hitting rate limits (429 = Too Many Requests). On the Pro plan here's what to check:

1. **Check your current usage** — log into the dashboard under Settings > API Usage to see if you've hit your monthly request cap.

2. **Implement exponential backoff** — your client should retry with delays of 1s, 2s, 4s before failing. Most SDKs have this built in.

3. **Check for runaway processes** — a misconfigured retry loop can exhaust your quota in minutes. Look for repeated identical requests in your logs.

If you're within quota and still seeing 429s, open a ticket with your API key and a sample request timestamp — that points to a server-side issue we'll trace on our end.

[Stream complete] input=98 output=143 tokens

Never Do This: Retry Inside a Stream Event Loop

Setting max_attempts > 1 on a streaming call in boto3 is dangerous. If a chunk fails mid-stream and boto3 retries, it starts the stream from the beginning — but your client has already received partial output. You end up with duplicated content prepended to the retry. Set retries to 1 on streaming clients and handle retries at the request level, before the stream opens.

Production Insight

ALB default idle timeout is 60s — too low for long streaming completions.

A 2,000-token completion at 30 tokens/second takes 67 seconds. The ALB kills the connection at 60s.

Rule: set ALB idle timeout to 300s for any endpoint that streams Bedrock completions. Check this before your first production deploy, not after your first outage.

Key Takeaway

Streaming drops perceived latency from total generation time to first-token time — critical for user-facing UX.

Never set max_attempts > 1 on streaming boto3 clients — retries restart the stream and duplicate content.

ALB idle timeout must be 300s+ for streaming endpoints. The 60s default will silently truncate long completions.

Streaming vs Sync InvokeModel

IfUser-facing chat or real-time response (< 3s perceived latency required)

→

UseUse InvokeModelWithResponseStream. Perceived latency drops from total generation time to first-token time.

IfBatch processing, document classification, or background jobs

→

UseUse InvokeModel (sync). Simpler error handling, no stream infrastructure needed.

IfWebhook or callback-based architecture

→

UseUse InvokeModel (sync) with async worker. Stream to callback URL after completion.

IfLoad balancer cannot be configured with high idle timeout

→

UseUse InvokeModel (sync). Streaming through a 60s ALB timeout will truncate long responses.

Bedrock Agents: When Single Prompts Aren't Enough and You Need Actual Orchestration

A single InvokeModel call works great when your task is stateless: summarise this, classify that, generate this copy. The moment your task requires multiple steps — look up customer data, reason about it, call an API, generate a response based on the result — you're either building your own orchestration loop or you're using Bedrock Agents.

Bedrock Agents is Amazon's managed multi-step reasoning engine. You define the agent's instructions (its persona and scope), attach Action Groups (Lambda functions that the agent can invoke), and optionally connect a Knowledge Base (a vector store backed by your documents). The agent runs a ReAct-style loop: it reasons about the user's request, decides which actions to take, calls your Lambdas, observes the results, and iterates until it has enough information to respond.

The thing most tutorials won't tell you: the agent's internal reasoning chain costs tokens you don't see upfront. Every step in the loop — including the model's internal 'thinking' about which action to call — burns input and output tokens. On complex multi-step tasks I've seen agents consume 10-15x the tokens you'd expect from reading the final answer alone. Budget for it. Also, Bedrock Agents has a fixed session timeout of one hour. Any stateful conversation longer than that needs explicit session management on your side — the agent won't remember anything after the session expires.

The sweet spot for Agents is internal tooling: HR bots that query Workday, DevOps assistants that check CloudWatch alarms and summarise them, customer-facing support bots that can actually look up order status. Tasks where the answer genuinely requires calling real systems, not just reasoning over embedded knowledge.

bedrock_agent_action_group_lambda.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

# io.thecodeforge — DevOps tutorial
# This Lambda is an Action Group handler for a Bedrock Agent.
# The agent calls this function when it needs to look up order status.
# Deploy this as a Lambda, then wire it to your Agent via the Bedrock console or CDK.

import json
import boto3
import os
from datetime import datetime

# In production: pull from environment, not hardcoded table names
ORDERS_TABLE = os.environ.get("ORDERS_TABLE_NAME", "platform-orders-prod")

dynamodb = boto3.resource("dynamodb")
orders_table = dynamodb.Table(ORDERS_TABLE)


def lambda_handler(event: dict, context) -> dict:
    """
    Bedrock Agent Action Group handler.
    The agent sends a specific event structure — you must return a specific structure back.
    Deviate from the response format and the agent silently fails or hallucinates an answer.
    """

    # Bedrock Agents wraps function calls in this structure
    agent_action = event.get event.get("apiPath", "")        # matches your OpenAPI schema path
    http_method = event.get("httpMethod", "")  # GET, POST, etc. — from your schema
    parameters = event.get("parameters", [])   # list of {name, type, value} dicts

    print(f"[Agent Action] group={agent_action} path={api_path} method={http_method}")

    # Route to the appropriate handler based on the API path
    if api_path == "/orders/{orderId}" and http_method == "GET":
        order_id = next(
            (p["value"] for p in parameters if("actionGroup", "")
    api_path = p["name"] == "orderId"), None
        )
        result = fetch_order_status(order_id)
    elif api_path == "/orders/{orderId}/cancel" and http_method == "POST":
        order_id = next(
            (p["value"] for p in parameters if p["name"] == "orderId"), None
        )
        result = cancel_order(order_id)
    else:
        result = {"error": f"Unknown action path: {api_path}"}

    # Bedrock Agents REQUIRES this exact response envelope.
    # Missing 'messageVersion', 'response', or 'actionGroup' fields = silent agent failure.
    return {
        "messageVersion": "1.0",
        "response": {
            "actionGroup": agent_action,
            "apiPath": api_path,
            "httpMethod": http_method,
            "httpStatusCode": 200 if "error" not in result else 400,
            "responseBody": {
                "application/json": {
                    "body": json.dumps(result)
                }
            }
        }
    }


def fetch_order_status(order_id: str) -> dict:
    """Look up a real order from DynamoDB and return structured status."""
    if not order_id:
        return {"error": "orderId is required"}

    try:
        response = orders_table.get_item(Key={"orderId": order_id})
        item = response.get("Item")

        if not item:
            # Be specific — the agent will relay this message verbatim to the user
            return {"error": f"Order {order_id} not found. It may not exist or may be archived."}

        return {
            "orderId": item["orderId"],
            "status": item["status"],                  # e.g. PROCESSING, SHIPPED, DELIVERED
            "estimatedDelivery": item.get("estimatedDelivery", "unknown"),
            "carrier": item.get("carrier", "not yet assigned"),
            "trackingNumber": item.get("trackingNumber", "not yet assigned"),
            "lastUpdated": item.get("lastUpdated", "")
        }

    except Exception as e:
        # Don't expose raw exception messages to the agent — it may relay them to the user
        print(f"[ERROR] DynamoDB lookup failed for order {order_id}: {e}")
        return {"error": "Order lookup temporarily unavailable. Please try again shortly."}


def cancel_order(order_id: str) -> dict:
    """Cancel an order if it's still in PROCESSING state."""
    if not order_id:
        return {"error": "orderId is required"}

    try:
        # Conditional update — only cancel if status is PROCESSING
        # This prevents the agent from cancelling already-shipped orders
        orders_table.update_item(
            Key={"orderId": order_id},
            UpdateExpression="SET #s = :cancelled, lastUpdated = :now",
            ConditionExpression="#s = :processing",
            ExpressionAttributeNames={"#s": "status"},   # 'status' is a reserved word in DynamoDB
            ExpressionAttributeValues={
                ":cancelled": "CANCELLED",
                ":processing": "PROCESSING",
                ":now": datetime.utcnow().isoformat()
            }
        )
        return {"orderId": order_id, "status": "CANCELLED", "message": "Order successfully cancelled."}

    except dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
        # Order exists but isn't in PROCESSING — give the agent a specific reason
        return {
            "error": f"Order {order_id} cannot be cancelled — it has already been shipped or delivered."
        }
    except Exception as e:
        print(f"[ERROR] Cancel failed for order {order_id}: {e}")
        return {"error": "Cancellation temporarily unavailable."}

Output

# When the Bedrock Agent receives: "Can you cancel order ORD-88421?"

# The agent internally calls this Lambda with:

# { "actionGroup": "OrderManagement", "apiPath": "/orders/{orderId}/cancel",

# "httpMethod": "POST", "parameters": [{"name": "orderId", "value": "ORD-88421"}] }

# Lambda fetches the order, finds status=PROCESSING, updates to CANCELLED.

# Lambda returns the structured envelope.

# Agent receives the result and responds to the user:

# Agent: "I've cancelled order ORD-88421. You'll receive a confirmation email

# within a few minutes and a refund within 3-5 business days."

# CloudWatch log output from Lambda:

# [Agent Action] group=OrderManagement path=/orders/{orderId}/cancel method=POST

Production Trap: Agent Token Costs Are Not What You Think

Bedrock Agents runs an internal reasoning chain that isn't visible in your application logs. On a 3-step task (lookup → reason → respond), I've measured 8,000+ tokens consumed for what looks like a 200-token final answer. Set up a CloudWatch metric filter on inputTokenCount and outputTokenCount from the agent's CloudTrail events before you go live. Otherwise your billing surprises will be significant and will arrive monthly.

Production Insight

Agent response envelope format is rigid — missing a single field causes silent hallucination.

The agent will generate a plausible-sounding answer from its training data instead of using your Lambda's response.

Rule: validate the envelope structure in Lambda unit tests before deploying. The 6 required fields are: messageVersion, response.actionGroup, response.apiPath, response.httpMethod, response.httpStatusCode, response.responseBody.

Key Takeaway

Agents consume 5-15x more tokens than the final response suggests — meter at CloudWatch, not application logs.

Agent system prompts are re-injected at every reasoning step — keep them under 200 tokens.

Route simple queries to InvokeModel directly. Reserve agents for tasks that genuinely require multi-step API orchestration.

Bedrock Agents vs Custom Orchestration

IfSingle-step task (classify, summarise, generate)

→

UseUse InvokeModel directly. Agents add overhead with no benefit for stateless tasks.

IfMulti-step with 1-3 known API calls and simple routing

→

UseUse Bedrock Agents. The managed ReAct loop handles routing without custom code.

IfComplex branching logic, conditional retries, or saga patterns

→

UseBuild custom orchestration (Step Functions, Temporal, or a state machine). Agents cannot handle complex control flow.

IfNeed cross-session memory (>1 hour conversation)

→

UseAgents cannot do this natively. Build custom session management with DynamoDB. Inject history into the agent's initial prompt.

Provisioned Throughput, Knowledge Bases, and When to Walk Away From Bedrock Entirely

On-demand pricing is great until you hit quota walls at scale. If your application is sending consistent, high-volume traffic to a specific model — think a customer-facing feature used by thousands of users during business hours — Provisioned Throughput might make more sense. You reserve Model Units (MUs) for a specific model, pay hourly regardless of usage, and get guaranteed throughput without ThrottlingExceptions.

Here's the honest math: a single MU for Claude 3 Sonnet runs about $60/hour. At 720 hours per month that's $43,200 per month, per MU. On-demand for the same volume might be cheaper — or wildly more expensive — depending on your actual token throughput. Run the numbers on your specific traffic pattern before committing. Provisioned Throughput has a minimum one-month commitment. I've seen teams lock in a MU for a feature that got descoped a week later.

Bedrock Knowledge Bases gives you managed RAG — upload documents to S3, Bedrock chunks and embeds them into a vector store (OpenSearch Serverless or Pinecone), and your agent can query it semantically. For internal documentation bots or product knowledge bases it's genuinely useful and much faster to ship than building your own embedding pipeline. The gotcha: chunk size and overlap settings are critical and not obvious. Default chunking works fine for short Q&A docs, but for dense technical PDFs you'll get retrieval misses because the relevant context gets split across chunk boundaries.

When should you not use Bedrock? Three clear signals: you need a model Bedrock doesn't offer (GPT-4o, Gemini Ultra — you're calling OpenAI/Google directly regardless), you need sub-100ms inference latency at scale (shared fleet variance won't get you there — look at SageMaker JumpStart with a dedicated endpoint), or you need fine-tuned models on highly proprietary data where sending data to a third-party API is a compliance non-starter. Bedrock does support some fine-tuning workflows, but they're limited in model scope and more complex than advertised.

bedrock_knowledge_base_rag_query.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

# io.thecodeforge — DevOps tutorial
# RAG (Retrieval Augmented Generation) pattern using Bedrock Knowledge Bases.
# Use case: internal engineering handbook bot that answers policy questions.
# The Knowledge Base is pre-populated with your company docs via the Bedrock console or CDK.

import boto3
import os
import json
from botocore.config import Config

KNOWLEDGE_BASE_ID = os.environ["BEDROCK_KB_ID"]   # e.g. "ABCD1234EF" — from Bedrock console
MODEL_ARN = (
    "arn:aws:bedrock:us-east-1::foundation-model/"
    "anthropic.claude-3-sonnet-20240229-v1:0"
)  # RetrieveAndGenerate requires the full ARN, not just the model ID

boto_config = Config(
    region_name=os.environ.get("AWS_REGION", "us-east-1"),
    connect_timeout=5,
   _attempts": 2, "mode": "adaptive"}
)

# Note: Knowledge Bases uses the 'bedrock-agent-runtime' client — NOT 'bedrock-runtime'.
# Using the wrong client gives you a NoRegionError or AttributeError with no useful message.
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", config=boto_config)


def query_engineering_handbook(question: str, max_retrieved_chunks: int = 5) -> dict:
    """
    Query the engineering handbook Knowledge Base using RetrieveAndGenerate.
    This is the fully managed RAG path — Bedrock handles retrieval + generation in one call.

    For transparency/debugging: also returns the source citations so you can verify
    the model isn't hallucinating answers that aren't in the docs.
    """

    try:
        response = bedrock_agent_runtime.retrieve_and_generate(
            input={"text": question},
            retrieveAndGenerateConfiguration={
                "type": "KNOWLEDGE_BASE",
                "knowledgeBaseConfiguration": {
                    "knowledgeBaseId": KNOWLEDGE_BASE_ID,
                    "modelArn": MODEL_ARN,
                    "retrievalConfiguration": {
                        "vectorSearchConfiguration": {
                            # Number of document chunks to retrieve before generation.
                            # Higher = more context but more tokens = higher cost and latency.
                            # 5 is a good starting point; tune based on your doc structure.
                            "numberOfResults": max_retrieved_chunks
                        }
                    },
                    "generationConfiguration": {
                        "promptTemplate": {
                            # Override the default prompt to enforce your preferred answer style.
                            # The $search_results$ placeholder is where retrieved chunks are injected.
                            "textPromptTemplate": (
                                "You are an assistant for TheCodeForge engineering team. "
                                "Answer based ONLY on the following retrieved context. "
                                "If the answer isn't in the context, say 'Not found in handbook.' "
                                "Do not invent policies or procedures.\n\n"
                                "Context:\n$search_results$\n\n"
                                f"Question: {question}"
                            )
                        }
                    }
                }
            }
        )

        answer = response["output"]["text"]

        # Extract citations — each citation maps to a specific chunk in your S3 docs.
        # In production: read_timeout=60,
    retries={"max surface these to the user so they can verify the source.
        citations = []
        for citation in response.get("citations", []):
            for reference in citation.get("retrievedReferences", []):
                location = reference.get("location", {}).get("s3Location", {})
                citations.append({
                    "source_uri": location.get("uri", "unknown"),
                    "excerpt": reference.get("content", {}).get("text", "")[:200]  # truncate for display
                })

        return {
            "answer": answer,
            "citations": citations,
            "citation_count": len(citations)
        }

    except bedrock_agent_runtime.exceptions.ResourceNotFoundException:
        raise ValueError(
            f"Knowledge Base {KNOWLEDGE_BASE_ID} not found. "
            "Check the ID and ensure the KB is in 'Active' status — "
            "embedding ingestion must complete before queries work."
        )
    except Exception as e:
        raise RuntimeError(f"Knowledge Base query failed: {e}") from e


if __name__ == "__main__":
    result = query_engineering_handbook(
        "What's our policy on hotfixing directly to the main branch?"
    )

    print(f"Answer:\n{result['answer']}\n")
    print(f"Sources ({result['citation_count']} retrieved):")
    for i, citation in enumerate(result["citations"], 1):
        print(f"  [{i}] {citation['source_uri']}")
        print(f"      Excerpt: {citation['excerpt']}...")

Output

Answer:

Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through a pull request with at least one reviewer approval. For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification in the PR description, followed by a post-incident review within 48 hours. Branch protection rules enforce this — direct pushes to main are blocked at the repository level.

Sources (3 retrieved):

[1] s3://thecodeforge-handbook/engineering/git-policy-v4.pdf

Excerpt: Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through...

[2] s3://thecodeforge-handbook/engineering/incident-response-runbook.pdf

Excerpt: For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification...

[3] s3://thecodeforge-handbook/engineering/branch-protection-setup.md

Excerpt: Branch protection rules enforce this — direct pushes to main are blocked at the repository level via GitHub rulesets...

Senior Shortcut: Two-Stage RAG for Better Retrieval Accuracy

If RetrieveAndGenerate gives you retrieval misses on complex questions, split it into two calls: first call Retrieve only (retrieve_from_knowledge_base) to get the raw chunks, then manually re-rank or filter them in your application code, then pass the filtered context to InvokeModel directly. This two-stage pattern costs slightly more but gives you control over what the model actually sees — and it surfaces retrieval quality issues that the end-to-end call hides.

Production Insight

Provisioned Throughput minimum commitment is 1 month at $43,200/MU for Claude 3 Sonnet.

A feature descoped one week after MU purchase = $43,200 wasted for that month.

Rule: validate feature stability and traffic projections for 30 days before committing to Provisioned Throughput. Use on-demand during the validation period and compare actual token spend against MU cost.

Key Takeaway

Provisioned Throughput is a 1-month minimum commitment at $43K+/MU. Validate traffic stability before committing.

Knowledge Bases default chunking fails on dense technical PDFs — tune chunk size or use the two-stage retrieve-then-generate pattern.

Walk away from Bedrock when you need sub-100ms latency, models outside the catalogue, or VPC-only data processing.

When to Use Bedrock vs Self-Host

IfNeed model output, team has no ML engineers, volume < 1B tokens/month

→

UseUse Bedrock on-demand. Fastest path to production, zero ops burden.

IfConsistent high-volume traffic (> 500M tokens/month) with stable model

→

UseEvaluate Provisioned Throughput. Run 30-day on-demand baseline, compare against MU hourly cost.

IfNeed sub-100ms p99 latency at scale

→

Userock shared fleet cannot guarantee sub-100ms.

IfNeed a model not in Bedrock catalogue (GPT-4o, Gemini Ultra)

→

UseCall the provider API directly. Bedrock cannot help.

IfData cannot leave your VPC (compliance requirement)

→

UseSelf-host on SageMaker or EC2 in your VPC. Bedrock processes data on Amazon's infrastructure.

IfNeed fine-tuning on proprietary data

→

UseEvaluate Bedrock fine-tuning first (limited model support). If insufficient, self-host with custom training pipeline.

● Production incidentPOST-MORTEMseverity: high

The $47K/month Agent Bill: Invisible Token Consumption in Bedrock Agents

Symptom

Monthly Bedrock bill 15x higher than projected. Application logs showed ~200 tokens per response, but CloudWatch metrics showed ~2,400 tokens per invocation. The delta was invisible to the application layer.

Assumption

The team measured cost based on the final response token count — what the user sees. They assumed the agent's internal reasoning was negligible. They did not instrument CloudTrail or the stream stop event metrics for agent-level token tracking.

Root cause

Bedrock Agents runs an internal ReAct-style reasoning loop. Each step — deciding which Action Group to call, interpreting the Lambda response, deciding whether to call another action — burns input and output tokens that are not surfaced in the application response. On a 3-step task (lookup Okta → check Meraki → create Jira ticket), the agent consumed approximately 2,400 tokens per invocation: ~400 tokens for the user's question, ~1,6 3 steps, and ~400 tokens for the final response. The team only metered the 400-token final response. Additionally, the agent was configured with00 tokens for internal reasoning across verbose instructions (800 tokens of system prompt) that were re-injected at every reasoning step, multiplying the input token cost.

Fix

1. Added CloudWatch metric filters on the amazon-bedrock-invocationMetrics inputTokenCount and outputTokenCount fields from the agent's CloudTrail events. This captured the true per-invocation token cost. 2. Reduced the agent's system prompt from 800 tokens to 150 tokens by removing redundant persona descriptions and consolidating instructions. 3. Added a session-level token budget with a hard cap of 3,000 tokens per conversation turn. When the budget was exhausted, the agent returned a 'please try a simpler question' response instead of continuing the reasoning loop. 4. Implemented a two-tier architecture: simple queries (password reset, VPN status) routed to direct InvokeModel calls with a static knowledge base, bypassing the agent entirely. Only multi-step tasks (requiring API calls) used the agent. This reduced agent invocations by 70%. 5. Set up a daily CloudWatch alarm on Bedrock token spend exceeding $1,500/day with automatic Slack notification to the platform team.

Key lesson

Bedrock Agent token costs are 5-15x higher than the final response suggests. Always meter at the CloudWatch/CloudTrail level, not the application response level.
Agent system prompts are re-injected at every reasoning step. A 800-token system prompt across a 4-step reasoning chain adds 3,200 tokens of input cost per invocation. Keep agent instructions under 200 tokens.
Not every query needs an agent. Route simple, stateless queries to direct InvokeModel calls. Reserve agents for tasks that genuinely require multi-step API orchestration.
Set up token spend alarms before go-live. Bedrock cost surprises arrive monthly, not per-request. Daily alarms catch runaway consumption before the bill compounds.

Production debug guideSymptom-to-action guide for Bedrock API errors, latencyPS quota spikes, quota issues, and Agent failures6 entries

Symptom · 01

ThrottlingException: 'Too many requests, please wait before trying again'

→

Fix

for the specific model. Check current limits: AWS Console > Service Quotas > Amazon Bedrock > search for your model's 'On-demand throughput limit'. Default is 5 TPS for most Claude models. File an increase request immediately — approval takes 5-10 business days. In the meantime, implement exponential backoff with jitter in your client.

Symptom · 02

ValidationException: 'Malformed input request' when calling InvokeModel

→

Fix

The request body format does not match the model's expected schema. Claude models on Bedrock require the Messages API format with 'anthropic_version': 'bedrock-2023-05-31'. Mistral and Llama use different formats. Check the model-specific API spec in the Bedrock documentation. Common mistake: using the legacy text-completion format for Claude 3.

Symptom · 03

InvokeModel call hangs for 60+ seconds then times out

→

Fix

Check boto3 read_timeout configuration. Default is 60s (or no timeout depending on version). For large prompts (>10K tokens), completions can take 30-120s. Set read_timeout=120 in your boto3 Config. AlsoYou have exceeded your account's T check if the model is experiencing elevated latency — check AWS Health Dashboard for the region.

Symptom · 04

Bedrock Agent returns wrong answers but Lambda executes correctly with no errors

→

Fix

The agent's reasoning loop is misinterpreting your Lambda's response. Check the response envelope structure — Bedrock Agents requires 'messageVersion', 'response', 'actionGroup', 'apiPath', 'httpMethod', 'httpStatusCode', and 'responseBody' in the exact format. Missing or misnamed fields cause the agent to hallucinate a response instead of using your data. Add structured logging in the Lambda to capture the full event and response for comparison.

Symptom · 05

Streaming response truncates mid-generation — client receives partial output

→

Fix

The stream connection dropped mid-response. Check ALB idle timeout (default 60s — too low for long completions). Increase to 300s. Check boto3 retries — setting max_attempts > 1 on streaming clients causes boto3 to restart the stream from the beginning, duplicating content already sent to the client. Set retries to 1 on streaming clients and handle retries at the request level.

Symptom · 06

Knowledge Base returns 'ResourceNotFoundException' despite correct KB ID

→

Fix

The Knowledge Base exists but is not in 'Active' status. Data source ingestion must complete before the KB can serve queries. Check status: AWS Console > Bedrock > Knowledge Bases > select your KB > check 'Status' column. If 'Creating' or 'Updating', wait for ingestion to complete. If the KB was recently created, the initial embedding process can take 10-60 minutes depending on document volume.

★ AWS Bedrock Triage Cheat SheetFast symptom-to-action for engineers investigating Bedrock failures. First 5 minutes.

ThrottlingException on every request−

Immediate action

Check TPS quota and current usage for the model in Service Quotas.

Commands

aws service-quotas get-service-quota --service-code amazon-bedrock --quota-code L-<your-model-quota-code>

aws cloudwatch get-metric-statistics --namespace AWS/Bedrock --metric-name Invocations --dimensions Name=ModelId,Value=<model-id> --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 60 --statistics Sum

Fix now

Implement exponential backoff. File Service Quotas increase request. If urgent, switch to Provisioned Throughput for guaranteed TPS.

Agent returns generic or wrong answers despite correct Lambda execution+

InvokeModel latency > 10s consistently+

Streaming response duplicated content mid-stream+

AWS Bedrock vs Self-Hosted Inference

Attribute	AWS Bedrock (On-Demand)	Self-Hosted on SageMaker / EC2
Time to first inference	Minutes (API key + boto3 call)	Days to weeks (instance setup, model download, server config)
Infrastructure ops burden	Zero — Amazon's problem	High — your team owns scaling, patching, CUDA versions
Latency consistency (p99)	Variable — shared fleet, expect 2-5x p50	Predictable — dedicated hardware, tunable
Cost at low volume (<10M tokens/month)	Cheap — pure pay-per-token	Expensive — idle GPU compute is still billed
Cost at high volume (>1B tokens/month)	Expensive — per-token adds up fast	Cheaper if utilisation is high and model is stable
Model selection	Limited to Bedrock catalogue (Claude, Titan, Llama, Mistral, Cohere)	Any open-weight model you can run
Data sovereignty / compliance	Data processed by AWS — review BAA requirements	Full control — data never leaves your VPC
Fine-tuning support	Limited — select models only, constrained workflow	Full control — any fine-tuning framework
Quota / rate limits	Default 5 TPS for most models — requires support ticket to raise	Self-imposed — limited by your hardware
Cold start latency	None — fleet is always warm	Real — model loading can take 30-90s on first call

Key takeaways

Bedrock's value isn't the models

it's the elimination of the MLOps surface area. The moment you start managing GPU instances, inference servers, and model versioning yourself, you've hired an invisible infrastructure team that doesn't ship features.

Default TPS quotas will end your production launch. 5 requests/second is not a starting point

it's a demo limit. File the Service Quotas increase before you write your first line of application code, not the week before go-live.

The right signal to reach for Bedrock

your team needs model output, not model ownership, and your volume is below the crossover point where Provisioned Throughput beats on-demand pricing. If you're shipping a feature to real users and your team has no ML engineers, Bedrock is almost always the correct first choice.

Bedrock Agents' internal token consumption is invisible to your application logs and will be 5-15x higher than the final response tokens suggest. If you're not metering agent token usage from CloudTrail or the invocation metrics in the stream stop event, your cost model is fiction.

Common mistakes to avoid

5 patterns

Hardcoding the model ID string in application source code

Symptom

When AWS deprecates the version ID or you want to upgrade to a newer model version, you must do a multi-repo find-and-replace and redeploy every service that references the old ID.

Fix

Store model IDs in AWS Systems Manager Parameter Store or environment variables, injected at deploy time via your CDK/Terraform config. A single parameter update triggers a rolling deploy without code changes.

Using the wrong boto3 client for Knowledge Bases

Symptom

Calling bedrock_runtime.retrieve_and_generate() produces AttributeError: 'BedrockRuntime' object has no attribute 'retrieve_and_generate' — looks like a boto3 version issue but it is not.

Fix

Knowledge Base operations use boto3.client('bedrock-agent-runtime'), not boto3.client('bedrock-runtime'). These are two different clients with different endpoints. The error message gives no indication of this.

Not requesting a Service Quotas increase before launch

Symptom

Default TPS for Claude 3 Sonnet is 5 requests/second in most regions for new accounts. At production load, every request beyond 5/s returns ThrottlingException with 'Too many requests, please wait before trying again'.

Fix

Go to AWS Service Quotas > Amazon Bedrock > find your model's 'On-demand throughput limit' and submit an increase request at least 10 business days before go-live. Default limits are designed for development, not production traffic.

Treating the streaming event loop like a simple for-loop without mid-stream error handling

Symptom

If the connection drops at token 300 of a 600-token completion, the generator stops silently and the client renders a half-finished response as if it were complete — silent data loss.

Fix

Wrap the event_stream iteration in try/except and yield an explicit error marker if the loop exits unexpectedly, so the client can detect incomplete responses. Never let a stream die silently.

Using temperature=1.0 (or the model default) for structured output tasks

Symptom

High temperature causes the model to occasionally produce malformed JSON or off-schema responses that break your downstream parser. Intermittent failures are the hardest to debug.

Fix

Set temperature between 0.0 and 0.2 for any task where format correctness matters more than creativity. Add output validation with a retry on parse failure.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Bedrock's on-demand pricing model uses a shared inference fleet. How doe...

Q02SENIOR

When would you choose Bedrock Agents over building your own LLM orchestr...

Q03SENIOR

A Bedrock Agent is calling your Action Group Lambda and intermittently r...

Q04SENIOR

Your team is running 500 million tokens per month through Bedrock on-dem...

Q01 of 04SENIOR

Bedrock's on-demand pricing model uses a shared inference fleet. How does that affect your p99 latency SLO design, and what would you change architecturally if your feature requires consistent sub-500ms responses?

ANSWER

Shared fleet means you cannot control queue position, routing, or hardware allocation. p99 latency will be 2-5x p50 depending on time of day and overall fleet load. Design for p99 from day one by load testing at peak hours (9am-5pm weekdays), not at 2am when the fleet is idle. If sub-500ms p99 is a hard requirement, three options: (1) Provisioned Throughput with dedicated Model Units gives guaranteed throughput but at $43K+/month per MU. (2) Switch to SageMaker with a dedicated endpoint for predictable latency on your own hardware. (3) Architect around the latency variance — use streaming for user-facing responses so first-token latency is under 1s even if total generation takes 3s. Cache frequent queries with semantic similarity matching to avoid inference calls entirely for common patterns.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How much does AWS Bedrock actually cost in production?

What's the difference between AWS Bedrock and SageMaker for running AI models?

How do I handle Bedrock ThrottlingException in production without losing requests?

Can Bedrock Agents maintain conversation history across multiple user sessions?

🔥

That's Cloud. Mark it forged?

6 min read · try the examples if you haven't