Homeβ€Ί DevOpsβ€Ί AWS Bedrock: Build Production GenAI Apps Without the MLOps Tax

AWS Bedrock: Build Production GenAI Apps Without the MLOps Tax

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Cloud β†’ Topic 20 of 20
AWS Bedrock lets you call foundation models via API without managing GPUs, fine-tuning pipelines, or model servers.
βš™οΈ Intermediate β€” basic DevOps knowledge assumed
In this tutorial, you'll learn:
  • Bedrock's value isn't the models β€” it's the elimination of the MLOps surface area. The moment you start managing GPU instances, inference servers, and model versioning yourself, you've hired an invisible infrastructure team that doesn't ship features.
  • Default TPS quotas will end your production launch. 5 requests/second is not a starting point β€” it's a demo limit. File the Service Quotas increase before you write your first line of application code, not the week before go-live.
  • The right signal to reach for Bedrock: your team needs model output, not model ownership, and your volume is below the crossover point where Provisioned Throughput beats on-demand pricing. If you're shipping a feature to real users and your team has no ML engineers, Bedrock is almost always the correct first choice.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Imagine you need fresh bread for your restaurant every morning. You could buy a wheat farm, hire agronomists, build a mill, and train bakers β€” or you could just call a bakery and say 'send me 200 sourdough loaves.' AWS Bedrock is the bakery. The foundation models β€” Claude, Titan, Llama, Mistral β€” are already baked, scaled, and maintained by someone else. You just make the call, get the output, and pay per loaf. The moment you think you need to 'own the farm' is the moment you've stopped shipping features and started running an AI infrastructure team.

A fintech startup I consulted for spent four months standing up a self-hosted Llama 2 cluster on EC2. GPU reservations, CUDA driver mismatches, custom inference servers, auto-scaling that never quite worked right. They burned $180k in compute before their first user ever typed a prompt. AWS Bedrock would have had them in production in an afternoon for a few cents per thousand tokens. That's not a sales pitch β€” it's a pattern I've watched repeat at least six times across different orgs.

Bedrock solves a specific and expensive problem: most product teams don't need to run a model β€” they need a model's output. The operational surface area between those two things is massive. You're talking GPU fleet management, model versioning, inference server tuning, cold-start latency, and on-call rotations that wake up ML engineers at 2am because the VRAM exploded under load. Bedrock collapses all of that into a single API. You pick a model, send a request, get a response. The fleet management, the scaling, the hardware β€” Amazon's problem now.

After reading this you'll be able to: wire up Bedrock's InvokeModel API in a real service context, implement streaming responses without blocking your web workers, set up Bedrock Agents for multi-step task orchestration, avoid the three quota and cost traps that silently destroy GenAI budgets, and make an informed decision about when Bedrock is the right call versus when you actually do need to self-host.

The Bedrock Model: What You're Actually Paying For and How It Routes

Before you write a single line of code, understand what Bedrock is under the hood β€” because the mental model directly affects how you design for cost, latency, and failure.

Bedrock is a managed inference proxy. When you call InvokeModel, you're not getting a dedicated GPU instance. Your request goes into Amazon's shared inference fleet for that model family. Amazon handles queuing, routing, scaling, and the hardware underneath. You pay per input token and per output token. There's no idle cost, no reserved capacity fee by default β€” unless you opt into Provisioned Throughput, which we'll get to.

This shared-fleet model is why you'll see latency variance that would be unacceptable from your own infrastructure. On a busy Tuesday afternoon, a Claude 3 Sonnet call might take 800ms. On Sunday at 6am it might take 280ms. You don't control that. Plan for p99 latency, not average. I've seen teams build chatbots that felt broken in production because they load-tested at 2am and designed for 400ms response times β€” then their 9am Monday demo crawled.

The model IDs matter more than you think. They're not stable aliases β€” they're versioned strings like anthropic.claude-3-sonnet-20240229-v1:0. When Anthropic ships a new version, the old ID stays available but you don't get automatically migrated. That's intentional. But it means you need a config-driven model ID system, not hardcoded strings in your service. Teams that hardcode model IDs end up doing find-and-replace across repos when they want to upgrade β€” which is exactly as painful as it sounds.

bedrock_inference_client.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
# io.thecodeforge β€” DevOps tutorial

import boto3
import json
import os
from botocore.config import Config
from botocore.exceptions import ClientError, EndpointResolutionError

# Config-driven model ID β€” never hardcode this in your service layer.
# Pull from environment or parameter store so upgrades don't require redeploys.
MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")

# Set explicit timeouts. Bedrock calls on large prompts can run 30-60s.
# Without this, boto3 defaults to 60s connect / no read timeout β€” silent hangs kill your web workers.
boto_config = Config(
    region_name=AWS_REGION,
    connect_timeout=5,        # fail fast if the endpoint is unreachable
    read_timeout=120,         # long enough for large completions, not infinite
    retries={
        "max_attempts": 3,
        "mode": "adaptive"    # exponential backoff with jitter β€” don't use 'legacy' mode in prod
    }
)

bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config)


def invoke_document_summariser(raw_document: str, max_tokens: int = 1024) -> dict:
    """
    Production pattern: document summarisation for a content pipeline.
    Returns structured output including token usage so the caller can track cost.
    """

    # Claude models use the Messages API format β€” not the legacy text-completion format.
    # Mixing them up gives you a cryptic ValidationException, not a helpful error.
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",  # required field for Anthropic models on Bedrock
        "max_tokens": max_tokens,
        "messages": [
            {
                "role": "user",
                "content": (
                    "Summarise the following document in 3 bullet points. "
                    "Focus on decisions made, not background context.\n\n"
                    f"{raw_document}"
                )
            }
        ],
        "temperature": 0.2,   # low temp for summarisation β€” you want deterministic, not creative
    }

    try:
        response = bedrock_runtime.invoke_model(
            modelId=MODEL_ID,
            contentType="application/json",
            accept="application/json",
            body=json.dumps(request_body)
        )

        response_body = json.loads(response["body"].read())

        # Always capture usage β€” this is your cost telemetry.
        # Log it to CloudWatch metrics or your billing system. Don't discard it.
        input_tokens = response_body["usage"]["input_tokens"]
        output_tokens = response_body["usage"]["output_tokens"]

        summary_text = response_body["content"][0]["text"]

        return {
            "summary": summary_text,
            "model_id": MODEL_ID,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            # Rough cost estimate for Claude 3 Sonnet at time of writing:
            # $0.003/1K input tokens, $0.015/1K output tokens
            "estimated_cost_usd": round(
                (input_tokens / 1000 * 0.003) + (output_tokens / 1000 * 0.015), 6
            )
        }

    except ClientError as e:
        error_code = e.response["Error"]["Code"]
        error_message = e.response["Error"]["Message"]

        # ThrottlingException hits when you exceed your account's TPS quota.
        # Default is 5 TPS for Claude 3 Sonnet in most regions β€” shockingly low for production.
        if error_code == "ThrottlingException":
            raise RuntimeError(
                f"Bedrock quota exceeded for model {MODEL_ID}. "
                "Request a limit increase via Service Quotas before going live."
            ) from e

        # ValidationException usually means malformed request body β€” check your model's spec.
        if error_code == "ValidationException":
            raise ValueError(f"Invalid request format for {MODEL_ID}: {error_message}") from e

        raise RuntimeError(f"Bedrock API error [{error_code}]: {error_message}") from e


if __name__ == "__main__":
    sample_doc = """
    Engineering Review β€” Q3 Platform Migration
    Decision: Move API gateway to AWS API Gateway v2 (HTTP APIs).
    Rationale: 60% cost reduction vs REST APIs for our traffic pattern.
    Rejected alternative: Kong on EKS β€” operational overhead too high for current team size.
    Timeline: Cutover scheduled for October 15th. Rollback plan approved.
    Owner: Platform team. Risk: Medium. Stakeholder sign-off: CTO, VP Engineering.
    """

    result = invoke_document_summariser(sample_doc)
    print(f"Summary:\n{result['summary']}")
    print(f"\nTokens β€” Input: {result['input_tokens']} | Output: {result['output_tokens']}")
    print(f"Estimated cost: ${result['estimated_cost_usd']}")
β–Ά Output
Summary:
β€’ Decided to migrate API gateway to AWS API Gateway v2 (HTTP APIs) for a 60% cost reduction.
β€’ Rejected Kong on EKS due to excessive operational overhead for the current team size.
β€’ Cutover set for October 15th with an approved rollback plan; medium risk, sign-off from CTO and VP Engineering.

Tokens β€” Input: 187 | Output: 73
Estimated cost: $0.001662
⚠️
Production Trap: Default TPS Quota Will Destroy Your LaunchAWS Bedrock default TPS limits are 5 requests/second for most Claude models in new accounts. That's fine for a demo. At production load with 50 concurrent users, you'll hit ThrottlingException inside two seconds. File a Service Quotas increase request at least two weeks before your launch date β€” AWS approval isn't instant, and the support ticket queue gets long.

Streaming Responses: Stop Blocking Your Threads and Start Shipping Perceived Speed

Here's what kills GenAI UX before a user ever reads a word: a 12-second blank screen while your server waits for the full completion before flushing anything to the client. Users think it's broken. They hit refresh. You get duplicate charges. Your support queue fills up.

Bedrock's InvokeModelWithResponseStream fixes this. It returns a streaming event iterator β€” text chunks arrive as the model generates them, and you pipe each chunk to the client immediately. From the user's perspective, text starts appearing in under a second and keeps flowing. Perceived latency drops dramatically even when total generation time is identical.

The tricky part isn't the streaming itself β€” it's the infrastructure around it. Your web framework needs to support streaming responses, your load balancer needs idle timeout configured high enough (ALB defaults to 60s β€” too low for long completions), and your error handling needs to account for the fact that the stream can fail mid-response. I've seen services that catch exceptions from InvokeModel just fine but have zero error handling inside the stream event loop β€” so when the stream dies at token 400 of a 600-token response, the client gets a truncated response with no indication that something went wrong. Silent data loss in a production AI system is a bad day.

bedrock_streaming_handler.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
# io.thecodeforge β€” DevOps tutorial

import boto3
import json
import os
from botocore.config import Config
from typing import Generator

MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")

boto_config = Config(
    region_name=os.environ.get("AWS_REGION", "us-east-1"),
    connect_timeout=5,
    read_timeout=300,   # streaming completions can run long β€” 120s isn't always enough
    retries={"max_attempts": 1, "mode": "standard"}  # don't retry mid-stream β€” retry at the caller level
)

bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config)


def stream_customer_support_response(
    customer_query: str,
    account_context: dict
) -> Generator[str, None, None]:
    """
    Production pattern: real-time customer support response generation.
    Yields text chunks as a generator so the caller (e.g. FastAPI StreamingResponse)
    can flush each chunk to the HTTP client immediately.

    account_context: dict with keys like 'plan', 'open_tickets', 'last_login'
    """

    # Build a system prompt from account context so the model responds with
    # customer-specific information rather than generic advice.
    system_prompt = (
        f"You are a support agent for TheCodeForge platform. "
        f"The customer is on the {account_context.get('plan', 'free')} plan. "
        f"They have {account_context.get('open_tickets', 0)} open support tickets. "
        "Be concise, direct, and actionable. Do not apologise excessively."
    )

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "system": system_prompt,
        "messages": [
            {"role": "user", "content": customer_query}
        ],
        "temperature": 0.3,
    }

    try:
        streaming_response = bedrock_runtime.invoke_model_with_response_stream(
            modelId=MODEL_ID,
            contentType="application/json",
            accept="application/json",
            body=json.dumps(request_body)
        )

        # event_stream is an iterator of EventStream objects β€” each one is a chunk
        event_stream = streaming_response["body"]

        for event in event_stream:
            chunk = event.get("chunk")
            if not chunk:
                # Non-chunk events exist (metadata, message_start, etc.) β€” skip them gracefully
                continue

            chunk_data = json.loads(chunk["bytes"].decode("utf-8"))

            # Claude streaming emits different event types β€” only 'content_block_delta' carries text
            if chunk_data.get("type") == "content_block_delta":
                delta = chunk_data.get("delta", {})
                if delta.get("type") == "text_delta":
                    text_piece = delta.get("text", "")
                    if text_piece:
                        yield text_piece  # flush this chunk to the caller immediately

            # message_stop event signals clean completion β€” log it for observability
            elif chunk_data.get("type") == "message_stop":
                # Amazon metrics come in the stop event β€” useful for billing dashboards
                amazon_metrics = chunk_data.get("amazon-bedrock-invocationMetrics", {})
                input_tokens = amazon_metrics.get("inputTokenCount", 0)
                output_tokens = amazon_metrics.get("outputTokenCount", 0)
                # In production: emit these as CloudWatch custom metrics here
                print(f"[Stream complete] input={input_tokens} output={output_tokens} tokens")

    except Exception as e:
        # Critical: yield an error marker so the client knows the stream died mid-response
        # Don't silently stop β€” the client will think the truncated response is complete
        yield f"\n[ERROR: Response generation interrupted β€” {type(e).__name__}]"
        raise


# --- Simulated FastAPI usage (shows how the generator plugs into a real web framework) ---
# In production this would be in your router module:
#
# from fastapi import FastAPI
# from fastapi.responses import StreamingResponse
#
# app = FastAPI()
#
# @app.post("/support/stream")
# async def stream_support(query: SupportQuery):
#     account_ctx = fetch_account_context(query.account_id)  # your DB call
#     return StreamingResponse(
#         stream_customer_support_response(query.text, account_ctx),
#         media_type="text/plain"
#     )


if __name__ == "__main__":
    query = "I deployed to production and my API calls are returning 429s. What do I do?"
    context = {"plan": "pro", "open_tickets": 1, "last_login": "2024-03-15"}

    print("Streaming response:\n")
    for chunk in stream_customer_support_response(query, context):
        print(chunk, end="", flush=True)  # flush=True is essential β€” don't buffer
    print("\n")
β–Ά Output
Streaming response:

You're hitting rate limits (429 = Too Many Requests). On the Pro plan here's what to check:

1. **Check your current usage** β€” log into the dashboard under Settings > API Usage to see if you've hit your monthly request cap.
2. **Implement exponential backoff** β€” your client should retry with delays of 1s, 2s, 4s before failing. Most SDKs have this built in.
3. **Check for runaway processes** β€” a misconfigured retry loop can exhaust your quota in minutes. Look for repeated identical requests in your logs.

If you're within quota and still seeing 429s, open a ticket with your API key and a sample request timestamp β€” that points to a server-side issue we'll trace on our end.

[Stream complete] input=98 output=143 tokens
⚠️
Never Do This: Retry Inside a Stream Event LoopSetting max_attempts > 1 on a streaming call in boto3 is dangerous. If a chunk fails mid-stream and boto3 retries, it starts the stream from the beginning β€” but your client has already received partial output. You end up with duplicated content prepended to the retry. Set retries to 1 on streaming clients and handle retries at the request level, before the stream opens.

Bedrock Agents: When Single Prompts Aren't Enough and You Need Actual Orchestration

A single InvokeModel call works great when your task is stateless: summarise this, classify that, generate this copy. The moment your task requires multiple steps β€” look up customer data, reason about it, call an API, generate a response based on the result β€” you're either building your own orchestration loop or you're using Bedrock Agents.

Bedrock Agents is Amazon's managed multi-step reasoning engine. You define the agent's instructions (its persona and scope), attach Action Groups (Lambda functions that the agent can invoke), and optionally connect a Knowledge Base (a vector store backed by your documents). The agent runs a ReAct-style loop: it reasons about the user's request, decides which actions to take, calls your Lambdas, observes the results, and iterates until it has enough information to respond.

The thing most tutorials won't tell you: the agent's internal reasoning chain costs tokens you don't see upfront. Every step in the loop β€” including the model's internal 'thinking' about which action to call β€” burns input and output tokens. On complex multi-step tasks I've seen agents consume 10-15x the tokens you'd expect from reading the final answer alone. Budget for it. Also, Bedrock Agents has a fixed session timeout of one hour. Any stateful conversation longer than that needs explicit session management on your side β€” the agent won't remember anything after the session expires.

The sweet spot for Agents is internal tooling: HR bots that query Workday, DevOps assistants that check CloudWatch alarms and summarise them, customer-facing support bots that can actually look up order status. Tasks where the answer genuinely requires calling real systems, not just reasoning over embedded knowledge.

bedrock_agent_action_group_lambda.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
# io.thecodeforge β€” DevOps tutorial
# This Lambda is an Action Group handler for a Bedrock Agent.
# The agent calls this function when it needs to look up order status.
# Deploy this as a Lambda, then wire it to your Agent via the Bedrock console or CDK.

import json
import boto3
import os
from datetime import datetime

# In production: pull from environment, not hardcoded table names
ORDERS_TABLE = os.environ.get("ORDERS_TABLE_NAME", "platform-orders-prod")

dynamodb = boto3.resource("dynamodb")
orders_table = dynamodb.Table(ORDERS_TABLE)


def lambda_handler(event: dict, context) -> dict:
    """
    Bedrock Agent Action Group handler.
    The agent sends a specific event structure β€” you must return a specific structure back.
    Deviate from the response format and the agent silently fails or hallucinates an answer.
    """

    # Bedrock Agents wraps function calls in this structure
    agent_action = event.get("actionGroup", "")
    api_path = event.get("apiPath", "")        # matches your OpenAPI schema path
    http_method = event.get("httpMethod", "")  # GET, POST, etc. β€” from your schema
    parameters = event.get("parameters", [])   # list of {name, type, value} dicts

    print(f"[Agent Action] group={agent_action} path={api_path} method={http_method}")

    # Route to the appropriate handler based on the API path
    if api_path == "/orders/{orderId}" and http_method == "GET":
        order_id = next(
            (p["value"] for p in parameters if p["name"] == "orderId"), None
        )
        result = fetch_order_status(order_id)
    elif api_path == "/orders/{orderId}/cancel" and http_method == "POST":
        order_id = next(
            (p["value"] for p in parameters if p["name"] == "orderId"), None
        )
        result = cancel_order(order_id)
    else:
        result = {"error": f"Unknown action path: {api_path}"}

    # Bedrock Agents REQUIRES this exact response envelope.
    # Missing 'messageVersion', 'response', or 'actionGroup' fields = silent agent failure.
    return {
        "messageVersion": "1.0",
        "response": {
            "actionGroup": agent_action,
            "apiPath": api_path,
            "httpMethod": http_method,
            "httpStatusCode": 200 if "error" not in result else 400,
            "responseBody": {
                "application/json": {
                    "body": json.dumps(result)
                }
            }
        }
    }


def fetch_order_status(order_id: str) -> dict:
    """Look up a real order from DynamoDB and return structured status."""
    if not order_id:
        return {"error": "orderId is required"}

    try:
        response = orders_table.get_item(Key={"orderId": order_id})
        item = response.get("Item")

        if not item:
            # Be specific β€” the agent will relay this message verbatim to the user
            return {"error": f"Order {order_id} not found. It may not exist or may be archived."}

        return {
            "orderId": item["orderId"],
            "status": item["status"],                  # e.g. PROCESSING, SHIPPED, DELIVERED
            "estimatedDelivery": item.get("estimatedDelivery", "unknown"),
            "carrier": item.get("carrier", "not yet assigned"),
            "trackingNumber": item.get("trackingNumber", "not yet assigned"),
            "lastUpdated": item.get("lastUpdated", "")
        }

    except Exception as e:
        # Don't expose raw exception messages to the agent β€” it may relay them to the user
        print(f"[ERROR] DynamoDB lookup failed for order {order_id}: {e}")
        return {"error": "Order lookup temporarily unavailable. Please try again shortly."}


def cancel_order(order_id: str) -> dict:
    """Cancel an order if it's still in PROCESSING state."""
    if not order_id:
        return {"error": "orderId is required"}

    try:
        # Conditional update β€” only cancel if status is PROCESSING
        # This prevents the agent from cancelling already-shipped orders
        orders_table.update_item(
            Key={"orderId": order_id},
            UpdateExpression="SET #s = :cancelled, lastUpdated = :now",
            ConditionExpression="#s = :processing",
            ExpressionAttributeNames={"#s": "status"},   # 'status' is a reserved word in DynamoDB
            ExpressionAttributeValues={
                ":cancelled": "CANCELLED",
                ":processing": "PROCESSING",
                ":now": datetime.utcnow().isoformat()
            }
        )
        return {"orderId": order_id, "status": "CANCELLED", "message": "Order successfully cancelled."}

    except dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
        # Order exists but isn't in PROCESSING β€” give the agent a specific reason
        return {
            "error": f"Order {order_id} cannot be cancelled β€” it has already been shipped or delivered."
        }
    except Exception as e:
        print(f"[ERROR] Cancel failed for order {order_id}: {e}")
        return {"error": "Cancellation temporarily unavailable."}
β–Ά Output
# When the Bedrock Agent receives: "Can you cancel order ORD-88421?"
# The agent internally calls this Lambda with:
# { "actionGroup": "OrderManagement", "apiPath": "/orders/{orderId}/cancel",
# "httpMethod": "POST", "parameters": [{"name": "orderId", "value": "ORD-88421"}] }
#
# Lambda fetches the order, finds status=PROCESSING, updates to CANCELLED.
# Lambda returns the structured envelope.
# Agent receives the result and responds to the user:
#
# Agent: "I've cancelled order ORD-88421. You'll receive a confirmation email
# within a few minutes and a refund within 3-5 business days."
#
# CloudWatch log output from Lambda:
# [Agent Action] group=OrderManagement path=/orders/{orderId}/cancel method=POST
⚠️
Production Trap: Agent Token Costs Are Not What You ThinkBedrock Agents runs an internal reasoning chain that isn't visible in your application logs. On a 3-step task (lookup β†’ reason β†’ respond), I've measured 8,000+ tokens consumed for what looks like a 200-token final answer. Set up a CloudWatch metric filter on inputTokenCount and outputTokenCount from the agent's CloudTrail events before you go live. Otherwise your billing surprises will be significant and will arrive monthly.

Provisioned Throughput, Knowledge Bases, and When to Walk Away From Bedrock Entirely

On-demand pricing is great until you hit quota walls at scale. If your application is sending consistent, high-volume traffic to a specific model β€” think a customer-facing feature used by thousands of users during business hours β€” Provisioned Throughput might make more sense. You reserve Model Units (MUs) for a specific model, pay hourly regardless of usage, and get guaranteed throughput without ThrottlingExceptions.

Here's the honest math: a single MU for Claude 3 Sonnet runs about $60/hour. At 720 hours per month that's $43,200 per month, per MU. On-demand for the same volume might be cheaper β€” or wildly more expensive β€” depending on your actual token throughput. Run the numbers on your specific traffic pattern before committing. Provisioned Throughput has a minimum one-month commitment. I've seen teams lock in a MU for a feature that got descoped a week later.

Bedrock Knowledge Bases gives you managed RAG β€” upload documents to S3, Bedrock chunks and embeds them into a vector store (OpenSearch Serverless or Pinecone), and your agent can query it semantically. For internal documentation bots or product knowledge bases it's genuinely useful and much faster to ship than building your own embedding pipeline. The gotcha: chunk size and overlap settings are critical and not obvious. Default chunking works fine for short Q&A docs, but for dense technical PDFs you'll get retrieval misses because the relevant context gets split across chunk boundaries.

When should you not use Bedrock? Three clear signals: you need a model Bedrock doesn't offer (GPT-4o, Gemini Ultra β€” you're calling OpenAI/Google directly regardless), you need sub-100ms inference latency at scale (shared fleet variance won't get you there β€” look at SageMaker JumpStart with a dedicated endpoint), or you need fine-tuned models on highly proprietary data where sending data to a third-party API is a compliance non-starter. Bedrock does support some fine-tuning workflows, but they're limited in model scope and more complex than advertised.

bedrock_knowledge_base_rag_query.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
# io.thecodeforge β€” DevOps tutorial
# RAG (Retrieval Augmented Generation) pattern using Bedrock Knowledge Bases.
# Use case: internal engineering handbook bot that answers policy questions.
# The Knowledge Base is pre-populated with your company docs via the Bedrock console or CDK.

import boto3
import os
import json
from botocore.config import Config

KNOWLEDGE_BASE_ID = os.environ["BEDROCK_KB_ID"]   # e.g. "ABCD1234EF" β€” from Bedrock console
MODEL_ARN = (
    "arn:aws:bedrock:us-east-1::foundation-model/"
    "anthropic.claude-3-sonnet-20240229-v1:0"
)  # RetrieveAndGenerate requires the full ARN, not just the model ID

boto_config = Config(
    region_name=os.environ.get("AWS_REGION", "us-east-1"),
    connect_timeout=5,
    read_timeout=60,
    retries={"max_attempts": 2, "mode": "adaptive"}
)

# Note: Knowledge Bases uses the 'bedrock-agent-runtime' client β€” NOT 'bedrock-runtime'.
# Using the wrong client gives you a NoRegionError or AttributeError with no useful message.
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", config=boto_config)


def query_engineering_handbook(question: str, max_retrieved_chunks: int = 5) -> dict:
    """
    Query the engineering handbook Knowledge Base using RetrieveAndGenerate.
    This is the fully managed RAG path β€” Bedrock handles retrieval + generation in one call.

    For transparency/debugging: also returns the source citations so you can verify
    the model isn't hallucinating answers that aren't in the docs.
    """

    try:
        response = bedrock_agent_runtime.retrieve_and_generate(
            input={"text": question},
            retrieveAndGenerateConfiguration={
                "type": "KNOWLEDGE_BASE",
                "knowledgeBaseConfiguration": {
                    "knowledgeBaseId": KNOWLEDGE_BASE_ID,
                    "modelArn": MODEL_ARN,
                    "retrievalConfiguration": {
                        "vectorSearchConfiguration": {
                            # Number of document chunks to retrieve before generation.
                            # Higher = more context but more tokens = higher cost and latency.
                            # 5 is a good starting point; tune based on your doc structure.
                            "numberOfResults": max_retrieved_chunks
                        }
                    },
                    "generationConfiguration": {
                        "promptTemplate": {
                            # Override the default prompt to enforce your preferred answer style.
                            # The $search_results$ placeholder is where retrieved chunks are injected.
                            "textPromptTemplate": (
                                "You are an assistant for TheCodeForge engineering team. "
                                "Answer based ONLY on the following retrieved context. "
                                "If the answer isn't in the context, say 'Not found in handbook.' "
                                "Do not invent policies or procedures.\n\n"
                                "Context:\n$search_results$\n\n"
                                f"Question: {question}"
                            )
                        }
                    }
                }
            }
        )

        answer = response["output"]["text"]

        # Extract citations β€” each citation maps to a specific chunk in your S3 docs.
        # In production: surface these to the user so they can verify the source.
        citations = []
        for citation in response.get("citations", []):
            for reference in citation.get("retrievedReferences", []):
                location = reference.get("location", {}).get("s3Location", {})
                citations.append({
                    "source_uri": location.get("uri", "unknown"),
                    "excerpt": reference.get("content", {}).get("text", "")[:200]  # truncate for display
                })

        return {
            "answer": answer,
            "citations": citations,
            "citation_count": len(citations)
        }

    except bedrock_agent_runtime.exceptions.ResourceNotFoundException:
        raise ValueError(
            f"Knowledge Base {KNOWLEDGE_BASE_ID} not found. "
            "Check the ID and ensure the KB is in 'Active' status β€” "
            "embedding ingestion must complete before queries work."
        )
    except Exception as e:
        raise RuntimeError(f"Knowledge Base query failed: {e}") from e


if __name__ == "__main__":
    result = query_engineering_handbook(
        "What's our policy on hotfixing directly to the main branch?"
    )

    print(f"Answer:\n{result['answer']}\n")
    print(f"Sources ({result['citation_count']} retrieved):")
    for i, citation in enumerate(result["citations"], 1):
        print(f"  [{i}] {citation['source_uri']}")
        print(f"      Excerpt: {citation['excerpt']}...")
β–Ά Output
Answer:
Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through a pull request with at least one reviewer approval. For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification in the PR description, followed by a post-incident review within 48 hours. Branch protection rules enforce this β€” direct pushes to main are blocked at the repository level.

Sources (3 retrieved):
[1] s3://thecodeforge-handbook/engineering/git-policy-v4.pdf
Excerpt: Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through...
[2] s3://thecodeforge-handbook/engineering/incident-response-runbook.pdf
Excerpt: For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification...
[3] s3://thecodeforge-handbook/engineering/branch-protection-setup.md
Excerpt: Branch protection rules enforce this β€” direct pushes to main are blocked at the repository level via GitHub rulesets...
⚠️
Senior Shortcut: Two-Stage RAG for Better Retrieval AccuracyIf RetrieveAndGenerate gives you retrieval misses on complex questions, split it into two calls: first call Retrieve only (retrieve_from_knowledge_base) to get the raw chunks, then manually re-rank or filter them in your application code, then pass the filtered context to InvokeModel directly. This two-stage pattern costs slightly more but gives you control over what the model actually sees β€” and it surfaces retrieval quality issues that the end-to-end call hides.
AttributeAWS Bedrock (On-Demand)Self-Hosted on SageMaker / EC2
Time to first inferenceMinutes (API key + boto3 call)Days to weeks (instance setup, model download, server config)
Infrastructure ops burdenZero β€” Amazon's problemHigh β€” your team owns scaling, patching, CUDA versions
Latency consistency (p99)Variable β€” shared fleet, expect 2-5x p50Predictable β€” dedicated hardware, tunable
Cost at low volume (<10M tokens/month)Cheap β€” pure pay-per-tokenExpensive β€” idle GPU compute is still billed
Cost at high volume (>1B tokens/month)Expensive β€” per-token adds up fastCheaper if utilisation is high and model is stable
Model selectionLimited to Bedrock catalogue (Claude, Titan, Llama, Mistral, Cohere)Any open-weight model you can run
Data sovereignty / complianceData processed by AWS β€” review BAA requirementsFull control β€” data never leaves your VPC
Fine-tuning supportLimited β€” select models only, constrained workflowFull control β€” any fine-tuning framework
Quota / rate limitsDefault 5 TPS for most models β€” requires support ticket to raiseSelf-imposed β€” limited by your hardware
Cold start latencyNone β€” fleet is always warmReal β€” model loading can take 30-90s on first call

🎯 Key Takeaways

  • Bedrock's value isn't the models β€” it's the elimination of the MLOps surface area. The moment you start managing GPU instances, inference servers, and model versioning yourself, you've hired an invisible infrastructure team that doesn't ship features.
  • Default TPS quotas will end your production launch. 5 requests/second is not a starting point β€” it's a demo limit. File the Service Quotas increase before you write your first line of application code, not the week before go-live.
  • The right signal to reach for Bedrock: your team needs model output, not model ownership, and your volume is below the crossover point where Provisioned Throughput beats on-demand pricing. If you're shipping a feature to real users and your team has no ML engineers, Bedrock is almost always the correct first choice.
  • Bedrock Agents' internal token consumption is invisible to your application logs and will be 5-15x higher than the final response tokens suggest. If you're not metering agent token usage from CloudTrail or the invocation metrics in the stream stop event, your cost model is fiction.

⚠ Common Mistakes to Avoid

  • βœ•Mistake 1: Hardcoding the model ID string (e.g. 'anthropic.claude-3-sonnet-20240229-v1:0') directly in application source code β€” when AWS deprecates that version ID or you want to upgrade, you're doing a multi-repo find-and-replace β€” Fix: store model IDs in AWS Systems Manager Parameter Store or environment variables, injected at deploy time via your CDK/Terraform config.
  • βœ•Mistake 2: Using the wrong boto3 client for Knowledge Bases β€” calling bedrock_runtime.retrieve_and_generate() instead of bedrock_agent_runtime.retrieve_and_generate() β€” results in an AttributeError: 'BedrockRuntime' object has no attribute 'retrieve_and_generate' that looks like a boto3 version issue but isn't β€” Fix: Knowledge Base operations use boto3.client('bedrock-agent-runtime'), not boto3.client('bedrock-runtime'). Two different clients, different endpoints.
  • βœ•Mistake 3: Not requesting a Service Quotas increase before launch β€” default TPS for Claude 3 Sonnet is 5 requests/second in most regions for new accounts β€” hitting this in production returns ThrottlingException with the message 'Too many requests, please wait before trying again' β€” Fix: go to AWS Service Quotas > Amazon Bedrock > find your model's 'On-demand throughput limit' and submit an increase request at least 10 business days before go-live.
  • βœ•Mistake 4: Treating Bedrock's streaming event loop like a simple for-loop without error handling mid-stream β€” if the connection drops at token 300 of a 600-token completion, the generator stops silently and the client renders a half-finished response as if it were complete β€” Fix: wrap the event_stream iteration in try/except and yield an explicit error marker if the loop exits unexpectedly, so the client can detect incomplete responses.
  • βœ•Mistake 5: Using temperature=1.0 (or the model default) for structured output tasks like JSON generation or classification β€” high temperature causes the model to occasionally produce malformed JSON or off-schema responses that break your downstream parser β€” Fix: set temperature between 0.0 and 0.2 for any task where format correctness matters more than creativity, and add output validation with a retry on parse failure.

Interview Questions on This Topic

  • QBedrock's on-demand pricing model uses a shared inference fleet. How does that affect your p99 latency SLO design, and what would you change architecturally if your feature requires consistent sub-500ms responses?
  • QWhen would you choose Bedrock Agents over building your own LLM orchestration loop with LangChain or a custom state machine? What's the concrete threshold where Agents becomes more pain than it's worth?
  • QA Bedrock Agent is calling your Action Group Lambda and intermittently returning wrong answers without any errors in CloudWatch. The Lambda is executing correctly. What's your debugging process β€” and what's the most likely root cause?
  • QYour team is running 500 million tokens per month through Bedrock on-demand and the bill is becoming significant. Walk me through how you'd evaluate whether Provisioned Throughput makes financial sense, and what data you'd need before committing to a reserved MU.

Frequently Asked Questions

How much does AWS Bedrock actually cost in production?

It depends entirely on token volume and model choice, but here's the concrete breakdown: Claude 3 Sonnet costs $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens at time of writing. A typical customer support response that consumes 500 input tokens and 200 output tokens costs roughly $0.0045 β€” under half a cent. At 100,000 requests per day that's $450/day or ~$13,500/month just in inference costs, before any Agents or Knowledge Base overhead. The cost curve is steep at scale, which is why you should be tracking token usage from day one, not month three.

What's the difference between AWS Bedrock and SageMaker for running AI models?

Bedrock is a managed API for calling pre-trained foundation models β€” you don't manage any infrastructure. SageMaker is a full ML platform where you can deploy any model (including custom or fine-tuned ones) on dedicated endpoints you control. The rule of thumb: use Bedrock when you want to call a foundation model and ship fast; use SageMaker when you need consistent low-latency inference, models outside Bedrock's catalogue, or full control over the serving environment.

How do I handle Bedrock ThrottlingException in production without losing requests?

Use boto3's 'adaptive' retry mode with max_attempts set to 3-5, which applies exponential backoff with jitter automatically. For user-facing features, wrap the call in a queue with a dead-letter path so throttled requests don't just disappear. Long-term fix: request a Service Quotas increase for your specific model's TPS limit via the AWS console β€” the default limits are designed for development, not production traffic.

Can Bedrock Agents maintain conversation history across multiple user sessions?

Within a single session yes β€” Bedrock Agents maintains context for up to one hour using a sessionId you provide. Across separate sessions or after the one-hour timeout, no β€” the agent has zero memory. For persistent cross-session memory you need to store conversation history in your own database (DynamoDB is the obvious choice), retrieve the relevant history at the start of each new session, and inject it into the agent's initial prompt or as part of your Action Group context. This is a design requirement, not a configuration option.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousMulti-Cloud Strategy
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged