AWS Bedrock: Build Production GenAI Apps Without the MLOps Tax
- Bedrock's value isn't the models β it's the elimination of the MLOps surface area. The moment you start managing GPU instances, inference servers, and model versioning yourself, you've hired an invisible infrastructure team that doesn't ship features.
- Default TPS quotas will end your production launch. 5 requests/second is not a starting point β it's a demo limit. File the Service Quotas increase before you write your first line of application code, not the week before go-live.
- The right signal to reach for Bedrock: your team needs model output, not model ownership, and your volume is below the crossover point where Provisioned Throughput beats on-demand pricing. If you're shipping a feature to real users and your team has no ML engineers, Bedrock is almost always the correct first choice.
A fintech startup I consulted for spent four months standing up a self-hosted Llama 2 cluster on EC2. GPU reservations, CUDA driver mismatches, custom inference servers, auto-scaling that never quite worked right. They burned $180k in compute before their first user ever typed a prompt. AWS Bedrock would have had them in production in an afternoon for a few cents per thousand tokens. That's not a sales pitch β it's a pattern I've watched repeat at least six times across different orgs.
Bedrock solves a specific and expensive problem: most product teams don't need to run a model β they need a model's output. The operational surface area between those two things is massive. You're talking GPU fleet management, model versioning, inference server tuning, cold-start latency, and on-call rotations that wake up ML engineers at 2am because the VRAM exploded under load. Bedrock collapses all of that into a single API. You pick a model, send a request, get a response. The fleet management, the scaling, the hardware β Amazon's problem now.
After reading this you'll be able to: wire up Bedrock's InvokeModel API in a real service context, implement streaming responses without blocking your web workers, set up Bedrock Agents for multi-step task orchestration, avoid the three quota and cost traps that silently destroy GenAI budgets, and make an informed decision about when Bedrock is the right call versus when you actually do need to self-host.
The Bedrock Model: What You're Actually Paying For and How It Routes
Before you write a single line of code, understand what Bedrock is under the hood β because the mental model directly affects how you design for cost, latency, and failure.
Bedrock is a managed inference proxy. When you call InvokeModel, you're not getting a dedicated GPU instance. Your request goes into Amazon's shared inference fleet for that model family. Amazon handles queuing, routing, scaling, and the hardware underneath. You pay per input token and per output token. There's no idle cost, no reserved capacity fee by default β unless you opt into Provisioned Throughput, which we'll get to.
This shared-fleet model is why you'll see latency variance that would be unacceptable from your own infrastructure. On a busy Tuesday afternoon, a Claude 3 Sonnet call might take 800ms. On Sunday at 6am it might take 280ms. You don't control that. Plan for p99 latency, not average. I've seen teams build chatbots that felt broken in production because they load-tested at 2am and designed for 400ms response times β then their 9am Monday demo crawled.
The model IDs matter more than you think. They're not stable aliases β they're versioned strings like anthropic.claude-3-sonnet-20240229-v1:0. When Anthropic ships a new version, the old ID stays available but you don't get automatically migrated. That's intentional. But it means you need a config-driven model ID system, not hardcoded strings in your service. Teams that hardcode model IDs end up doing find-and-replace across repos when they want to upgrade β which is exactly as painful as it sounds.
# io.thecodeforge β DevOps tutorial import boto3 import json import os from botocore.config import Config from botocore.exceptions import ClientError, EndpointResolutionError # Config-driven model ID β never hardcode this in your service layer. # Pull from environment or parameter store so upgrades don't require redeploys. MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0") AWS_REGION = os.environ.get("AWS_REGION", "us-east-1") # Set explicit timeouts. Bedrock calls on large prompts can run 30-60s. # Without this, boto3 defaults to 60s connect / no read timeout β silent hangs kill your web workers. boto_config = Config( region_name=AWS_REGION, connect_timeout=5, # fail fast if the endpoint is unreachable read_timeout=120, # long enough for large completions, not infinite retries={ "max_attempts": 3, "mode": "adaptive" # exponential backoff with jitter β don't use 'legacy' mode in prod } ) bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config) def invoke_document_summariser(raw_document: str, max_tokens: int = 1024) -> dict: """ Production pattern: document summarisation for a content pipeline. Returns structured output including token usage so the caller can track cost. """ # Claude models use the Messages API format β not the legacy text-completion format. # Mixing them up gives you a cryptic ValidationException, not a helpful error. request_body = { "anthropic_version": "bedrock-2023-05-31", # required field for Anthropic models on Bedrock "max_tokens": max_tokens, "messages": [ { "role": "user", "content": ( "Summarise the following document in 3 bullet points. " "Focus on decisions made, not background context.\n\n" f"{raw_document}" ) } ], "temperature": 0.2, # low temp for summarisation β you want deterministic, not creative } try: response = bedrock_runtime.invoke_model( modelId=MODEL_ID, contentType="application/json", accept="application/json", body=json.dumps(request_body) ) response_body = json.loads(response["body"].read()) # Always capture usage β this is your cost telemetry. # Log it to CloudWatch metrics or your billing system. Don't discard it. input_tokens = response_body["usage"]["input_tokens"] output_tokens = response_body["usage"]["output_tokens"] summary_text = response_body["content"][0]["text"] return { "summary": summary_text, "model_id": MODEL_ID, "input_tokens": input_tokens, "output_tokens": output_tokens, # Rough cost estimate for Claude 3 Sonnet at time of writing: # $0.003/1K input tokens, $0.015/1K output tokens "estimated_cost_usd": round( (input_tokens / 1000 * 0.003) + (output_tokens / 1000 * 0.015), 6 ) } except ClientError as e: error_code = e.response["Error"]["Code"] error_message = e.response["Error"]["Message"] # ThrottlingException hits when you exceed your account's TPS quota. # Default is 5 TPS for Claude 3 Sonnet in most regions β shockingly low for production. if error_code == "ThrottlingException": raise RuntimeError( f"Bedrock quota exceeded for model {MODEL_ID}. " "Request a limit increase via Service Quotas before going live." ) from e # ValidationException usually means malformed request body β check your model's spec. if error_code == "ValidationException": raise ValueError(f"Invalid request format for {MODEL_ID}: {error_message}") from e raise RuntimeError(f"Bedrock API error [{error_code}]: {error_message}") from e if __name__ == "__main__": sample_doc = """ Engineering Review β Q3 Platform Migration Decision: Move API gateway to AWS API Gateway v2 (HTTP APIs). Rationale: 60% cost reduction vs REST APIs for our traffic pattern. Rejected alternative: Kong on EKS β operational overhead too high for current team size. Timeline: Cutover scheduled for October 15th. Rollback plan approved. Owner: Platform team. Risk: Medium. Stakeholder sign-off: CTO, VP Engineering. """ result = invoke_document_summariser(sample_doc) print(f"Summary:\n{result['summary']}") print(f"\nTokens β Input: {result['input_tokens']} | Output: {result['output_tokens']}") print(f"Estimated cost: ${result['estimated_cost_usd']}")
β’ Decided to migrate API gateway to AWS API Gateway v2 (HTTP APIs) for a 60% cost reduction.
β’ Rejected Kong on EKS due to excessive operational overhead for the current team size.
β’ Cutover set for October 15th with an approved rollback plan; medium risk, sign-off from CTO and VP Engineering.
Tokens β Input: 187 | Output: 73
Estimated cost: $0.001662
Streaming Responses: Stop Blocking Your Threads and Start Shipping Perceived Speed
Here's what kills GenAI UX before a user ever reads a word: a 12-second blank screen while your server waits for the full completion before flushing anything to the client. Users think it's broken. They hit refresh. You get duplicate charges. Your support queue fills up.
Bedrock's InvokeModelWithResponseStream fixes this. It returns a streaming event iterator β text chunks arrive as the model generates them, and you pipe each chunk to the client immediately. From the user's perspective, text starts appearing in under a second and keeps flowing. Perceived latency drops dramatically even when total generation time is identical.
The tricky part isn't the streaming itself β it's the infrastructure around it. Your web framework needs to support streaming responses, your load balancer needs idle timeout configured high enough (ALB defaults to 60s β too low for long completions), and your error handling needs to account for the fact that the stream can fail mid-response. I've seen services that catch exceptions from InvokeModel just fine but have zero error handling inside the stream event loop β so when the stream dies at token 400 of a 600-token response, the client gets a truncated response with no indication that something went wrong. Silent data loss in a production AI system is a bad day.
# io.thecodeforge β DevOps tutorial import boto3 import json import os from botocore.config import Config from typing import Generator MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0") boto_config = Config( region_name=os.environ.get("AWS_REGION", "us-east-1"), connect_timeout=5, read_timeout=300, # streaming completions can run long β 120s isn't always enough retries={"max_attempts": 1, "mode": "standard"} # don't retry mid-stream β retry at the caller level ) bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config) def stream_customer_support_response( customer_query: str, account_context: dict ) -> Generator[str, None, None]: """ Production pattern: real-time customer support response generation. Yields text chunks as a generator so the caller (e.g. FastAPI StreamingResponse) can flush each chunk to the HTTP client immediately. account_context: dict with keys like 'plan', 'open_tickets', 'last_login' """ # Build a system prompt from account context so the model responds with # customer-specific information rather than generic advice. system_prompt = ( f"You are a support agent for TheCodeForge platform. " f"The customer is on the {account_context.get('plan', 'free')} plan. " f"They have {account_context.get('open_tickets', 0)} open support tickets. " "Be concise, direct, and actionable. Do not apologise excessively." ) request_body = { "anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "system": system_prompt, "messages": [ {"role": "user", "content": customer_query} ], "temperature": 0.3, } try: streaming_response = bedrock_runtime.invoke_model_with_response_stream( modelId=MODEL_ID, contentType="application/json", accept="application/json", body=json.dumps(request_body) ) # event_stream is an iterator of EventStream objects β each one is a chunk event_stream = streaming_response["body"] for event in event_stream: chunk = event.get("chunk") if not chunk: # Non-chunk events exist (metadata, message_start, etc.) β skip them gracefully continue chunk_data = json.loads(chunk["bytes"].decode("utf-8")) # Claude streaming emits different event types β only 'content_block_delta' carries text if chunk_data.get("type") == "content_block_delta": delta = chunk_data.get("delta", {}) if delta.get("type") == "text_delta": text_piece = delta.get("text", "") if text_piece: yield text_piece # flush this chunk to the caller immediately # message_stop event signals clean completion β log it for observability elif chunk_data.get("type") == "message_stop": # Amazon metrics come in the stop event β useful for billing dashboards amazon_metrics = chunk_data.get("amazon-bedrock-invocationMetrics", {}) input_tokens = amazon_metrics.get("inputTokenCount", 0) output_tokens = amazon_metrics.get("outputTokenCount", 0) # In production: emit these as CloudWatch custom metrics here print(f"[Stream complete] input={input_tokens} output={output_tokens} tokens") except Exception as e: # Critical: yield an error marker so the client knows the stream died mid-response # Don't silently stop β the client will think the truncated response is complete yield f"\n[ERROR: Response generation interrupted β {type(e).__name__}]" raise # --- Simulated FastAPI usage (shows how the generator plugs into a real web framework) --- # In production this would be in your router module: # # from fastapi import FastAPI # from fastapi.responses import StreamingResponse # # app = FastAPI() # # @app.post("/support/stream") # async def stream_support(query: SupportQuery): # account_ctx = fetch_account_context(query.account_id) # your DB call # return StreamingResponse( # stream_customer_support_response(query.text, account_ctx), # media_type="text/plain" # ) if __name__ == "__main__": query = "I deployed to production and my API calls are returning 429s. What do I do?" context = {"plan": "pro", "open_tickets": 1, "last_login": "2024-03-15"} print("Streaming response:\n") for chunk in stream_customer_support_response(query, context): print(chunk, end="", flush=True) # flush=True is essential β don't buffer print("\n")
You're hitting rate limits (429 = Too Many Requests). On the Pro plan here's what to check:
1. **Check your current usage** β log into the dashboard under Settings > API Usage to see if you've hit your monthly request cap.
2. **Implement exponential backoff** β your client should retry with delays of 1s, 2s, 4s before failing. Most SDKs have this built in.
3. **Check for runaway processes** β a misconfigured retry loop can exhaust your quota in minutes. Look for repeated identical requests in your logs.
If you're within quota and still seeing 429s, open a ticket with your API key and a sample request timestamp β that points to a server-side issue we'll trace on our end.
[Stream complete] input=98 output=143 tokens
Bedrock Agents: When Single Prompts Aren't Enough and You Need Actual Orchestration
A single InvokeModel call works great when your task is stateless: summarise this, classify that, generate this copy. The moment your task requires multiple steps β look up customer data, reason about it, call an API, generate a response based on the result β you're either building your own orchestration loop or you're using Bedrock Agents.
Bedrock Agents is Amazon's managed multi-step reasoning engine. You define the agent's instructions (its persona and scope), attach Action Groups (Lambda functions that the agent can invoke), and optionally connect a Knowledge Base (a vector store backed by your documents). The agent runs a ReAct-style loop: it reasons about the user's request, decides which actions to take, calls your Lambdas, observes the results, and iterates until it has enough information to respond.
The thing most tutorials won't tell you: the agent's internal reasoning chain costs tokens you don't see upfront. Every step in the loop β including the model's internal 'thinking' about which action to call β burns input and output tokens. On complex multi-step tasks I've seen agents consume 10-15x the tokens you'd expect from reading the final answer alone. Budget for it. Also, Bedrock Agents has a fixed session timeout of one hour. Any stateful conversation longer than that needs explicit session management on your side β the agent won't remember anything after the session expires.
The sweet spot for Agents is internal tooling: HR bots that query Workday, DevOps assistants that check CloudWatch alarms and summarise them, customer-facing support bots that can actually look up order status. Tasks where the answer genuinely requires calling real systems, not just reasoning over embedded knowledge.
# io.thecodeforge β DevOps tutorial # This Lambda is an Action Group handler for a Bedrock Agent. # The agent calls this function when it needs to look up order status. # Deploy this as a Lambda, then wire it to your Agent via the Bedrock console or CDK. import json import boto3 import os from datetime import datetime # In production: pull from environment, not hardcoded table names ORDERS_TABLE = os.environ.get("ORDERS_TABLE_NAME", "platform-orders-prod") dynamodb = boto3.resource("dynamodb") orders_table = dynamodb.Table(ORDERS_TABLE) def lambda_handler(event: dict, context) -> dict: """ Bedrock Agent Action Group handler. The agent sends a specific event structure β you must return a specific structure back. Deviate from the response format and the agent silently fails or hallucinates an answer. """ # Bedrock Agents wraps function calls in this structure agent_action = event.get("actionGroup", "") api_path = event.get("apiPath", "") # matches your OpenAPI schema path http_method = event.get("httpMethod", "") # GET, POST, etc. β from your schema parameters = event.get("parameters", []) # list of {name, type, value} dicts print(f"[Agent Action] group={agent_action} path={api_path} method={http_method}") # Route to the appropriate handler based on the API path if api_path == "/orders/{orderId}" and http_method == "GET": order_id = next( (p["value"] for p in parameters if p["name"] == "orderId"), None ) result = fetch_order_status(order_id) elif api_path == "/orders/{orderId}/cancel" and http_method == "POST": order_id = next( (p["value"] for p in parameters if p["name"] == "orderId"), None ) result = cancel_order(order_id) else: result = {"error": f"Unknown action path: {api_path}"} # Bedrock Agents REQUIRES this exact response envelope. # Missing 'messageVersion', 'response', or 'actionGroup' fields = silent agent failure. return { "messageVersion": "1.0", "response": { "actionGroup": agent_action, "apiPath": api_path, "httpMethod": http_method, "httpStatusCode": 200 if "error" not in result else 400, "responseBody": { "application/json": { "body": json.dumps(result) } } } } def fetch_order_status(order_id: str) -> dict: """Look up a real order from DynamoDB and return structured status.""" if not order_id: return {"error": "orderId is required"} try: response = orders_table.get_item(Key={"orderId": order_id}) item = response.get("Item") if not item: # Be specific β the agent will relay this message verbatim to the user return {"error": f"Order {order_id} not found. It may not exist or may be archived."} return { "orderId": item["orderId"], "status": item["status"], # e.g. PROCESSING, SHIPPED, DELIVERED "estimatedDelivery": item.get("estimatedDelivery", "unknown"), "carrier": item.get("carrier", "not yet assigned"), "trackingNumber": item.get("trackingNumber", "not yet assigned"), "lastUpdated": item.get("lastUpdated", "") } except Exception as e: # Don't expose raw exception messages to the agent β it may relay them to the user print(f"[ERROR] DynamoDB lookup failed for order {order_id}: {e}") return {"error": "Order lookup temporarily unavailable. Please try again shortly."} def cancel_order(order_id: str) -> dict: """Cancel an order if it's still in PROCESSING state.""" if not order_id: return {"error": "orderId is required"} try: # Conditional update β only cancel if status is PROCESSING # This prevents the agent from cancelling already-shipped orders orders_table.update_item( Key={"orderId": order_id}, UpdateExpression="SET #s = :cancelled, lastUpdated = :now", ConditionExpression="#s = :processing", ExpressionAttributeNames={"#s": "status"}, # 'status' is a reserved word in DynamoDB ExpressionAttributeValues={ ":cancelled": "CANCELLED", ":processing": "PROCESSING", ":now": datetime.utcnow().isoformat() } ) return {"orderId": order_id, "status": "CANCELLED", "message": "Order successfully cancelled."} except dynamodb.meta.client.exceptions.ConditionalCheckFailedException: # Order exists but isn't in PROCESSING β give the agent a specific reason return { "error": f"Order {order_id} cannot be cancelled β it has already been shipped or delivered." } except Exception as e: print(f"[ERROR] Cancel failed for order {order_id}: {e}") return {"error": "Cancellation temporarily unavailable."}
# The agent internally calls this Lambda with:
# { "actionGroup": "OrderManagement", "apiPath": "/orders/{orderId}/cancel",
# "httpMethod": "POST", "parameters": [{"name": "orderId", "value": "ORD-88421"}] }
#
# Lambda fetches the order, finds status=PROCESSING, updates to CANCELLED.
# Lambda returns the structured envelope.
# Agent receives the result and responds to the user:
#
# Agent: "I've cancelled order ORD-88421. You'll receive a confirmation email
# within a few minutes and a refund within 3-5 business days."
#
# CloudWatch log output from Lambda:
# [Agent Action] group=OrderManagement path=/orders/{orderId}/cancel method=POST
inputTokenCount and outputTokenCount from the agent's CloudTrail events before you go live. Otherwise your billing surprises will be significant and will arrive monthly.Provisioned Throughput, Knowledge Bases, and When to Walk Away From Bedrock Entirely
On-demand pricing is great until you hit quota walls at scale. If your application is sending consistent, high-volume traffic to a specific model β think a customer-facing feature used by thousands of users during business hours β Provisioned Throughput might make more sense. You reserve Model Units (MUs) for a specific model, pay hourly regardless of usage, and get guaranteed throughput without ThrottlingExceptions.
Here's the honest math: a single MU for Claude 3 Sonnet runs about $60/hour. At 720 hours per month that's $43,200 per month, per MU. On-demand for the same volume might be cheaper β or wildly more expensive β depending on your actual token throughput. Run the numbers on your specific traffic pattern before committing. Provisioned Throughput has a minimum one-month commitment. I've seen teams lock in a MU for a feature that got descoped a week later.
Bedrock Knowledge Bases gives you managed RAG β upload documents to S3, Bedrock chunks and embeds them into a vector store (OpenSearch Serverless or Pinecone), and your agent can query it semantically. For internal documentation bots or product knowledge bases it's genuinely useful and much faster to ship than building your own embedding pipeline. The gotcha: chunk size and overlap settings are critical and not obvious. Default chunking works fine for short Q&A docs, but for dense technical PDFs you'll get retrieval misses because the relevant context gets split across chunk boundaries.
When should you not use Bedrock? Three clear signals: you need a model Bedrock doesn't offer (GPT-4o, Gemini Ultra β you're calling OpenAI/Google directly regardless), you need sub-100ms inference latency at scale (shared fleet variance won't get you there β look at SageMaker JumpStart with a dedicated endpoint), or you need fine-tuned models on highly proprietary data where sending data to a third-party API is a compliance non-starter. Bedrock does support some fine-tuning workflows, but they're limited in model scope and more complex than advertised.
# io.thecodeforge β DevOps tutorial # RAG (Retrieval Augmented Generation) pattern using Bedrock Knowledge Bases. # Use case: internal engineering handbook bot that answers policy questions. # The Knowledge Base is pre-populated with your company docs via the Bedrock console or CDK. import boto3 import os import json from botocore.config import Config KNOWLEDGE_BASE_ID = os.environ["BEDROCK_KB_ID"] # e.g. "ABCD1234EF" β from Bedrock console MODEL_ARN = ( "arn:aws:bedrock:us-east-1::foundation-model/" "anthropic.claude-3-sonnet-20240229-v1:0" ) # RetrieveAndGenerate requires the full ARN, not just the model ID boto_config = Config( region_name=os.environ.get("AWS_REGION", "us-east-1"), connect_timeout=5, read_timeout=60, retries={"max_attempts": 2, "mode": "adaptive"} ) # Note: Knowledge Bases uses the 'bedrock-agent-runtime' client β NOT 'bedrock-runtime'. # Using the wrong client gives you a NoRegionError or AttributeError with no useful message. bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", config=boto_config) def query_engineering_handbook(question: str, max_retrieved_chunks: int = 5) -> dict: """ Query the engineering handbook Knowledge Base using RetrieveAndGenerate. This is the fully managed RAG path β Bedrock handles retrieval + generation in one call. For transparency/debugging: also returns the source citations so you can verify the model isn't hallucinating answers that aren't in the docs. """ try: response = bedrock_agent_runtime.retrieve_and_generate( input={"text": question}, retrieveAndGenerateConfiguration={ "type": "KNOWLEDGE_BASE", "knowledgeBaseConfiguration": { "knowledgeBaseId": KNOWLEDGE_BASE_ID, "modelArn": MODEL_ARN, "retrievalConfiguration": { "vectorSearchConfiguration": { # Number of document chunks to retrieve before generation. # Higher = more context but more tokens = higher cost and latency. # 5 is a good starting point; tune based on your doc structure. "numberOfResults": max_retrieved_chunks } }, "generationConfiguration": { "promptTemplate": { # Override the default prompt to enforce your preferred answer style. # The $search_results$ placeholder is where retrieved chunks are injected. "textPromptTemplate": ( "You are an assistant for TheCodeForge engineering team. " "Answer based ONLY on the following retrieved context. " "If the answer isn't in the context, say 'Not found in handbook.' " "Do not invent policies or procedures.\n\n" "Context:\n$search_results$\n\n" f"Question: {question}" ) } } } } ) answer = response["output"]["text"] # Extract citations β each citation maps to a specific chunk in your S3 docs. # In production: surface these to the user so they can verify the source. citations = [] for citation in response.get("citations", []): for reference in citation.get("retrievedReferences", []): location = reference.get("location", {}).get("s3Location", {}) citations.append({ "source_uri": location.get("uri", "unknown"), "excerpt": reference.get("content", {}).get("text", "")[:200] # truncate for display }) return { "answer": answer, "citations": citations, "citation_count": len(citations) } except bedrock_agent_runtime.exceptions.ResourceNotFoundException: raise ValueError( f"Knowledge Base {KNOWLEDGE_BASE_ID} not found. " "Check the ID and ensure the KB is in 'Active' status β " "embedding ingestion must complete before queries work." ) except Exception as e: raise RuntimeError(f"Knowledge Base query failed: {e}") from e if __name__ == "__main__": result = query_engineering_handbook( "What's our policy on hotfixing directly to the main branch?" ) print(f"Answer:\n{result['answer']}\n") print(f"Sources ({result['citation_count']} retrieved):") for i, citation in enumerate(result["citations"], 1): print(f" [{i}] {citation['source_uri']}") print(f" Excerpt: {citation['excerpt']}...")
Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through a pull request with at least one reviewer approval. For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification in the PR description, followed by a post-incident review within 48 hours. Branch protection rules enforce this β direct pushes to main are blocked at the repository level.
Sources (3 retrieved):
[1] s3://thecodeforge-handbook/engineering/git-policy-v4.pdf
Excerpt: Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through...
[2] s3://thecodeforge-handbook/engineering/incident-response-runbook.pdf
Excerpt: For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification...
[3] s3://thecodeforge-handbook/engineering/branch-protection-setup.md
Excerpt: Branch protection rules enforce this β direct pushes to main are blocked at the repository level via GitHub rulesets...
| Attribute | AWS Bedrock (On-Demand) | Self-Hosted on SageMaker / EC2 |
|---|---|---|
| Time to first inference | Minutes (API key + boto3 call) | Days to weeks (instance setup, model download, server config) |
| Infrastructure ops burden | Zero β Amazon's problem | High β your team owns scaling, patching, CUDA versions |
| Latency consistency (p99) | Variable β shared fleet, expect 2-5x p50 | Predictable β dedicated hardware, tunable |
| Cost at low volume (<10M tokens/month) | Cheap β pure pay-per-token | Expensive β idle GPU compute is still billed |
| Cost at high volume (>1B tokens/month) | Expensive β per-token adds up fast | Cheaper if utilisation is high and model is stable |
| Model selection | Limited to Bedrock catalogue (Claude, Titan, Llama, Mistral, Cohere) | Any open-weight model you can run |
| Data sovereignty / compliance | Data processed by AWS β review BAA requirements | Full control β data never leaves your VPC |
| Fine-tuning support | Limited β select models only, constrained workflow | Full control β any fine-tuning framework |
| Quota / rate limits | Default 5 TPS for most models β requires support ticket to raise | Self-imposed β limited by your hardware |
| Cold start latency | None β fleet is always warm | Real β model loading can take 30-90s on first call |
π― Key Takeaways
- Bedrock's value isn't the models β it's the elimination of the MLOps surface area. The moment you start managing GPU instances, inference servers, and model versioning yourself, you've hired an invisible infrastructure team that doesn't ship features.
- Default TPS quotas will end your production launch. 5 requests/second is not a starting point β it's a demo limit. File the Service Quotas increase before you write your first line of application code, not the week before go-live.
- The right signal to reach for Bedrock: your team needs model output, not model ownership, and your volume is below the crossover point where Provisioned Throughput beats on-demand pricing. If you're shipping a feature to real users and your team has no ML engineers, Bedrock is almost always the correct first choice.
- Bedrock Agents' internal token consumption is invisible to your application logs and will be 5-15x higher than the final response tokens suggest. If you're not metering agent token usage from CloudTrail or the invocation metrics in the stream stop event, your cost model is fiction.
β Common Mistakes to Avoid
- βMistake 1: Hardcoding the model ID string (e.g. 'anthropic.claude-3-sonnet-20240229-v1:0') directly in application source code β when AWS deprecates that version ID or you want to upgrade, you're doing a multi-repo find-and-replace β Fix: store model IDs in AWS Systems Manager Parameter Store or environment variables, injected at deploy time via your CDK/Terraform config.
- βMistake 2: Using the wrong boto3 client for Knowledge Bases β calling bedrock_runtime.retrieve_and_generate() instead of bedrock_agent_runtime.retrieve_and_generate() β results in an AttributeError: 'BedrockRuntime' object has no attribute 'retrieve_and_generate' that looks like a boto3 version issue but isn't β Fix: Knowledge Base operations use boto3.client('bedrock-agent-runtime'), not boto3.client('bedrock-runtime'). Two different clients, different endpoints.
- βMistake 3: Not requesting a Service Quotas increase before launch β default TPS for Claude 3 Sonnet is 5 requests/second in most regions for new accounts β hitting this in production returns ThrottlingException with the message 'Too many requests, please wait before trying again' β Fix: go to AWS Service Quotas > Amazon Bedrock > find your model's 'On-demand throughput limit' and submit an increase request at least 10 business days before go-live.
- βMistake 4: Treating Bedrock's streaming event loop like a simple for-loop without error handling mid-stream β if the connection drops at token 300 of a 600-token completion, the generator stops silently and the client renders a half-finished response as if it were complete β Fix: wrap the event_stream iteration in try/except and yield an explicit error marker if the loop exits unexpectedly, so the client can detect incomplete responses.
- βMistake 5: Using temperature=1.0 (or the model default) for structured output tasks like JSON generation or classification β high temperature causes the model to occasionally produce malformed JSON or off-schema responses that break your downstream parser β Fix: set temperature between 0.0 and 0.2 for any task where format correctness matters more than creativity, and add output validation with a retry on parse failure.
Interview Questions on This Topic
- QBedrock's on-demand pricing model uses a shared inference fleet. How does that affect your p99 latency SLO design, and what would you change architecturally if your feature requires consistent sub-500ms responses?
- QWhen would you choose Bedrock Agents over building your own LLM orchestration loop with LangChain or a custom state machine? What's the concrete threshold where Agents becomes more pain than it's worth?
- QA Bedrock Agent is calling your Action Group Lambda and intermittently returning wrong answers without any errors in CloudWatch. The Lambda is executing correctly. What's your debugging process β and what's the most likely root cause?
- QYour team is running 500 million tokens per month through Bedrock on-demand and the bill is becoming significant. Walk me through how you'd evaluate whether Provisioned Throughput makes financial sense, and what data you'd need before committing to a reserved MU.
Frequently Asked Questions
How much does AWS Bedrock actually cost in production?
It depends entirely on token volume and model choice, but here's the concrete breakdown: Claude 3 Sonnet costs $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens at time of writing. A typical customer support response that consumes 500 input tokens and 200 output tokens costs roughly $0.0045 β under half a cent. At 100,000 requests per day that's $450/day or ~$13,500/month just in inference costs, before any Agents or Knowledge Base overhead. The cost curve is steep at scale, which is why you should be tracking token usage from day one, not month three.
What's the difference between AWS Bedrock and SageMaker for running AI models?
Bedrock is a managed API for calling pre-trained foundation models β you don't manage any infrastructure. SageMaker is a full ML platform where you can deploy any model (including custom or fine-tuned ones) on dedicated endpoints you control. The rule of thumb: use Bedrock when you want to call a foundation model and ship fast; use SageMaker when you need consistent low-latency inference, models outside Bedrock's catalogue, or full control over the serving environment.
How do I handle Bedrock ThrottlingException in production without losing requests?
Use boto3's 'adaptive' retry mode with max_attempts set to 3-5, which applies exponential backoff with jitter automatically. For user-facing features, wrap the call in a queue with a dead-letter path so throttled requests don't just disappear. Long-term fix: request a Service Quotas increase for your specific model's TPS limit via the AWS console β the default limits are designed for development, not production traffic.
Can Bedrock Agents maintain conversation history across multiple user sessions?
Within a single session yes β Bedrock Agents maintains context for up to one hour using a sessionId you provide. Across separate sessions or after the one-hour timeout, no β the agent has zero memory. For persistent cross-session memory you need to store conversation history in your own database (DynamoDB is the obvious choice), retrieve the relevant history at the start of each new session, and inject it into the agent's initial prompt or as part of your Action Group context. This is a design requirement, not a configuration option.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.