Pricing: pay per input/output token on-demand, or reserve Model Units (Provisioned Throughput) for guaranteed TPS
Production trap: default TPS quota is 5 requests/second for most models — file Service Quotas increase 2+ weeks before launch
Cost insight: Agent internal reasoning chains consume 5-15x more tokens than the final response suggests — meter everything from day one
Plain-English First
Imagine you need fresh bread for your restaurant every morning. You could buy a wheat farm, hire agronomists, build a mill, and train bakers — or you could just call a bakery and say 'send me 200 sourdough loaves.' AWS Bedrock is the bakery. The foundation models — Claude, Titan, Llama, Mistral — are already baked, scaled, and maintained by someone else. You just make the call, get the output, and pay per loaf. The moment you think you need to 'own the farm' is the moment you've stopped shipping features and started running an AI infrastructure team.
A fintech startup I consulted for spent four months standing up a self-hosted Llama 2 cluster on EC2. GPU reservations, CUDA driver mismatches, custom inference servers, auto-scaling that never quite worked right. They burned $180k in compute before their first user ever typed a prompt. AWS Bedrock would have had them in production in an afternoon for a few cents per thousand tokens. That's not a sales pitch — it's a pattern I've watched repeat at least six times across different orgs.
Bedrock solves a specific and expensive problem: most product teams don't need to run a model — they need a model's output. The operational surface area between those two things is massive. You're talking GPU fleet management, model versioning, inference server tuning, cold-start latency, and on-call rotations that wake up ML engineers at 2am because the VRAM exploded under load. Bedrock collapses all of that into a single API. You pick a model, send a request, get a response. The fleet management, the scaling, the hardware — Amazon's problem now.
After reading this you'll be able to: wire up Bedrock's InvokeModel API in a real service context, implement streaming responses without blocking your web workers, set up Bedrock Agents for multi-step task orchestration, avoid the three quota and cost traps that silently destroy GenAI budgets, and make an informed decision about when Bedrock is the right call versus when you actually do need to self-host.
The Bedrock Model: What You're Actually Paying For and How It Routes
Before you write a single line of code, understand what Bedrock is under the hood — because the mental model directly affects how you design for cost, latency, and failure.
Bedrock is a managed inference proxy. When you call InvokeModel, you're not getting a dedicated GPU instance. Your request goes into Amazon's shared inference fleet for that model family. Amazon handles queuing, routing, scaling, and the hardware underneath. You pay per input token and per output token. There's no idle cost, no reserved capacity fee by default — unless you opt into Provisioned Throughput, which we'll get to.
This shared-fleet model is why you'll see latency variance that would be unacceptable from your own infrastructure. On a busy Tuesday afternoon, a Claude 3 Sonnet call might take 800ms. On Sunday at 6am it might take 280ms. You don't control that. Plan for p99 latency, not average. I've seen teams build chatbots that felt broken in production because they load-tested at 2am and designed for 400ms response times — then their 9am Monday demo crawled.
The model IDs matter more than you think. They're not stable aliases — they're versioned strings like anthropic.claude-3-sonnet-20240229-v1:0. When Anthropic ships a new version, the old ID stays available but you don't get automatically migrated. That's intentional. But it means you need a config-driven model ID system, not hardcoded strings in your service. Teams that hardcode model IDs end up doing find-and-replace across repos when they want to upgrade — which is exactly as painful as it sounds.
bedrock_inference_client.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# io.thecodeforge — DevOps tutorialimport boto3
import json
import os
from botocore.config importConfigfrom botocore.exceptions importClientError, EndpointResolutionError# Config-driven model ID — never hardcode this in your service layer.# Pull from environment or parameter store so upgrades don't require redeploys.
MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
# Set explicit timeouts. Bedrock calls on large prompts can run 30-60s.# Without this, boto3 defaults to 60s connect / no read timeout — silent hangs kill your web workers.
boto_config = Config(
region_name=AWS_REGION,
connect_timeout=5, # fail fast if the endpoint is unreachable
read_timeout=120, # long enough for large completions, not infinite
retries={
"max_attempts": 3,
"mode": "adaptive" # exponential backoff with jitter — don't use 'legacy' mode in prod
}
)
bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config)
definvoke_document_summariser(raw_document: str, max_tokens: int = 1024) -> dict:
"""
Production pattern: document summarisation for a content pipeline.
Returns structured output including token usage so the caller can track cost.
"""
# Claude models use the Messages API format — not the legacy text-completion format.# Mixing them up gives you a cryptic ValidationException, not a helpful error.
request_body = {
"anthropic_version": "bedrock-2023-05-31", # required field for Anthropic models on Bedrock"max_tokens": max_tokens,
"messages": [
{
"role": "user",
"content": (
"Summarise the following document in 3 bullet points. ""Focus on decisions made, not background context.\n\n"
f"{raw_document}"
)
}
],
"temperature": 0.2, # low temp for summarisation — you want deterministic, not creative
}
try:
response = bedrock_runtime.invoke_model(
modelId=MODEL_ID,
contentType="application/json",
accept="application/json",
body=json.dumps(request_body)
)
response_body = json.loads(response["body"].read())
# Always capture usage — this is your cost telemetry.# Log it to CloudWatch metrics or your billing system. Don't discard it.
input_tokens = response_body["usage"]["input_tokens"]
output_tokens = response_body["usage"]["output_tokens"]
summary_text = response_body["content"][0]["text"]
return {
"summary": summary_text,
"model_id": MODEL_ID,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
# Rough cost estimate for Claude 3 Sonnet at time of writing:# $0.003/1K input tokens, $0.015/1K output tokens"estimated_cost_usd": round(
(input_tokens / 1000 * 0.003) + (output_tokens / 1000 * 0.015), 6
)
}
exceptClientErroras e:
error_code = e.response["Error"]["Code"]
error_message = e.response["Error"]["Message"]
# ThrottlingException hits when you exceed your account's TPS quota.# Default is 5 TPS for Claude 3 Sonnet in most regions — shockingly low for production.if error_code == "ThrottlingException":
raiseRuntimeError(
f"Bedrock quota exceeded for model {MODEL_ID}. ""Request a limit increase via Service Quotas before going live."
) from e
# ValidationException usually means malformed request body — check your model's spec.if error_code == "ValidationException":
raiseValueError(f"Invalid request format for {MODEL_ID}: {error_message}") from e
raiseRuntimeError(f"Bedrock API error [{error_code}]: {error_message}") from e
if __name__ == "__main__":
sample_doc = """
EngineeringReview — Q3PlatformMigrationDecision: MoveAPI gateway to AWSAPIGatewayv2 (HTTPAPIs).
Rationale: 60% cost reduction vs RESTAPIsfor our traffic pattern.
Rejected alternative: Kong on EKS — operational overhead too high for current team size.
Timeline: Cutover scheduled forOctober 15th. Rollback plan approved.
Owner: Platform team. Risk: Medium. Stakeholder sign-off: CTO, VPEngineering.
"""
result = invoke_document_summariser(sample_doc)
print(f"Summary:\n{result['summary']}")
print(f"\nTokens — Input: {result['input_tokens']} | Output: {result['output_tokens']}")
print(f"Estimated cost: ${result['estimated_cost_usd']}")
Output
Summary:
• Decided to migrate API gateway to AWS API Gateway v2 (HTTP APIs) for a 60% cost reduction.
• Rejected Kong on EKS due to excessive operational overhead for the current team size.
• Cutover set for October 15th with an approved rollback plan; medium risk, sign-off from CTO and VP Engineering.
Tokens — Input: 187 | Output: 73
Estimated cost: $0.001662
Production Trap: Default TPS Quota Will Destroy Your Launch
AWS Bedrock default TPS limits are 5 requests/second for most Claude models in new accounts. That's fine for a demo. At production load with 50 concurrent users, you'll hit ThrottlingException inside two seconds. File a Service Quotas increase request at least two weeks before your launch date — AWS approval isn't instant, and the support ticket queue gets long.
Production Insight
Shared fleet latency variance is 2-5x between p50 and p99.
You cannot control routing or queue position — Amazon decides.
Rule: design for p99 latency from day one, not average. Load test at peak hours (9am-5pm weekdays), not at 2am when the fleet is idle.
Key Takeaway
Bedrock is a shared inference proxy — you get no dedicated hardware and no latency guarantees.
Model IDs are versioned and must be config-driven, not hardcoded.
Design for p99, meter token usage from CloudWatch, and file your TPS increase before you write application code.
Model ID Management Strategy
IfSingle model, single version, no upgrade planned
→
UseEnvironment variable is sufficient — set BEDROCK_MODEL_ID at deploy time.
IfMultiple models or frequent version upgrades
→
UseUse AWS Systems Manager Parameter Store with a config-driven lookup. CI/CD pipeline updates the parameter, not the code.
IfA/B testing model versions
→
UseStore a JSON map of model aliases to versioned IDs in Parameter Store. Route by feature flag or percentage.
Streaming Responses: Stop Blocking Your Threads and Start Shipping Perceived Speed
Here's what kills GenAI UX before a user ever reads a word: a 12-second blank screen while your server waits for the full completion before flushing anything to the client. Users think it's broken. They hit refresh. You get duplicate charges. Your support queue fills up.
Bedrock's InvokeModelWithResponseStream fixes this. It returns a streaming event iterator — text chunks arrive as the model generates them, and you pipe each chunk to the client immediately. From the user's perspective, text starts appearing in under a second and keeps flowing. Perceived latency drops dramatically even when total generation time is identical.
The tricky part isn't the streaming itself — it's the infrastructure around it. Your web framework needs to support streaming responses, your load balancer needs idle timeout configured high enough (ALB defaults to 60s — too low for long completions), and your error handling needs to account for the fact that the stream can fail mid-response. I've seen services that catch exceptions from InvokeModel just fine but have zero error handling inside the stream event loop — so when the stream dies at token 400 of a 600-token response, the client gets a truncated response with no indication that something went wrong. Silent data loss in a production AI system is a bad day.
bedrock_streaming_handler.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# io.thecodeforge — DevOps tutorialimport boto3
import json
import os
from botocore.config importConfigfrom typing importGenerator
MODEL_ID = os.environ.get("BEDROCK_MODEL_ID", "anthropic.claude-3-sonnet-20240229-v1:0")
boto_config = Config(
region_name=os.environ.get("AWS_REGION", "us-east-1"),
connect_timeout=5,
read_timeout=300, # streaming completions can run long — 120s isn't always enough
retries={"max_attempts": 1, "mode": "standard"} # don't retry mid-stream — retry at the caller level
)
bedrock_runtime = boto3.client("bedrock-runtime", config=boto_config)
defstream_customer_support_response(
customer_query: str,
account_context: dict
) -> Generator[str, None, None]:
"""
Production pattern: real-time customer support response generation.
Yields text chunks as a generator so the caller (e.g. FastAPIStreamingResponse)
can flush each chunk to the HTTP client immediately.
account_context: dict with keys like 'plan', 'open_tickets', 'last_login'"""
# Build a system prompt from account context so the model responds with# customer-specific information rather than generic advice.
system_prompt = (
f"You are a support agent for TheCodeForge platform. "
f"The customer is on the {account_context.get('plan', 'free')} plan. "
f"They have {account_context.get('open_tickets', 0)} open support tickets. ""Be concise, direct, and actionable. Do not apologise excessively."
)
request_body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"system": system_prompt,
"messages": [
{"role": "user", "content": customer_query}
],
"temperature": 0.3,
}
try:
streaming_response = bedrock_runtime.invoke_model_with_response_stream(
modelId=MODEL_ID,
contentType="application/json",
accept="application/json",
body=json.dumps(request_body)
)
# event_stream is an iterator of EventStream objects — each one is a chunk
event_stream = streaming_response["body"]
for event in event_stream:
chunk = event.get("chunk")
ifnot chunk:
# Non-chunk events exist (metadata, message_start, etc.) — skip them gracefullycontinue
chunk_data = json.loads(chunk["bytes"].decode("utf-8"))
# Claude streaming emits different event types — only 'content_block_delta' carries textif chunk_data.get("type") == "content_block_delta":
delta = chunk_data.get("delta", {})
if delta.get("type") == "text_delta":
text_piece = delta.get("text", "")
if text_piece:
yield text_piece # flush this chunk to the caller immediately# message_stop event signals clean completion — log it for observabilityelif chunk_data.get("type") == "message_stop":
# Amazon metrics come in the stop event — useful for billing dashboards
amazon_metrics = chunk_data.get("amazon-bedrock-invocationMetrics", {})
input_tokens = amazon_metrics.get("inputTokenCount", 0)
output_tokens = amazon_metrics.get("outputTokenCount", 0)
# In production: emit these as CloudWatch custom metrics hereprint(f"[Stream complete] input={input_tokens} output={output_tokens} tokens")
exceptExceptionas e:
# Critical: yield an error marker so the client knows the stream died mid-response# Don't silently stop — the client will think the truncated response is completeyield f"\n[ERROR: Response generation interrupted — {type(e).__name__}]"raise# --- Simulated FastAPI usage (shows how the generator plugs into a real web framework) ---# In production this would be in your router module:## from fastapi import FastAPI# from fastapi.responses import StreamingResponse## app = FastAPI()## @app.post("/support/stream")# async def stream_support(query: SupportQuery):# account_ctx = fetch_account_context(query.account_id) # your DB call# return StreamingResponse(# stream_customer_support_response(query.text, account_ctx),# media_type="text/plain"# )if __name__ == "__main__":
query = "I deployed to production and my API calls are returning 429s. What do I do?"
context = {"plan": "pro", "open_tickets": 1, "last_login": "2024-03-15"}
print("Streaming response:\n")
for chunk instream_customer_support_response(query, context):
print(chunk, end="", flush=True) # flush=True is essential — don't bufferprint("\n")
Output
Streaming response:
You're hitting rate limits (429 = Too Many Requests). On the Pro plan here's what to check:
1. **Check your current usage** — log into the dashboard under Settings > API Usage to see if you've hit your monthly request cap.
2. **Implement exponential backoff** — your client should retry with delays of 1s, 2s, 4s before failing. Most SDKs have this built in.
3. **Check for runaway processes** — a misconfigured retry loop can exhaust your quota in minutes. Look for repeated identical requests in your logs.
If you're within quota and still seeing 429s, open a ticket with your API key and a sample request timestamp — that points to a server-side issue we'll trace on our end.
[Stream complete] input=98 output=143 tokens
Never Do This: Retry Inside a Stream Event Loop
Setting max_attempts > 1 on a streaming call in boto3 is dangerous. If a chunk fails mid-stream and boto3 retries, it starts the stream from the beginning — but your client has already received partial output. You end up with duplicated content prepended to the retry. Set retries to 1 on streaming clients and handle retries at the request level, before the stream opens.
Production Insight
ALB default idle timeout is 60s — too low for long streaming completions.
A 2,000-token completion at 30 tokens/second takes 67 seconds. The ALB kills the connection at 60s.
Rule: set ALB idle timeout to 300s for any endpoint that streams Bedrock completions. Check this before your first production deploy, not after your first outage.
Key Takeaway
Streaming drops perceived latency from total generation time to first-token time — critical for user-facing UX.
Never set max_attempts > 1 on streaming boto3 clients — retries restart the stream and duplicate content.
ALB idle timeout must be 300s+ for streaming endpoints. The 60s default will silently truncate long completions.
Streaming vs Sync InvokeModel
IfUser-facing chat or real-time response (< 3s perceived latency required)
→
UseUse InvokeModelWithResponseStream. Perceived latency drops from total generation time to first-token time.
IfBatch processing, document classification, or background jobs
→
UseUse InvokeModel (sync). Simpler error handling, no stream infrastructure needed.
IfWebhook or callback-based architecture
→
UseUse InvokeModel (sync) with async worker. Stream to callback URL after completion.
IfLoad balancer cannot be configured with high idle timeout
→
UseUse InvokeModel (sync). Streaming through a 60s ALB timeout will truncate long responses.
Bedrock Agents: When Single Prompts Aren't Enough and You Need Actual Orchestration
A single InvokeModel call works great when your task is stateless: summarise this, classify that, generate this copy. The moment your task requires multiple steps — look up customer data, reason about it, call an API, generate a response based on the result — you're either building your own orchestration loop or you're using Bedrock Agents.
Bedrock Agents is Amazon's managed multi-step reasoning engine. You define the agent's instructions (its persona and scope), attach Action Groups (Lambda functions that the agent can invoke), and optionally connect a Knowledge Base (a vector store backed by your documents). The agent runs a ReAct-style loop: it reasons about the user's request, decides which actions to take, calls your Lambdas, observes the results, and iterates until it has enough information to respond.
The thing most tutorials won't tell you: the agent's internal reasoning chain costs tokens you don't see upfront. Every step in the loop — including the model's internal 'thinking' about which action to call — burns input and output tokens. On complex multi-step tasks I've seen agents consume 10-15x the tokens you'd expect from reading the final answer alone. Budget for it. Also, Bedrock Agents has a fixed session timeout of one hour. Any stateful conversation longer than that needs explicit session management on your side — the agent won't remember anything after the session expires.
The sweet spot for Agents is internal tooling: HR bots that query Workday, DevOps assistants that check CloudWatch alarms and summarise them, customer-facing support bots that can actually look up order status. Tasks where the answer genuinely requires calling real systems, not just reasoning over embedded knowledge.
bedrock_agent_action_group_lambda.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# io.thecodeforge — DevOps tutorial# This Lambda is an Action Group handler for a Bedrock Agent.# The agent calls this function when it needs to look up order status.# Deploy this as a Lambda, then wire it to your Agent via the Bedrock console or CDK.import json
import boto3
import os
from datetime import datetime
# In production: pull from environment, not hardcoded table names
ORDERS_TABLE = os.environ.get("ORDERS_TABLE_NAME", "platform-orders-prod")
dynamodb = boto3.resource("dynamodb")
orders_table = dynamodb.Table(ORDERS_TABLE)
deflambda_handler(event: dict, context) -> dict:
"""
BedrockAgentActionGroup handler.
The agent sends a specific event structure — you must return a specific structure back.
Deviatefrom the response format and the agent silently fails or hallucinates an answer.
"""
# Bedrock Agents wraps function calls in this structure
agent_action = event.get event.get("apiPath", "") # matches your OpenAPI schema path
http_method = event.get("httpMethod", "") # GET, POST, etc. — from your schema
parameters = event.get("parameters", []) # list of {name, type, value} dictsprint(f"[Agent Action] group={agent_action} path={api_path} method={http_method}")
# Route to the appropriate handler based on the API pathif api_path == "/orders/{orderId}"and http_method == "GET":
order_id = next(
(p["value"] for p in parameters if("actionGroup", "")
api_path = p["name"] == "orderId"), None
)
result = fetch_order_status(order_id)
elif api_path == "/orders/{orderId}/cancel"and http_method == "POST":
order_id = next(
(p["value"] for p in parameters if p["name"] == "orderId"), None
)
result = cancel_order(order_id)
else:
result = {"error": f"Unknown action path: {api_path}"}
# Bedrock Agents REQUIRES this exact response envelope.# Missing 'messageVersion', 'response', or 'actionGroup' fields = silent agent failure.return {
"messageVersion": "1.0",
"response": {
"actionGroup": agent_action,
"apiPath": api_path,
"httpMethod": http_method,
"httpStatusCode": 200if"error"notin result else400,
"responseBody": {
"application/json": {
"body": json.dumps(result)
}
}
}
}
deffetch_order_status(order_id: str) -> dict:
"""Look up a real order from DynamoDB and return structured status."""ifnot order_id:
return {"error": "orderId is required"}
try:
response = orders_table.get_item(Key={"orderId": order_id})
item = response.get("Item")
ifnot item:
# Be specific — the agent will relay this message verbatim to the userreturn {"error": f"Order {order_id} not found. It may not exist or may be archived."}
return {
"orderId": item["orderId"],
"status": item["status"], # e.g. PROCESSING, SHIPPED, DELIVERED"estimatedDelivery": item.get("estimatedDelivery", "unknown"),
"carrier": item.get("carrier", "not yet assigned"),
"trackingNumber": item.get("trackingNumber", "not yet assigned"),
"lastUpdated": item.get("lastUpdated", "")
}
exceptExceptionas e:
# Don't expose raw exception messages to the agent — it may relay them to the userprint(f"[ERROR] DynamoDB lookup failed for order {order_id}: {e}")
return {"error": "Order lookup temporarily unavailable. Please try again shortly."}
defcancel_order(order_id: str) -> dict:
"""Cancel an order if it's still in PROCESSING state."""ifnot order_id:
return {"error": "orderId is required"}
try:
# Conditional update — only cancel if status is PROCESSING# This prevents the agent from cancelling already-shipped orders
orders_table.update_item(
Key={"orderId": order_id},
UpdateExpression="SET#s = :cancelled, lastUpdated = :now",ConditionExpression="#s = :processing",ExpressionAttributeNames={"#s": "status"}, # 'status' is a reserved word in DynamoDBExpressionAttributeValues={
":cancelled": "CANCELLED",
":processing": "PROCESSING",
":now": datetime.utcnow().isoformat()
}
)
return {"orderId": order_id, "status": "CANCELLED", "message": "Order successfully cancelled."}
except dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
# Order exists but isn't in PROCESSING — give the agent a specific reasonreturn {
"error": f"Order {order_id} cannot be cancelled — it has already been shipped or delivered."
}
exceptExceptionas e:
print(f"[ERROR] Cancel failed for order {order_id}: {e}")
return {"error": "Cancellation temporarily unavailable."}
Output
# When the Bedrock Agent receives: "Can you cancel order ORD-88421?"
Production Trap: Agent Token Costs Are Not What You Think
Bedrock Agents runs an internal reasoning chain that isn't visible in your application logs. On a 3-step task (lookup → reason → respond), I've measured 8,000+ tokens consumed for what looks like a 200-token final answer. Set up a CloudWatch metric filter on inputTokenCount and outputTokenCount from the agent's CloudTrail events before you go live. Otherwise your billing surprises will be significant and will arrive monthly.
Production Insight
Agent response envelope format is rigid — missing a single field causes silent hallucination.
The agent will generate a plausible-sounding answer from its training data instead of using your Lambda's response.
Rule: validate the envelope structure in Lambda unit tests before deploying. The 6 required fields are: messageVersion, response.actionGroup, response.apiPath, response.httpMethod, response.httpStatusCode, response.responseBody.
Key Takeaway
Agents consume 5-15x more tokens than the final response suggests — meter at CloudWatch, not application logs.
Agent system prompts are re-injected at every reasoning step — keep them under 200 tokens.
Route simple queries to InvokeModel directly. Reserve agents for tasks that genuinely require multi-step API orchestration.
UseAgents cannot do this natively. Build custom session management with DynamoDB. Inject history into the agent's initial prompt.
Provisioned Throughput, Knowledge Bases, and When to Walk Away From Bedrock Entirely
On-demand pricing is great until you hit quota walls at scale. If your application is sending consistent, high-volume traffic to a specific model — think a customer-facing feature used by thousands of users during business hours — Provisioned Throughput might make more sense. You reserve Model Units (MUs) for a specific model, pay hourly regardless of usage, and get guaranteed throughput without ThrottlingExceptions.
Here's the honest math: a single MU for Claude 3 Sonnet runs about $60/hour. At 720 hours per month that's $43,200 per month, per MU. On-demand for the same volume might be cheaper — or wildly more expensive — depending on your actual token throughput. Run the numbers on your specific traffic pattern before committing. Provisioned Throughput has a minimum one-month commitment. I've seen teams lock in a MU for a feature that got descoped a week later.
Bedrock Knowledge Bases gives you managed RAG — upload documents to S3, Bedrock chunks and embeds them into a vector store (OpenSearch Serverless or Pinecone), and your agent can query it semantically. For internal documentation bots or product knowledge bases it's genuinely useful and much faster to ship than building your own embedding pipeline. The gotcha: chunk size and overlap settings are critical and not obvious. Default chunking works fine for short Q&A docs, but for dense technical PDFs you'll get retrieval misses because the relevant context gets split across chunk boundaries.
When should you not use Bedrock? Three clear signals: you need a model Bedrock doesn't offer (GPT-4o, Gemini Ultra — you're calling OpenAI/Google directly regardless), you need sub-100ms inference latency at scale (shared fleet variance won't get you there — look at SageMaker JumpStart with a dedicated endpoint), or you need fine-tuned models on highly proprietary data where sending data to a third-party API is a compliance non-starter. Bedrock does support some fine-tuning workflows, but they're limited in model scope and more complex than advertised.
bedrock_knowledge_base_rag_query.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# io.thecodeforge — DevOps tutorial# RAG (Retrieval Augmented Generation) pattern using Bedrock Knowledge Bases.# Use case: internal engineering handbook bot that answers policy questions.# The Knowledge Base is pre-populated with your company docs via the Bedrock console or CDK.import boto3
import os
import json
from botocore.config importConfig
KNOWLEDGE_BASE_ID = os.environ["BEDROCK_KB_ID"] # e.g. "ABCD1234EF" — from Bedrock console
MODEL_ARN = (
"arn:aws:bedrock:us-east-1::foundation-model/""anthropic.claude-3-sonnet-20240229-v1:0"
) # RetrieveAndGenerate requires the full ARN, not just the model ID
boto_config = Config(
region_name=os.environ.get("AWS_REGION", "us-east-1"),
connect_timeout=5,
_attempts": 2, "mode": "adaptive"}
)
# Note: Knowledge Bases uses the 'bedrock-agent-runtime' client — NOT 'bedrock-runtime'.# Using the wrong client gives you a NoRegionError or AttributeError with no useful message.
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", config=boto_config)
defquery_engineering_handbook(question: str, max_retrieved_chunks: int = 5) -> dict:
"""
Query the engineering handbook KnowledgeBase using RetrieveAndGenerate.
Thisis the fully managed RAG path — Bedrock handles retrieval + generation in one call.
For transparency/debugging: also returns the source citations so you can verify
the model isn't hallucinating answers that aren't in the docs.
"""
try:
response = bedrock_agent_runtime.retrieve_and_generate(
input={"text": question},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": KNOWLEDGE_BASE_ID,
"modelArn": MODEL_ARN,
"retrievalConfiguration": {
"vectorSearchConfiguration": {
# Number of document chunks to retrieve before generation.# Higher = more context but more tokens = higher cost and latency.# 5 is a good starting point; tune based on your doc structure."numberOfResults": max_retrieved_chunks
}
},
"generationConfiguration": {
"promptTemplate": {
# Override the default prompt to enforce your preferred answer style.# The $search_results$ placeholder is where retrieved chunks are injected."textPromptTemplate": (
"You are an assistant for TheCodeForge engineering team. ""Answer based ONLY on the following retrieved context. ""If the answer isn't in the context, say 'Not found in handbook.' ""Do not invent policies or procedures.\n\n""Context:\n$search_results$\n\n"
f"Question: {question}"
)
}
}
}
}
)
answer = response["output"]["text"]
# Extract citations — each citation maps to a specific chunk in your S3 docs.# In production: read_timeout=60,
retries={"max surface these to the user so they can verify the source.
citations = []
for citation in response.get("citations", []):
for reference in citation.get("retrievedReferences", []):
location = reference.get("location", {}).get("s3Location", {})
citations.append({
"source_uri": location.get("uri", "unknown"),
"excerpt": reference.get("content", {}).get("text", "")[:200] # truncate for display
})
return {
"answer": answer,
"citations": citations,
"citation_count": len(citations)
}
except bedrock_agent_runtime.exceptions.ResourceNotFoundException:
raiseValueError(
f"Knowledge Base {KNOWLEDGE_BASE_ID} not found. ""Check the ID and ensure the KB is in 'Active' status — ""embedding ingestion must complete before queries work."
)
exceptExceptionas e:
raiseRuntimeError(f"Knowledge Base query failed: {e}") from e
if __name__ == "__main__":
result = query_engineering_handbook(
"What's our policy on hotfixing directly to the main branch?"
)
print(f"Answer:\n{result['answer']}\n")
print(f"Sources ({result['citation_count']} retrieved):")
for i, citation inenumerate(result["citations"], 1):
print(f" [{i}] {citation['source_uri']}")
print(f" Excerpt: {citation['excerpt']}...")
Output
Answer:
Direct hotfixes to the main branch are not permitted under standard policy. All changes, including urgent patches, must go through a pull request with at least one reviewer approval. For P0 incidents, an expedited review process applies: the on-call engineer can approve with documented justification in the PR description, followed by a post-incident review within 48 hours. Branch protection rules enforce this — direct pushes to main are blocked at the repository level.
Excerpt: Branch protection rules enforce this — direct pushes to main are blocked at the repository level via GitHub rulesets...
Senior Shortcut: Two-Stage RAG for Better Retrieval Accuracy
If RetrieveAndGenerate gives you retrieval misses on complex questions, split it into two calls: first call Retrieve only (retrieve_from_knowledge_base) to get the raw chunks, then manually re-rank or filter them in your application code, then pass the filtered context to InvokeModel directly. This two-stage pattern costs slightly more but gives you control over what the model actually sees — and it surfaces retrieval quality issues that the end-to-end call hides.
Production Insight
Provisioned Throughput minimum commitment is 1 month at $43,200/MU for Claude 3 Sonnet.
A feature descoped one week after MU purchase = $43,200 wasted for that month.
Rule: validate feature stability and traffic projections for 30 days before committing to Provisioned Throughput. Use on-demand during the validation period and compare actual token spend against MU cost.
Key Takeaway
Provisioned Throughput is a 1-month minimum commitment at $43K+/MU. Validate traffic stability before committing.
Knowledge Bases default chunking fails on dense technical PDFs — tune chunk size or use the two-stage retrieve-then-generate pattern.
Walk away from Bedrock when you need sub-100ms latency, models outside the catalogue, or VPC-only data processing.
When to Use Bedrock vs Self-Host
IfNeed model output, team has no ML engineers, volume < 1B tokens/month
→
UseUse Bedrock on-demand. Fastest path to production, zero ops burden.
IfConsistent high-volume traffic (> 500M tokens/month) with stable model
→
UseEvaluate Provisioned Throughput. Run 30-day on-demand baseline, compare against MU hourly cost.
IfNeed sub-100ms p99 latency at scale
→
Userock shared fleet cannot guarantee sub-100ms.
IfNeed a model not in Bedrock catalogue (GPT-4o, Gemini Ultra)
→
UseCall the provider API directly. Bedrock cannot help.
IfData cannot leave your VPC (compliance requirement)
→
UseSelf-host on SageMaker or EC2 in your VPC. Bedrock processes data on Amazon's infrastructure.
IfNeed fine-tuning on proprietary data
→
UseEvaluate Bedrock fine-tuning first (limited model support). If insufficient, self-host with custom training pipeline.
● Production incidentPOST-MORTEMseverity: high
The $47K/month Agent Bill: Invisible Token Consumption in Bedrock Agents
Symptom
Monthly Bedrock bill 15x higher than projected. Application logs showed ~200 tokens per response, but CloudWatch metrics showed ~2,400 tokens per invocation. The delta was invisible to the application layer.
Assumption
The team measured cost based on the final response token count — what the user sees. They assumed the agent's internal reasoning was negligible. They did not instrument CloudTrail or the stream stop event metrics for agent-level token tracking.
Root cause
Bedrock Agents runs an internal ReAct-style reasoning loop. Each step — deciding which Action Group to call, interpreting the Lambda response, deciding whether to call another action — burns input and output tokens that are not surfaced in the application response. On a 3-step task (lookup Okta → check Meraki → create Jira ticket), the agent consumed approximately 2,400 tokens per invocation: ~400 tokens for the user's question, ~1,6 3 steps, and ~400 tokens for the final response. The team only metered the 400-token final response.
Additionally, the agent was configured with00 tokens for internal reasoning across verbose instructions (800 tokens of system prompt) that were re-injected at every reasoning step, multiplying the input token cost.
Fix
1. Added CloudWatch metric filters on the amazon-bedrock-invocationMetrics inputTokenCount and outputTokenCount fields from the agent's CloudTrail events. This captured the true per-invocation token cost.
2. Reduced the agent's system prompt from 800 tokens to 150 tokens by removing redundant persona descriptions and consolidating instructions.
3. Added a session-level token budget with a hard cap of 3,000 tokens per conversation turn. When the budget was exhausted, the agent returned a 'please try a simpler question' response instead of continuing the reasoning loop.
4. Implemented a two-tier architecture: simple queries (password reset, VPN status) routed to direct InvokeModel calls with a static knowledge base, bypassing the agent entirely. Only multi-step tasks (requiring API calls) used the agent. This reduced agent invocations by 70%.
5. Set up a daily CloudWatch alarm on Bedrock token spend exceeding $1,500/day with automatic Slack notification to the platform team.
Key lesson
Bedrock Agent token costs are 5-15x higher than the final response suggests. Always meter at the CloudWatch/CloudTrail level, not the application response level.
Agent system prompts are re-injected at every reasoning step. A 800-token system prompt across a 4-step reasoning chain adds 3,200 tokens of input cost per invocation. Keep agent instructions under 200 tokens.
Not every query needs an agent. Route simple, stateless queries to direct InvokeModel calls. Reserve agents for tasks that genuinely require multi-step API orchestration.
Set up token spend alarms before go-live. Bedrock cost surprises arrive monthly, not per-request. Daily alarms catch runaway consumption before the bill compounds.
Production debug guideSymptom-to-action guide for Bedrock API errors, latencyPS quota spikes, quota issues, and Agent failures6 entries
Symptom · 01
ThrottlingException: 'Too many requests, please wait before trying again'
→
Fix
for the specific model. Check current limits: AWS Console > Service Quotas > Amazon Bedrock > search for your model's 'On-demand throughput limit'. Default is 5 TPS for most Claude models. File an increase request immediately — approval takes 5-10 business days. In the meantime, implement exponential backoff with jitter in your client.
Symptom · 02
ValidationException: 'Malformed input request' when calling InvokeModel
→
Fix
The request body format does not match the model's expected schema. Claude models on Bedrock require the Messages API format with 'anthropic_version': 'bedrock-2023-05-31'. Mistral and Llama use different formats. Check the model-specific API spec in the Bedrock documentation. Common mistake: using the legacy text-completion format for Claude 3.
Symptom · 03
InvokeModel call hangs for 60+ seconds then times out
→
Fix
Check boto3 read_timeout configuration. Default is 60s (or no timeout depending on version). For large prompts (>10K tokens), completions can take 30-120s. Set read_timeout=120 in your boto3 Config. AlsoYou have exceeded your account's T check if the model is experiencing elevated latency — check AWS Health Dashboard for the region.
Symptom · 04
Bedrock Agent returns wrong answers but Lambda executes correctly with no errors
→
Fix
The agent's reasoning loop is misinterpreting your Lambda's response. Check the response envelope structure — Bedrock Agents requires 'messageVersion', 'response', 'actionGroup', 'apiPath', 'httpMethod', 'httpStatusCode', and 'responseBody' in the exact format. Missing or misnamed fields cause the agent to hallucinate a response instead of using your data. Add structured logging in the Lambda to capture the full event and response for comparison.
The stream connection dropped mid-response. Check ALB idle timeout (default 60s — too low for long completions). Increase to 300s. Check boto3 retries — setting max_attempts > 1 on streaming clients causes boto3 to restart the stream from the beginning, duplicating content already sent to the client. Set retries to 1 on streaming clients and handle retries at the request level.
Symptom · 06
Knowledge Base returns 'ResourceNotFoundException' despite correct KB ID
→
Fix
The Knowledge Base exists but is not in 'Active' status. Data source ingestion must complete before the KB can serve queries. Check status: AWS Console > Bedrock > Knowledge Bases > select your KB > check 'Status' column. If 'Creating' or 'Updating', wait for ingestion to complete. If the KB was recently created, the initial embedding process can take 10-60 minutes depending on document volume.
★ AWS Bedrock Triage Cheat SheetFast symptom-to-action for engineers investigating Bedrock failures. First 5 minutes.
ThrottlingException on every request−
Immediate action
Check TPS quota and current usage for the model in Service Quotas.
Verify response envelope has messageVersion, response.actionGroup, response.apiPath, response.httpMethod, response.httpStatusCode, response.responseBody. If instruction > 500 tokens, trim it.
Set max_attempts=1 on streaming boto3 clients. Retries mid-stream restart from the beginning, duplicating already-sent content. Handle retries at the request level, not the stream level.
AWS Bedrock vs Self-Hosted Inference
Attribute
AWS Bedrock (On-Demand)
Self-Hosted on SageMaker / EC2
Time to first inference
Minutes (API key + boto3 call)
Days to weeks (instance setup, model download, server config)
Infrastructure ops burden
Zero — Amazon's problem
High — your team owns scaling, patching, CUDA versions
Latency consistency (p99)
Variable — shared fleet, expect 2-5x p50
Predictable — dedicated hardware, tunable
Cost at low volume (<10M tokens/month)
Cheap — pure pay-per-token
Expensive — idle GPU compute is still billed
Cost at high volume (>1B tokens/month)
Expensive — per-token adds up fast
Cheaper if utilisation is high and model is stable
Model selection
Limited to Bedrock catalogue (Claude, Titan, Llama, Mistral, Cohere)
Default 5 TPS for most models — requires support ticket to raise
Self-imposed — limited by your hardware
Cold start latency
None — fleet is always warm
Real — model loading can take 30-90s on first call
Key takeaways
1
Bedrock's value isn't the models
it's the elimination of the MLOps surface area. The moment you start managing GPU instances, inference servers, and model versioning yourself, you've hired an invisible infrastructure team that doesn't ship features.
2
Default TPS quotas will end your production launch. 5 requests/second is not a starting point
it's a demo limit. File the Service Quotas increase before you write your first line of application code, not the week before go-live.
3
The right signal to reach for Bedrock
your team needs model output, not model ownership, and your volume is below the crossover point where Provisioned Throughput beats on-demand pricing. If you're shipping a feature to real users and your team has no ML engineers, Bedrock is almost always the correct first choice.
4
Bedrock Agents' internal token consumption is invisible to your application logs and will be 5-15x higher than the final response tokens suggest. If you're not metering agent token usage from CloudTrail or the invocation metrics in the stream stop event, your cost model is fiction.
Common mistakes to avoid
5 patterns
×
Hardcoding the model ID string in application source code
Symptom
When AWS deprecates the version ID or you want to upgrade to a newer model version, you must do a multi-repo find-and-replace and redeploy every service that references the old ID.
Fix
Store model IDs in AWS Systems Manager Parameter Store or environment variables, injected at deploy time via your CDK/Terraform config. A single parameter update triggers a rolling deploy without code changes.
×
Using the wrong boto3 client for Knowledge Bases
Symptom
Calling bedrock_runtime.retrieve_and_generate() produces AttributeError: 'BedrockRuntime' object has no attribute 'retrieve_and_generate' — looks like a boto3 version issue but it is not.
Fix
Knowledge Base operations use boto3.client('bedrock-agent-runtime'), not boto3.client('bedrock-runtime'). These are two different clients with different endpoints. The error message gives no indication of this.
×
Not requesting a Service Quotas increase before launch
Symptom
Default TPS for Claude 3 Sonnet is 5 requests/second in most regions for new accounts. At production load, every request beyond 5/s returns ThrottlingException with 'Too many requests, please wait before trying again'.
Fix
Go to AWS Service Quotas > Amazon Bedrock > find your model's 'On-demand throughput limit' and submit an increase request at least 10 business days before go-live. Default limits are designed for development, not production traffic.
×
Treating the streaming event loop like a simple for-loop without mid-stream error handling
Symptom
If the connection drops at token 300 of a 600-token completion, the generator stops silently and the client renders a half-finished response as if it were complete — silent data loss.
Fix
Wrap the event_stream iteration in try/except and yield an explicit error marker if the loop exits unexpectedly, so the client can detect incomplete responses. Never let a stream die silently.
×
Using temperature=1.0 (or the model default) for structured output tasks
Symptom
High temperature causes the model to occasionally produce malformed JSON or off-schema responses that break your downstream parser. Intermittent failures are the hardest to debug.
Fix
Set temperature between 0.0 and 0.2 for any task where format correctness matters more than creativity. Add output validation with a retry on parse failure.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Bedrock's on-demand pricing model uses a shared inference fleet. How doe...
Q02SENIOR
When would you choose Bedrock Agents over building your own LLM orchestr...
Q03SENIOR
A Bedrock Agent is calling your Action Group Lambda and intermittently r...
Q04SENIOR
Your team is running 500 million tokens per month through Bedrock on-dem...
Q01 of 04SENIOR
Bedrock's on-demand pricing model uses a shared inference fleet. How does that affect your p99 latency SLO design, and what would you change architecturally if your feature requires consistent sub-500ms responses?
ANSWER
Shared fleet means you cannot control queue position, routing, or hardware allocation. p99 latency will be 2-5x p50 depending on time of day and overall fleet load. Design for p99 from day one by load testing at peak hours (9am-5pm weekdays), not at 2am when the fleet is idle. If sub-500ms p99 is a hard requirement, three options: (1) Provisioned Throughput with dedicated Model Units gives guaranteed throughput but at $43K+/month per MU. (2) Switch to SageMaker with a dedicated endpoint for predictable latency on your own hardware. (3) Architect around the latency variance — use streaming for user-facing responses so first-token latency is under 1s even if total generation takes 3s. Cache frequent queries with semantic similarity matching to avoid inference calls entirely for common patterns.
Q02 of 04SENIOR
When would you choose Bedrock Agents over building your own LLM orchestration loop with LangChain or a custom state machine? What's the concrete threshold where Agents becomes more pain than it's worth?
ANSWER
Choose Bedrock Agents when your task is: 1-3 API calls with straightforward routing, no complex conditional logic, and the agent's ReAct loop can handle the reasoning. The sweet spot is internal tooling — HR bots, DevOps assistants, support bots that look up order status. The threshold where Agents breaks down: complex branching logic (if X then Y else if Z then retry W), saga patterns with compensating transactions, cross-session memory beyond 1 hour, or tasks requiring more than 4-5 Action Group calls. At that point, the agent's reasoning loop becomes unpredictable and the token cost explodes. Build a custom state machine with Step Functions or Temporal where you control the execution path deterministically. Also walk away from Agents if you need fine-grained error handling — agents silently hallucinate when Action Groups fail, and debugging the reasoning chain is opaque.
Q03 of 04SENIOR
A Bedrock Agent is calling your Action Group Lambda and intermittently returning wrong answers without any errors in CloudWatch. The Lambda is executing correctly. What's your debugging process — and what's the most likely root cause?
ANSWER
Debugging process: (1) Add structured logging in the Lambda to capture the full incoming event and the exact response being returned. (2) Check the response envelope against the Bedrock Agent specification — verify all 6 required fields: messageVersion, response.actionGroup, response.apiPath, response.httpMethod, response.httpStatusCode, response.responseBody. (3) If the envelope is correct, check if the Lambda response body is too verbose or contains fields the agent's reasoning loop misinterprets. (4) Check the agent's instruction prompt — if it is too long (>500 tokens), the agent may truncate context from earlier reasoning steps. Most likely root cause: the response envelope has a missing or misnamed field. Bedrock Agents does not surface envelope errors to CloudWatch — it silently falls back to generating a response from its training data instead of using your Lambda's data. The intermittent nature suggests the error occurs only for certain API paths where the envelope differs slightly.
Q04 of 04SENIOR
Your team is running 500 million tokens per month through Bedrock on-demand and the bill is becoming significant. Walk me through how you'd evaluate whether Provisioned Throughput makes financial sense, and what data you'd need before committing to a reserved MU.
ANSWER
Step 1: Measure actual monthly token spend on on-demand. For Claude 3 Sonnet at $0.003/1K input and $0.015/1K output, 500M tokens (assuming 70% input, 30% output) costs approximately $3,300/month. Step 2: Compare against Provisioned Throughput. One MU for Claude 3 Sonnet is $60/hour = $43,200/month. At 500M tokens/month, on-demand is dramatically cheaper. Step 3: Calculate the crossover point. Provisioned Throughput makes sense when on-demand spend exceeds $43,200/month — that is roughly 4-5 billion tokens/month for Claude 3 Sonnet. Step 4: Factor in the non-cost benefits: guaranteed TPS (no ThrottlingExceptions), predictable latency, and no quota limits. These may justify Provisioned Throughput at a lower token volume if your feature has strict latency or availability SLAs. Step 5: Validate traffic stability for 30 days. Provisioned Throughput has a 1-month minimum commitment. If your traffic is spiky or seasonal, you may pay for idle MU capacity. Data needed: 30-day token usage histogram (not just total), p99 latency on current on-demand, TPS peak vs quota, and feature roadmap stability (will this feature exist in 3 months?).
01
Bedrock's on-demand pricing model uses a shared inference fleet. How does that affect your p99 latency SLO design, and what would you change architecturally if your feature requires consistent sub-500ms responses?
SENIOR
02
When would you choose Bedrock Agents over building your own LLM orchestration loop with LangChain or a custom state machine? What's the concrete threshold where Agents becomes more pain than it's worth?
SENIOR
03
A Bedrock Agent is calling your Action Group Lambda and intermittently returning wrong answers without any errors in CloudWatch. The Lambda is executing correctly. What's your debugging process — and what's the most likely root cause?
SENIOR
04
Your team is running 500 million tokens per month through Bedrock on-demand and the bill is becoming significant. Walk me through how you'd evaluate whether Provisioned Throughput makes financial sense, and what data you'd need before committing to a reserved MU.
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
How much does AWS Bedrock actually cost in production?
It depends entirely on token volume and model choice, but here's the concrete breakdown: Claude 3 Sonnet costs $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens at time of writing. A typical customer support response that consumes 500 input tokens and 200 output tokens costs roughly $0.0045 — under half a cent. At 100,000 requests per day that's $450/day or ~$13,500/month just in inference costs, before any Agents or Knowledge Base overhead. The cost curve is steep at scale, which is why you should be tracking token usage from day one, not month three.
Was this helpful?
02
What's the difference between AWS Bedrock and SageMaker for running AI models?
Bedrock is a managed API for calling pre-trained foundation models — you don't manage any infrastructure. SageMaker is a full ML platform where you can deploy any model (including custom or fine-tuned ones) on dedicated endpoints you control. The rule of thumb: use Bedrock when you want to call a foundation model and ship fast; use SageMaker when you need consistent low-latency inference, models outside Bedrock's catalogue, or full control over the serving environment.
Was this helpful?
03
How do I handle Bedrock ThrottlingException in production without losing requests?
Use boto3's 'adaptive' retry mode with max_attempts set to 3-5, which applies exponential backoff with jitter automatically. For user-facing features, wrap the call in a queue with a dead-letter path so throttled requests don't just disappear. Long-term fix: request a Service Quotas increase for your specific model's TPS limit via the AWS console — the default limits are designed for development, not production traffic.
Was this helpful?
04
Can Bedrock Agents maintain conversation history across multiple user sessions?
Within a single session yes — Bedrock Agents maintains context for up to one hour using a sessionId you provide. Across separate sessions or after the one-hour timeout, no — the agent has zero memory. For persistent cross-session memory you need to store conversation history in your own database (DynamoDB is the obvious choice), retrieve the relevant history at the start of each new session, and inject it into the agent's initial prompt or as part of your Action Group context. This is a design requirement, not a configuration option.