Skip to content
Home DevOps S3 Principal * Bucket Policy — Why 50K SSNs Hit Google

S3 Principal * Bucket Policy — Why 50K SSNs Hit Google

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Cloud → Topic 4 of 23
One Principal * bucket policy made 50K customer SSNs searchable on Google.
⚙️ Intermediate — basic DevOps knowledge assumed
In this tutorial, you'll learn
One Principal * bucket policy made 50K customer SSNs searchable on Google.
  • S3 object keys are flat strings, not real file paths — 'folder' prefixes are a UI convenience, not a filesystem hierarchy. This changes how you design efficient list and filter operations at scale.
  • Presigned URLs are the production-grade pattern for user file access — they keep objects private while granting time-limited, per-object access without exposing IAM credentials to the client.
  • Every bucket that receives large file uploads needs an AbortIncompleteMultipartUpload lifecycle rule — without it, failed partial uploads accumulate silently and cost you real money.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • S3 stores data as objects in globally unique buckets
  • Buckets are region-scoped; objects have keys (flat namespace) — not folders
  • Access control: bucket policies (persistent) vs presigned URLs (time-limited)
  • Lifecycle policies auto-tier objects (Standard → IA → Glacier) to cut costs by up to 95%
  • Multipart uploads required for files >5GB; incomplete uploads silently cost money
  • Enable versioning on every production bucket to prevent accidental data loss
🚨 START HERE

S3 Quick Debug Cheat Sheet

Five common S3 production issues and the exact commands to diagnose and fix them
🟡

Can't access object — AccessDenied even with correct credentials

Immediate ActionRun `aws s3api get-bucket-policy --bucket <bucket>` to check if a deny policy exists
Commands
aws s3api get-bucket-policy-status --bucket <bucket> --query 'PolicyStatus.IsPublic'
aws s3api get-object-acl --bucket <bucket> --key <key>
Fix NowIf the bucket policy has an explicit Deny, update it. If the object ACL is private but bucket policy allows public, the bucket policy wins — fix the policy.
🟡

Upload speeds are terrible for large files

Immediate ActionCheck file size; if >100MB, abort current upload and use multipart upload
Commands
aws s3api list-multipart-uploads --bucket <bucket> (check for hanging uploads)
aws configure set s3.max_concurrent_requests 20
Fix NowUse `aws s3 cp` with `--cli-read-timeout 0` and `--cli-connect-timeout 0` for large files, or use S3 Transfer Acceleration by enabling it on the bucket and appending `--endpoint-url https://<bucket>.s3-accelerate.amazonaws.com`.
🟡

Object deleted accidentally — need recovery

Immediate ActionCheck if versioning is enabled on the bucket
Commands
aws s3api get-bucket-versioning --bucket <bucket>
aws s3api list-object-versions --bucket <bucket> --prefix <key>
Fix NowIf versioning is enabled, restore the object by copying from a previous version: `aws s3api copy-object --bucket <bucket> --copy-source <bucket>/<key>?versionId=<versionId> --key <key>`
🟡

Bucket policy not taking effect

Immediate ActionVerify the policy syntax and that it's attached to the correct bucket
Commands
aws s3api get-bucket-policy --bucket <bucket>
aws iam simulate-custom-policy --policy-input-list file://policy.json --action-names s3:GetObject --resource-arns arn:aws:s3:::<bucket>/*
Fix NowUse the AWS Policy Simulator to test. Common mistake: missing `/*` at the end of the Resource ARN. If using Principal with an IAM role, ensure the role ARN is correct.
🟡

S3 website endpoint returns 403 Forbidden

Immediate ActionVerify the bucket policy allows public read and that Static Website Hosting is enabled
Commands
aws s3api get-bucket-website --bucket <bucket>
aws s3api get-bucket-policy --bucket <bucket>
Fix NowEnsure bucket policy has `{"Effect":"Allow","Principal":"*","Action":"s3:GetObject","Resource":"arn:aws:s3:::<bucket>/*"}`. Also check that the bucket is not blocking public access via Block Public Access settings.
Production Incident

Public Bucket Exposed 50,000 Customer Records

An engineer made a bucket public to share one invoice PDF. Within hours, search engines had indexed every file in the bucket — including 50,000 customer PII records.
SymptomBucket objects appeared in Google search results. Legal notified the team that customer data (names, addresses, SSNs) was publicly accessible.
AssumptionMaking the bucket public was the quickest way to share a file with a client. The engineer assumed no one would discover the bucket URL.
Root causeThe bucket policy granted s3:GetObject to Principal * without any prefix restriction. Every object in the bucket became world-readable.
Fix1. Immediately update the bucket policy to deny all public access. 2. Enable Block Public Access at the account level. 3. Use presigned URLs for all future file sharing. 4. Rotate any exposed credentials found in the objects.
Key Lesson
Never make a bucket public. Always use presigned URLs for per-file sharing.Enable S3 Block Public Access at the account level — it's a safety net that prevents accidental public exposure.If you must serve public static assets, use CloudFront with an Origin Access Control (OAC) to keep the bucket itself private.
Production Debug Guide

Common symptoms and immediate actions for S3 issues in production

AccessDenied when trying to read an objectCheck bucket policy, IAM role permissions, and whether Block Public Access is enabled. Use aws s3api get-object-acl --bucket <name> --key <key> to verify object ownership.
Slow uploads or downloads (>500ms latency)Check client region vs bucket region. Use S3 Transfer Acceleration for large objects across continents. For uploads >100MB, switch to multipart upload.
Bucket creation fails with 'BucketAlreadyExists'Bucket names are globally unique. Try a more specific name (e.g., acme-prod-eu-logs). You cannot delete and recreate the same name quickly — DNS propagation delays cause this error.
Unexpected high S3 billCheck for incomplete multipart uploads using lifecycle rules. Review storage class transitions — objects may be stuck in Standard. Use S3 Inventory for granular cost analysis.

Every modern application eventually needs somewhere to put files — profile pictures, invoice PDFs, video uploads, database backups, static websites. The moment you outgrow a single server's disk, you need distributed, durable storage. AWS S3 is where the industry landed, and it's powered everything from Netflix's video catalogue to your favourite startup's user uploads for nearly two decades. If you're building anything serious on AWS, S3 is the first service you'll touch.

Before S3, teams had to spin up dedicated file servers, worry about disk failures, manually handle replication, and scramble when traffic spiked. S3 solved all of that in one API. It stores your data across multiple physical locations automatically, scales to literally exabytes without any configuration, and charges you only for what you use. The result: you stop thinking about storage infrastructure and start thinking about your product.

By the end of this article you'll know exactly how S3 is structured, how to create buckets and upload objects using both the AWS CLI and the Python Boto3 SDK, how to control who can access your data and when, and the real-world patterns that experienced engineers use daily — plus the costly gotchas that trip up everyone the first time.

Buckets and Objects — How S3 is Actually Structured

S3 has exactly two building blocks: buckets and objects. A bucket is a top-level container — think of it as a named partition of Amazon's storage infrastructure. Every bucket name must be globally unique across all AWS accounts everywhere. Not just unique to you — unique to every person using S3 on Earth. That's why 'images' is taken, but 'acme-corp-product-images-prod' probably isn't.

An object is any file you store inside a bucket. It has three parts: the key (the file path, e.g. 'invoices/2024/jan/invoice-001.pdf'), the data itself (up to 5TB per object), and metadata (key-value pairs like content type or custom tags).

Here's the important mental model shift: S3 is NOT a filesystem. There are no real folders. 'invoices/2024/jan/' is just a prefix in the key name. The AWS console and SDK simulate folders for your convenience, but under the hood every object lives flat in the bucket identified by its full key string. This matters when you're listing, filtering, or managing costs at scale.

Buckets are also region-specific. When you create a bucket in us-east-1, your data lives there unless you explicitly set up replication. Always create buckets in the region closest to your users or your compute layer.

s3_bucket_and_object_basics.sh · BASH
12345678910111213141516171819202122232425262728
#!/bin/bash
# ── Prerequisites: AWS CLI installed and configured with `aws configure` ──

# 1. Create a new bucket in us-east-1
#    Bucket names: lowercase, 3-63 chars, no underscores, globally unique
aws s3api create-bucket \
  --bucket acme-corp-user-uploads-dev \
  --region us-east-1
# Note: us-east-1 is the only region that does NOT need a LocationConstraint.
# Every other region requires --create-bucket-configuration LocationConstraint=<region>

# 2. Upload a local file as an S3 object
#    The key here is 'avatars/user-4821/profile.jpg'
#    S3 doesn't create a folder -- 'avatars/user-4821/' is just part of the key name
aws s3 cp ./profile.jpg \
  s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg

# 3. List objects using a prefix filter (simulates folder browsing)
#    This returns ONLY objects whose key starts with 'avatars/user-4821/'
aws s3 ls s3://acme-corp-user-uploads-dev/avatars/user-4821/

# 4. Download the object back to verify the round-trip
aws s3 cp \
  s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg \
  ./profile_downloaded.jpg

# 5. Delete the object
aws s3 rm s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg
▶ Output
# Step 1 -- create-bucket
{
"Location": "/acme-corp-user-uploads-dev"
}

# Step 2 -- cp upload
upload: ./profile.jpg to s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg

# Step 3 -- ls
2024-06-12 14:03:22 84231 profile.jpg

# Step 4 -- cp download
download: s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg to ./profile_downloaded.jpg

# Step 5 -- rm
delete: s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg
⚠ Watch Out: The us-east-1 LocationConstraint Trap
If you run create-bucket with --create-bucket-configuration LocationConstraint=us-east-1, AWS throws a cryptic 'InvalidLocationConstraint' error. us-east-1 is the default region and must NOT have a LocationConstraint. Every other region — eu-west-1, ap-southeast-2, etc. — requires it. This catches almost everyone on their first cross-region script.
📊 Production Insight
Using prefix-based folder simulation means listing 10M objects under one prefix is slow.
S3 ListObjectsV2 is limited to 1000 keys per call.
Rule: design your key hierarchy to keep object counts per prefix under 100k for fast listing.
🎯 Key Takeaway
S3 object keys are flat strings, not real file paths.
'nested/folders/' is just a key prefix.
Design key structure for performance, not aesthetics.

Access Control — Who Can See Your Files and Why It Matters

By default, every S3 bucket and every object inside it is completely private. Nothing is publicly accessible unless you deliberately make it so. This is the right default, but it means you need to understand the two main ways to grant access: bucket policies and presigned URLs.

A bucket policy is a JSON document attached to the bucket that grants broad, persistent permissions — for example, allowing your application's IAM role to read everything under the 'invoices/' prefix, or making a 'public-assets/' prefix readable by the entire internet for a static website. Bucket policies are evaluated on every request to that bucket, so they're perfect for service-to-service access.

A presigned URL is the smarter choice for user-facing file access. Instead of making objects public, you generate a time-limited URL server-side that temporarily grants access to one specific object. The URL embeds cryptographic credentials and an expiry timestamp. When it expires, access is gone — automatically, no cleanup needed. This is how every serious application handles file downloads and uploads: the backend stays in control, and the client gets a URL that works just long enough.

Never make a bucket public unless it's genuinely meant to serve public static assets. Even then, use a CloudFront distribution in front of it rather than direct public bucket access.

s3_access_control.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import boto3
import json
from datetime import datetime

# ── Boto3 picks up credentials from env vars, ~/.aws/credentials, or IAM role ──
s3_client = boto3.client('s3', region_name='us-east-1')

BUCKET_NAME = 'acme-corp-user-uploads-dev'

# ── PART 1: Attach a bucket policy ──
# This policy allows ONLY our application's IAM role to read objects
# under the 'invoices/' prefix. Nothing else in the bucket is affected.
bucket_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAppRoleInvoiceRead",
            "Effect": "Allow",
            "Principal": {
                # Replace with your actual IAM role ARN
                "AWS": "arn:aws:iam::123456789012:role/AcmeAppServerRole"
            },
            "Action": "s3:GetObject",
            # The /* at the end means: any object under this prefix
            "Resource": f"arn:aws:s3:::{BUCKET_NAME}/invoices/*"
        }
    ]
}

s3_client.put_bucket_policy(
    Bucket=BUCKET_NAME,
    Policy=json.dumps(bucket_policy)  # Policy must be serialised to a JSON string
)
print(f"Bucket policy applied to {BUCKET_NAME}")

# ── PART 2: Generate a presigned URL for secure, time-limited object access ──
# Use case: a user clicks 'Download Invoice' in your web app.
# Your backend generates this URL and returns it. The browser hits S3 directly.
# The object itself stays private the whole time.

object_key = 'invoices/2024/jan/invoice-001.pdf'
expiry_seconds = 3600  # URL valid for exactly 1 hour

presigned_download_url = s3_client.generate_presigned_url(
    ClientMethod='get_object',  # 'put_object' works the same way for uploads
    Params={
        'Bucket': BUCKET_NAME,
        'Key': object_key
    },
    ExpiresIn=expiry_seconds
)

print(f"\nPresigned download URL (valid for {expiry_seconds}s):")
print(presigned_download_url)
print(f"\nURL expires at approximately: {datetime.utcnow()} + {expiry_seconds // 60} minutes")

# ── PART 3: Generate a presigned URL for direct-upload from the browser ──
# The client uploads straight to S3 — your server never handles the file bytes.
# This is the standard pattern for large file uploads.
presigned_upload_url = s3_client.generate_presigned_url(
    ClientMethod='put_object',
    Params={
        'Bucket': BUCKET_NAME,
        'Key': 'avatars/user-9001/profile.jpg',
        'ContentType': 'image/jpeg'  # Enforce content type at signing time
    },
    ExpiresIn=300  # Upload must start within 5 minutes
)

print(f"\nPresigned upload URL (valid for 300s):")
print(presigned_upload_url)
▶ Output
Bucket policy applied to acme-corp-user-uploads-dev

Presigned download URL (valid for 3600s):
https://acme-corp-user-uploads-dev.s3.amazonaws.com/invoices/2024/jan/invoice-001.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20240612%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240612T140322Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=3d8a2f...

URL expires at approximately: 2024-06-12 14:03:22 + 60 minutes

Presigned upload URL (valid for 300s):
https://acme-corp-user-uploads-dev.s3.amazonaws.com/avatars/user-9001/profile.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20240612%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240612T140322Z&X-Amz-Expires=300&X-Amz-SignedHeaders=content-type%3Bhost&X-Amz-Signature=9c1b4e...
💡Pro Tip: Presigned Upload URLs = No File Size Limit on Your Server
When you generate a presigned PUT URL and give it to the browser, the user uploads directly from their device to S3. Your application server never receives the file bytes. This means you're not capped by your server's memory, you're not paying egress costs for the upload, and 10 GB video files upload just as smoothly as 10 KB thumbnails. This pattern is used by almost every major SaaS product.
📊 Production Insight
Bucket policies are evaluated first, then IAM policies, then ACLs.
If you have a Deny in any layer, access is denied — no Allow overrides a Deny.
Rule: use bucket policies for bucket-wide rules, IAM for user-specific, presigned URLs for per-object client access.
🎯 Key Takeaway
Presigned URLs are the production-grade pattern for user file access.
They keep objects private while granting time-limited, per-object access.
Never expose your bucket or objects to the public internet.

Storage Classes and Lifecycle Policies — Cutting Your S3 Bill in Half

Not all data is accessed equally often. Your app might read a user's profile picture dozens of times a day, but that invoice from January 2021? Probably never again unless there's an audit. S3 gives you storage classes — different pricing tiers based on how frequently and quickly you need to access data.

S3 Standard is the default and the most expensive per GB. It's designed for data you access regularly with millisecond latency. S3 Standard-IA (Infrequent Access) costs about 45% less per GB but charges a per-retrieval fee, making it ideal for backups and older content you occasionally need. S3 Glacier Instant Retrieval drops the cost further for archival data you access maybe once a quarter. S3 Glacier Deep Archive is the cheapest tier — pennies per GB per month — for data you might need once a year and can wait up to 12 hours to retrieve.

The power move is combining storage classes with lifecycle policies: automated rules that transition objects to cheaper tiers (or delete them entirely) based on their age. You configure this once, and S3 handles the cost optimisation forever. A common real-world pattern: keep user uploads in Standard for 30 days, move to Standard-IA for 6 months, then Glacier Deep Archive indefinitely — with zero manual work after initial setup.

s3_lifecycle_policy.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import boto3
import json

s3_client = boto3.client('s3', region_name='us-east-1')

BUCKET_NAME = 'acme-corp-user-uploads-dev'

# ── Apply a lifecycle policy that automatically manages storage costs ──
# This policy covers objects under the 'invoices/' prefix only.
# Other prefixes (avatars/, reports/) are not affected.
lifecycle_configuration = {
    "Rules": [
        {
            "ID": "InvoiceArchivalPolicy",
            "Status": "Enabled",  # 'Disabled' to pause without deleting the rule
            "Filter": {
                # Only apply this rule to objects whose key starts with 'invoices/'
                "Prefix": "invoices/"
            },
            "Transitions": [
                {
                    # After 30 days, move to Standard-IA
                    # ~45% cheaper per GB, small per-retrieval fee applies
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    # After 180 days, move to Glacier Instant Retrieval
                    # ~68% cheaper than Standard, millisecond retrieval still available
                    "Days": 180,
                    "StorageClass": "GLACIER_IR"
                },
                {
                    # After 365 days, move to Glacier Deep Archive
                    # Cheapest tier. Retrieval takes up to 12 hours.
                    # Perfect for compliance-required long-term retention
                    "Days": 365,
                    "StorageClass": "DEEP_ARCHIVE"
                }
            ],
            # Permanently delete objects after 7 years (2555 days)
            # Adjust to match your legal/compliance retention requirements
            "Expiration": {
                "Days": 2555
            }
        },
        {
            # Second rule: automatically clean up incomplete multipart uploads
            # Without this, partially uploaded large files silently accumulate costs
            "ID": "CleanupIncompleteMultipartUploads",
            "Status": "Enabled",
            "Filter": {"Prefix": ""},  # Apply to the entire bucket
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7  # Abort uploads not completed within 7 days
            }
        }
    ]
}

response = s3_client.put_bucket_lifecycle_configuration(
    Bucket=BUCKET_NAME,
    LifecycleConfiguration=lifecycle_configuration
)

print(f"Lifecycle policy applied. HTTP status: {response['ResponseMetadata']['HTTPStatusCode']}")

# ── Verify the policy was saved correctly ──
saved_policy = s3_client.get_bucket_lifecycle_configuration(Bucket=BUCKET_NAME)
for rule in saved_policy['Rules']:
    print(f"\nRule ID: {rule['ID']} | Status: {rule['Status']}")
    if 'Transitions' in rule:
        for transition in rule['Transitions']:
            print(f"  → After {transition['Days']} days: {transition['StorageClass']}")
    if 'Expiration' in rule:
        print(f"  → Delete after: {rule['Expiration']['Days']} days")
▶ Output
Lifecycle policy applied. HTTP status: 200

Rule ID: InvoiceArchivalPolicy | Status: Enabled
→ After 30 days: STANDARD_IA
→ After 180 days: GLACIER_IR
→ After 365 days: DEEP_ARCHIVE
→ Delete after: 2555 days

Rule ID: CleanupIncompleteMultipartUploads | Status: Enabled
🔥Interview Gold: The Hidden Multipart Upload Cost Leak
Interviewers love asking about unexpected S3 costs. The answer they want: incomplete multipart uploads. When a large file upload fails halfway, S3 stores the uploaded parts and charges you for them — indefinitely, silently, without showing them in a normal bucket listing. The fix is exactly what's in the code above: an AbortIncompleteMultipartUpload lifecycle rule. Every production bucket should have one.
📊 Production Insight
Lifecycle policies have a minimum age of 30 days for Standard → IA and 90 days for IA → Glacier.
If you try to set a shorter transition, the API returns an error.
Rule: always check the minimum transition days before writing your policy.
🎯 Key Takeaway
Storage class transitions via lifecycle policies are a one-time setup that permanently reduces costs.
Standard → Standard-IA → Glacier Deep Archive can cut your S3 bill by over 95% for archival data.
Add an AbortIncompleteMultipartUpload rule to every bucket.

Versioning and Object Lock — Protect Against Accidental Deletion and Compliance

Without versioning, every PUT overwrites the object and every DELETE makes it gone forever. That's fine for temp files, but for user content, production configs, or audit logs, you need versioning.

When versioning is enabled, every object operation creates a new version ID. A DELETE doesn't remove the object — it just adds a delete marker. You can restore any previous version instantly. Versioning also integrates with lifecycle policies to automatically expire old versions and reduce storage costs.

S3 Object Lock takes protection further by making objects write-once-read-many (WORM). You can lock objects for a retention period (days or years) or use legal holds. This is critical for compliance with SEC, FINRA, or GDPR retention rules. Even the root user of the AWS account cannot delete a locked object before the retention period expires.

Versioning and Object Lock together give you an immutable data layer. Enable versioning on every production bucket from day one. Object Lock must be enabled when the bucket is created — you cannot add it later.

s3_versioning_and_lock.sh · BASH
1234567891011121314151617181920212223242526272829303132333435
#!/bin/bash
# ── Enable versioning on an existing bucket ──
aws s3api put-bucket-versioning \
  --bucket acme-corp-user-uploads-prod \
  --versioning-configuration Status=Enabled

# ── List all versions of an object (including deleted) ──
aws s3api list-object-versions \
  --bucket acme-corp-user-uploads-prod \
  --prefix invoices/jan-2024/report.pdf

# ── Restore a deleted object by copying from a previous version ──
aws s3api copy-object \
  --bucket acme-corp-user-uploads-prod \
  --copy-source acme-corp-user-uploads-prod/invoices/jan-2024/report.pdf?versionId=abc123 \
  --key invoices/jan-2024/report.pdf

# ── Enable Object Lock at bucket creation (required during creation) ──
aws s3api create-bucket \
  --bucket acme-corp-compliant-logs \
  --region us-east-1 \
  --object-lock-enabled-for-bucket

# ── Set a default retention period on the bucket (7 days for compliance) ──
aws s3api put-object-lock-configuration \
  --bucket acme-corp-compliant-logs \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "GOVERNANCE",
        "Days": 7
      }
    }
  }'
▶ Output
# Enable versioning (no output on success, check with get-bucket-versioning)
# List object versions
{
"Versions": [
{
"Key": "invoices/jan-2024/report.pdf",
"VersionId": "abc123",
"IsLatest": false,
"LastModified": "2024-01-15T10:00:00Z",
"Size": 12345
},
{
"Key": "invoices/jan-2024/report.pdf",
"VersionId": "xyz789",
"IsLatest": true,
"LastModified": "2024-06-12T14:00:00Z",
"Size": 12345
}
],
"DeleteMarkers": []
}

# Restore object (success returns metadata)
# Object lock configuration set
⚠ Object Lock Must Be Enabled at Bucket Creation
You cannot enable Object Lock on an existing bucket. If you think you might need WORM compliance later, create the bucket with --object-lock-enabled-for-bucket even if you don't configure retention immediately. That keeps the door open. Without it, you're forced to migrate data to a new bucket.
📊 Production Insight
Versioning multiplies your storage costs by the number of versions kept.
A file updated 100 times stores 100 copies.
Rule: combine versioning with a lifecycle policy to expire old versions after 30-90 days.
🎯 Key Takeaway
Enable versioning on every production bucket.
It's a one-line CLI command that gives you undo for S3.
Object Lock must be decided at bucket creation — plan ahead for compliance needs.

Performance Patterns — Multipart Uploads, Transfer Acceleration, and Cross-Region Replication

S3 is fast, but you can make it faster — and more expensive if you're not careful.

Multipart Upload: For files larger than 100MB (recommended), use multipart upload. It splits the file into parts (min 5MB each) and uploads them in parallel. If a part fails, only that part is retried, not the entire file. The boto3 upload_file method handles this automatically above 8MB. CLI aws s3 cp does the same.

S3 Transfer Acceleration: Uses AWS edge locations to route uploads over the AWS backbone network instead of the public internet. This can cut upload times by 50-80% for users far from the bucket region. It costs extra per GB uploaded. Enable it on the bucket and use the accelerated endpoint.

Cross-Region Replication (CRR): Automatically replicates objects to a bucket in another region. Use for disaster recovery, lower latency for global users, or compliance. Replication is asynchronous — expect a few seconds to a few hours of lag. You need versioning enabled on both source and destination buckets.

Key trade-off: Transfer Acceleration and CRR cost money. Don't enable them unless you have a measurable need. For most apps, a single bucket with CloudFront is cheaper and faster.

s3_performance_patterns.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142
import boto3
import os

s3_client = boto3.client('s3', region_name='us-east-1')
BUCKET = 'acme-corp-large-uploads-prod'

# ── Multipart Upload using boto3's high-level upload_file ──
# Automatically uses multipart for files >8MB
s3_client.upload_file(
    Filename='/tmp/large_backup.tar.gz',
    Bucket=BUCKET,
    Key='backups/june-2026.tar.gz'
)
print("Upload completed (multipart used automatically)")

# ── Transfer Acceleration example ──
# First, enable acceleration on the bucket (CLI):
# aws s3api put-bucket-accelerate-configuration --bucket <bucket> --accelerate-configuration Status=Enabled

# Then upload using the accelerated endpoint
s3_accelerated = boto3.client(
    's3',
    region_name='us-east-1',
    config=boto3.session.Config(
        s3={'use_accelerate_endpoint': True}
    )
)
# Subsequent uploads go through the accelerated endpoint automatically
s3_accelerated.upload_file(
    '/tmp/large_file.mp4',
    BUCKET,
    'videos/promo.mp4'
)
print("Uploaded via Transfer Acceleration")

# ── List ongoing multipart uploads (to find stuck ones) ──
response = s3_client.list_multipart_uploads(Bucket=BUCKET)
if 'Uploads' in response:
    for upload in response['Uploads']:
        print(f"Stuck upload: {upload['Key']} initiated {upload['Initiated']}")
else:
    print("No incomplete multipart uploads found")
▶ Output
Upload completed (multipart used automatically)
Uploaded via Transfer Acceleration
No incomplete multipart uploads found
💡When to Avoid Transfer Acceleration
If your client and bucket are in the same region, Transfer Acceleration adds latency and cost. It also doesn't help for downloads — only uploads. Benchmark with aws s3 cp with and without --endpoint-url before enabling.
📊 Production Insight
Cross-Region Replication costs storage in both regions + replication PUT requests.
If you're replicating for DR, consider using S3 Batch Replication first to seed, then CRR for ongoing.
Rule: monitor replication lag with S3 metrics — lag of over an hour indicates a bottleneck.
🎯 Key Takeaway
Multipart uploads are essential for files >100MB — they're automatic in the SDK.
Transfer Acceleration helps global uploads but costs extra; benchmark first.
CRR is async and costs double storage; only use when required.
🗂 S3 Storage Class Comparison
Choose based on access frequency and latency requirements
Storage ClassBest ForRetrieval LatencyApprox. Cost / GB / MonthPer-Retrieval Fee
S3 StandardActive user data, frequently accessed filesMilliseconds$0.023None
S3 Standard-IABackups, older content, DR copiesMilliseconds$0.0125Yes (~$0.01/GB)
S3 Glacier Instant RetrievalQuarterly-access archivesMilliseconds$0.004Yes (~$0.03/GB)
S3 Glacier Flexible RetrievalAnnual-access compliance dataMinutes to hours$0.0036Yes
S3 Glacier Deep ArchiveLong-term legal/compliance retentionUp to 12 hours$0.00099Yes (~$0.02/GB)

🎯 Key Takeaways

  • S3 object keys are flat strings, not real file paths — 'folder' prefixes are a UI convenience, not a filesystem hierarchy. This changes how you design efficient list and filter operations at scale.
  • Presigned URLs are the production-grade pattern for user file access — they keep objects private while granting time-limited, per-object access without exposing IAM credentials to the client.
  • Every bucket that receives large file uploads needs an AbortIncompleteMultipartUpload lifecycle rule — without it, failed partial uploads accumulate silently and cost you real money.
  • Storage class transitions via lifecycle policies are a one-time setup that permanently reduces costs — Standard → Standard-IA → Glacier Deep Archive can cut your S3 bill by over 95% for archival data.
  • Enable versioning on every production bucket — it's a single CLI command that gives you undo for S3 and is a prerequisite for Object Lock and replication.

⚠ Common Mistakes to Avoid

    Making the entire bucket public to share one file
    Symptom

    Anyone can browse and download all your data — search engines index it.

    Fix

    Use presigned URLs for any per-file sharing. If you genuinely need a public static hosting bucket, lock it down to a specific key prefix using a bucket policy Resource like 'arn:aws:s3:::your-bucket/public/*' and never put private data under that prefix.

    Creating buckets in us-east-1 for users in Sydney
    Symptom

    Upload and download speeds are painfully slow for non-US users; latency adds 200-300ms to every S3 operation.

    Fix

    Always create buckets in the AWS region geographically closest to your users or your EC2/Lambda compute. For a global audience, put CloudFront in front of a single S3 bucket — it caches content at edge locations worldwide.

    Ignoring S3 versioning and then accidentally deleting production files
    Symptom

    A one-line script bug runs aws s3 rm on the wrong prefix and irreplaceable data is gone.

    Fix

    Enable versioning on any bucket holding user-generated content or production assets with aws s3api put-bucket-versioning --bucket your-bucket --versioning-configuration Status=Enabled. With versioning on, 'deleted' objects are just marked with a delete marker and can be restored instantly. Combine with MFA Delete for critical buckets.

    Not setting up an AbortIncompleteMultipartUpload lifecycle rule
    Symptom

    Failing large uploads leave partial parts in S3, silently accumulating storage costs — not visible in normal bucket listings.

    Fix

    Add a lifecycle rule with AbortIncompleteMultipartUpload set to 7 days. This automatically cleans up any partial upload that wasn't completed within a week.

Interview Questions on This Topic

  • QS3 is often described as 'eventually consistent' — can you explain what that means and whether it's still true today, and how it would affect an application that writes and immediately reads the same object?SeniorReveal
    Before 2020, S3 provided read-after-write consistency for PUTs of new objects, but eventual consistency for overwrites and deletes. In December 2020, AWS announced strong read-after-write consistency for all S3 GET, LIST, and PUT operations in all regions. That means if you write an object and immediately read it, you'll see the latest version. However, LIST operations after a DELETE or overwrite may still be eventually consistent for a short period. The practical impact: your application no longer needs to implement retry logic for immediate read-after-write, but you should still be cautious with LIST consistency after modifications.
  • QWalk me through how you'd architect a file upload feature for a web app where users can upload files up to 5GB. Why would you not route the upload through your application server?Mid-levelReveal
    I would use presigned URLs for the upload. The application server generates a presigned PUT URL with a short expiry (5 min) and returns it to the client. The client then uploads directly to S3 from their browser or device. This pattern avoids several problems: the server doesn't need to handle large file bytes (avoids memory pressure), the server doesn't pay egress costs for the upload, and the client gets faster uploads via S3's global infrastructure. For files >100MB, I'd ensure the SDK uses multipart upload automatically. For extremely large files near 5GB, I'd also add a server-side validation via a small initial request (e.g., content type, file size) before generating the presigned URL. Finally, I'd set up an S3 event notification to trigger a Lambda function to process the file after upload is complete.
  • QWhat's the difference between an S3 bucket policy and an IAM policy, and when would you use one over the other to control access to S3 objects?Mid-levelReveal
    An IAM policy is attached to a user, group, or role and defines what that principal can do across AWS. An S3 bucket policy is attached to the bucket itself and defines who can access it and how. Use bucket policies when you need to grant access to anonymous users (e.g., public static website), when you want to centralise access rules for a bucket, or when you need to grant cross-account access without creating IAM roles. Use IAM policies when you need granular control per user or role, when you want to enforce permissions across multiple resources (e.g., allow access to S3 and DynamoDB), or when following least-privilege principles at the user level. In practice, most architectures use both: IAM roles for application access and bucket policies to enforce bucket-wide restrictions (e.g., deny all public access).

Frequently Asked Questions

What is the maximum file size you can upload to S3?

A single S3 object can be up to 5TB in size. However, the maximum size for a single PUT upload is 5GB. For anything larger — or for better reliability with large files — you should use the multipart upload API, which splits the file into parts (minimum 5MB each) and uploads them in parallel. Boto3's upload_file method handles this automatically when the file exceeds 8MB.

Does S3 guarantee that my data won't be lost?

S3 Standard is designed for 99.999999999% durability (11 nines) by automatically storing your data redundantly across a minimum of three Availability Zones within a region. In practical terms, if you stored 10 million objects, you'd expect to lose one object every 10,000 years. That said, durability doesn't protect against accidental deletion — for that, enable versioning and optionally MFA Delete on critical buckets.

What's the difference between S3 and EBS — when do you use each?

EBS (Elastic Block Store) is a network-attached block device — it works like a hard drive mounted to a single EC2 instance, with low latency and read/write for databases or OS volumes. S3 is an object store accessed via HTTP API — it's not mountable as a drive (without third-party tools), but it's infinitely scalable, globally accessible, and far cheaper per GB. Use EBS for your EC2 instance storage, databases, and OS. Use S3 for files, media, backups, static assets, and any data that needs to be accessed by more than one service.

Can I enable Object Lock on an existing bucket?

No. Object Lock must be enabled when the bucket is created using the --object-lock-enabled-for-bucket flag. You cannot add it later. If you need WORM compliance on an existing bucket, you must migrate the data to a new bucket with Object Lock enabled. Plan ahead: if there's any chance you'll need Object Lock, enable it at creation time even if you don't configure retention immediately.

How do I reduce S3 costs without losing data?
  1. Use lifecycle policies to automatically transition objects to cheaper storage classes (Standard → IA → Glacier) based on age. 2. Add an AbortIncompleteMultipartUpload rule to clean up failed uploads. 3. Enable S3 Intelligent-Tiering for unpredictable access patterns. 4. Use S3 Inventory to identify and delete unneeded objects. 5. For infrequently accessed data, consider Standard-IA instead of Standard.
🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAWS EC2 BasicsNext →AWS Lambda and Serverless
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged