Home DevOps AWS S3 Basics Explained — Buckets, Objects, and Real-World Storage Patterns

AWS S3 Basics Explained — Buckets, Objects, and Real-World Storage Patterns

In Plain English 🔥
Imagine a massive warehouse with unlimited shelf space. You rent a named section of that warehouse (a 'bucket'), and inside it you can store any box (a 'file') of any size. Each box gets a unique label so anyone — or only you — can find it later. That's S3: Amazon's unlimited, pay-per-byte file warehouse in the cloud, accessible from anywhere on the internet.
⚡ Quick Answer
Imagine a massive warehouse with unlimited shelf space. You rent a named section of that warehouse (a 'bucket'), and inside it you can store any box (a 'file') of any size. Each box gets a unique label so anyone — or only you — can find it later. That's S3: Amazon's unlimited, pay-per-byte file warehouse in the cloud, accessible from anywhere on the internet.

Every modern application eventually needs somewhere to put files — profile pictures, invoice PDFs, video uploads, database backups, static websites. The moment you outgrow a single server's disk, you need distributed, durable storage. AWS S3 is where the industry landed, and it's powered everything from Netflix's video catalogue to your favourite startup's user uploads for nearly two decades. If you're building anything serious on AWS, S3 is the first service you'll touch.

Before S3, teams had to spin up dedicated file servers, worry about disk failures, manually handle replication, and scramble when traffic spiked. S3 solved all of that in one API. It stores your data across multiple physical locations automatically, scales to literally exabytes without any configuration, and charges you only for what you use. The result: you stop thinking about storage infrastructure and start thinking about your product.

By the end of this article you'll know exactly how S3 is structured, how to create buckets and upload objects using both the AWS CLI and the Python Boto3 SDK, how to control who can access your data and when, and the real-world patterns that experienced engineers use daily — plus the costly gotchas that trip up everyone the first time.

Buckets and Objects — How S3 is Actually Structured

S3 has exactly two building blocks: buckets and objects. A bucket is a top-level container — think of it as a named partition of Amazon's storage infrastructure. Every bucket name must be globally unique across all AWS accounts everywhere. Not just unique to you — unique to every person using S3 on Earth. That's why 'images' is taken, but 'acme-corp-product-images-prod' probably isn't.

An object is any file you store inside a bucket. It has three parts: the key (the file path, e.g. 'invoices/2024/jan/invoice-001.pdf'), the data itself (up to 5TB per object), and metadata (key-value pairs like content type or custom tags).

Here's the important mental model shift: S3 is NOT a filesystem. There are no real folders. 'invoices/2024/jan/' is just a prefix in the key name. The AWS console and SDK simulate folders for your convenience, but under the hood every object lives flat in the bucket identified by its full key string. This matters when you're listing, filtering, or managing costs at scale.

Buckets are also region-specific. When you create a bucket in us-east-1, your data lives there unless you explicitly set up replication. Always create buckets in the region closest to your users or your compute layer.

s3_bucket_and_object_basics.sh · BASH
12345678910111213141516171819202122232425262728
#!/bin/bash
# ── Prerequisites: AWS CLI installed and configured with `aws configure` ──

# 1. Create a new bucket in us-east-1
#    Bucket names: lowercase, 3-63 chars, no underscores, globally unique
aws s3api create-bucket \
  --bucket acme-corp-user-uploads-dev \
  --region us-east-1
# Note: us-east-1 is the only region that does NOT need a LocationConstraint.
# Every other region requires --create-bucket-configuration LocationConstraint=<region>

# 2. Upload a local file as an S3 object
#    The key here is 'avatars/user-4821/profile.jpg'
#    S3 doesn't create a folder — 'avatars/user-4821/' is just part of the key name
aws s3 cp ./profile.jpg \
  s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg

# 3. List objects using a prefix filter (simulates folder browsing)
#    This returns ONLY objects whose key starts with 'avatars/user-4821/'
aws s3 ls s3://acme-corp-user-uploads-dev/avatars/user-4821/

# 4. Download the object back to verify the round-trip
aws s3 cp \
  s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg \
  ./profile_downloaded.jpg

# 5. Delete the object
aws s3 rm s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg
▶ Output
# Step 1 — create-bucket
{
"Location": "/acme-corp-user-uploads-dev"
}

# Step 2 — cp upload
upload: ./profile.jpg to s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg

# Step 3 — ls
2024-06-12 14:03:22 84231 profile.jpg

# Step 4 — cp download
download: s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg to ./profile_downloaded.jpg

# Step 5 — rm
delete: s3://acme-corp-user-uploads-dev/avatars/user-4821/profile.jpg
⚠️
Watch Out: The us-east-1 LocationConstraint TrapIf you run `create-bucket` with `--create-bucket-configuration LocationConstraint=us-east-1`, AWS throws a cryptic 'InvalidLocationConstraint' error. us-east-1 is the default region and must NOT have a LocationConstraint. Every other region — eu-west-1, ap-southeast-2, etc. — requires it. This catches almost everyone on their first cross-region script.

Access Control — Who Can See Your Files and Why It Matters

By default, every S3 bucket and every object inside it is completely private. Nothing is publicly accessible unless you deliberately make it so. This is the right default, but it means you need to understand the two main ways to grant access: bucket policies and presigned URLs.

A bucket policy is a JSON document attached to the bucket that grants broad, persistent permissions — for example, allowing your application's IAM role to read everything under the 'invoices/' prefix, or making a 'public-assets/' prefix readable by the entire internet for a static website. Bucket policies are evaluated on every request to that bucket, so they're perfect for service-to-service access.

A presigned URL is the smarter choice for user-facing file access. Instead of making objects public, you generate a time-limited URL server-side that temporarily grants access to one specific object. The URL embeds cryptographic credentials and an expiry timestamp. When it expires, access is gone — automatically, no cleanup needed. This is how every serious application handles file downloads and uploads: the backend stays in control, and the client gets a URL that works just long enough.

Never make a bucket public unless it's genuinely meant to serve public static assets. Even then, use a CloudFront distribution in front of it rather than direct public bucket access.

s3_access_control.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import boto3
import json
from datetime import datetime

# ── Boto3 picks up credentials from env vars, ~/.aws/credentials, or IAM role ──
s3_client = boto3.client('s3', region_name='us-east-1')

BUCKET_NAME = 'acme-corp-user-uploads-dev'

# ── PART 1: Attach a bucket policy ──
# This policy allows ONLY our application's IAM role to read objects
# under the 'invoices/' prefix. Nothing else in the bucket is affected.
bucket_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAppRoleInvoiceRead",
            "Effect": "Allow",
            "Principal": {
                # Replace with your actual IAM role ARN
                "AWS": "arn:aws:iam::123456789012:role/AcmeAppServerRole"
            },
            "Action": "s3:GetObject",
            # The /* at the end means: any object under this prefix
            "Resource": f"arn:aws:s3:::{BUCKET_NAME}/invoices/*"
        }
    ]
}

s3_client.put_bucket_policy(
    Bucket=BUCKET_NAME,
    Policy=json.dumps(bucket_policy)  # Policy must be serialised to a JSON string
)
print(f"Bucket policy applied to {BUCKET_NAME}")

# ── PART 2: Generate a presigned URL for secure, time-limited object access ──
# Use case: a user clicks 'Download Invoice' in your web app.
# Your backend generates this URL and returns it. The browser hits S3 directly.
# The object itself stays private the whole time.

object_key = 'invoices/2024/jan/invoice-001.pdf'
expiry_seconds = 3600  # URL valid for exactly 1 hour

presigned_download_url = s3_client.generate_presigned_url(
    ClientMethod='get_object',  # 'put_object' works the same way for uploads
    Params={
        'Bucket': BUCKET_NAME,
        'Key': object_key
    },
    ExpiresIn=expiry_seconds
)

print(f"\nPresigned download URL (valid for {expiry_seconds}s):")
print(presigned_download_url)
print(f"\nURL expires at approximately: {datetime.utcnow()} + {expiry_seconds // 60} minutes")

# ── PART 3: Generate a presigned URL for direct-upload from the browser ──
# The client uploads straight to S3 — your server never handles the file bytes.
# This is the standard pattern for large file uploads.
presigned_upload_url = s3_client.generate_presigned_url(
    ClientMethod='put_object',
    Params={
        'Bucket': BUCKET_NAME,
        'Key': 'avatars/user-9001/profile.jpg',
        'ContentType': 'image/jpeg'  # Enforce content type at signing time
    },
    ExpiresIn=300  # Upload must start within 5 minutes
)

print(f"\nPresigned upload URL (valid for 300s):")
print(presigned_upload_url)
▶ Output
Bucket policy applied to acme-corp-user-uploads-dev

Presigned download URL (valid for 3600s):
https://acme-corp-user-uploads-dev.s3.amazonaws.com/invoices/2024/jan/invoice-001.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20240612%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240612T140322Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=3d8a2f...

URL expires at approximately: 2024-06-12 14:03:22 + 60 minutes

Presigned upload URL (valid for 300s):
https://acme-corp-user-uploads-dev.s3.amazonaws.com/avatars/user-9001/profile.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIOSFODNN7EXAMPLE%2F20240612%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240612T140322Z&X-Amz-Expires=300&X-Amz-SignedHeaders=content-type%3Bhost&X-Amz-Signature=9c1b4e...
⚠️
Pro Tip: Presigned Upload URLs = No File Size Limit on Your ServerWhen you generate a presigned PUT URL and give it to the browser, the user uploads directly from their device to S3. Your application server never receives the file bytes. This means you're not capped by your server's memory, you're not paying egress costs for the upload, and 10 GB video files upload just as smoothly as 10 KB thumbnails. This pattern is used by almost every major SaaS product.

Storage Classes and Lifecycle Policies — Cutting Your S3 Bill in Half

Not all data is accessed equally often. Your app might read a user's profile picture dozens of times a day, but that invoice from January 2021? Probably never again unless there's an audit. S3 gives you storage classes — different pricing tiers based on how frequently and quickly you need to access data.

S3 Standard is the default and the most expensive per GB. It's designed for data you access regularly with millisecond latency. S3 Standard-IA (Infrequent Access) costs about 45% less per GB but charges a per-retrieval fee, making it ideal for backups and older content you occasionally need. S3 Glacier Instant Retrieval drops the cost further for archival data you access maybe once a quarter. S3 Glacier Deep Archive is the cheapest tier — pennies per GB per month — for data you might need once a year and can wait up to 12 hours to retrieve.

The power move is combining storage classes with lifecycle policies: automated rules that transition objects to cheaper tiers (or delete them entirely) based on their age. You configure this once, and S3 handles the cost optimisation forever. A common real-world pattern: keep user uploads in Standard for 30 days, move to Standard-IA for 6 months, then Glacier Deep Archive indefinitely — with zero manual work after initial setup.

s3_lifecycle_policy.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import boto3
import json

s3_client = boto3.client('s3', region_name='us-east-1')

BUCKET_NAME = 'acme-corp-user-uploads-dev'

# ── Apply a lifecycle policy that automatically manages storage costs ──
# This policy covers objects under the 'invoices/' prefix only.
# Other prefixes (avatars/, reports/) are not affected.
lifecycle_configuration = {
    "Rules": [
        {
            "ID": "InvoiceArchivalPolicy",
            "Status": "Enabled",  # 'Disabled' to pause without deleting the rule
            "Filter": {
                # Only apply this rule to objects whose key starts with 'invoices/'
                "Prefix": "invoices/"
            },
            "Transitions": [
                {
                    # After 30 days, move to Standard-IA
                    # ~45% cheaper per GB, small per-retrieval fee applies
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    # After 180 days, move to Glacier Instant Retrieval
                    # ~68% cheaper than Standard, millisecond retrieval still available
                    "Days": 180,
                    "StorageClass": "GLACIER_IR"
                },
                {
                    # After 365 days, move to Glacier Deep Archive
                    # Cheapest tier. Retrieval takes up to 12 hours.
                    # Perfect for compliance-required long-term retention
                    "Days": 365,
                    "StorageClass": "DEEP_ARCHIVE"
                }
            ],
            # Permanently delete objects after 7 years (2555 days)
            # Adjust to match your legal/compliance retention requirements
            "Expiration": {
                "Days": 2555
            }
        },
        {
            # Second rule: automatically clean up incomplete multipart uploads
            # Without this, partially uploaded large files silently accumulate costs
            "ID": "CleanupIncompleteMultipartUploads",
            "Status": "Enabled",
            "Filter": {"Prefix": ""},  # Apply to the entire bucket
            "AbortIncompleteMultipartUpload": {
                "DaysAfterInitiation": 7  # Abort uploads not completed within 7 days
            }
        }
    ]
}

response = s3_client.put_bucket_lifecycle_configuration(
    Bucket=BUCKET_NAME,
    LifecycleConfiguration=lifecycle_configuration
)

print(f"Lifecycle policy applied. HTTP status: {response['ResponseMetadata']['HTTPStatusCode']}")

# ── Verify the policy was saved correctly ──
saved_policy = s3_client.get_bucket_lifecycle_configuration(Bucket=BUCKET_NAME)
for rule in saved_policy['Rules']:
    print(f"\nRule ID: {rule['ID']} | Status: {rule['Status']}")
    if 'Transitions' in rule:
        for transition in rule['Transitions']:
            print(f"  → After {transition['Days']} days: {transition['StorageClass']}")
    if 'Expiration' in rule:
        print(f"  → Delete after: {rule['Expiration']['Days']} days")
▶ Output
Lifecycle policy applied. HTTP status: 200

Rule ID: InvoiceArchivalPolicy | Status: Enabled
→ After 30 days: STANDARD_IA
→ After 180 days: GLACIER_IR
→ After 365 days: DEEP_ARCHIVE
→ Delete after: 2555 days

Rule ID: CleanupIncompleteMultipartUploads | Status: Enabled
🔥
Interview Gold: The Hidden Multipart Upload Cost LeakInterviewers love asking about unexpected S3 costs. The answer they want: incomplete multipart uploads. When a large file upload fails halfway, S3 stores the uploaded parts and charges you for them — indefinitely, silently, without showing them in a normal bucket listing. The fix is exactly what's in the code above: an AbortIncompleteMultipartUpload lifecycle rule. Every production bucket should have one.
S3 Storage ClassBest ForRetrieval LatencyApprox. Cost / GB / MonthPer-Retrieval Fee
S3 StandardActive user data, frequently accessed filesMilliseconds$0.023None
S3 Standard-IABackups, older content, DR copiesMilliseconds$0.0125Yes (~$0.01/GB)
S3 Glacier Instant RetrievalQuarterly-access archivesMilliseconds$0.004Yes (~$0.03/GB)
S3 Glacier Flexible RetrievalAnnual-access compliance dataMinutes to hours$0.0036Yes
S3 Glacier Deep ArchiveLong-term legal/compliance retentionUp to 12 hours$0.00099Yes (~$0.02/GB)

🎯 Key Takeaways

  • S3 object keys are flat strings, not real file paths — 'folder' prefixes are a UI convenience, not a filesystem hierarchy. This changes how you design efficient list and filter operations at scale.
  • Presigned URLs are the production-grade pattern for user file access — they keep objects private while granting time-limited, per-object access without exposing IAM credentials to the client.
  • Every bucket that receives large file uploads needs an AbortIncompleteMultipartUpload lifecycle rule — without it, failed partial uploads accumulate silently and cost you real money.
  • Storage class transitions via lifecycle policies are a one-time setup that permanently reduces costs — Standard → Standard-IA → Glacier Deep Archive can cut your S3 bill by over 95% for archival data.

⚠ Common Mistakes to Avoid

  • Mistake 1: Making the entire bucket public to share one file — This exposes every object in the bucket to the internet permanently — Symptom: anyone can browse and download all your data — Fix: Use presigned URLs for any per-file sharing. If you genuinely need a public static hosting bucket, lock it down to a specific key prefix using a bucket policy Resource like 'arn:aws:s3:::your-bucket/public/*' and never put private data under that prefix.
  • Mistake 2: Creating buckets in us-east-1 for users in Sydney — Symptom: upload and download speeds are painfully slow for non-US users; latency adds 200-300ms to every S3 operation — Fix: Always create buckets in the AWS region geographically closest to your users or your EC2/Lambda compute. For a global audience, put CloudFront in front of a single S3 bucket — it caches content at edge locations worldwide.
  • Mistake 3: Ignoring S3 versioning and then accidentally deleting production files — Symptom: a one-line script bug runs aws s3 rm on the wrong prefix and irreplaceable data is gone — Fix: Enable versioning on any bucket holding user-generated content or production assets with aws s3api put-bucket-versioning --bucket your-bucket --versioning-configuration Status=Enabled. With versioning on, 'deleted' objects are just marked with a delete marker and can be restored instantly. Combine with MFA Delete for critical buckets.

Interview Questions on This Topic

  • QS3 is often described as 'eventually consistent' — can you explain what that means and whether it's still true today, and how it would affect an application that writes and immediately reads the same object?
  • QWalk me through how you'd architect a file upload feature for a web app where users can upload files up to 5GB. Why would you not route the upload through your application server?
  • QWhat's the difference between an S3 bucket policy and an IAM policy, and when would you use one over the other to control access to S3 objects?

Frequently Asked Questions

What is the maximum file size you can upload to S3?

A single S3 object can be up to 5TB in size. However, the maximum size for a single PUT upload is 5GB. For anything larger — or for better reliability with large files — you should use the multipart upload API, which splits the file into parts (minimum 5MB each) and uploads them in parallel. Boto3's upload_file method handles this automatically when the file exceeds 8MB.

Does S3 guarantee that my data won't be lost?

S3 Standard is designed for 99.999999999% durability (11 nines) by automatically storing your data redundantly across a minimum of three Availability Zones within a region. In practical terms, if you stored 10 million objects, you'd expect to lose one object every 10,000 years. That said, durability doesn't protect against accidental deletion — for that, enable versioning and optionally MFA Delete on critical buckets.

What's the difference between S3 and EBS — when do you use each?

EBS (Elastic Block Store) is a network-attached block device — it works like a hard drive mounted to a single EC2 instance, with low latency and read/write for databases or OS volumes. S3 is an object store accessed via HTTP API — it's not mountable as a drive (without third-party tools), but it's infinitely scalable, globally accessible, and far cheaper per GB. Use EBS for your EC2 instance storage, databases, and OS. Use S3 for files, media, backups, static assets, and any data that needs to be accessed by more than one service.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousAWS EC2 BasicsNext →AWS Lambda and Serverless
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged