S3 Principal * Bucket Policy — Why 50K SSNs Hit Google
One Principal * bucket policy made 50K customer SSNs searchable on Google.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
- S3 stores data as objects in globally unique buckets
- Buckets are region-scoped; objects have keys (flat namespace) - not folders
- Access control: bucket policies (persistent) vs presigned URLs (time-limited)
- Lifecycle policies auto-tier objects (Standard → IA → Glacier) to cut costs by up to 95%
- Multipart uploads required for files >5GB; incomplete uploads silently cost money
- Enable versioning on every production bucket to prevent accidental data loss
Imagine a massive warehouse with unlimited shelf space. You rent a named section of that warehouse (a 'bucket'), and inside it you can store any box (a 'file') of any size. Each box gets a unique label so anyone — or only you — can find it later. That's S3: Amazon's unlimited, pay-per-byte file warehouse in the cloud, accessible from anywhere on the internet.
Every modern application eventually needs somewhere to put files — profile pictures, invoice PDFs, video uploads, database backups, static websites. The moment you outgrow a single server's disk, you need distributed, durable storage. AWS S3 is where the industry landed, and it's powered everything from Netflix's video catalogue to your favourite startup's user uploads for nearly two decades. If you're building anything serious on AWS, S3 is the first service you'll touch.
Before S3, teams had to spin up dedicated file servers, worry about disk failures, manually handle replication, and scramble when traffic spiked. S3 solved all of that in one API. It stores your data across multiple physical locations automatically, scales to literally exabytes without any configuration, and charges you only for what you use. The result: you stop thinking about storage infrastructure and start thinking about your product.
By the end of this article you'll know exactly how S3 is structured, how to create buckets and upload objects using both the AWS CLI and the Python Boto3 SDK, how to control who can access your data and when, and the real-world patterns that experienced engineers use daily — plus the costly gotchas that trip up everyone the first time.
Why a Single S3 Bucket Policy Leaked 50K SSNs
S3 bucket policies are JSON-based resource-based policies that define who can access a bucket and what actions they can perform on which objects. Unlike IAM policies attached to users or roles, bucket policies are attached directly to the bucket and can grant cross-account access without requiring any IAM configuration in the target account. The core mechanic is that the policy's Principal element can be set to "*" to allow anonymous access, which is the single most common cause of data exposure.
When you attach a bucket policy with Principal: "*" and an Allow effect, you are explicitly granting access to every unauthenticated internet user. AWS evaluates bucket policies alongside IAM policies — an explicit Allow in either wins unless there's an explicit Deny. This means a single misconfigured policy can override all other access controls. The policy is evaluated at the bucket level, not the object level, so even objects with ACLs denying access can be read if the bucket policy allows it.
Use bucket policies when you need to grant cross-account access, enforce HTTPS-only access, or set a blanket permission across all objects in a bucket. They are essential for public website hosting, but dangerous for any bucket containing sensitive data. The 2017 Verizon breach of 14 million customer records and the 2021 Pegasus Airlines leak of 6.5 million files both started with a single bucket policy allowing public read access.
Buckets and Objects — How S3 is Actually Structured
S3 has exactly two building blocks: buckets and objects. A bucket is a top-level container — think of it as a named partition of Amazon's storage infrastructure. Every bucket name must be globally unique across all AWS accounts everywhere. Not just unique to you — unique to every person using S3 on Earth. That's why 'images' is taken, but 'acme-corp-product-images-prod' probably isn't.
An object is any file you store inside a bucket. It has three parts: the key (the file path, e.g. 'invoices/2024/jan/invoice-001.pdf'), the data itself (up to 5TB per object), and metadata (key-value pairs like content type or custom tags).
Here's the important mental model shift: S3 is NOT a filesystem. There are no real folders. 'invoices/2024/jan/' is just a prefix in the key name. The AWS console and SDK simulate folders for your convenience, but under the hood every object lives flat in the bucket identified by its full key string. This matters when you're listing, filtering, or managing costs at scale.
Buckets are also region-specific. When you create a bucket in us-east-1, your data lives there unless you explicitly set up replication. Always create buckets in the region closest to your users or your compute layer.
create-bucket with --create-bucket-configuration LocationConstraint=us-east-1, AWS throws a cryptic 'InvalidLocationConstraint' error. us-east-1 is the default region and must NOT have a LocationConstraint. Every other region — eu-west-1, ap-southeast-2, etc. — requires it. This catches almost everyone on their first cross-region script.Access Control — Who Can See Your Files and Why It Matters
By default, every S3 bucket and every object inside it is completely private. Nothing is publicly accessible unless you deliberately make it so. This is the right default, but it means you need to understand the two main ways to grant access: bucket policies and presigned URLs.
A bucket policy is a JSON document attached to the bucket that grants broad, persistent permissions — for example, allowing your application's IAM role to read everything under the 'invoices/' prefix, or making a 'public-assets/' prefix readable by the entire internet for a static website. Bucket policies are evaluated on every request to that bucket, so they're perfect for service-to-service access.
A presigned URL is the smarter choice for user-facing file access. Instead of making objects public, you generate a time-limited URL server-side that temporarily grants access to one specific object. The URL embeds cryptographic credentials and an expiry timestamp. When it expires, access is gone — automatically, no cleanup needed. This is how every serious application handles file downloads and uploads: the backend stays in control, and the client gets a URL that works just long enough.
Never make a bucket public unless it's genuinely meant to serve public static assets. Even then, use a CloudFront distribution in front of it rather than direct public bucket access.
Bucket Policy vs IAM Policy Comparison Matrix
AWS provides two primary mechanisms for controlling access to S3: bucket policies and IAM policies. Choosing between them (or using both together) is a frequent source of confusion. This matrix clarifies the differences so you can make the right choice for each access scenario.
| Feature | Bucket Policy | IAM Policy |
|---|---|---|
| Scope | Attached to a specific S3 bucket | Attached to a user, group, or role (IAM principal) |
| Who can be granted access | Any AWS account, IAM role, or anonymous (Principal: "*") | Only the principal the policy is attached to |
| Cross-account access | Yes — specify another account's IAM role in Principal | No — must use cross-account roles or bucket policies |
| Anonymous access | Yes — set Principal: "*" | No — IAM policies never apply to unauthenticated requests |
| Granularity | Bucket-wide or prefix-level (Resource with /*) | Can specify individual object ARNs (arn:aws:s3:::bucket/key) |
| Default effect | Deny (no policy means no access) | Deny (no policy means no access) |
| Policy evaluation order | Evaluated first; explicit Deny always wins | Evaluated after bucket policy; Deny still wins |
| Use case example | Allow anonymous read for static website | Allow a specific developer to read all buckets in an account |
| Management | One policy per bucket, manageable via S3 console | Centralized via IAM console or AWS Organizations SCPs |
| Maximum size | 20 KB | 6,144 characters (managed policies) / 10 KB (inline) |
When to use which: - Bucket policy when you need to grant access to non-AWS identities (like anonymous users) or manage permissions for a shared bucket used by multiple accounts. - IAM policy when you want to follow least-privilege and centrally manage permissions for your own users and roles. - Both in production: use IAM for your application roles and add bucket policies only for specific overrides (e.g., deny all public access as a safety net).
A common anti-pattern is using a permissive bucket policy to allow your own IAM role, then tightening via IAM. Instead, let IAM do the fine-grained control and keep bucket policies simpler.
aws:SourceIp or aws:MultiFactorAuthPresent that bucket policies do not. If you need MFA enforcement for S3 operations, attach an IAM policy with a condition — bucket policies cannot require MFA directly.S3 Storage Classes — Cost vs Durability Comparison Table
S3 offers multiple storage classes designed to balance cost, retrieval latency, and durability across your data lifecycle. The table below summarises each class along with typical use cases and cost trade-offs.
| Storage Class | Best For | Retrieval Latency | Approx. Cost/GB/Month (us-east-1) | Retrieval Fee | Durability | Minimum Object Size | Minimum Storage Duration |
|---|---|---|---|---|---|---|---|
| S3 Standard | Frequently accessed, active data | Milliseconds | $0.023 | None | 99.999999999% | None | None |
| S3 Intelligent-Tiering | Unknown or changing access patterns | Milliseconds | $0.023 (frequent tier) + monitoring fee | None | 99.999999999% | None | 30 days in frequent tier |
| S3 Standard-IA | Infrequently accessed but needs quick access | Milliseconds | $0.0125 | $0.01/GB | 99.999999999% | 128KB | 30 days |
| S3 One Zone-IA | Re-creatable, infrequently accessed data | Milliseconds | $0.01 | $0.01/GB | 99.999999999% (single AZ) | 128KB | 30 days |
| S3 Glacier Instant Retrieval | Archive data accessed quarterly | Milliseconds | $0.004 | $0.03/GB | 99.999999999% | 128KB | 90 days |
| S3 Glacier Flexible Retrieval | Archive data accessed yearly | Minutes to hours | $0.0036 | $0.03/GB (expedited) | 99.999999999% | 40KB | 90 days |
| S3 Glacier Deep Archive | Long-term compliance, accessed rarely | 12 hours | $0.00099 | $0.02/GB | 99.999999999% | 40KB | 180 days |
| S3 on Outposts | On-premises workloads requiring local S3 | Milliseconds (local) | Varies | None | Depends on hardware | None | None |
Key cost factors: - Standard costs the most per GB but has no retrieval fees or minimums. - Standard-IA cuts storage cost by ~45% but adds a per-GB retrieval fee. Ideal for backups you access a few times a year. - Glacier Deep Archive is 96% cheaper than Standard but retrieval takes up to 12 hours — perfect for regulatory retention. - Minimum object sizes and minimum storage durations apply to Infrequent Access and Glacier tiers; storing many small objects or deleting early incurs additional charges.
Use lifecycle policies to automatically transition objects between these classes as they age, maximising cost savings without manual intervention.
Lifecycle Transition Visual — Standard → IA → Glacier
Automated lifecycle transitions are the backbone of S3 cost optimisation. Instead of manually moving files between storage classes, define rules that trigger based on object age. The diagram below visualises a typical production lifecycle path.
How it works: 1. Objects are uploaded into S3 Standard (fastest, most expensive). 2. After 30 days, they automatically transition to S3 Standard-IA (45% cheaper, same latency). 3. After 180 days, they move to S3 Glacier Instant Retrieval (68% cheaper, same latency). 4. After 365 days, they move to S3 Glacier Deep Archive (96% cheaper, 12-hour retrieval). 5. After 7 years (2555 days), objects are permanently deleted (if desired).
Each transition is invisible to your application — the S3 API continues to work identically regardless of storage class. Only the billing changes.
Important: Transitions are one-way. You cannot automatically move objects from Glacier back to Standard without manual restoration (which incurs retrieval fees). Plan your tiers carefully: once data is archived, treat it as read-only.
Storage Classes and Lifecycle Policies — Cutting Your S3 Bill in Half
Not all data is accessed equally often. Your app might read a user's profile picture dozens of times a day, but that invoice from January 2021? Probably never again unless there's an audit. S3 gives you storage classes — different pricing tiers based on how frequently and quickly you need to access data.
S3 Standard is the default and the most expensive per GB. It's designed for data you access regularly with millisecond latency. S3 Standard-IA (Infrequent Access) costs about 45% less per GB but charges a per-retrieval fee, making it ideal for backups and older content you occasionally need. S3 Glacier Instant Retrieval drops the cost further for archival data you access maybe once a quarter. S3 Glacier Deep Archive is the cheapest tier — pennies per GB per month — for data you might need once a year and can wait up to 12 hours to retrieve.
The power move is combining storage classes with lifecycle policies: automated rules that transition objects to cheaper tiers (or delete them entirely) based on their age. You configure this once, and S3 handles the cost optimisation forever. A common real-world pattern: keep user uploads in Standard for 30 days, move to Standard-IA for 6 months, then Glacier Deep Archive indefinitely — with zero manual work after initial setup.
Versioning and Object Lock — Protect Against Accidental Deletion and Compliance
Without versioning, every PUT overwrites the object and every DELETE makes it gone forever. That's fine for temp files, but for user content, production configs, or audit logs, you need versioning.
When versioning is enabled, every object operation creates a new version ID. A DELETE doesn't remove the object — it just adds a delete marker. You can restore any previous version instantly. Versioning also integrates with lifecycle policies to automatically expire old versions and reduce storage costs.
S3 Object Lock takes protection further by making objects write-once-read-many (WORM). You can lock objects for a retention period (days or years) or use legal holds. This is critical for compliance with SEC, FINRA, or GDPR retention rules. Even the root user of the AWS account cannot delete a locked object before the retention period expires.
Versioning and Object Lock together give you an immutable data layer. Enable versioning on every production bucket from day one. Object Lock must be enabled when the bucket is created — you cannot add it later.
--object-lock-enabled-for-bucket even if you don't configure retention immediately. That keeps the door open. Without it, you're forced to migrate data to a new bucket.Performance Patterns — Multipart Uploads, Transfer Acceleration, and Cross-Region Replication
S3 is fast, but you can make it faster — and more expensive if you're not careful.
Multipart Upload: For files larger than 100MB (recommended), use multipart upload. It splits the file into parts (min 5MB each) and uploads them in parallel. If a part fails, only that part is retried, not the entire file. The boto3 upload_file method handles this automatically above 8MB. CLI aws s3 cp does the same.
S3 Transfer Acceleration: Uses AWS edge locations to route uploads over the AWS backbone network instead of the public internet. This can cut upload times by 50-80% for users far from the bucket region. It costs extra per GB uploaded. Enable it on the bucket and use the accelerated endpoint.
Cross-Region Replication (CRR): Automatically replicates objects to a bucket in another region. Use for disaster recovery, lower latency for global users, or compliance. Replication is asynchronous — expect a few seconds to a few hours of lag. You need versioning enabled on both source and destination buckets.
Key trade-off: Transfer Acceleration and CRR cost money. Don't enable them unless you have a measurable need. For most apps, a single bucket with CloudFront is cheaper and faster.
aws s3 cp with and without --endpoint-url before enabling.Use an S3 Bucket Like a Senior — Console, CLI, or Scripts
You don't "use" S3 via the console if you manage more than 3 buckets. The console is for debugging, not operations. Real teams script everything.
The AWS CLI is your hammer. aws s3 cp, aws s3 sync, aws s3 rm. That's 90% of your daily interactions. Sync is particularly deadly — it only transfers changed files, saving bandwidth and time. But watch out: sync deletes files in the destination that aren't in the source unless you pass --delete explicitly. That flag is how you accidentally nuke a production bucket.
For programmatic access, use the AWS SDK (boto3 in Python, aws-sdk in JS). Always use IAM roles, never hardcode keys. Set ServerSideEncryption and BucketKeyEnabled in every PutObject call. Your future self will thank you when the auditor asks why your data isn't encrypted at rest.
The console is fine for one-off uploads or checking bucket properties. But if you click "Upload" more than once a week, you're doing it wrong.
aws s3 sync with --delete in the wrong direction will delete all files in the destination bucket. Use --dryrun first. Always.S3 Data Consistency — The Read-After-Write Promise You Can Actually Rely On
S3 gives you read-after-write consistency for PUTS of new objects. You upload, then immediately read — you get the data. This is not eventual consistency. It's immediate. For overwrite PUTS and DELETES, it's also strongly consistent. That means if you delete an object and then list the bucket, the object is gone.
But here's the edge case that bites people: S3 is eventually consistent for listings in certain scenarios with bucket operations. If you create a bucket and immediately query it, you might not see it. That's a 1-second window, but in production automation, that's enough to fail a deploy script.
Versioning changes the game. With versioning enabled, every overwrite creates a new version. The old version is preserved. Deleting an object adds a delete marker — the object is still there. This makes consistency a non-issue for rollbacks. If a bad deploy overwrites production assets, you just delete the delete marker.
For cross-region replication (CRR), consistency is eventually consistent. Changes replicate asynchronously. Never design a system that depends on CRR being instant.
Computing in AWS: Beyond Static Hosting
S3 is object storage, but it doesn't compute. In production, you pair S3 with AWS compute services to process uploads, serve dynamic content, or run batch jobs. EC2 gives you full control: launch a virtual machine, install web servers, and pull data from S3 via SDK. For stateless tasks, use AWS Fargate—containerized computing without managing servers. ECS and EKS orchestrate clusters. The key insight: S3 triggers Lambda functions on object creation (e.g., resizing images). Design for decoupling—S3 pushes events to SQS or EventBridge, and compute services consume them. This pattern scales to zero cost when idle. Always set IAM roles on compute instances to access S3, never hard-code keys. Compute choice impacts latency: co-locate compute in the same region as your S3 bucket.
AWS Elastic Beanstalk: Managed Application Deployments
Elastic Beanstalk abstracts infrastructure so you focus on code. Upload a ZIP or connect to CodePipeline, and Beanstalk provisions EC2, load balancers, auto-scaling groups, and S3 bucket for logs. Choose platform: Python, Node.js, Go, Docker, or Java. It integrates with RDS for databases and CloudWatch for monitoring. Senior engineers use .ebextensions directory for custom config (e.g., environment variables, security group rules). Critical: Beanstalk creates an S3 bucket in your account to store deployment artifacts—never delete it manually. For blue-green deployments, swap environment URLs. Monitoring tip: enable enhanced health reporting for real-time CPU, memory, and latency. Beanstalk is not for every workload—if you need fine-grained networking (e.g., VPC peering), use CDK or Terraform instead. Pricing is EC2 + resources; no extra Beanstalk fee.
Public Bucket Exposed 50,000 Customer Records
- Never make a bucket public. Always use presigned URLs for per-file sharing.
- Enable S3 Block Public Access at the account level — it's a safety net that prevents accidental public exposure.
- If you must serve public static assets, use CloudFront with an Origin Access Control (OAC) to keep the bucket itself private.
aws s3api get-object-acl --bucket <name> --key <key> to verify object ownership.aws s3api get-bucket-policy-status --bucket <bucket> --query 'PolicyStatus.IsPublic'aws s3api get-object-acl --bucket <bucket> --key <key>Key takeaways
Common mistakes to avoid
4 patternsMaking the entire bucket public to share one file
Creating buckets in us-east-1 for users in Sydney
Ignoring S3 versioning and then accidentally deleting production files
aws s3 rm on the wrong prefix and irreplaceable data is gone.aws s3api put-bucket-versioning --bucket your-bucket --versioning-configuration Status=Enabled. With versioning on, 'deleted' objects are just marked with a delete marker and can be restored instantly. Combine with MFA Delete for critical buckets.Not setting up an AbortIncompleteMultipartUpload lifecycle rule
AbortIncompleteMultipartUpload set to 7 days. This automatically cleans up any partial upload that wasn't completed within a week.Interview Questions on This Topic
S3 is often described as 'eventually consistent' — can you explain what that means and whether it's still true today, and how it would affect an application that writes and immediately reads the same object?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Cloud. Mark it forged?
14 min read · try the examples if you haven't