Senior 7 min · March 06, 2026

GCP — Stop allUsers IAM Data Leaks

A real incident: allUsers IAM binding made a GCP bucket publicly accessible, letting scrapers exfiltrate data.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • GCP is Google's cloud platform built on the same infrastructure powering Search and YouTube
  • The Project is the atomic unit of isolation — billing, IAM, and APIs are per-project
  • Compute options: GCE (VMs), GKE (Kubernetes), Cloud Run (serverless containers) — pick by ops overhead tolerance
  • Storage services: GCS (blobs), Cloud SQL (relational), Spanner (global), Firestore (NoSQL) — match your data access pattern
  • Biggest mistake: using roles/editor on a service account — grants nearly full write access, making any compromise catastrophic
  • Performance insight: Cloud Run scales to zero, costing $0 at idle; GKE clusters cost ~$70/month minimum even when idle
Plain-English First

Imagine you're opening a restaurant but you don't want to buy the building, the ovens, or hire an electrician. Instead, you rent a fully-equipped kitchen by the hour — use as much or as little as you need, and pay only for what you cook. Google Cloud Platform is exactly that, but for software. Instead of buying servers, databases, and networking gear, your app rents Google's global infrastructure by the second. When traffic spikes on Black Friday, you dial up the kitchen size. When it's quiet, you dial it back down. No hardware, no waste.

Every production application you've ever used — from a startup's API to a Fortune 500's data pipeline — runs on someone's computers. The question is whose, and at what cost. Running your own servers means upfront capital, a team to maintain them, and a very bad Monday when one fails at 2 AM. Cloud platforms exist to flip that model: you get world-class infrastructure on demand, billed like a utility, with Google's Site Reliability Engineers quietly keeping the lights on behind the scenes. Google Cloud Platform is Google's answer to that problem, and it's built on the same infrastructure that runs Search, Gmail, and YouTube — systems engineered to handle billions of requests a day.

The real problem GCP solves isn't just 'running code remotely.' It's the operational complexity that kills engineering teams: patching OS vulnerabilities, provisioning storage that scales automatically, routing traffic across continents, and debugging distributed systems. Before managed cloud services, teams burned enormous engineering hours on infrastructure that added zero value to their product. GCP packages that complexity into opinionated, composable services so your team can stay focused on the thing that actually matters — the software itself.

After reading this, you'll confidently map a real-world application's requirements to specific GCP services, understand the difference between GCP's compute tiers and when each is appropriate, deploy a containerized workload to Google Kubernetes Engine, and avoid the billing and security mistakes that catch new GCP users off guard. This isn't a tour of the UI — it's a mental model you'll actually use.

GCP's Mental Model: Projects, Regions, and the Resource Hierarchy

Before touching any GCP service, you need to understand how GCP organises everything. Get this wrong and you'll end up with sprawling costs, broken IAM permissions, and services that can't talk to each other.

GCP groups resources into a three-tier hierarchy: Organisation → Folders → Projects. A Project is the atomic unit — every resource (a VM, a bucket, a database) lives inside exactly one project. Billing, IAM permissions, and API enablement are all scoped to the project. This is intentional: it means a dev team can have a payments-service-dev project completely isolated from payments-service-prod, with different budgets, different access controls, and separate audit logs.

Regions and zones handle physical location. A Region is a geographic area (e.g., us-central1 in Iowa). Each region contains multiple Zones (us-central1-a, us-central1-b, etc.) — these are independent data centres within that region. The rule of thumb: deploy across at least two zones for high availability, across multiple regions only if latency to global users or data sovereignty requires it. Cross-region data transfer costs money, so don't do it by default.

Understanding this hierarchy is what separates developers who get surprised by a $4,000 bill from those who plan budgets accurately from day one.

gcp_project_setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/bin/bash
# -----------------------------------------------------------
# GCP PROJECT SETUP SCRIPT
# Run this once to initialise a new GCP project correctly.
# Requires: gcloud CLI authenticated via `gcloud auth login`
# -----------------------------------------------------------

# Define project configuration as variables — never hardcode these inline
PROJECT_ID="payments-service-prod"        # Must be globally unique across all GCP
BILLING_ACCOUNT_ID="012345-ABCDEF-789GHI" # Found in GCP Console > Billing
PRIMARY_REGION="us-central1"              # Closest region to your main user base
PRIMARY_ZONE="us-central1-a"             # Default zone within that region

# Step 1: Create the project
# --set-as-default means subsequent gcloud commands target this project automatically
gcloud projects create "${PROJECT_ID}" \
  --name="Payments Service Production" \
  --set-as-default

echo "Project '${PROJECT_ID}' created."

# Step 2: Link a billing account — without this, most services won't activate
gcloud billing projects link "${PROJECT_ID}" \
  --billing-account="${BILLING_ACCOUNT_ID}"

echo "Billing account linked."

# Step 3: Set the default region and zone so you don't have to repeat --region/--zone
# on every command. This saves you from accidentally deploying to the wrong region.
gcloud config set compute/region "${PRIMARY_REGION}"
gcloud config set compute/zone "${PRIMARY_ZONE}"

echo "Default region set to ${PRIMARY_REGION}, zone to ${PRIMARY_ZONE}."

# Step 4: Enable only the APIs your project actually needs.
# GCP disables most APIs by defaultthis is a security feature, not a bug.
# Enabling unused APIs increases your attack surface for nothing.
gcloud services enable \
  compute.googleapis.com \
  container.googleapis.com \
  cloudsql.googleapis.com \
  storage.googleapis.com

echo "Core APIs enabled."
echo "Project initialisation complete. Run 'gcloud config list' to verify."
Output
Project 'payments-service-prod' created.
Billing account linked.
Default region set to us-central1, zone to us-central1-a.
Operation "operations/acf.p2-1234567890-abcdef" finished successfully.
Core APIs enabled.
Project initialisation complete. Run 'gcloud config list' to verify.
Watch Out: Project IDs Are Permanent
Once a GCP Project ID is created, it cannot be changed — ever. Even after deleting the project, that ID is reserved globally for 30 days. Always use a naming convention like {team}-{service}-{env} (e.g., platform-auth-prod) before you run that create command.
Production Insight
Mistaking the project for an organisational boundary leads to sprawling costs and broken IAM.
Many teams put dev/staging/prod in folders under one project — wrong: separate projects isolate billing and access.
Rule: one project per environment, one service account per service.
Key Takeaway
The project is your atomic unit of isolation.
Billing, IAM, and APIs are per-project.
Use separate projects for separate environments.
Project Hierarchy Decisions
IfSingle service, no need for isolation
UseOne project is fine
IfMultiple environments (dev/staging/prod)
UseCreate separate projects per environment
IfMulti-team, multi-service
UseUse folders under an organisation node for logical grouping, each team gets its own project

GCP Compute Options: Choosing the Right Engine for Your Workload

GCP gives you five distinct ways to run code, and picking the wrong one is one of the most common — and expensive — mistakes teams make. They're not interchangeable; each is optimised for a specific shape of workload.

Compute Engine (GCE) is raw virtual machines. You control the OS, you manage patching, you configure networking. Use this when you're lifting-and-shifting an existing application that has specific OS dependencies, or when you need GPU access for ML training jobs. It's the most flexible and the most operational overhead.

Google Kubernetes Engine (GKE) is managed Kubernetes. GCP handles the control plane (the bit that schedules your containers) and you manage your node pools and workloads. This is the workhorse for microservices architectures — use it when you have multiple services that need independent scaling, resource isolation, and rolling deployments.

Cloud Run is serverless containers. You push a container image, GCP handles everything else — scaling from zero to thousands of instances, load balancing, HTTPS. No cluster to manage. Use this for stateless APIs and event-driven services where you want zero infrastructure management. It's phenomenally cost-efficient for variable traffic.

App Engine is the oldest PaaS on GCP — opinionated, language-specific runtimes. Mostly superseded by Cloud Run for new projects.

Cloud Functions is function-level serverless for event triggers. Use it for glue code: responding to a file upload, processing a Pub/Sub message, or running a webhook handler. Not suited for long-running or compute-heavy work.

Here's the thing — each tier has a hidden cost: GCE's sustained-use discounts save you after 25% of the month, but they don't apply to preemptible VMs. GKE's control plane is free, but node costs add up fast — a three-node n1-standard-2 cluster costs about $200/month before any workload. Cloud Run per-request billing means you pay nothing at idle, but cold starts can hit 3 seconds for JVM apps. Trade-offs everywhere.

cloud_run_service.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# -----------------------------------------------------------
# CLOUD RUN SERVICE DEFINITION
# Deploys a containerised payments API to Cloud Run.
# Cloud Run auto-scales to zero when idle — you pay nothing
# when your service isn't handling requests.
# Deploy with: gcloud run services replace cloud_run_service.yaml
# -----------------------------------------------------------
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: payments-api
  namespace: "123456789"  # Your GCP Project Number (not Project ID)
  annotations:
    # Force all traffic through HTTPS — never allow plain HTTP in production
    run.googleapis.com/ingress: all
spec:
  template:
    metadata:
      annotations:
        # Scale down to zero instances when there are no requests
        # This is what makes Cloud Run cost-effective for variable traffic
        autoscaling.knative.dev/minScale: "0"
        # Cap at 10 instances to prevent runaway costs during a traffic spike
        autoscaling.knative.dev/maxScale: "10"
        # Each instance handles max 80 concurrent requests before a new one spins up
        run.googleapis.com/execution-environment: gen2
    spec:
      # How long Cloud Run waits for a response before treating it as a timeout
      timeoutSeconds: 30
      # CPU and memory are per-instance limits
      containers:
        - image: gcr.io/payments-service-prod/payments-api:v2.1.0
          ports:
            - containerPort: 8080  # Cloud Run always routes traffic to port 8080
          resources:
            limits:
              cpu: "1"        # 1 vCPU per instance
              memory: "512Mi" # 512MB RAM — right-size this based on profiling
          env:
            # Never hardcode secrets. Reference Secret Manager instead.
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: latest
                  name: payments-db-password  # Name of secret in Secret Manager
  traffic:
    # 100% of traffic goes to the latest revision
    # You can split traffic here for canary deployments (e.g., 90/10)
    - latestRevision: true
      percent: 100
Output
Deploying...
Setting IAM policy
Done.
Service [payments-api] revision [payments-api-00002-xyz] has been deployed and is serving 100 percent of traffic.
Service URL: https://payments-api-abcdef-uc.a.run.app
Pro Tip: Cloud Run Cold Starts
Setting minScale to 0 means you pay nothing at idle, but the first request after a period of inactivity hits a 'cold start' — typically 1-3 seconds for a JVM app, under 300ms for Go or Node. For latency-sensitive services (payments, auth), set minScale to 1. The cost is roughly $10-15/month for one always-warm instance — cheap insurance against SLA breaches.
Production Insight
Choosing the wrong compute tier is one of the most expensive mistakes teams make.
Cloud Run's scale-to-zero saves money for variable traffic but introduces cold start latency.
Rule: start with Cloud Run for stateless services; only move to GKE or GCE when Cloud Run limits constrain you.
Key Takeaway
Compute choice is about ops overhead vs control.
Cloud Run is the default for new stateless services.
GKE for multi-container apps needing control; GCE for legacy lift-and-shift or GPUs.
Compute Decision: Which Service to Use
IfStateless API, unpredictable traffic, no GPU needed
UseCloud Run — scale-to-zero, per-request billing
IfMulti-service microservices architecture, needs control over networking
UseGKE — autoscaling, rolling updates, service mesh support
IfLegacy app with specific OS dependencies or GPU/TPU required
UseCompute Engine — full VM control, GPU support

Storage on GCP: Matching the Data Shape to the Right Service

Nothing reveals a GCP beginner faster than seeing them store relational data in Cloud Storage or put time-series metrics into Cloud SQL. GCP has six distinct storage services and each one is engineered for a specific data access pattern. Using the wrong one doesn't just waste money — it actively degrades performance.

Cloud Storage (GCS) is object storage — think S3. Binary blobs, static assets, backups, data lake files. Infinitely scalable, globally accessible, extremely cheap. Access pattern: write once, read many, no updates to individual fields.

Cloud SQL is managed relational databases — PostgreSQL, MySQL, or SQL Server. Handles backups, failover, and patching. Use it when you have structured data with relationships and your team already thinks in SQL. Scales vertically (bigger machine) with read replicas for horizontal read scaling.

Cloud Spanner is the exotic one — globally distributed, horizontally scalable relational database with ACID transactions. It's what powers Google's own financial systems. Use it when Cloud SQL's 96TB limit isn't enough or when you need active-active multi-region writes. The price point reflects its power — about 20x Cloud SQL.

Firestore is a serverless NoSQL document database, optimised for mobile and web clients with real-time sync built in. Excellent for user profiles, session data, and content that's hierarchical and document-shaped.

Bigtable is a managed wide-column NoSQL store, designed for petabyte-scale time-series, IoT, and financial data with millisecond latency at massive scale. Not a general-purpose database.

Memorystore is managed Redis or Memcached — in-memory caching layer for your hot data.

One more thing: GCS storage classes (Standard, Nearline, Coldline, Archive) let you save 60-90% by picking the right access frequency. Access a Coldline object once? That retrieval costs more than storing it for a month. Pick storage class based on real access patterns, not on what feels right.

gcs_upload_and_signed_url.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# -----------------------------------------------------------
# GCS OBJECT UPLOAD + SIGNED URL GENERATION
# Real-world pattern: a user uploads a profile photo.
# We store it privately in GCS, then generate a short-lived
# signed URL so the frontend can display it without making
# the bucket publicly readable (a major security mistake).
#
# Install dependencies: pip install google-cloud-storage
# Auth: set GOOGLE_APPLICATION_CREDENTIALS env var to your
#       service account key JSON path, or use Workload Identity.
# -----------------------------------------------------------

import datetime
from pathlib import Path
from google.cloud import storage

GCP_PROJECT_ID = "payments-service-prod"
PRIVATE_BUCKET_NAME = "user-profile-photos-prod"  # This bucket is NOT public
SIGNED_URL_EXPIRY_MINUTES = 15  # Short expiry — limits blast radius if URL leaks


def upload_user_profile_photo(
    user_id: str,
    local_file_path: Path,
    content_type: str = "image/jpeg",
) -> str:
    """
    Uploads a profile photo to GCS and returns a signed URL
    the frontend can use to display it for the next 15 minutes.

    Returns the signed URL string.
    """
    storage_client = storage.Client(project=GCP_PROJECT_ID)
    bucket = storage_client.bucket(PRIVATE_BUCKET_NAME)

    # Build a deterministic object path — makes it easy to find later
    # and naturally organises objects by user without needing folders
    object_name = f"users/{user_id}/profile/avatar.jpg"

    blob = bucket.blob(object_name)

    # Set content type so browsers render it correctly, not download it
    blob.content_type = content_type

    # Upload the file — this overwrites any existing photo for this user
    blob.upload_from_filename(str(local_file_path))
    print(f"Uploaded '{local_file_path}' to gs://{PRIVATE_BUCKET_NAME}/{object_name}")

    # Generate a V4 signed URL — time-limited, cryptographically signed
    # by our service account. The bucket stays private; only holders of
    # this URL can access the object, and only until it expires.
    signed_url = blob.generate_signed_url(
        version="v4",
        expiration=datetime.timedelta(minutes=SIGNED_URL_EXPIRY_MINUTES),
        method="GET",  # Read-only access
    )

    print(f"Signed URL (valid {SIGNED_URL_EXPIRY_MINUTES} mins): {signed_url[:80]}...")
    return signed_url


if __name__ == "__main__":
    # Simulate uploading a photo for user ID 'usr_8821'
    sample_photo_path = Path("/tmp/avatar_upload.jpg")

    # In production this file comes from a multipart form upload
    sample_photo_path.write_bytes(b"<fake-jpeg-bytes-for-demo>")

    url = upload_user_profile_photo(
        user_id="usr_8821",
        local_file_path=sample_photo_path,
    )
    print(f"\nFrontend should use this URL to render the avatar: {url[:60]}...")
Output
Uploaded '/tmp/avatar_upload.jpg' to gs://user-profile-photos-prod/users/usr_8821/profile/avatar.jpg
Signed URL (valid 15 mins): https://storage.googleapis.com/user-profile-photos-prod/users/usr_88...
Frontend should use this URL to render the avatar: https://storage.googleapis.com/user-profile-photos...
Watch Out: Never Make a Storage Bucket Containing PII Public
GCS has an 'allUsers' IAM permission that makes an entire bucket readable by the whole internet. It's convenient for hosting static assets, but it has caused real data breaches when teams accidentally applied it to buckets containing user data. Use signed URLs as shown above — they give time-limited, auditable access without ever opening the bucket publicly.
Production Insight
Using Cloud Storage for relational data or Cloud SQL for blob data wastes money and performance.
A common trap: storing JSON blobs in Cloud SQL when Firestore or GCS would be cheaper and faster.
Rule: analyse your data access pattern before picking a storage service.
Key Takeaway
Storage decision is purely about access pattern.
Blobs → GCS, relational → Cloud SQL, global relational → Spanner, documents → Firestore.
Mixing them costs both money and latency.
Storage Decision by Data Pattern
IfBinary blobs, static assets, backups
UseCloud Storage (GCS)
IfStructured relational data under 96TB
UseCloud SQL (PostgreSQL/MySQL)
IfGlobal relational with multi-region writes
UseCloud Spanner
IfDocument-shaped hierarchical data, real-time sync
UseFirestore
IfTime-series or IoT data at petabyte scale
UseBigtable

GCP IAM and Networking: The Security Layer You Can't Skip

Here's the uncomfortable truth: most cloud security incidents aren't caused by sophisticated attacks. They're caused by over-permissioned service accounts, open firewall rules, and credentials hardcoded into source code. GCP's IAM and VPC model exist specifically to prevent this — but only if you use them intentionally.

IAM (Identity and Access Management) in GCP follows the principle of least privilege. Every service account, user, and group gets only the permissions it needs — nothing more. Roles are either predefined (like roles/storage.objectViewer) or custom. The most dangerous role is roles/editor on a project — it's temptingly broad and you'll see it everywhere in tutorials. Never use it in production.

Workload Identity is the right way for GKE workloads to authenticate to GCP APIs. Instead of downloading a service account key JSON file (a long-lived credential that can be stolen), Workload Identity binds a Kubernetes service account to a GCP service account. The credential is ephemeral and automatically rotated. If you're using key files in a Kubernetes cluster, stop — switch to Workload Identity.

VPC (Virtual Private Cloud) is your private network inside GCP. By default, GCP creates a 'default' VPC with permissive firewall rules. For anything production, create a custom VPC with explicit subnets per region, and firewall rules that deny all ingress by default and allow only what you specify. Use Private Google Access on subnets so VMs can reach GCP APIs without needing a public IP.

gcp_iam_least_privilege_setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#!/bin/bash
# -----------------------------------------------------------
# GCP IAM LEAST-PRIVILEGE SETUP
# Creates a service account for a Cloud Run payments service
# with ONLY the permissions it actually needs:
#   - Read secrets from Secret Manager
#   - Write to a specific Cloud Storage bucket
#   - Publish to a specific Pub/Sub topic
# Nothing else. This limits blast radius if the service is compromised.
# -----------------------------------------------------------

PROJECT_ID="payments-service-prod"
SERVICE_NAME="payments-api"

# Step 1: Create a dedicated service account for this service
# One service account per service — never share service accounts
SERVICE_ACCOUNT_EMAIL="${SERVICE_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

gcloud iam service-accounts create "${SERVICE_NAME}" \
  --project="${PROJECT_ID}" \
  --display-name="Payments API Service Account" \
  --description="Identity for the payments-api Cloud Run service. Least-privilege access only."

echo "Service account created: ${SERVICE_ACCOUNT_EMAIL}"

# Step 2: Grant permission to read secrets from Secret Manager
# This is scoped to the PROJECT level — ideally scope it to individual secrets
# using resource-level IAM for even finer control
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/secretmanager.secretAccessor"

echo "Secret Manager access granted."

# Step 3: Grant permission to write objects to a specific bucket ONLY
# Note: roles/storage.objectCreator is narrower than roles/storage.objectAdmin
# objectCreator can write new objects but cannot delete or overwrite existing ones
TARGET_BUCKET="gs://payments-receipts-prod"

gcloud storage buckets add-iam-policy-binding "${TARGET_BUCKET}" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/storage.objectCreator"

echo "Storage write access granted to ${TARGET_BUCKET} only."

# Step 4: Grant Pub/Sub publish permission on one specific topic
TARGET_TOPIC="projects/${PROJECT_ID}/topics/payment-completed-events"

gcloud pubsub topics add-iam-policy-binding "payment-completed-events" \
  --project="${PROJECT_ID}" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/pubsub.publisher"

echo "Pub/Sub publish access granted to payment-completed-events topic."

# Step 5: Attach this service account to the Cloud Run service
# The service now authenticates as this SA automatically — no key files needed
gcloud run services update "${SERVICE_NAME}" \
  --project="${PROJECT_ID}" \
  --region="us-central1" \
  --service-account="${SERVICE_ACCOUNT_EMAIL}"

echo "Service account attached to Cloud Run service."
echo "IAM setup complete. This service account has NO other GCP permissions."
Output
Service account created: payments-api@payments-service-prod.iam.gserviceaccount.com
Secret Manager access granted.
Storage write access granted to gs://payments-receipts-prod only.
Pub/Sub publish access granted to payment-completed-events topic.
Service account attached to Cloud Run service.
IAM setup complete. This service account has NO other GCP permissions.
Interview Gold: Why Not Just Use roles/editor?
roles/editor grants write access to almost every GCP resource in the project — including the ability to read secrets, exfiltrate data, and create new compute resources. If a service with this role is compromised, the attacker has near-full control of your GCP project. Interviewers love asking how you'd scope permissions for a specific service. The answer is always: one dedicated service account, one role per permission needed, no editor/owner roles in production.
Production Insight
The most common security incident in GCP is an over-permissioned service account.
A single roles/editor binding on a project lets an attacker control everything.
Rule: use custom roles or least-privilege predefined roles; never use roles/editor in production.
Key Takeaway
IAM is not an afterthought — design least-privilege before deploying.
One SA per service, one role per need.
Use Workload Identity for GKE to avoid key files.
IAM Strategy Decisions
IfService needs to read from Cloud SQL
UseGrant roles/cloudsql.client on the instance or project
IfService needs to publish to Pub/Sub
UseGrant roles/pubsub.publisher on the specific topic
IfService needs to write to a specific GCS bucket
UseGrant roles/storage.objectCreator on the bucket

GCP Networking: VPCs, Firewalls, and Connectivity

GCP's networking model is built around Virtual Private Clouds (VPCs). A VPC is a global isolated network that spans all regions. Within it, you define subnets per region, each with a private IP range. By default, GCP creates a 'default' VPC with permissive firewall rules — convenient for prototyping but dangerous for production. Always create a custom VPC for production workloads.

Subnets are regional IP ranges (e.g., 10.0.0.0/20 in us-central1). Resources within the same subnet can communicate without a firewall rule. Firewall Rules are stateful — by default, all ingress is denied unless allowed. Egress is allowed. Rule order doesn't matter; priority does. Private Google Access lets VMs without external IPs reach Google APIs via Google's internal network. Cloud NAT is required for VMs with no external IP to outbound internet. VPC Peering connects two VPCs so they can communicate using internal IPs — common for multi-project setups. Shared VPC centralises network management: a host project shares its VPC with service projects.

Best practice: start with a custom VPC, define subnets for each tier (frontend, backend, data), apply firewall rules that deny all ingress except on specific ports from specific source ranges, and use Private Google Access for all API calls.

gcp_custom_vpc_setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/bin/bash
# -----------------------------------------------------------
# CUSTOM VPC SETUP FOR PRODUCTION
# Creates a custom VPC with three subnets (frontend, backend, data)
# and minimal firewall rules.
# -----------------------------------------------------------

PROJECT_ID="payments-service-prod"
VPC_NAME="payments-prod-vpc"
REGION="us-central1"
FRONTEND_SUBNET="frontend-subnet"
BACKEND_SUBNET="backend-subnet"
DATA_SUBNET="data-subnet"

# Step 1: Create a custom VPC (not auto-mode)
gcloud compute networks create "${VPC_NAME}" \
  --project="${PROJECT_ID}" \
  --subnet-mode=custom

echo "VPC created: ${VPC_NAME}"

# Step 2: Create subnets for each tier
gcloud compute networks subnets create "${FRONTEND_SUBNET}" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --region="${REGION}" \
  --range="10.0.1.0/24" \
  --enable-private-ip-google-access

gcloud compute networks subnets create "${BACKEND_SUBNET}" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --region="${REGION}" \
  --range="10.0.2.0/24" \
  --enable-private-ip-google-access

gcloud compute networks subnets create "${DATA_SUBNET}" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --region="${REGION}" \
  --range="10.0.3.0/24" \
  --enable-private-ip-google-access

echo "Subnets created."

# Step 3: Create firewall rules — deny all ingress by default, then allow needed
# Allow health checks from GCP's health checker IP ranges
gcloud compute firewall-rules create "${VPC_NAME}-allow-health-checks" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --direction=ingress \
  --priority=1000 \
  --source-ranges="130.211.0.0/22,35.191.0.0/16" \
  --target-tags="allow-health-checks" \
  --rules="tcp:80,tcp:443"

echo "Firewall rule for health checks created."

# Step 4: Create a firewall rule to allow internal traffic between subnets
gcloud compute firewall-rules create "${VPC_NAME}-allow-internal" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --direction=ingress \
  --priority=1000 \
  --source-ranges="10.0.0.0/16" \
  --rules="tcp:0-65535,udp:0-65535,icmp"

echo "Internal traffic rule created."

echo "Custom VPC setup complete."
Output
VPC created: payments-prod-vpc
Subnets created.
Firewall rule for health checks created.
Internal traffic rule created.
Custom VPC setup complete.
Don't Use the Default VPC in Production
GCP's default VPC has an ingress rule allowing SSH (tcp:22) and RDP (tcp:3389) from any IP (0.0.0.0/0). If you deploy a VM with an external IP, it's accessible from the internet within minutes. Always create a custom VPC with strict firewall rules.
Production Insight
Using the default VPC in production often leaves SSH and RDP open on all instances.
Attackers scan GCP IP ranges and find exposed instances within hours.
Rule: always create a custom VPC with ingress firewall rules that only allow your specific IP ranges.
Key Takeaway
Custom VPCs are mandatory for production.
Default VPC is too permissive.
Use Private Google Access to keep VMs off the internet.
VPC Design Decisions
IfSingle project, small app
UseCustom VPC with a single subnet per region is fine
IfMultiple services with different security tiers
UseMultiple subnets with strict firewall rules between them
IfMulti-project organisation
UseShared VPC for centralised networking; use VPC peering for isolated projects
● Production incidentPOST-MORTEMseverity: high

Data Exposure via Public Bucket

Symptom
A security scanner flagged the bucket as publicly accessible. Later analysis showed automated scrapers had downloaded the data.
Assumption
The team believed that 'allUsers' only applied to authenticated Google users, not the entire internet.
Root cause
The IAM binding roles/storage.objectViewer for allUsers made all objects readable without any authentication.
Fix
Immediately removed the allUsers binding using gcloud storage buckets remove-iam-policy-binding gs://BUCKET_NAME --member=allUsers --role=roles/storage.objectViewer. Then rotated any exposed secrets and rotated the bucket's default KMS key. Migrated to signed URLs for temporary access.
Key lesson
  • Never grant allUsers access to any bucket that contains sensitive data. Use pre-signed URLs for time-limited access.
  • Audit bucket IAM bindings regularly with Cloud Asset Inventory.
  • Enable Object Versioning and retention policies to detect and recover from accidental exposure.
Production debug guideSymptom → Action guide for the most common GCP issues5 entries
Symptom · 01
gcloud command fails with 'Permission denied'
Fix
Run gcloud auth application-default login or set GOOGLE_APPLICATION_CREDENTIALS. Verify the service account has the required role with gcloud projects get-iam-policy PROJECT_ID.
Symptom · 02
Can't reach a GCE instance via external IP
Fix
Check firewall rules: gcloud compute firewall-rules list --filter=network=default. Ensure an ingress rule allows traffic on the required port from your IP.
Symptom · 03
Cloud Run service returns 403
Fix
Verify the service's service account has the roles/run.invoker on the service. Use gcloud run services get-iam-policy SERVICE_NAME --region=REGION.
Symptom · 04
GKE pod cannot connect to Cloud SQL
Fix
Check the Pod's service account has roles/cloudsql.client. Use Workload Identity mapping. Verify VPC peering or Private Services Access is configured if using Private IP.
Symptom · 05
Billing is unexpectedly high
Fix
Use the Compute Engine VM list with labels. Run gcloud billing accounts list and check budget alerts. Use the Cost Table dashboard in GCP Console.
★ GCP CLI Debug Cheat SheetQuick commands to identify and fix common GCP issues in under 60 seconds.
Authentication failure
Immediate action
Re-authenticate with `gcloud auth login`
Commands
gcloud auth login
gcloud config list account
Fix now
Also check GOOGLE_APPLICATION_CREDENTIALS env var is set correctly.
Project not found+
Immediate action
List accessible projects
Commands
gcloud projects list
gcloud config set project PROJECT_ID
Fix now
Ensure billing is enabled for the project.
Compute Engine not starting+
Immediate action
Check instance status
Commands
gcloud compute instances describe INSTANCE_NAME --zone=ZONE --format='value(status)'
gcloud compute instances start INSTANCE_NAME --zone=ZONE
Fix now
If stuck on provisioning, check GPU quota limits.
GKE cluster unreachable+
Immediate action
Get cluster credentials
Commands
gcloud container clusters get-credentials CLUSTER_NAME --region=REGION
kubectl get nodes
Fix now
If nodes are in NotReady state, kubectl describe node to find disk pressure or memory pressure.
Storage bucket permission issues+
Immediate action
Test bucket access
Commands
gsutil ls gs://BUCKET_NAME
gsutil iam get gs://BUCKET_NAME
Fix now
Grant storage.objectViewer to the service account or user.
Compute Options Comparison
DimensionCompute Engine (GCE)Google Kubernetes Engine (GKE)Cloud Run
Abstraction LevelRaw VMs (IaaS)Managed Kubernetes (CaaS)Serverless containers (PaaS)
Ops OverheadHigh — you manage OS, patching, scalingMedium — GCP manages control plane, you manage node poolsLow — GCP manages everything except your container
Scaling BehaviourManual or MIG autoscaling (minutes)Pod autoscaling via HPA (seconds)Instant scale-to-zero and scale-out (sub-second)
Billing UnitPer-second VM uptimePer-second node uptimePer-request CPU and memory (free at idle)
Best ForLegacy apps, GPU workloads, custom OS configsMicroservices, multi-container apps, stateful workloadsStateless APIs, event-driven functions, variable traffic
Cold StartNone (always running)None (always running)Yes — 300ms to 3s depending on runtime
Max Request TimeoutN/A — not request-orientedN/A — not request-oriented3600 seconds (1 hour)
Minimum Cost~$5-10/month for f1-micro~$70/month for smallest cluster$0/month at zero traffic

Key takeaways

1
GCP's Project is the atomic unit of isolation
billing, IAM, and APIs are all scoped per project. Use separate projects for dev/staging/prod, not separate folders within one project.
2
The compute decision (GCE vs GKE vs Cloud Run) is really a decision about how much operational ownership you want
more control always means more operational overhead. Cloud Run is the default choice for new stateless services unless you have a specific reason not to use it.
3
The right storage service is determined entirely by your data access pattern
Cloud Storage for blobs, Cloud SQL for relational data under 96TB, Spanner for globally distributed relational, Firestore for document-shaped hierarchical data. Mixing these up costs money and performance.
4
IAM is not an afterthought
set up least-privilege service accounts before you deploy your first service. The cost of retrofitting permissions on a live system is far higher than getting it right during initial setup.
5
Custom VPCs are mandatory for production. Default VPC is too permissive. Use Private Google Access to keep VMs off the internet.

Common mistakes to avoid

3 patterns
×

Enabling allUsers IAM on a GCS bucket containing user data

Symptom
All objects in the bucket are publicly readable on the internet, often discovered via a security scanner or a data breach report.
Fix
Remove the allUsers binding immediately using gcloud storage buckets remove-iam-policy-binding gs://BUCKET_NAME --member=allUsers --role=roles/storage.objectViewer. Audit Cloud Audit Logs to check which objects were accessed. Switch to signed URLs for any temporary public access.
×

Using a single service account with roles/editor for every service in a project

Symptom
If one service is compromised or a key file is leaked, an attacker gains near-full write access to the entire GCP project including secrets, databases, and compute.
Fix
Create one service account per service, grant only the specific predefined roles required (e.g., roles/pubsub.publisher, not roles/pubsub.admin), and use Workload Identity for GKE instead of key files.
×

Deploying all resources to a single zone without high-availability consideration

Symptom
A GCP zonal outage (like the 2021 us-central1-b incident) takes down your entire application, violating your SLA.
Fix
For Compute Engine, use Managed Instance Groups (MIGs) spread across multiple zones. For Cloud SQL, enable High Availability to provision a standby instance in a different zone. For GKE, create node pools with nodes spread across zones using --num-nodes-per-zone.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You're building a payments microservice that needs to read from Cloud SQ...
Q02SENIOR
A product manager tells you traffic to your API is unpredictable — quiet...
Q03SENIOR
Your team is moving from a monolith to microservices on GCP. How do you ...
Q01 of 03SENIOR

You're building a payments microservice that needs to read from Cloud SQL and publish to Pub/Sub. Walk me through how you'd set up IAM for it in production — and specifically, what would you NOT do that junior engineers typically get wrong?

ANSWER
First, create a dedicated service account for the payments microservice — no shared accounts. Grant roles/cloudsql.client for connecting to Cloud SQL and roles/pubsub.publisher for publishing. For Cloud SQL, also ensure the service account is authorized in the database using cloudsql_proxy or Private IP. What I would NOT do: use roles/editor or roles/cloudsql.admin — those grant far too much access, including the ability to drop databases or modify IAM policies. Also avoid embedding a long-lived service account key file in the container — use Workload Identity (if on GKE) or attach the SA to the Cloud Run service directly.
FAQ · 1 QUESTIONS

Frequently Asked Questions

01
Is Google Cloud Platform better than AWS for beginners?
🔥

That's Cloud. Mark it forged?

7 min read · try the examples if you haven't

Previous
AWS CloudFront and Route 53
10 / 23 · Cloud
Next
Introduction to Azure