Intermediate 21 min · March 06, 2026

Introduction to Google Cloud Platform

GCP — Stop allUsers IAM Data Leaks

Q: Is Google Cloud Platform better than AWS for beginners?

Neither is objectively better — they solve the same problems with different UX and pricing models. GCP tends to have a cleaner CLI (gcloud) and more opinionated managed services like Cloud Run and BigQuery that reduce setup time. AWS has a larger ecosystem and more third-party integrations. For net-new projects without existing cloud investments, GCP's Cloud Run and managed Kubernetes are genuinely excellent starting points, and GCP's free tier is generous enough to learn without a credit card charge.

Q: Is Google Cloud cheaper than AWS?

Often yes, for compute — GCP applies sustained-use discounts automatically when you run a VM for more than 25% of a month, without any upfront commitment. AWS requires Reserved Instances for equivalent savings. GCP also uses per-second billing (after the first minute) vs AWS's per-hour legacy pricing on some instance types. For data analytics, BigQuery's serverless model is typically cheaper than maintaining an EMR or Redshift cluster. Network egress pricing is comparable between the two.

Q: What is BigQuery and why does GCP recommend it?

BigQuery is Google's serverless, columnar data warehouse. You load data and run SQL — there's no cluster to provision, no nodes to size, no indexes to maintain. It automatically scales to query terabytes in seconds. The key architectural insight: it separates compute from storage, so you pay only for queries run and storage used. For teams that would otherwise run Redshift, Snowflake, or a self-managed Spark cluster, BigQuery often halves both cost and operational burden.

Q: What is GCP's sustained-use discount and how does it compare to AWS Reserved Instances?

Sustained-use discounts (SUDs) are automatically applied when a Compute Engine VM runs for more than 25% of a month — no upfront commitment, no contract. At 100% monthly usage, the discount reaches approximately 30% off on-demand pricing. AWS requires purchasing Reserved Instances (1-year or 3-year commitment, paid upfront or monthly) to get equivalent savings. GCP's SUD is effectively a free Reserved Instance for workloads that run continuously. For variable workloads that sometimes run and sometimes don't, AWS Savings Plans may be more flexible.

Q: What is the difference between Cloud Monitoring and Cloud Logging?

Cloud Monitoring handles metrics — time-series numerical data (CPU %, request latency, error rate). You build dashboards, set alerting thresholds, and create uptime checks with it. Cloud Logging handles log records — structured text events from your application and GCP services. The two integrate: you can create log-based metrics in Cloud Monitoring from log patterns, and route logs to BigQuery for long-term analytics. For most teams, the practical split is: Cloud Monitoring for 'is the system healthy?' and Cloud Logging for 'why did it fail?'

A real incident: allUsers IAM binding made a GCP bucket publicly accessible, letting scrapers exfiltrate data.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of DevOps fundamentals
✓Comfortable with command-line tools
✓Basic Linux administration knowledge

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

GCP is Google's cloud platform built on the same infrastructure powering Search and YouTube
The Project is the atomic unit of isolation — billing, IAM, and APIs are per-project
Compute options: GCE (VMs), GKE (Kubernetes), Cloud Run (serverless containers) — pick by ops overhead tolerance
Storage services: GCS (blobs), Cloud SQL (relational), Spanner (global), Firestore (NoSQL) — match your data access pattern
Biggest mistake: using roles/editor on a service account — grants nearly full write access, making any compromise catastrophic
Performance insight: Cloud Run scales to zero, costing $0 at idle; GKE clusters cost ~$70/month minimum even when idle

✦ Definition~90s read

What is Introduction to Google Cloud Platform?

GCP (Google Cloud Platform) is a suite of cloud computing services offered by Google, running on the same infrastructure that Google uses internally for its end-user products like Google Search, Gmail, and YouTube. It provides a comprehensive set of modular cloud services including computing, data storage, data analytics, and machine learning, accessible via a public internet connection.

★

Imagine you're opening a restaurant but you don't want to buy the building, the ovens, or hire an electrician.

GCP is one of the three major public cloud providers, alongside AWS and Azure, and is distinguished by its strong emphasis on open-source technologies, data and analytics capabilities, and global network infrastructure.

Plain-English First

Imagine you're opening a restaurant but you don't want to buy the building, the ovens, or hire an electrician. Instead, you rent a fully-equipped kitchen by the hour — use as much or as little as you need, and pay only for what you cook. Google Cloud Platform is exactly that, but for software. Instead of buying servers, databases, and networking gear, your app rents Google's global infrastructure by the second. When traffic spikes on Black Friday, you dial up the kitchen size. When it's quiet, you dial it back down. No hardware, no waste.

Every production application you've ever used — from a startup's API to a Fortune 500's data pipeline — runs on someone's computers. The question is whose, and at what cost. Running your own servers means upfront capital, a team to maintain them, and a very bad Monday when one fails at 2 AM. Cloud platforms exist to flip that model: you get world-class infrastructure on demand, billed like a utility, with Google's Site Reliability Engineers quietly keeping the lights on behind the scenes. Google Cloud Platform is Google's answer to that problem, and it's built on the same infrastructure that runs Search, Gmail, and YouTube — systems engineered to handle billions of requests a day.

The real problem GCP solves isn't just 'running code remotely.' It's the operational complexity that kills engineering teams: patching OS vulnerabilities, provisioning storage that scales automatically, routing traffic across continents, and debugging distributed systems. Before managed cloud services, teams burned enormous engineering hours on infrastructure that added zero value to their product. GCP packages that complexity into opinionated, composable services so your team can stay focused on the thing that actually matters — the software itself.

After reading this, you'll confidently map a real-world application's requirements to specific GCP services, understand the difference between GCP's compute tiers and when each is appropriate, deploy a containerized workload to Google Kubernetes Engine, and avoid the billing and security mistakes that catch new GCP users off guard. This isn't a tour of the UI — it's a mental model you'll actually use.

How IAM allUsers Leaks Your Data

The allUsers principal in Google Cloud IAM grants anonymous access to any resource. It is the most common cause of unintended data exposure. When you set a bucket or object to allUsers, you bypass all authentication and authorization checks — anyone with the URL can read, write, or delete your data. This is not a bug; it is a deliberate configuration that teams apply without understanding the blast radius.

In practice, allUsers is often used for public websites or static assets. The problem is that it applies to every action on the resource, not just read. A bucket with allUsers on storage.objectViewer leaks object metadata and contents. With storage.objectAdmin, anyone can delete or overwrite files. There is no rate limiting, no IP restriction, no logging of who accessed what — just raw, unfiltered access. The IAM policy is evaluated at request time, so any misconfiguration is immediately live.

Use allUsers only when you explicitly need public access and have no other option. For most use cases, use a load balancer with Cloud CDN, or a signed URL with a short expiration. If you must use allUsers, restrict it to the minimum role (e.g., storage.objectViewer) and never combine it with write or delete roles. Audit your IAM policies weekly — a single allUsers binding on a production bucket is a data breach waiting to happen.

⚠ allUsers is not the same as authenticated users

allUsers includes unauthenticated requests. allAuthenticatedUsers still requires a Google account but is rarely needed — prefer service accounts or domain-restricted sharing.

📊 Production Insight

A team set a bucket to allUsers for a static site, but the bucket also contained backup files with PII. The backup files were publicly readable for 6 months before a security scan detected them.

The symptom: no failed auth logs, no unusual traffic patterns — the data was just silently accessible.

Rule: Never apply allUsers to a bucket that contains any data you cannot publish to the entire internet. Use separate buckets for public and private data.

🎯 Key Takeaway

allUsers grants anonymous access to every action on the resource — not just read.

Audit all IAM policies for allUsers bindings at least weekly; a single misconfiguration is a breach.

Prefer signed URLs, load balancers, or Cloud CDN over allUsers for public content.

thecodeforge.io

Introduction Google Cloud

What Is Google Cloud Platform? — Services, Strengths, and the GCP Mental Model

Google Cloud Platform is a suite of cloud computing services that run on the same global infrastructure powering Search, YouTube, and Gmail. That's not marketing fluff — it's the core differentiator. GCP's network spans over 200 points of presence connected by a private fiber backbone. Your traffic never touches the public internet unless you deliberately expose it.

GCP organizes its services into three layers. The infrastructure layer includes Compute Engine for raw VMs, Google Kubernetes Engine (GKE) for container orchestration, and bare metal servers for specialized workloads. Above that sits the managed platform layer — Cloud Run, App Engine, and Cloud Functions. These abstract away servers entirely. You push code, GCP runs it. The top layer is data and ML — BigQuery for analytics, Vertex AI for training and serving models, and Pub/Sub for event streaming.

Three structural differences separate GCP from AWS. First, the network. GCP's global network is its own undersea cable system. Latency between regions averages under 10ms within the same continent. AWS routes inter-region traffic through the public internet unless you use Direct Connect. Second, BigQuery is serverless analytics done right. No cluster to manage, no nodes to size, no indexes to tune. Run SQL against petabytes of data — it just works. AWS has Redshift, but even Redshift Serverless requires specifying RPUs. Third, Kubernetes lineage. GKE didn't just adopt Kubernetes — Google invented it. GKE gets new features first. AWS's EKS and Azure's AKS are followers, not leaders.

If you're coming from AWS, the mental model shift is this: projects replace accounts, regions are the same, but zones are always within a region. Every resource lives in a project. IAM policies attach at the project, folder, or organization level. There's no concept of a VPC spanning multiple regions by default — each project gets a default VPC, but you can create shared VPCs or use VPC peering for global connectivity.

The production insight here: GCP's network cost structure is simpler than AWS's. Egress between GCP regions via the internal IP costs nothing. On AWS, inter-region data transfer is charged per GB. If you're running multi-region microservices that talk to each other, GCP saves you real money.

🔥GCP's secret weapon

The private fiber backbone means your cross-region traffic never touches the public internet. That's not just faster — it's more secure and doesn't incur egress costs.

📊 Production Insight

GCP's network is its moat. 200+ PoPs. Private fiber. No egress between regions on internal IPs.

BigQuery separates compute from storage — you pay per query, not per cluster.

The rule: if your workload talks across regions, GCP's network alone can justify the switch.

🎯 Key Takeaway

GCP runs on Google's own global network.

Three layers: infrastructure, managed platform, data/ML.

Choose GCP when network performance and serverless analytics matter most.

GCP vs AWS vs Azure — Choosing the Right Cloud for Your Use Case

The cloud triopoly offers three different pricing philosophies. GCP charges per second after the first minute, with sustained-use discounts that kick in automatically when a VM runs over 25% of a month. No upfront commitment. AWS charges per hour or per second depending on the instance family. Reserved instances save 40-70% but lock you into 1-3 year terms. Azure charges per minute. Which one wins? It depends on your usage pattern. If your VMs run 24/7, AWS Reserved Instances are cheaper. If you scale up and down unpredictably, GCP's automatic discounts win every time.

ML and AI is a battleground where each vendor has a distinct advantage. GCP offers Vertex AI for end-to-end ML, with first-class TPU support — those custom chips are 2-3x faster than equivalent GPUs for transformer models. AWS SageMaker is more mature but tied to GPU pricing. Azure's advantage is the OpenAI partnership — you can deploy GPT-4 directly through Azure, no secret API key needed.

Kubernetes is not a level playing field. Google invented Kubernetes. GKE is the benchmark — automatic node repair, vertical pod autoscaling, and GKE Autopilot that manages the entire cluster. EKS and AKS are catching up, but they still lag on features. If Kubernetes is central to your architecture, GKE is the safest bet.

Multi-cloud and hybrid deployments vary wildly. GCP's Anthos lets you run GKE on-premises and on AWS. AWS Outposts brings AWS hardware to your data center. Azure Arc manages servers and Kubernetes clusters across clouds. Each solution works, but only Anthos provides a unified Kubernetes-based control plane across environments.

Data analytics is where GCP dominates. BigQuery runs SQL over petabytes in seconds. No cluster management. Redshift requires provisioning nodes, designing sort keys, and running VACUUM commands. Azure Synapse is somewhere in between. If your team spends time tuning data warehouses instead of analyzing data, GCP is the obvious choice.

Choose GCP when: BigQuery is your data warehouse, you're already on Kubernetes, or ML workloads dominate. Choose AWS when: you need raw compute breadth (200+ instance types, Graviton processors) or deep integration with the broader AWS ecosystem. Choose Azure when: you're a Microsoft shop, need OpenAI access, or require deep hybrid connectivity with Active Directory.

Mental Model

The cloud decision tree

Start with your primary workload. Data analytics → GCP. Kubernetes-heavy → GCP. Windows/.NET → Azure. Compute breadth → AWS. ML with OpenAI → Azure. ML with custom models → GCP. Multi-cloud → evaluate Anthos vs Arc vs Outposts.

📊 Production Insight

GCP's sustained-use discounts apply automatically. No paperwork. No commitments.

GKE is the benchmark Kubernetes service — everything else is playing catch-up.

The rule: pick the cloud where your primary workload gets the most native leverage.

🎯 Key Takeaway

GCP wins on pricing simplicity, Kubernetes, and analytics.

AWS wins on instance variety and ecosystem breadth.

Azure wins on Microsoft integration and OpenAI access.

thecodeforge.io

Introduction Google Cloud

Getting Started: Free Tier, Account Setup, and Your First gcloud Command

GCP offers a $300 credit for 90 days on new accounts. That's enough to run a medium-sized project or stress-test a service. But that's not the only free option. The Always Free tier never expires — Cloud Run (2 million requests/month), Cloud Functions (2 million invocations/month), Cloud Storage (5 GB), and Cloud Shell (a browser-based terminal with 5 GB of persistent disk). You can run small-scale applications indefinitely at zero cost.

Setting up an account takes three steps. First, go to console.cloud.google.com and sign in with a Google account. Second, create a project or let the default one be created. Third, enable billing by entering a credit card. GCP doesn't charge until you exceed the free tier, and you can set budget alerts to prevent surprise bills. If you're paranoid, enable billing alerts at $10, $50, and $100 — you'll get email notifications before costs spiral.

Now install the gcloud CLI. On macOS:

brew install --cask google-cloud-sdk

For Linux (one-liner):

curl https://sdk.cloud.google.com | bash

Restart your shell. Run 'gcloud init' to configure the default project and region.

Type these three commands in order:

gcloud auth login

This opens a browser window to authenticate. The CLI saves credentials locally.

gcloud projects list

This lists every project your account has access to. If you just created one, you'll see it here.

gcloud config set project YOUR_PROJECT_ID

This sets the default project for all subsequent commands. Now you can run 'gcloud compute instances list' or 'gcloud storage buckets list' without specifying --project every time.

A pro tip: use gcloud config configurations to manage multiple projects. Each config holds a project, region, and account. Switch between them with 'gcloud config configurations activate config-name'. If you're juggling development, staging, and production projects, this saves you from accidentally running a destructive command on the wrong project.

One more thing: enable the Compute Engine API before using any compute service. GCP requires API enablement per project — it's not automatic. Run 'gcloud services enable compute.googleapis.com'. This step trips up every new user.

io/thecodeforge/gcp/setup.shBASH

# Install gcloud CLI on macOS
brew install --cask google-cloud-sdk

# Reinitialize
source $(brew --prefix google-cloud-sdk)/path.zsh.inc

# Authenticate and configure
gcloud auth login
gcloud projects list
gcloud config set project my-production-project-42

# Enable the Compute Engine API
gcloud services enable compute.googleapis.com

# Verify setup
gcloud compute zones list | head -5

Output

NAME REGION STATUS

us-central1-a us-central1 UP

us-central1-b us-central1 UP

us-central1-c us-central1 UP

us-central1-f us-central1 UP

us-east1-b us-east1 UP

💡Don't skip billing alerts

Create a budget alert the same day you enable billing. GCP won't stop services when you blow budget — you'll just get an angry email weeks later. Set alerts at 50%, 90%, and 100% of your budget.

📊 Production Insight

$300 free credit for 90 days.

Always Free tier runs indefinitely — Cloud Run, Functions, Storage.

The rule: use gcloud config configurations to keep dev, staging, and prod separate.

🎯 Key Takeaway

Three steps: sign up, create project, enable billing.

Always Free tier covers small workloads forever.

Use gcloud auth login, projects list, and config set to get started.

GCP's Mental Model: Projects, Regions, and the Resource Hierarchy

Before touching any GCP service, you need to understand how GCP organises everything. Get this wrong and you'll end up with sprawling costs, broken IAM permissions, and services that can't talk to each other.

GCP groups resources into a three-tier hierarchy: Organisation → Folders → Projects. A Project is the atomic unit — every resource (a VM, a bucket, a database) lives inside exactly one project. Billing, IAM permissions, and API enablement are all scoped to the project. This is intentional: it means a dev team can have a payments-service-dev project completely isolated from payments-service-prod, with different budgets, different access controls, and separate audit logs.

Regions and zones handle physical location. A Region is a geographic area (e.g., us-central1 in Iowa). Each region contains multiple Zones (us-central1-a, us-central1-b, etc.) — these are independent data centres within that region. The rule of thumb: deploy across at least two zones for high availability, across multiple regions only if latency to global users or data sovereignty requires it. Cross-region data transfer costs money, so don't do it by default.

Understanding this hierarchy is what separates developers who get surprised by a $4,000 bill from those who plan budgets accurately from day one.

gcp_project_setup.shBASH

#!/bin/bash
# -----------------------------------------------------------
# GCP PROJECT SETUP SCRIPT
# Run this once to initialise a new GCP project correctly.
# Requires: gcloud CLI authenticated via `gcloud auth login`
# -----------------------------------------------------------

# Define project configuration as variables — never hardcode these inline
PROJECT_ID="payments-service-prod"        # Must be globally unique across all GCP
BILLING_ACCOUNT_ID="012345-ABCDEF-789GHI" # Found in GCP Console > Billing
PRIMARY_REGION="us-central1"              # Closest region to your main user base
PRIMARY_ZONE="us-central1-a"             # Default zone within that region

# Step 1: Create the project
# --set-as-default means subsequent gcloud commands target this project automatically
gcloud projects create "${PROJECT_ID}" \
  --name="Payments Service Production" \
  --set-as-default

echo "Project '${PROJECT_ID}' created."

# Step 2: Link a billing account — without this, most services won't activate
gcloud billing projects link "${PROJECT_ID}" \
  --billing-account="${BILLING_ACCOUNT_ID}"

echo "Billing account linked."

# Step 3: Set the default region and zone so you don't have to repeat --region/--zone
# on every command. This saves you from accidentally deploying to the wrong region.
gcloud config set compute/region "${PRIMARY_REGION}"
gcloud config set compute/zone "${PRIMARY_ZONE}"

echo "Default region set to ${PRIMARY_REGION}, zone to ${PRIMARY_ZONE}."

# Step 4: Enable only the APIs your project actually needs.
# GCP disables most APIs by default — this is a security feature, not a bug.
# Enabling unused APIs increases your attack surface for nothing.
gcloud services enable \
  compute.googleapis.com \
  container.googleapis.com \
  cloudsql.googleapis.com \
  storage.googleapis.com

echo "Core APIs enabled."
echo "Project initialisation complete. Run 'gcloud config list' to verify."

Output

Project 'payments-service-prod' created.

Billing account linked.

Default region set to us-central1, zone to us-central1-a.

Operation "operations/acf.p2-1234567890-abcdef" finished successfully.

Core APIs enabled.

Project initialisation complete. Run 'gcloud config list' to verify.

⚠ Watch Out: Project IDs Are Permanent

Once a GCP Project ID is created, it cannot be changed — ever. Even after deleting the project, that ID is reserved globally for 30 days. Always use a naming convention like {team}-{service}-{env} (e.g., platform-auth-prod) before you run that create command.

📊 Production Insight

Mistaking the project for an organisational boundary leads to sprawling costs and broken IAM.

Many teams put dev/staging/prod in folders under one project — wrong: separate projects isolate billing and access.

Rule: one project per environment, one service account per service.

🎯 Key Takeaway

The project is your atomic unit of isolation.

Billing, IAM, and APIs are per-project.

Use separate projects for separate environments.

Project Hierarchy Decisions

IfSingle service, no need for isolation

→

UseOne project is fine

IfMultiple environments (dev/staging/prod)

→

UseCreate separate projects per environment

IfMulti-team, multi-service

→

UseUse folders under an organisation node for logical grouping, each team gets its own project

GCP Pricing Model — Sustained-Use Discounts, Preemptible VMs, and Per-Second Billing

You're running a 24/7 n2-standard-4 on GCP. You never bought a Reserved Instance. You never signed a commitment. Yet you're paying ~$0.23 per hour. That's $165 per month. Here's the kicker: GCP automatically cuts that to ~$0.16 after 25% of the month. That's the sustained-use discount (SUD). After a full month, you save ~30%. No paperwork. No upfront payment. Just automatic savings.

Compare that to AWS. To get the same discount on AWS, you buy a Reserved Instance — 1-year or 3-year commitment, paid upfront or monthly. Miss the purchase window? You're stuck at on-demand. GCP's SUD rewards loyalty without trapping you. For variable workloads, SUDs are a no-brainer. For predictable workloads, committed use discounts (CUDs) go further — 1-year gets ~37%, 3-year hits ~55%. CUDs apply at the project level, not per VM. You commit to a minimum spend per hour, and every eligible VM in that project gets the discount.

Now the power move: Preemptible VMs — now called Spot VMs on GCP. They cost up to 91% less than on-demand. But GCP can reclaim them with 30 seconds notice. You can't run your database on Spot VMs. You can run batch processing, Dataflow jobs, CI/CD agents, and ML training workloads. GKE can automatically replace preempted nodes via node auto-repair. My rule: any job that can survive a power loss should run on Spot VMs.

Per-second billing matters more than you think. GCP bills per second after a 1-minute minimum. A 90-second VM costs 1.5 minutes, not 2. AWS bills per hour (rounded up). For short-lived test VMs or autoscaling groups with frequent scale-down, per-second billing saves 20-30%. I've seen teams shave $2,000/month by clearing idle VMs and using per-second billing.

Here's the concrete comparison for an n2-standard-4 (4 vCPU, 16 GB RAM) in us-central1: on-demand is ~$0.23/hr ($165/mo). With 100% SUD, ~$0.16/hr ($115/mo). With 1-year CUD, ~$0.14/hr ($101/mo). With Spot, ~$0.02/hr ($14/mo). A team running 100 continuous n2-standard-4s can save $50,000/year by switching to CUDs.

Don't ignore pricing on Day 1. Retrofitting cost optimisation is a nightmare. Set up billing alerts early. Review CUD recommendations monthly. And for batch — always use Spot VMs.

io/thecodeforge/cloud/gcp/gcp_demo_pricing_comparison.shBASH

#!/bin/bash
# Compare on-demand vs SUD vs CUD for a month of continuous n2-standard-4
ON_DEMAND_PER_HR=0.23
SUD_FACTOR=0.7  # ~30% discount at 100% usage
CUD_1Y_FACTOR=0.63  # ~37% discount
SPOT_PER_HR=0.02
HOURS=730  # avg month

echo "=== n2-standard-4 Monthly Cost Estimate ==="
echo "On-demand:     \$(echo "scale=2; $ON_DEMAND_PER_HR * $HOURS" | bc)"
echo "SUD (~30%):    \$(echo "scale=2; $ON_DEMAND_PER_HR * $SUD_FACTOR * $HOURS" | bc)"
echo "CUD 1-year:    \$(echo "scale=2; $ON_DEMAND_PER_HR * $CUD_1Y_FACTOR * $HOURS" | bc)"
echo "Spot:          \$(echo "scale=2; $SPOT_PER_HR * $HOURS" | bc)"

Output

=== n2-standard-4 Monthly Cost Estimate ===

On-demand: 167.90

SUD (~30%): 117.53

CUD 1-year: 105.77

Spot: 14.60

⚠ Commitment Lock-In

CUDs lock you into a minimum hourly spend for 1 or 3 years. If your workload shrinks, you still pay. Start with SUDs and auto-scaling before committing.

📊 Production Insight

Automate node scaling with Spot Preemptible + regular VMs

Set up billing alerts at 80% and 100% of budget

Review committed use recommendations monthly

🎯 Key Takeaway

SUDs are automatic, free Reserved Instances

CUDs need a commitment but cut deeper

Spot VMs are for stateless fault-tolerant workloads only

GCP Global Infrastructure — Regions, Zones, Multi-Regions, and the Private Backbone

You deploy a VM in us-central1. Your users are in London. That request traverses the Atlantic on undersea cables owned by someone else. It adds ~150ms. But deploy in europe-west2 (London) and that drops to ~10ms. The right region choice shaves 90% of network latency.

GCP has 40+ regions. Each region has 3 zones. Zones are independent failure domains — separate power, cooling, and networking in the same region. A flood in zone-a won't touch zone-b. Deploy your application across all 3 zones. Single-zone deployments are gambling. GCP's multi-zone SLA is 99.99%; single zone is lower and frankly dangerous.

Multi-regions go further. GCS, BigQuery, and Spanner offer multi-region configurations (US, EU, ASIA). Data is replicated automatically across two distant regions. Read from one, fail over to the other. For disaster recovery, this is your safety net. But multi-region costs more and adds ~5ms per read. Use it for databases you cannot afford to lose, not for static files.

Edge PoPs — 200+ Points of Presence — sit at the edge of Google's network. Cloud CDN caches content there. TCP terminates there. DDoS scrubbing happens there. Your users never hit your origin server for cached content. That's why GCP's network feels fast even for global audiences.

The private backbone is GCP's secret weapon. Traffic between GCP regions travels on Google's private fibre, not the public internet. A Compute Engine VM in us-east1 talking to a Cloud SQL instance in europe-west2 stays on Google's network. No ISP bottlenecks, no BGP hijacking, no packet loss. This gives GCP a latency advantage over AWS and Azure for cross-region communication — measured at 30-40% better in benchmarks.

Rule of thumb: deploy in the same region as your users. Use multi-region only for DR or global read-replicas. And test across zones from Day 1 — adding zone redundancy later is a painful refactor.

io/thecodeforge/cloud/gcp/gcp_infra_region_check.shBASH

#!/bin/bash
# Check available zones in a region
REGION="us-central1"
# gcloud compute zones list --filter="region=($REGION)" --format="value(name)"
echo "Zones in $REGION:"
gcloud compute zones list --filter="region~^$REGION$ AND name~$REGION" --format="value(name)"

Output

Zones in us-central1:

us-central1-a

us-central1-b

us-central1-f

🔥Private Backbone

GCP traffic stays on private fibre. AWS and Azure route cross-region traffic over the internet in many cases. That's GCP's biggest latency advantage.

📊 Production Insight

Deploy across 3 zones per region for HA

Use multi-region only for global data or DR

Never rely on public internet for inter-service communication

🎯 Key Takeaway

Zones save your app from datacenter failures

Multi-regions protect region-level disasters

Private backbone is GCP's super power

GCP Compute Options: Choosing the Right Engine for Your Workload

GCP gives you five distinct ways to run code, and picking the wrong one is one of the most common — and expensive — mistakes teams make. They're not interchangeable; each is optimised for a specific shape of workload.

Compute Engine (GCE) is raw virtual machines. You control the OS, you manage patching, you configure networking. Use this when you're lifting-and-shifting an existing application that has specific OS dependencies, or when you need GPU access for ML training jobs. It's the most flexible and the most operational overhead.

Google Kubernetes Engine (GKE) is managed Kubernetes. GCP handles the control plane (the bit that schedules your containers) and you manage your node pools and workloads. This is the workhorse for microservices architectures — use it when you have multiple services that need independent scaling, resource isolation, and rolling deployments.

Cloud Run is serverless containers. You push a container image, GCP handles everything else — scaling from zero to thousands of instances, load balancing, HTTPS. No cluster to manage. Use this for stateless APIs and event-driven services where you want zero infrastructure management. It's phenomenally cost-efficient for variable traffic.

App Engine is the oldest PaaS on GCP — opinionated, language-specific runtimes. Mostly superseded by Cloud Run for new projects.

Cloud Functions is function-level serverless for event triggers. Use it for glue code: responding to a file upload, processing a Pub/Sub message, or running a webhook handler. Not suited for long-running or compute-heavy work.

Here's the thing — each tier has a hidden cost: GCE's sustained-use discounts save you after 25% of the month, but they don't apply to preemptible VMs. GKE's control plane is free, but node costs add up fast — a three-node n1-standard-2 cluster costs about $200/month before any workload. Cloud Run per-request billing means you pay nothing at idle, but cold starts can hit 3 seconds for JVM apps. Trade-offs everywhere.

cloud_run_service.yamlYAML

# -----------------------------------------------------------
# CLOUD RUN SERVICE DEFINITION
# Deploys a containerised payments API to Cloud Run.
# Cloud Run auto-scales to zero when idle — you pay nothing
# when your service isn't handling requests.
# Deploy with: gcloud run services replace cloud_run_service.yaml
# -----------------------------------------------------------
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: payments-api
  namespace: "123456789"  # Your GCP Project Number (not Project ID)
  annotations:
    # Force all traffic through HTTPS — never allow plain HTTP in production
    run.googleapis.com/ingress: all
spec:
  template:
    metadata:
      annotations:
        # Scale down to zero instances when there are no requests
        # This is what makes Cloud Run cost-effective for variable traffic
        autoscaling.knative.dev/minScale: "0"
        # Cap at 10 instances to prevent runaway costs during a traffic spike
        autoscaling.knative.dev/maxScale: "10"
        # Each instance handles max 80 concurrent requests before a new one spins up
        run.googleapis.com/execution-environment: gen2
    spec:
      # How long Cloud Run waits for a response before treating it as a timeout
      timeoutSeconds: 30
      # CPU and memory are per-instance limits
      containers:
        - image: gcr.io/payments-service-prod/payments-api:v2.1.0
          ports:
            - containerPort: 8080  # Cloud Run always routes traffic to port 8080
          resources:
            limits:
              cpu: "1"        # 1 vCPU per instance
              memory: "512Mi" # 512MB RAM — right-size this based on profiling
          env:
            # Never hardcode secrets. Reference Secret Manager instead.
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: latest
                  name: payments-db-password  # Name of secret in Secret Manager
  traffic:
    # 100% of traffic goes to the latest revision
    # You can split traffic here for canary deployments (e.g., 90/10)
    - latestRevision: true
      percent: 100

Output

Deploying...

Setting IAM policy

Done.

Service [payments-api] revision [payments-api-00002-xyz] has been deployed and is serving 100 percent of traffic.

Service URL: https://payments-api-abcdef-uc.a.run.app

💡Pro Tip: Cloud Run Cold Starts

Setting minScale to 0 means you pay nothing at idle, but the first request after a period of inactivity hits a 'cold start' — typically 1-3 seconds for a JVM app, under 300ms for Go or Node. For latency-sensitive services (payments, auth), set minScale to 1. The cost is roughly $10-15/month for one always-warm instance — cheap insurance against SLA breaches.

📊 Production Insight

Choosing the wrong compute tier is one of the most expensive mistakes teams make.

Cloud Run's scale-to-zero saves money for variable traffic but introduces cold start latency.

Rule: start with Cloud Run for stateless services; only move to GKE or GCE when Cloud Run limits constrain you.

🎯 Key Takeaway

Compute choice is about ops overhead vs control.

Cloud Run is the default for new stateless services.

GKE for multi-container apps needing control; GCE for legacy lift-and-shift or GPUs.

Compute Decision: Which Service to Use

IfStateless API, unpredictable traffic, no GPU needed

→

UseCloud Run — scale-to-zero, per-request billing

IfMulti-service microservices architecture, needs control over networking

→

UseGKE — autoscaling, rolling updates, service mesh support

IfLegacy app with specific OS dependencies or GPU/TPU required

→

UseCompute Engine — full VM control, GPU support

Storage on GCP: Matching the Data Shape to the Right Service

Nothing reveals a GCP beginner faster than seeing them store relational data in Cloud Storage or put time-series metrics into Cloud SQL. GCP has six distinct storage services and each one is engineered for a specific data access pattern. Using the wrong one doesn't just waste money — it actively degrades performance.

Cloud Storage (GCS) is object storage — think S3. Binary blobs, static assets, backups, data lake files. Infinitely scalable, globally accessible, extremely cheap. Access pattern: write once, read many, no updates to individual fields.

Cloud SQL is managed relational databases — PostgreSQL, MySQL, or SQL Server. Handles backups, failover, and patching. Use it when you have structured data with relationships and your team already thinks in SQL. Scales vertically (bigger machine) with read replicas for horizontal read scaling.

Cloud Spanner is the exotic one — globally distributed, horizontally scalable relational database with ACID transactions. It's what powers Google's own financial systems. Use it when Cloud SQL's 96TB limit isn't enough or when you need active-active multi-region writes. The price point reflects its power — about 20x Cloud SQL.

Firestore is a serverless NoSQL document database, optimised for mobile and web clients with real-time sync built in. Excellent for user profiles, session data, and content that's hierarchical and document-shaped.

Bigtable is a managed wide-column NoSQL store, designed for petabyte-scale time-series, IoT, and financial data with millisecond latency at massive scale. Not a general-purpose database.

Memorystore is managed Redis or Memcached — in-memory caching layer for your hot data.

One more thing: GCS storage classes (Standard, Nearline, Coldline, Archive) let you save 60-90% by picking the right access frequency. Access a Coldline object once? That retrieval costs more than storing it for a month. Pick storage class based on real access patterns, not on what feels right.

gcs_upload_and_signed_url.pyPYTHON

# -----------------------------------------------------------
# GCS OBJECT UPLOAD + SIGNED URL GENERATION
# Real-world pattern: a user uploads a profile photo.
# We store it privately in GCS, then generate a short-lived
# signed URL so the frontend can display it without making
# the bucket publicly readable (a major security mistake).
#
# Install dependencies: pip install google-cloud-storage
# Auth: set GOOGLE_APPLICATION_CREDENTIALS env var to your
#       service account key JSON path, or use Workload Identity.
# -----------------------------------------------------------

import datetime
from pathlib import Path
from google.cloud import storage

GCP_PROJECT_ID = "payments-service-prod"
PRIVATE_BUCKET_NAME = "user-profile-photos-prod"  # This bucket is NOT public
SIGNED_URL_EXPIRY_MINUTES = 15  # Short expiry — limits blast radius if URL leaks


def upload_user_profile_photo(
    user_id: str,
    local_file_path: Path,
    content_type: str = "image/jpeg",
) -> str:
    """
    Uploads a profile photo to GCS and returns a signed URL
    the frontend can use to display it for the next 15 minutes.

    Returns the signed URL string.
    """
    storage_client = storage.Client(project=GCP_PROJECT_ID)
    bucket = storage_client.bucket(PRIVATE_BUCKET_NAME)

    # Build a deterministic object path — makes it easy to find later
    # and naturally organises objects by user without needing folders
    object_name = f"users/{user_id}/profile/avatar.jpg"

    blob = bucket.blob(object_name)

    # Set content type so browsers render it correctly, not download it
    blob.content_type = content_type

    # Upload the file — this overwrites any existing photo for this user
    blob.upload_from_filename(str(local_file_path))
    print(f"Uploaded '{local_file_path}' to gs://{PRIVATE_BUCKET_NAME}/{object_name}")

    # Generate a V4 signed URL — time-limited, cryptographically signed
    # by our service account. The bucket stays private; only holders of
    # this URL can access the object, and only until it expires.
    signed_url = blob.generate_signed_url(
        version="v4",
        expiration=datetime.timedelta(minutes=SIGNED_URL_EXPIRY_MINUTES),
        method="GET",  # Read-only access
    )

    print(f"Signed URL (valid {SIGNED_URL_EXPIRY_MINUTES} mins): {signed_url[:80]}...")
    return signed_url


if __name__ == "__main__":
    # Simulate uploading a photo for user ID 'usr_8821'
    sample_photo_path = Path("/tmp/avatar_upload.jpg")

    # In production this file comes from a multipart form upload
    sample_photo_path.write_bytes(b"<fake-jpeg-bytes-for-demo>")

    url = upload_user_profile_photo(
        user_id="usr_8821",
        local_file_path=sample_photo_path,
    )
    print(f"\nFrontend should use this URL to render the avatar: {url[:60]}...")

Output

Uploaded '/tmp/avatar_upload.jpg' to gs://user-profile-photos-prod/users/usr_8821/profile/avatar.jpg

Signed URL (valid 15 mins): https://storage.googleapis.com/user-profile-photos-prod/users/usr_88...

Frontend should use this URL to render the avatar: https://storage.googleapis.com/user-profile-photos...

⚠ Watch Out: Never Make a Storage Bucket Containing PII Public

GCS has an 'allUsers' IAM permission that makes an entire bucket readable by the whole internet. It's convenient for hosting static assets, but it has caused real data breaches when teams accidentally applied it to buckets containing user data. Use signed URLs as shown above — they give time-limited, auditable access without ever opening the bucket publicly.

📊 Production Insight

Using Cloud Storage for relational data or Cloud SQL for blob data wastes money and performance.

A common trap: storing JSON blobs in Cloud SQL when Firestore or GCS would be cheaper and faster.

Rule: analyse your data access pattern before picking a storage service.

🎯 Key Takeaway

Storage decision is purely about access pattern.

Blobs → GCS, relational → Cloud SQL, global relational → Spanner, documents → Firestore.

Mixing them costs both money and latency.

Storage Decision by Data Pattern

IfBinary blobs, static assets, backups

→

UseCloud Storage (GCS)

IfStructured relational data under 96TB

→

UseCloud SQL (PostgreSQL/MySQL)

IfGlobal relational with multi-region writes

→

UseCloud Spanner

IfDocument-shaped hierarchical data, real-time sync

→

UseFirestore

IfTime-series or IoT data at petabyte scale

→

UseBigtable

Data and Analytics on GCP — BigQuery, Pub/Sub, and Dataflow

If GCP has a killer feature, it's BigQuery. A serverless data warehouse that runs SQL against petabytes of data without provisioning a single node. You load data into tables, and BigQuery's columnar storage engine scans only the columns your query touches. Querying one column of a hundred-column table costs 1/100th of a full scan. That's how you spend $5 per TB scanned and still query terabytes in seconds.

Pricing is simple: $5 per TB of data scanned per query. Or flat-rate slots for predictable pricing at scale. No cluster management, no vacuum commands, no sort key design. If you've used Redshift, you know the pain of designing distribution keys and analyzing query plans. BigQuery eliminates all that.

Let's see it in action. Load a CSV into BigQuery and run a query:

bq load --source_format=CSV mydataset.orders gs://my-bucket/orders.csv order_id:STRING,customer_id:INT64,amount:FLOAT64

Then query:

SELECT customer_id, SUM(amount) as total_revenue FROM mydataset.orders WHERE date >= '2024-01-01' GROUP BY customer_id ORDER BY total_revenue DESC LIMIT 100;

Pub/Sub handles event streaming. Think of it as managed Kafka without the cluster management headaches. Guaranteed at-least-once delivery, push and pull modes, and global message retention. Your microservices publish events to topics. Subscribers pull messages or receive push callbacks. The key difference from Kafka: Pub/Sub auto-scales its throughput without partition management. You don't decide partition counts or replication factors.

Dataflow is the ETL engine — managed Apache Beam. You write a pipeline in Java or Python, and Dataflow executes it across an auto-scaling cluster. The same code works for batch and streaming. You define transforms once, and Dataflow handles windowing, triggers, and exactly-once semantics under the hood.

Production insight: BigQuery pricing punishes exploratory queries on large datasets. Run a SELECT * on a petabyte table — that's a $5k bill. Always preview data or use clustering to limit scan size. The best rule: never query raw tables in dashboards. Create aggregated views that reduce columns and pre-filter rows.

io/thecodeforge/gcp/pubsub_subscriber.pyPYTHON

from google.cloud import pubsub_v1
import os

project_id = os.environ["GCP_PROJECT_ID"]
subscription_id = "order-events-sub"

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)

def callback(message: pubsub_v1.subscriber.message.Message) -> None:
    print(f"Received {message.data}.")
    message.ack()  # acknowledge — prevents redelivery

streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}...")

try:
    streaming_pull_future.result()
except KeyboardInterrupt:
    streaming_pull_future.cancel()
    streaming_pull_future.result()

Output

Listening for messages on projects/my-production-project-42/subscriptions/order-events-sub...

Received b'{"order_id":123,"customer_id":456,"amount":99.99}'.

⚠ BigQuery cost trap

📊 Production Insight

BigQuery scans only the columns your query touches. Columnar storage efficiency.

Pub/Sub auto-scales without partition planning.

The rule: create aggregated views to prevent costly ad-hoc queries.

🎯 Key Takeaway

BigQuery is serverless petabyte SQL.

Pub/Sub replaces Kafka without cluster management.

Dataflow unifies batch and streaming ETL.

GCP IAM and Networking: The Security Layer You Can't Skip

Here's the uncomfortable truth: most cloud security incidents aren't caused by sophisticated attacks. They're caused by over-permissioned service accounts, open firewall rules, and credentials hardcoded into source code. GCP's IAM and VPC model exist specifically to prevent this — but only if you use them intentionally.

IAM (Identity and Access Management) in GCP follows the principle of least privilege. Every service account, user, and group gets only the permissions it needs — nothing more. Roles are either predefined (like roles/storage.objectViewer) or custom. The most dangerous role is roles/editor on a project — it's temptingly broad and you'll see it everywhere in tutorials. Never use it in production.

Workload Identity is the right way for GKE workloads to authenticate to GCP APIs. Instead of downloading a service account key JSON file (a long-lived credential that can be stolen), Workload Identity binds a Kubernetes service account to a GCP service account. The credential is ephemeral and automatically rotated. If you're using key files in a Kubernetes cluster, stop — switch to Workload Identity.

VPC (Virtual Private Cloud) is your private network inside GCP. By default, GCP creates a 'default' VPC with permissive firewall rules. For anything production, create a custom VPC with explicit subnets per region, and firewall rules that deny all ingress by default and allow only what you specify. Use Private Google Access on subnets so VMs can reach GCP APIs without needing a public IP.

gcp_iam_least_privilege_setup.shBASH

#!/bin/bash
# -----------------------------------------------------------
# GCP IAM LEAST-PRIVILEGE SETUP
# Creates a service account for a Cloud Run payments service
# with ONLY the permissions it actually needs:
#   - Read secrets from Secret Manager
#   - Write to a specific Cloud Storage bucket
#   - Publish to a specific Pub/Sub topic
# Nothing else. This limits blast radius if the service is compromised.
# -----------------------------------------------------------

PROJECT_ID="payments-service-prod"
SERVICE_NAME="payments-api"

# Step 1: Create a dedicated service account for this service
# One service account per service — never share service accounts
SERVICE_ACCOUNT_EMAIL="${SERVICE_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

gcloud iam service-accounts create "${SERVICE_NAME}" \
  --project="${PROJECT_ID}" \
  --display-name="Payments API Service Account" \
  --description="Identity for the payments-api Cloud Run service. Least-privilege access only."

echo "Service account created: ${SERVICE_ACCOUNT_EMAIL}"

# Step 2: Grant permission to read secrets from Secret Manager
# This is scoped to the PROJECT level — ideally scope it to individual secrets
# using resource-level IAM for even finer control
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/secretmanager.secretAccessor"

echo "Secret Manager access granted."

# Step 3: Grant permission to write objects to a specific bucket ONLY
# Note: roles/storage.objectCreator is narrower than roles/storage.objectAdmin
# objectCreator can write new objects but cannot delete or overwrite existing ones
TARGET_BUCKET="gs://payments-receipts-prod"

gcloud storage buckets add-iam-policy-binding "${TARGET_BUCKET}" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/storage.objectCreator"

echo "Storage write access granted to ${TARGET_BUCKET} only."

# Step 4: Grant Pub/Sub publish permission on one specific topic
TARGET_TOPIC="projects/${PROJECT_ID}/topics/payment-completed-events"

gcloud pubsub topics add-iam-policy-binding "payment-completed-events" \
  --project="${PROJECT_ID}" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/pubsub.publisher"

echo "Pub/Sub publish access granted to payment-completed-events topic."

# Step 5: Attach this service account to the Cloud Run service
# The service now authenticates as this SA automatically — no key files needed
gcloud run services update "${SERVICE_NAME}" \
  --project="${PROJECT_ID}" \
  --region="us-central1" \
  --service-account="${SERVICE_ACCOUNT_EMAIL}"

echo "Service account attached to Cloud Run service."
echo "IAM setup complete. This service account has NO other GCP permissions."

Output

Service account created: payments-api@payments-service-prod.iam.gserviceaccount.com

Secret Manager access granted.

Storage write access granted to gs://payments-receipts-prod only.

Pub/Sub publish access granted to payment-completed-events topic.

Service account attached to Cloud Run service.

IAM setup complete. This service account has NO other GCP permissions.

🔥Interview Gold: Why Not Just Use roles/editor?

roles/editor grants write access to almost every GCP resource in the project — including the ability to read secrets, exfiltrate data, and create new compute resources. If a service with this role is compromised, the attacker has near-full control of your GCP project. Interviewers love asking how you'd scope permissions for a specific service. The answer is always: one dedicated service account, one role per permission needed, no editor/owner roles in production.

📊 Production Insight

The most common security incident in GCP is an over-permissioned service account.

A single roles/editor binding on a project lets an attacker control everything.

Rule: use custom roles or least-privilege predefined roles; never use roles/editor in production.

🎯 Key Takeaway

IAM is not an afterthought — design least-privilege before deploying.

One SA per service, one role per need.

Use Workload Identity for GKE to avoid key files.

IAM Strategy Decisions

IfService needs to read from Cloud SQL

→

UseGrant roles/cloudsql.client on the instance or project

IfService needs to publish to Pub/Sub

→

UseGrant roles/pubsub.publisher on the specific topic

IfService needs to write to a specific GCS bucket

→

UseGrant roles/storage.objectCreator on the bucket

GCP Networking: VPCs, Firewalls, and Connectivity

GCP's networking model is built around Virtual Private Clouds (VPCs). A VPC is a global isolated network that spans all regions. Within it, you define subnets per region, each with a private IP range. By default, GCP creates a 'default' VPC with permissive firewall rules — convenient for prototyping but dangerous for production. Always create a custom VPC for production workloads.

Subnets are regional IP ranges (e.g., 10.0.0.0/20 in us-central1). Resources within the same subnet can communicate without a firewall rule. Firewall Rules are stateful — by default, all ingress is denied unless allowed. Egress is allowed. Rule order doesn't matter; priority does. Private Google Access lets VMs without external IPs reach Google APIs via Google's internal network. Cloud NAT is required for VMs with no external IP to outbound internet. VPC Peering connects two VPCs so they can communicate using internal IPs — common for multi-project setups. Shared VPC centralises network management: a host project shares its VPC with service projects.

Best practice: start with a custom VPC, define subnets for each tier (frontend, backend, data), apply firewall rules that deny all ingress except on specific ports from specific source ranges, and use Private Google Access for all API calls.

gcp_custom_vpc_setup.shBASH

#!/bin/bash
# -----------------------------------------------------------
# CUSTOM VPC SETUP FOR PRODUCTION
# Creates a custom VPC with three subnets (frontend, backend, data)
# and minimal firewall rules.
# -----------------------------------------------------------

PROJECT_ID="payments-service-prod"
VPC_NAME="payments-prod-vpc"
REGION="us-central1"
FRONTEND_SUBNET="frontend-subnet"
BACKEND_SUBNET="backend-subnet"
DATA_SUBNET="data-subnet"

# Step 1: Create a custom VPC (not auto-mode)
gcloud compute networks create "${VPC_NAME}" \
  --project="${PROJECT_ID}" \
  --subnet-mode=custom

echo "VPC created: ${VPC_NAME}"

# Step 2: Create subnets for each tier
gcloud compute networks subnets create "${FRONTEND_SUBNET}" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --region="${REGION}" \
  --range="10.0.1.0/24" \
  --enable-private-ip-google-access

gcloud compute networks subnets create "${BACKEND_SUBNET}" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --region="${REGION}" \
  --range="10.0.2.0/24" \
  --enable-private-ip-google-access

gcloud compute networks subnets create "${DATA_SUBNET}" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --region="${REGION}" \
  --range="10.0.3.0/24" \
  --enable-private-ip-google-access

echo "Subnets created."

# Step 3: Create firewall rules — deny all ingress by default, then allow needed
# Allow health checks from GCP's health checker IP ranges
gcloud compute firewall-rules create "${VPC_NAME}-allow-health-checks" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --direction=ingress \
  --priority=1000 \
  --source-ranges="130.211.0.0/22,35.191.0.0/16" \
  --target-tags="allow-health-checks" \
  --rules="tcp:80,tcp:443"

echo "Firewall rule for health checks created."

# Step 4: Create a firewall rule to allow internal traffic between subnets
gcloud compute firewall-rules create "${VPC_NAME}-allow-internal" \
  --project="${PROJECT_ID}" \
  --network="${VPC_NAME}" \
  --direction=ingress \
  --priority=1000 \
  --source-ranges="10.0.0.0/16" \
  --rules="tcp:0-65535,udp:0-65535,icmp"

echo "Internal traffic rule created."

echo "Custom VPC setup complete."

Output

VPC created: payments-prod-vpc

Subnets created.

Firewall rule for health checks created.

Internal traffic rule created.

Custom VPC setup complete.

⚠ Don't Use the Default VPC in Production

GCP's default VPC has an ingress rule allowing SSH (tcp:22) and RDP (tcp:3389) from any IP (0.0.0.0/0). If you deploy a VM with an external IP, it's accessible from the internet within minutes. Always create a custom VPC with strict firewall rules.

📊 Production Insight

Using the default VPC in production often leaves SSH and RDP open on all instances.

Attackers scan GCP IP ranges and find exposed instances within hours.

Rule: always create a custom VPC with ingress firewall rules that only allow your specific IP ranges.

🎯 Key Takeaway

Custom VPCs are mandatory for production.

Default VPC is too permissive.

Use Private Google Access to keep VMs off the internet.

VPC Design Decisions

IfSingle project, small app

→

UseCustom VPC with a single subnet per region is fine

IfMultiple services with different security tiers

→

UseMultiple subnets with strict firewall rules between them

IfMulti-project organisation

→

UseShared VPC for centralised networking; use VPC peering for isolated projects

Monitoring and Observability — Cloud Monitoring, Cloud Logging, and Cloud Trace

Your app crashes at 3 AM. You didn't set up alerts. You didn't send logs to a central sink. You're SSHing into VMs, grepping through /var/log/syslog. That's the pain of retrofitting observability. Don't be that engineer.

GCP's four pillars: Cloud Monitoring, Cloud Logging, Cloud Trace, and Cloud Error Reporting. Enable them from Day 1.

Cloud Monitoring (formerly Stackdriver) ingests metrics — CPU, memory, request latency, error rates — from GKE, Cloud Run, Compute Engine, and custom apps via OpenTelemetry. You build dashboards. You set alerting policies. The free tier covers 1,000,000 metric points per month per project. For most teams that's enough to start. The key metric: 99th percentile latency. Track it, alert on it. Average latency hides outliers.

Cloud Logging collects structured logs. Every GCP service emits logs here automatically. Your app should too — structured JSON, not freeform text. Add severity levels (INFO, WARNING, ERROR). Filter with severity>=ERROR. Export logs to BigQuery for long-term analysis. At scale, raw logs cost money — sink rarely needed logs to GCS cold storage.

Cloud Trace does distributed tracing. A microservice calls another across Cloud Pub/Sub. Cloud Trace shows you each hop's latency. You see the 800ms wait in the database call. Without tracing, you blame the network. Tracing integrates with OpenTelemetry — instrument your app once, get traces everywhere.

Cloud Error Reporting groups exceptions by stack trace. It shows first/last occurrence and affected users. No more sifting through log dumps to find 'NullPointerException' across 100 services. It's free with Cloud Logging.

Production setup: enable all four. Your launch checklist must include a Cloud Monitoring dashboard with CPU, memory, request latency, and alert for 99th percentile > 500ms. Anything else is negligence.

io/thecodeforge/cloud/gcp/gcp_monitoring_logging_setup.shBASH

#!/bin/bash
# Quick log triage
echo "Last 50 ERROR logs:"
gcloud logging read 'severity>=ERROR' --limit 50

echo "
List monitoring dashboards:"
gcloud monitoring dashboards list

Output

Last 50 ERROR logs:

--- filtered logs ---

List monitoring dashboards:

--- dashboard list ---

Mental Model

Observability vs Monitoring

Monitoring tells you the system is down. Observability tells you why. Both are necessary. Build observability first, monitoring second.

📊 Production Insight

Enable Cloud Logging + Monitoring + Trace + Error Reporting from Day 1

Structure logs as JSON with severity levels

Set alert on 99th percentile latency > 500ms

🎯 Key Takeaway

Retrofitting observability is 10x harder

Use OpenTelemetry for custom metrics

Cloud Error Reporting groups exceptions for free

Additional GCP Services — Cloud CDN, Cloud DNS, KMS, Cloud Armor, and Deployment Manager

You've built a great app. Now make it fast, secure, and reliable. These five services fill gaps your core compute and storage won't address.

Cloud CDN caches static and dynamic content at 200+ edge PoPs. Enable it with a single checkbox on your HTTP(S) load balancer. Cache hit rates of 80-95% are typical for static assets. That means 80% fewer requests hitting your backend. Less load, lower egress costs, faster response times for users on the other side of the planet. For dynamic content, enable cache keys based on query parameters. You'll be surprised what's cacheable.

Cloud DNS is managed authoritative DNS with a 100% uptime SLA. Create public zones for your domain, private zones for internal service discovery. Migration from Route53 or Cloudflare is straightforward — gcloud dns managed-zones create and import your zones. The 100% SLA means your domain never goes down due to DNS issues. That's worth it.

Cloud KMS manages encryption keys. Create a key ring, encrypt and decrypt with gcloud kms encrypt. GCS, BigQuery, GKE, and Cloud SQL support Customer-Managed Encryption Keys (CMEK). Google manages the key, but you control access via IAM. For compliance (HIPAA, PCI DSS), this is mandatory. Never store raw secrets in source code — use Secret Manager for that.

Cloud Armor is a WAF and DDoS protection layer at your load balancer. Pre-built rule sets for OWASP Top 10 vulnerabilities (SQL injection, XSS). Rate limiting per IP. Geo-based access control. You can block entire countries with a single policy. For production applications, this is your first line of defence.

Cloud Deployment Manager is GCP's infrastructure-as-code tool — YAML/Jinja2 templates. Similar to AWS CloudFormation. But here's the honest take: use Terraform instead. Terraform is multi-cloud, has a huge community, and doesn't lock you into GCP's ecosystem. Deployment Manager works, but it's not worth learning for a single cloud.

Enable CDN and Cloud Armor on your load balancer from Day 1. Use Cloud DNS for all zones. Use KMS for secrets. Skip Deployment Manager — use Terraform.

io/thecodeforge/cloud/gcp/gcp_additional_services_commands.shBASH

#!/bin/bash
# Enable Cloud CDN on a backend service
gcloud compute backend-services update my-backend-service --enable-cdn

# Create a DNS managed zone
gcloud dns managed-zones create my-zone --dns-name="example.com."

# Encrypt with KMS
echo -n "my-secret" | gcloud kms encrypt \
  --location=global --keyring=my-keyring --key=my-key \
  --plaintext-file=- --ciphertext-file=-

Output

--- commands executed ---

💡Terraform over Deployment Manager

Learn Terraform, not Deployment Manager. Terraform is multi-cloud, community-driven, and portable. Deployment Manager locks you into GCP's DSL. Your next job might not be GCP-only.

📊 Production Insight

Enable Cloud CDN and Cloud Armor on every public load balancer

Use Cloud DNS for all domains, with 100% uptime SLA

Encrypt sensitive data with Cloud KMS CMEK from the start

🎯 Key Takeaway

CDN + Armor on all public LBs

KMS for compliance, Secret Manager for secrets

Deployment Manager exists but Terraform is better

GCP Learning Path and Certifications — From Zero to Professional

You want a GCP job. Certifications help, but real projects matter more. Here's the path I'd take — and I've built and reviewed dozens of GCP systems.

Start with the free tier. 90 days of $300 credit. No credit card required for Cloud Shell. Google Cloud Skills Boost has free qwiklabs that walk you through IAM, Compute Engine, BigQuery. Spend 40 hours there before touching a paid resource. You'll learn by doing, not by reading.

Your first certification: Associate Cloud Engineer (ACE). Cost is $200. It covers deployment, monitoring, and basic architecture. Most teams expect this within 6 months of GCP experience. Study with the official exam guide and the Coursera Google Cloud courses. Skip the $200+ bootcamps — self-study works if you do the labs.

Professional Cloud Architect is the next step. It's the most recognised GCP cert. Describes solution design, migrations, and security patterns. I've interviewed candidates with this cert — some could design a global app on the fly, others couldn't name the three zones in us-central1. The cert tests knowledge, not experience. Pair it with hands-on work on real projects.

Professional Data Engineer is for BigQuery, Dataflow, and ML pipelines. If your work is data-heavy, this is the one. Professional Cloud DevOps Engineer covers CI/CD, SRE practices, and Cloud Build. Each exam is $200. Most employers reimburse on pass — check your benefits before paying out of pocket.

Honestly? Certifications open doors but they don't build them. I've seen uncertified engineers build production systems that scale to millions of users. I've seen certified engineers who can't debug a simple GKE pod crash. Build real projects: deploy a three-tier app with Cloud Run, GCS, Cloud SQL, and Cloud Armor. Then put it on your resume. That's worth more than any badge.

Start today. Cloud Shell is free. Build something real. Certifications follow.

io/thecodeforge/cloud/gcp/gcp_learning_path_script.shBASH

#!/bin/bash
echo "=== GCP Learning Path ==="
echo "Step 1: Free Tier + Cloud Shell (3 months)"
echo "Step 2: Associate Cloud Engineer (ACE) cert"
echo "Step 3: Build a real project (3-tier app)"
echo "Step 4: Professional Cloud Architect (Architect)"
echo ""
echo "Practice counts more than paper. Build. Then certify."

Output

=== GCP Learning Path ===

Step 1: Free Tier + Cloud Shell (3 months)

Step 2: Associate Cloud Engineer (ACE) cert

Step 3: Build a real project (3-tier app)

Step 4: Professional Cloud Architect (Architect)

Practice counts more than paper. Build. Then certify.

⚠ Cert vs Experience

Certifications don't replace hands-on debugging. A production outage won't ask for your badge number. Build real projects before you pay for the exam.

📊 Production Insight

Start with free tier and Cloud Shell for 90 days

Certify ACE after 6 months hands-on

Build a real 3-tier app before applying for roles

🎯 Key Takeaway

Free tier is your sandbox — use it aggressively

ACE opens doors, Architect opens senior roles

Real projects beat certifications every time

Why Learn GCP? Because 'Cloud Agnostic' Is a Lie You Tell Your Manager

Every cloud platform has a personality. AWS is a thousand services you'll never touch. Azure is enterprise lock-in with a PowerPoint theme. GCP is the engineer's cloud — built by people who wrote the papers on distributed systems.

Google runs the world's largest networks. Their internal tooling — Borg, Colossus, Dremel — directly shaped Compute Engine, Cloud Storage, and BigQuery. You don't learn GCP for the console. You learn it for the APIs, the gcloud CLI, and the fact that a single bq command can query terabytes in seconds.

The real reason? Kubernetes was born here. Anthos, Cloud Run, and Spanner are production-hardened at Google scale. If you want to build systems that survive planet-wide traffic, stop fiddling with EC2 and learn the platform that runs YouTube and Search. Your resume will thank you when the next startup asks for 'GCP experience for their data pipeline.'

WhyGCPMatters.ymlYAML

// io.thecodeforge — devops tutorial

# This is why you learn GCP: one command, petabyte-scale
gcloud:
  service: bigquery
  query: "SELECT COUNT(*) FROM `bigquery-public-data.github_repos.commits`"
result: "3.2 billion rows in 8 seconds"
# Try that on your Postgres instance. I'll wait.

Output

Row Count: 3,221,847,522

Elapsed Time: 8.2 sec

Slots Utilized: 100%

💡Senior Shortcut:

If you're job-hopping, focus on BigQuery, Cloud Run, and IAM. Those three cover 80% of real-world interviews. Skip the cert if you can explain Spanner's TrueTime API.

🎯 Key Takeaway

GCP is the only cloud where the default tools are the same ones Google uses internally. Learn it for the architecture, not the certification.

Prerequisites Before Learning GCP: You Can't Build on Sand

I've seen juniors treat GCP like a magic box. They type gcloud compute instances create and wonder why their VM gets pwned in 30 minutes. Don't be that person.

Before you touch the Google Cloud Console, you need three things locked down. First: Linux. Not GUI Linux — you need to SSH in, grep logs, and write a bash one-liner without panicking. GCP's CLI is Linux-native. If you can't chmod a key file, go back and learn it.

Second: networking basics. What's a CIDR block? How does DNS resolution work? GCP's VPC is software-defined and you will misconfigure a firewall rule that exposes your database. Understand ports, subnets, and NAT before you create a single resource.

Third: cloud computing fundamentals. Virtualization, load balancing, and stateless vs. stateful services. GCP abstracts the hardware, but you still need to know why n2-standard-8 costs more than e2-micro. Skip this prep and you'll burn money on orphaned disks and idle instances.

PrerequisitesCheck.ymlYAML

// io.thecodeforge — devops tutorial

# Minimum prep checklist before you create a project
prerequisites:
  linux:
    - ssh key generation
    - grep, awk, sed
    - systemd service management
  networking:
    - CIDR notation (192.168.1.0/24)
    - TCP/UDP basics
    - DNS A and CNAME records
  cloud_fundamentals:
    - IaC concept (Terraform or Deployment Manager)
    - stateless vs stateful applications
    - cost awareness: e2 vs n2 machine series

Output

If you can't check all three boxes, schedule 2 weeks of Linux/Networking cramming before your first `gcloud init`.

⚠ Production Trap:

Many GCP tutorials skip networking. First time you open port 3306 to 0.0.0.0/0, your Cloud SQL dump is on Pastebin. Lock down your VPC before you launch anything.

🎯 Key Takeaway

GCP won't save you from bad fundamentals. Master Linux, networking, and cloud basics first, or you're just paying Google to watch your mistakes scale.

GCP Career Opportunities — Why Bother Learning This Stuff

You want to know why you should spend weekends grinding GCP certs instead of playing golf? Money. Not just salary — leverage. Companies that dropped AWS for GCP did it for BigQuery, Kubernetes-native managed services, and per-second billing that actually saves real cash. That means they need engineers who understand GCP's quirks, not just cloud in theory.

Every bank, retail giant, and gaming studio running on GCP has a skeleton crew of people who actually know how to stitch together Cloud Spanner with Dataflow without burning budget. Those people are indispensable. The market for GCP specialists is less crowded than AWS because the barrier to entry is higher — you actually have to understand the "why" behind the architecture, not just click buttons in a console.

Certifications matter, but proof of work matters more. If you can show you kept a production system alive, handled a billing spike from a misconfigured preemptible VM, or migrated a petabyte-scale data pipeline from on-prem to BigQuery, you write your own ticket. The money follows the pain you can solve.

gcp-skill-gap.ymlYAML

// io.thecodeforge — devops tutorial

// The roles that actually pay
roles:
  - title: "Cloud Architect"
    avg_salary: $180k
    pain: "Knowing which GCP service won't bankrupt you"
  - title: "DevOps Engineer"
    avg_salary: $150k
    pain: "GKE cluster upgrades without killing prod"
  - title: "Data Engineer"
    avg_salary: $160k
    pain: "BigQuery slot management"

// Bottom line: AWS has more jobs. GCP has better paying ones.

Output

No direct output — this is a reference config.

🔥Senior Shortcut:

When job hunting, filter by companies using GCP's Anthos or BigQuery. They have budget, they have complex problems, and they pay for expertise.

🎯 Key Takeaway

GCP career value is inverse to market saturation — fewer specialists, higher premiums.

GCP Step-01: Introduction — The Only First Step That Matters

Step-01 isn't "what is a cloud." It's setting up a billing alarm before you touch anything else. Google Cloud charges by the second and that sounds nice until your ML experiment spins up 5000 GPU instances and you're homeless. The intro step every junior skips? Creating a budget alert and disabling automatic service enablement.

You want to learn GCP? Start by clicking nothing. Read the IAM roles. Understand that allUsers with a bucket means anyone with a browser owns your data. Then, and only then, type your first gcloud command.

Your project structure should be clean from day one. One project for learning, one for experiments, never mixing production credentials with the free tier. The first step is not about running a VM — it's about not getting fired before you build anything. Get the billing guardrails up, disable the services you don't need, and lock down your default service account. Then you can play.

step-01-introduction.ymlYAML

// io.thecodeforge — devops tutorial

// Do this before anything
steps:
  - action: "Create billing budget"
    command: "gcloud alpha billing budgets create --billing-account=XXX --display-name='Stop-bleeding-cash' --budget-amount=100"
  - action: "Lock default service account"
    command: "gcloud projects get-iam-policy your-project | set binding with roles/editor empty"
  - action: "Disable automatic service enablement"
    command: "gcloud config set disable_usage_reporting true"

// If you skip this, your first GCP bill will be a horror story.

Output

Budget alert created. Default SA disabled. No services auto-enabled.

⚠ Production Trap:

Never use the root billing account project for real workloads. Create a separate project for experimentation — and delete it when you're done. GCP doesn't forgive forgotten resources.

🎯 Key Takeaway

Your first GCP step is not learning services — it's preventing financial ruin.

● Production incidentPOST-MORTEMseverity: high

Data Exposure via Public Bucket

Symptom

A security scanner flagged the bucket as publicly accessible. Later analysis showed automated scrapers had downloaded the data.

Assumption

The team believed that 'allUsers' only applied to authenticated Google users, not the entire internet.

Root cause

The IAM binding roles/storage.objectViewer for allUsers made all objects readable without any authentication.

Fix

Immediately removed the allUsers binding using gcloud storage buckets remove-iam-policy-binding gs://BUCKET_NAME --member=allUsers --role=roles/storage.objectViewer. Then rotated any exposed secrets and rotated the bucket's default KMS key. Migrated to signed URLs for temporary access.

Key lesson

Never grant allUsers access to any bucket that contains sensitive data. Use pre-signed URLs for time-limited access.
Audit bucket IAM bindings regularly with Cloud Asset Inventory.
Enable Object Versioning and retention policies to detect and recover from accidental exposure.

Production debug guideSymptom → Action guide for the most common GCP issues5 entries

Symptom · 01

gcloud command fails with 'Permission denied'

→

Fix

Run gcloud auth application-default login or set GOOGLE_APPLICATION_CREDENTIALS. Verify the service account has the required role with gcloud projects get-iam-policy PROJECT_ID.

Symptom · 02

Can't reach a GCE instance via external IP

→

Fix

Check firewall rules: gcloud compute firewall-rules list --filter=network=default. Ensure an ingress rule allows traffic on the required port from your IP.

Symptom · 03

Cloud Run service returns 403

→

Fix

Verify the service's service account has the roles/run.invoker on the service. Use gcloud run services get-iam-policy SERVICE_NAME --region=REGION.

Symptom · 04

GKE pod cannot connect to Cloud SQL

→

Fix

Check the Pod's service account has roles/cloudsql.client. Use Workload Identity mapping. Verify VPC peering or Private Services Access is configured if using Private IP.

Symptom · 05

Billing is unexpectedly high

→

Fix

Use the Compute Engine VM list with labels. Run gcloud billing accounts list and check budget alerts. Use the Cost Table dashboard in GCP Console.

★ GCP CLI Debug Cheat SheetQuick commands to identify and fix common GCP issues in under 60 seconds.

Authentication failure−

Immediate action

Re-authenticate with `gcloud auth login`

Commands

gcloud auth login

gcloud config list account

Fix now

Also check GOOGLE_APPLICATION_CREDENTIALS env var is set correctly.

Project not found+

Compute Engine not starting+

GKE cluster unreachable+

Storage bucket permission issues+

Compute Options Comparison

Dimension	Compute Engine (GCE)	Google Kubernetes Engine (GKE)	Cloud Run
Abstraction Level	Raw VMs (IaaS)	Managed Kubernetes (CaaS)	Serverless containers (PaaS)
Ops Overhead	High — you manage OS, patching, scaling	Medium — GCP manages control plane, you manage node pools	Low — GCP manages everything except your container
Scaling Behaviour	Manual or MIG autoscaling (minutes)	Pod autoscaling via HPA (seconds)	Instant scale-to-zero and scale-out (sub-second)
Billing Unit	Per-second VM uptime	Per-second node uptime	Per-request CPU and memory (free at idle)
Best For	Legacy apps, GPU workloads, custom OS configs	Microservices, multi-container apps, stateful workloads	Stateless APIs, event-driven functions, variable traffic
Cold Start	None (always running)	None (always running)	Yes — 300ms to 3s depending on runtime
Max Request Timeout	N/A — not request-oriented	N/A — not request-oriented	3600 seconds (1 hour)
Minimum Cost	~$5-10/month for f1-micro	~$70/month for smallest cluster	$0/month at zero traffic

⚙ Quick Reference

16 commands from this guide

File	Command / Code	Purpose
iothecodeforgegcpsetup.sh	brew install --cask google-cloud-sdk	Getting Started
gcp_project_setup.sh	PROJECT_ID="payments-service-prod" # Must be globally unique across all G...	GCP's Mental Model
iothecodeforgecloudgcpgcp_demo_pricing_comparison.sh	ON_DEMAND_PER_HR=0.23	GCP Pricing Model
iothecodeforgecloudgcpgcp_infra_region_check.sh	REGION="us-central1"	GCP Global Infrastructure
cloud_run_service.yaml	apiVersion: serving.knative.dev/v1	GCP Compute Options
gcs_upload_and_signed_url.py	from pathlib import Path	Storage on GCP
iothecodeforgegcppubsub_subscriber.py	from google.cloud import pubsub_v1	Data and Analytics on GCP
gcp_iam_least_privilege_setup.sh	PROJECT_ID="payments-service-prod"	GCP IAM and Networking
gcp_custom_vpc_setup.sh	PROJECT_ID="payments-service-prod"	GCP Networking
iothecodeforgecloudgcpgcp_monitoring_logging_setup.sh	echo "Last 50 ERROR logs:"	Monitoring and Observability
iothecodeforgecloudgcpgcp_additional_services_commands.sh	gcloud compute backend-services update my-backend-service --enable-cdn	Additional GCP Services
iothecodeforgecloudgcpgcp_learning_path_script.sh	echo "=== GCP Learning Path ==="	GCP Learning Path and Certifications
WhyGCPMatters.yml	gcloud:	Why Learn GCP? Because 'Cloud Agnostic' Is a Lie You Tell Yo
PrerequisitesCheck.yml	prerequisites:	Prerequisites Before Learning GCP
gcp-skill-gap.yml	roles:	GCP Career Opportunities
step-01-introduction.yml	steps:	GCP Step-01: Introduction

Key takeaways

GCP's Project is the atomic unit of isolation

billing, IAM, and APIs are all scoped per project. Use separate projects for dev/staging/prod, not separate folders within one project.

The compute decision (GCE vs GKE vs Cloud Run) is really a decision about how much operational ownership you want

more control always means more operational overhead. Cloud Run is the default choice for new stateless services unless you have a specific reason not to use it.

The right storage service is determined entirely by your data access pattern

Cloud Storage for blobs, Cloud SQL for relational data under 96TB, Spanner for globally distributed relational, Firestore for document-shaped hierarchical data. Mixing these up costs money and performance.

IAM is not an afterthought

set up least-privilege service accounts before you deploy your first service. The cost of retrofitting permissions on a live system is far higher than getting it right during initial setup.

Custom VPCs are mandatory for production. Default VPC is too permissive. Use Private Google Access to keep VMs off the internet.

Common mistakes to avoid

5 patterns

Enabling allUsers IAM on a GCS bucket containing user data

Symptom

All objects in the bucket are publicly readable on the internet, often discovered via a security scanner or a data breach report.

Fix

Remove the allUsers binding immediately using gcloud storage buckets remove-iam-policy-binding gs://BUCKET_NAME --member=allUsers --role=roles/storage.objectViewer. Audit Cloud Audit Logs to check which objects were accessed. Switch to signed URLs for any temporary public access.

Using a single service account with roles/editor for every service in a project

Symptom

If one service is compromised or a key file is leaked, an attacker gains near-full write access to the entire GCP project including secrets, databases, and compute.

Fix

Create one service account per service, grant only the specific predefined roles required (e.g., roles/pubsub.publisher, not roles/pubsub.admin), and use Workload Identity for GKE instead of key files.

Deploying all resources to a single zone without high-availability consideration

Symptom

A GCP zonal outage (like the 2021 us-central1-b incident) takes down your entire application, violating your SLA.

Fix

For Compute Engine, use Managed Instance Groups (MIGs) spread across multiple zones. For Cloud SQL, enable High Availability to provision a standby instance in a different zone. For GKE, create node pools with nodes spread across zones using --num-nodes-per-zone.

Using roles/editor or roles/owner service account keys in application code

Symptom

A key leak in a public GitHub repo or Docker image gives attackers full project access. Entire project can be deleted or exfiltrated.

Fix

Create per-service service accounts with only the IAM roles they need (principle of least privilege). In GKE, use Workload Identity instead of key files entirely — the pod gets credentials via GCP metadata server without any file on disk. Run gcloud iam service-accounts list to audit which accounts exist and what roles they have.

Deploying production workloads in a single zone instead of spreading across multiple zones

Symptom

A zone maintenance event or hardware failure takes down your entire application. GCP guarantees 99.99% availability per region only for multi-zone deployments — single-zone VMs have a lower SLA.

Fix

For GKE: set the node pool location to regional (not zonal). For Compute Engine: create a Managed Instance Group with distribution policy set to EVEN (balanced across all zones in the region). For Cloud SQL: enable high availability (HA), which creates a standby instance in a different zone.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You're building a payments microservice that needs to read from Cloud SQ...

Q02SENIOR

A product manager tells you traffic to your API is unpredictable — quiet...

Q03SENIOR

Your team is moving from a monolith to microservices on GCP. How do you ...

Q01 of 03SENIOR

You're building a payments microservice that needs to read from Cloud SQL and publish to Pub/Sub. Walk me through how you'd set up IAM for it in production — and specifically, what would you NOT do that junior engineers typically get wrong?

ANSWER

First, create a dedicated service account for the payments microservice — no shared accounts. Grant roles/cloudsql.client for connecting to Cloud SQL and roles/pubsub.publisher for publishing. For Cloud SQL, also ensure the service account is authorized in the database using cloudsql_proxy or Private IP. What I would NOT do: use roles/editor or roles/cloudsql.admin — those grant far too much access, including the ability to drop databases or modify IAM policies. Also avoid embedding a long-lived service account key file in the container — use Workload Identity (if on GKE) or attach the SA to the Cloud Run service directly.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Is Google Cloud Platform better than AWS for beginners?

Is Google Cloud cheaper than AWS?

What is BigQuery and why does GCP recommend it?

What is GCP's sustained-use discount and how does it compare to AWS Reserved Instances?

What is the difference between Cloud Monitoring and Cloud Logging?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Cloud. Mark it forged?

21 min read · try the examples if you haven't