Intermediate 4 min · March 09, 2026

Google Cloud Compute Engine Basics

Compute Engine — Orphaned Disks Burn $4,200/Month

Q: What exactly happens to my data when I delete a VM?

It depends on how the disk was configured at attachment time. Boot disks created as part of the 'gcloud compute instances create' command default to auto-delete=yes — they are deleted with the VM. Secondary disks attached using 'gcloud compute instances attach-disk' default to auto-delete=no — they survive VM deletion and continue accruing storage charges. To check the auto-delete setting on a running VM's disks before deletion: 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(disks[].autoDelete,disks[].source)'. If any disk shows autoDelete: false and you don't need to retain it, either delete the VM with '--delete-disks=all' flag or delete the disk manually afterward with 'gcloud compute disks delete DISK_NAME --zone=ZONE'. Rule of thumb: after any VM deletion operation, run 'gcloud compute disks list --filter=-users:*' to confirm no orphaned disks remain.

Q: What is the difference between a Zone and a Region in GCE, and how does it affect my architecture?

A Region is a geographic location containing multiple independent Zones (e.g., us-central1 contains us-central1-a, us-central1-b, us-central1-c, and us-central1-f). A Zone is a single isolated deployment area — think of it as one or more data centers within the region. Zones within a region are connected by Google's private network with single-digit millisecond latency between them. Architecturally: a VM in a single zone has no protection against zone-level failures (power, cooling, networking). Deploying a Managed Instance Group across 2-3 zones in the same region provides high availability with negligible latency penalty — cross-zone traffic within a region stays on Google's private network. Deploying across regions provides disaster recovery capability but adds 30-100ms of inter-region latency and inter-region egress charges. For most production workloads: multi-zone within a single region is the right baseline. Multi-region is for global user distribution (lower latency for geographically distributed users) or regulatory requirements for geographic data separation.

Q: Can I resize a VM after it's been created, and do I need to stop it?

Machine type changes (CPU and RAM) require stopping the VM first: 'gcloud compute instances stop VM_NAME --zone=ZONE', then 'gcloud compute instances set-machine-type VM_NAME --zone=ZONE --machine-type=NEW_MACHINE_TYPE', then 'gcloud compute instances start VM_NAME --zone=ZONE'. Expect 2-5 minutes of downtime for the stop-resize-start cycle. For production workloads, perform this change using a MIG rolling update with a new instance template rather than resizing individual VMs — the MIG approach maintains availability during the change. Disk resizing can be done while the VM is running: 'gcloud compute disks resize DISK_NAME --zone=ZONE --size=NEW_SIZE_GB'. After resizing the disk, you also need to resize the partition and filesystem inside the VM (using resize2fs for ext4 or xfs_growfs for XFS). GCE does not automatically expand the filesystem when the disk is resized. Changing machine families (e.g., E2 to N2, or N2 to C3) requires deleting and recreating the VM from a snapshot of the disk — you cannot change machine families in place.

Q: What is the difference between Preemptible VMs and Spot VMs, and which should I use?

Preemptible VMs were GCE's original discounted compute offering with two defining constraints: a hard 24-hour maximum lifetime (the VM is automatically terminated after 24 hours regardless of what's running) and no guaranteed availability (GCE reclaims capacity when needed with 30 seconds notice). Spot VMs are the modern replacement: no maximum lifetime, the same price as Preemptible (60-80% discount), and reclamation behavior that's based on capacity demand rather than a fixed timer. For all new deployments, use Spot VMs — '--provisioning-model=SPOT'. There is no scenario where Preemptible VMs are the correct choice in 2026; Spot VMs offer the same discount without the artificial 24-hour cap. Both Preemptible and Spot VMs are unsuitable for always-on production services with uptime SLAs. They're purpose-built for fault-tolerant batch workloads: ML training with checkpointing, CI/CD pipeline runners, data transformation jobs, rendering pipelines. Design the workload to tolerate reclamation and Spot VMs are a straightforward 60-80% cost reduction with no architectural downside.

Q: How do I control costs on GCE as the team and workload scale?

Cost control on GCE has three layers, and teams that implement all three consistently see 40-60% reductions versus unmanaged deployments. First, visibility: set up a budget alert in Cloud Billing at 80% and 110% of expected monthly spend. Enable billing export to BigQuery so you can query spend by label, project, and resource type. Label every VM with env, team, and app labels at creation time — unlabeled resources are unattributable costs. Second, right-sizing: GCE Right-sizing Recommendations appear automatically in the Compute Engine console after 8+ days of utilization data. Review them monthly. For workloads with variable CPU/RAM needs, Custom Machine Types let you specify exact resources instead of rounding up to the next predefined type. For non-production VMs, Instance Schedules stop resources during off-hours automatically. Third, commitment: if a VM or family of VMs will run for 12+ months, Committed Use Discounts offer 37% savings (1-year) or 55% savings (3-year) over On-Demand pricing in exchange for a usage commitment — no upfront payment required. For batch workloads, Spot VMs deliver 60-80% savings over On-Demand. Combining Committed Use on baseline capacity with Spot VMs for burst capacity is the cost optimization pattern used by mature GCP deployments.

Deleting a GCE VM doesn't delete disks.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.

✓ Production

production tested

July 04, 2026

last updated

377

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of DevOps fundamentals
✓Comfortable with command-line tools
✓Basic Linux administration knowledge

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

GCE is Google's IaaS platform — you rent virtual machines on demand instead of buying physical servers
Machine families are purpose-built: E2 for dev/test, N2 for balanced production, C2/C3 for compute-heavy workloads
Live Migration moves your running VM to a different host during maintenance without rebooting — unique to Google Cloud
Preemptible (Spot) VMs cost up to 80% less but can be reclaimed at any time — use them for fault-tolerant batch jobs only
Persistent Disks survive VM deletion if auto-delete is disabled — orphaned disks silently accrue costs with no warning
The biggest trap: running VMs with the Default Compute Service Account (Editor role) — it's a project-wide security hole waiting to be exploited

✦ Definition~90s read

What is Google Cloud Compute Engine Basics?

Compute Engine is built on the same physical infrastructure that runs Google Search, Gmail, and YouTube. That's not a marketing claim — it's the architectural reason GCE has capabilities you don't find on competing platforms. Live Migration, Google's global private fiber backbone, and the custom Titanium chip that handles networking and security offloading all came from internal Google infrastructure before they became GCE features.

★

Think of Google Cloud Compute Engine as a high-end virtual computer living in one of Google's data centers — the same infrastructure that keeps YouTube and Google Search running.

Provisioning a GCE VM takes 25 seconds via the gcloud CLI. That gap — weeks versus seconds — is the entire premise of Infrastructure-as-a-Service.

But GCE is not just 'a VM in the cloud.' The decisions you make at provisioning time — machine family, disk type, service account, network configuration, maintenance policy — have meaningful operational and cost consequences that play out over months. Understanding those decisions is the difference between a GCE deployment that works well and one that generates surprise bills and 2am pages.

The fundamental question GCE answers is: do you need an operating system? If you need kernel-level control, a custom OS image, GPU access, long-running background processes, or a persistent filesystem that behaves like a local disk, GCE is your tool. If you're deploying a containerized stateless HTTP service, Cloud Run or GKE may be a better fit.

The choice is not about which is 'better' — it's about matching the abstraction level to the workload requirements.

Plain-English First

Think of Google Cloud Compute Engine as a high-end virtual computer living in one of Google's data centers — the same infrastructure that keeps YouTube and Google Search running. Instead of buying a physical server, racking it, cabling it, and paying for the electricity yourself, you rent exactly the amount of compute power you need, for exactly as long as you need it.

The mental model I use with engineers new to GCE: imagine you're a contractor who builds houses. Without cloud infrastructure, you'd have to buy every tool before starting a job — drill, saw, scaffolding — and store it all in your garage after. With GCE, you call a tool rental shop, say 'I need a drill and scaffolding for three days,' and return everything when the job is done. You paid for three days of use, not the lifetime cost of the equipment.

The part that surprises most developers: you don't just get a computer. You get a computer connected to Google's private fiber network, with the ability to resize it without buying new hardware, snapshot its disk before a risky deployment, and pay only for the seconds it's actually running. That last part — per-second billing — is what makes cloud-native cost modeling genuinely different from anything you'd do with physical hardware.

Google Cloud Compute Engine (GCE) is the Infrastructure-as-a-Service (IaaS) layer of Google Cloud Platform, and it's one of the most capable — and most frequently misused — services in the GCP catalog. Every time I've joined a new engineering organization running on GCP, the Compute Engine bill is where I find the most recoverable waste and the most preventable incidents.

GCE exists to give you the same infrastructure primitives Google uses internally, exposed through an API. That means you can provision a VM in under 30 seconds, attach and detach persistent disks without rebooting, resize a machine type with a single command, and deploy across 40+ regions worldwide — all without touching physical hardware or filing a procurement request.

This guide covers the real mechanics of GCE: how to provision VMs correctly, which machine families to reach for in different scenarios, how the disk model actually works (and where it silently burns budget), and the security decisions that most tutorials skip entirely. We'll also cover the failure modes that show up in production — the orphaned disk that runs up a $4,200 monthly bill, the ephemeral IP that breaks DNS at 2am, and the Default Compute Service Account that turns a compromised VM into a project-wide breach.

By the end, you'll have both the conceptual foundation and production-grade examples to provision and operate GCE workloads with confidence — and to audit the ones you've inherited.

What Is Google Cloud Compute Engine and Why Does It Exist?

GCE exists to solve a problem that anyone who has run physical hardware understands viscerally: the gap between the capacity you need today and the capacity you provisioned three months ago when you ordered the hardware. Provisioning a physical server takes weeks of procurement, shipping, racking, cabling, and OS installation. Provisioning a GCE VM takes 25 seconds via the gcloud CLI. That gap — weeks versus seconds — is the entire premise of Infrastructure-as-a-Service.

io/thecodeforge/gce/ProvisionVM.shBASH

#!/bin/bash
# io.thecodeforge: Production-grade VM Provisioning via gcloud CLI
# Every flag here is deliberate — see inline comments for rationale.
# Do not remove flags without understanding their security or operational purpose.

gcloud compute instances create forge-web-server-01 \
    --project=thecodeforge-prod \
    --zone=us-central1-a \
    \
    # Machine type: e2-standard-2 = 2 vCPU, 8GB RAM.
    # For production web APIs, start here and right-size after 2 weeks of metrics.
    --machine-type=e2-standard-2 \
    \
    # PREMIUM network tier uses Google's private backbone for egress.
    # STANDARD tier uses public internet routing — cheaper but higher latency.
    --network-interface=network-tier=PREMIUM,subnet=default \
    \
    # MIGRATE = Live Migration during host maintenance (no reboot).
    # TERMINATE is required for GPU instances — they cannot be live-migrated.
    --maintenance-policy=MIGRATE \
    \
    # STANDARD = on-demand pricing. Use SPOT for batch/CI workloads only.
    --provisioning-model=STANDARD \
    \
    # Critical: use a custom SA with least-privilege IAM, NOT the default compute SA.
    # The default SA has Editor role on the entire project — a major security risk.
    --service-account=forge-web-sa@thecodeforge-prod.iam.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    \
    # Network tags are used by firewall rules to target specific VMs.
    --tags=http-server,https-server \
    \
    # Boot disk configuration:
    # auto-delete=yes: disk is deleted when VM is deleted (prevents orphaned disk charges).
    # pd-balanced: good balance of cost and IOPS for web workloads.
    # 20GB is sufficient for OS + app — don't overprovision disk.
    --create-disk=auto-delete=yes,boot=yes,device-name=boot-disk,\
image=projects/debian-cloud/global/images/family/debian-12,\
mode=rw,size=20,type=pd-balanced \
    \
    # Shielded VM: prevents boot-level rootkits and provides integrity attestation.
    # Required for PCI-DSS, HIPAA, and most enterprise security baselines.
    --shielded-secure-boot \
    --shielded-vtpm \
    --shielded-integrity-monitoring \
    \
    # Labels: critical for cost allocation and lifecycle management.
    # Use these to identify and delete resources by environment.
    --labels=env=production,app=frontend,team=platform,owner=sre

Output

Created [https://www.googleapis.com/compute/v1/projects/thecodeforge-prod/zones/us-central1-a/instances/forge-web-server-01].

NAME ZONE MACHINE_TYPE INTERNAL_IP EXTERNAL_IP STATUS

forge-web-server-01 us-central1-a e2-standard-2 10.128.0.2 34.135.10.45 RUNNING

# Verify Shielded VM is active:

# gcloud compute instances get-shielded-instance-config forge-web-server-01 --zone=us-central1-a

# shieldedInstanceConfig:

# enableIntegrityMonitoring: true

# enableSecureBoot: true

# enableVtpm: true

Mental Model

VM vs Serverless — The Decision That Determines Your Operational Overhead

The question isn't 'which is better.' It's 'how much of the stack do you need to own?' GCE gives you everything from the OS up. Serverless gives you only the runtime. The right choice is determined by your workload's requirements, not by preference.

Use GCE when you need kernel-level control: custom OS builds, kernel parameters (vm.swappiness, tcp_keepalive), eBPF-based networking, or custom kernel modules for high-performance I/O
Use GCE for stateful workloads where data locality matters: databases, file servers, ML model serving with large model files, anything that writes to local disk faster than network storage can keep up
Use GCE for GPU workloads — A100, L4, and H100 GPUs are attached to GCE instances, not available in serverless environments
Use Cloud Run for stateless HTTP workloads that need to scale to zero and back up in seconds — if you're not SSH-ing into the machine, you probably don't need GCE
The operational cost comparison: GCE requires you to manage OS patches, security hardening, disk monitoring, and capacity planning. Cloud Run offloads all of that. That operational delta is real engineering time — factor it into the decision.

📊 Production Insight

GCE's Live Migration is architecturally unique — AWS terminates and restarts affected instances during host maintenance, Azure gives you a short maintenance window notice but still reboots. On GCE, the maintenance event is invisible to most workloads. This matters for stateful applications where a reboot means a cold start, cache warm-up, and connection re-establishment.

Preemptible and Spot VMs are not just a cost option — they're an architectural forcing function. Designing your batch processing to tolerate preemption makes it resilient to all forms of unexpected VM loss, not just preemption. Teams that adopt Spot VMs for CI/CD runners typically end up with better pipeline resilience across the board.

Rule: use Standard VMs for always-on services with SLAs. Use Spot VMs for batch processing, CI/CD runners, and any workload that can checkpoint and resume. The 60-80% cost reduction is significant at scale — a fleet of 50 CI runners at Spot pricing vs Standard pricing is the difference between a manageable infrastructure budget and one that requires quarterly justification.

🎯 Key Takeaway

GCE gives you full kernel-level control that no serverless platform can match — but that control comes with operational responsibility for patching, hardening, and lifecycle management that serverless offloads.

Machine families are not interchangeable: E2 shared-core instances can be throttled by neighbors, C3 instances deliver consistent per-core performance that E2 can't guarantee. Choosing the wrong family for a workload means either overpaying or getting unexpected performance variance.

Punchline: if you're running a predefined machine type and Cloud Monitoring shows consistent RAM usage below 60% of your allocation, run 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(machineType)' and compare against Custom Machine Type pricing. Switching from n2-standard-8 to a custom 8 vCPU / 20GB machine for a Java service with a 16GB heap can save $80-120/month per instance at scale.

Choosing the Right GCE Machine Family

IfDev/test environments, bursty workloads, or cost-sensitive non-production jobs

→

UseUse E2 — shared-core options (e2-micro, e2-small) available at very low cost. E2-standard gives dedicated vCPUs. Good for workloads with variable CPU needs and tolerance for occasional throttling on shared-core instances.

IfBalanced production workloads: web servers, REST APIs, application servers, build systems

→

UseUse N2 (Intel) or N2D (AMD) — dedicated cores, no noisy-neighbor throttling, good CPU-to-RAM ratio (1:4 vCPU:GB). N2D is typically 10-15% cheaper than N2 for equivalent specs.

IfCPU-bound workloads: compilation, rendering, scientific computing, data transformation pipelines

→

UseUse C2 (Intel Cascade Lake) or C3 (Intel Sapphire Rapids, 2023+) — highest per-core performance in GCE, designed for sustained compute-intensive work. C3 has DDR5 memory and offers meaningfully better per-core throughput than C2.

IfMemory-intensive workloads: in-memory databases, large JVM heap sizes, SAP HANA, Redis with large datasets

→

UseUse M1 (up to 4TB RAM) or M2 (up to 12TB RAM) — these are designed specifically for workloads that don't fit in standard memory ratios. Expensive, but the alternative is sharding what could be a single-instance workload.

IfThe predefined machine types don't match your workload's CPU-to-RAM ratio

→

UseUse Custom Machine Types — specify exact vCPU count and RAM in 256MB increments. If you need 6 vCPUs and 20GB RAM, you pay for exactly that. This is often 20-40% cheaper than the next predefined type that fits.

IfML inference, video transcoding, scientific simulation, or any GPU-accelerated workload

→

UseUse A2 (A100 GPUs), G2 (L4 GPUs for inference), or A3 (H100 GPUs for large model training). GPU availability varies by region — check quotas before designing architecture around a specific GPU type in a specific region.

thecodeforge.io

Google Cloud Compute Engine

Common Mistakes and How to Avoid Them

Compute Engine is permissive by default in ways that create operational problems over time. The defaults were chosen to make getting started easy, not to be correct for production. The gap between 'easy to start' and 'correct for production' is where most GCE mistakes live.

The IP address problem is the one I see most often in teams that are new to GCE. Ephemeral external IPs are assigned at VM start time and released when the VM stops. This is fine for dev environments. For a production web server, it means every restart — planned or unplanned — changes the IP your DNS record points to. The fix is a one-time 30-second operation that most tutorials skip because it doesn't affect the happy path. The cost of skipping it is a 2am DNS debugging session.

The service account problem is the security debt that accumulates silently. Every GCE VM needs a service account to authenticate to other GCP services. The path of least resistance is the Default Compute Service Account, which has Editor role on the entire project. This means any process running on that VM — including a compromised web process — can read from any Cloud Storage bucket, write to any Pub/Sub topic, query any Cloud SQL database, and delete any other VM in the project. That's not hypothetical risk. It's the blast radius calculation for a supply-chain attack or a server-side request forgery vulnerability against a service running on that VM.

The over-provisioning problem is the cost debt that accumulates just as silently. GCE's Right-sizing Recommendations in the console analyze 8 days of CPU and memory utilization and suggest smaller machine types when resources are consistently underused. I've seen teams save 30-40% on compute spend just by reviewing these recommendations quarterly and acting on them.

io/thecodeforge/gce/LifecycleManagement.shBASH

#!/bin/bash
# io.thecodeforge: GCE Disk and Instance Lifecycle Management
# Run these as part of pre-deployment and post-teardown procedures.

# ============================================================
# STEP 1: Pre-deployment snapshot
# Take a consistent disk snapshot before any major deployment.
# This is your rollback point — do this before every production change.
# ============================================================
DISK_NAME="boot-disk"
ZONE="us-central1-a"
SNAPSHOT_NAME="pre-deploy-$(date +%Y%m%d-%H%M%S)"

gcloud compute disks snapshot "${DISK_NAME}" \
    --project=thecodeforge-prod \
    --snapshot-names="${SNAPSHOT_NAME}" \
    --zone="${ZONE}" \
    --storage-location=us-central1

echo "Snapshot created: ${SNAPSHOT_NAME}"
echo "To restore: gcloud compute disks create restored-disk --source-snapshot=${SNAPSHOT_NAME} --zone=${ZONE}"

# ============================================================
# STEP 2: Audit orphaned disks before teardown
# Run this BEFORE deleting VMs and AFTER deleting VMs.
# The before-run gives you a baseline. The after-run catches anything missed.
# ============================================================
echo "=== Unattached Disks (potential orphans) ==="
gcloud compute disks list \
    --filter='-users:*' \
    --format='table(name,zone,sizeGb,type,creationTimestamp,status)' \
    --sort-by=~sizeGb

# ============================================================
# STEP 3: Delete dev environment instances by label
# Labels are the correct mechanism for lifecycle management.
# Never maintain a manual list of instance names to delete.
# ============================================================
INSTANCES_TO_DELETE=$(gcloud compute instances list \
    --filter="labels.env=development" \
    --format="value(name,zone)" \
    --sort-by=zone)

if [ -z "${INSTANCES_TO_DELETE}" ]; then
    echo "No development instances found. Nothing to delete."
else
    echo "Instances to delete:"
    echo "${INSTANCES_TO_DELETE}"
    # Delete by zone to handle multi-zone dev environments correctly
    gcloud compute instances delete \
        $(gcloud compute instances list \
            --filter="labels.env=development" \
            --format="value(name)" | tr '\n' ' ') \
        --zone="${ZONE}" \
        --quiet
fi

# ============================================================
# STEP 4: Verify no orphaned disks remain after teardown
# If this list is non-empty after deletion, investigate before closing the ticket.
# ============================================================
echo "=== Post-teardown orphan check ==="
gcloud compute disks list \
    --filter='-users:*' \
    --format='table(name,zone,sizeGb,type,creationTimestamp)'

Output

Snapshot created: pre-deploy-20260319-143022

To restore: gcloud compute disks create restored-disk --source-snapshot=pre-deploy-20260319-143022 --zone=us-central1-a

=== Unattached Disks (potential orphans) ===

NAME ZONE SIZE_GB TYPE CREATED

old-data-disk-01 us-central1-a 500 pd-balanced 2025-09-12

stale-boot-02 us-central1-b 20 pd-standard 2025-11-03

Instances to delete:

forge-dev-vm-01 us-central1-a

forge-dev-vm-02 us-central1-a

Deleted [forge-dev-vm-01].

Deleted [forge-dev-vm-02].

=== Post-teardown orphan check ===

NAME ZONE SIZE_GB TYPE CREATED

old-data-disk-01 us-central1-a 500 pd-balanced 2025-09-12

stale-boot-02 us-central1-b 20 pd-standard 2025-11-03

# ACTION REQUIRED: These 2 disks were not deleted. Investigate before closing.

⚠ The Default Compute Service Account Is a Project-Wide Security Risk

Every GCE VM is created with a service account. The easy choice — accepting the default — attaches the Default Compute Service Account, which has the Editor role on the entire project. A single compromised process on a VM using this account can read secrets, delete infrastructure, and exfiltrate data from every service in the project. Create a custom service account for every VM with only the IAM roles it actually needs. This is a 5-minute setup task that eliminates an entire category of blast radius risk.

📊 Production Insight

The ephemeral IP problem is predictable and preventable. Static IP reservation costs nothing while the IP is attached to a running VM — you pay $0.01/hour only when it's reserved but unattached. The operational risk of an ephemeral IP on a production endpoint is orders of magnitude more expensive than the reservation fee.

GCE Right-sizing Recommendations are generated automatically and available in the Compute Engine console under the Recommendations section. They require no setup and are based on actual utilization data. Teams that review these monthly and act on them consistently report 25-40% compute cost reductions over 12 months — not from dramatic architectural changes, but from incremental machine type adjustments that add up across a fleet.

Rule: on the first of every month, open the GCE Right-sizing Recommendations panel. Apply any recommendations where the CPU and memory savings are above 20%. Reserve any Static IPs that are currently ephemeral on production-facing VMs. Check for unattached disks older than 7 days. These three checks, done consistently, eliminate the vast majority of avoidable GCE costs.

🎯 Key Takeaway

Ephemeral IPs are the correct default for development and the wrong default for production. The distinction is simple: if a DNS record points to the IP, it must be static. Everything else can be ephemeral.

The Default Compute Service Account is a convenience that becomes a liability the moment a VM is compromised. Spending 5 minutes creating a custom service account with scoped IAM roles eliminates a project-wide blast radius. There is no argument for using the default SA in production.

Punchline: run 'gcloud compute instances list --format=table(name,zone,serviceAccounts[0].email)' across your production project. If any row shows the default compute service account (PROJECT_NUMBER-compute@developer.gserviceaccount.com), that VM is over-permissioned and needs a dedicated SA created and attached before the next deployment.

Networking and Cost Optimization Decisions

IfProduction web server, API endpoint, or any service with a DNS record pointing to it

→

UseReserve a Static External IP immediately. Cost: free while attached. Risk of not doing it: DNS breaks on every VM restart, maintenance event that triggers replacement, or MIG rolling update.

IfInternal microservice that only communicates with other services within the same VPC

→

UseNo external IP needed — use Internal IPs only. Eliminates the attack surface of a public IP, avoids egress charges for same-zone traffic, and enforces that the service is not accidentally exposed to the internet.

IfDeveloper needs SSH access to a VM that has no external IP

→

UseUse IAP (Identity-Aware Proxy) tunneling: 'gcloud compute ssh VM_NAME --zone=ZONE --tunnel-through-iap'. IAP authenticates via IAM — no bastion host, no VPN, no open SSH port to the internet. This is the correct pattern for all developer access in 2026.

IfVM shows consistent CPU below 40% or RAM below 50% in Cloud Monitoring for 2+ weeks

→

UseCheck GCE Right-sizing Recommendations in the console. If no recommendation exists yet (requires 8+ days of data), calculate the Custom Machine Type that fits your P95 utilization with 30% headroom and switch to it.

IfBatch job or CI/CD runner that runs for a bounded duration

→

UseUse Spot provisioning model: '--provisioning-model=SPOT'. Design the job to checkpoint progress to Cloud Storage. Cost savings of 60-80% over Standard pricing make this the correct default for all non-always-on workloads.

Why Your VMs Are Leaking Money (and How to Fix It)

Every time you spin up a generic n1-standard-4 because 'that's what we always use', you are burning cash. Google Compute Engine charges by the second with a one-minute minimum. That includes RAM, CPU, and premium OS licenses. The first thing a battle-hardened engineer learns: match the machine family to the workload. Need raw CPU cycles for batch processing? Use C2 instances. Running a memory-mapped database? M2 or M3 families. Burstable web servers? Use E2 instances with committed use discounts.

Here is the real killer: most teams never clean up orphaned disks or static IPs attached to terminated VMs. Those resources keep billing you. Build a garbage collection job on day one. Tag resources by environment and lifecycle. Use preemptible VMs for stateless batch jobs — they cost 60-80% less and are terminated after 24 hours or when capacity is needed. For production, combine committed use discounts (1 or 3 year terms) and sustained use discounts that auto-apply when you run a VM for over 25% of a month.

find-orphaned-disks.shBASH

// io.thecodeforge

#!/bin/bash
# Find disks not attached to any VM in project "prod-east"
# Run weekly as a cron job

PROJECT="prod-east-12345"

# List all disks, filter by status READY (not attached to any instance)
gcloud compute disks list \
  --project=${PROJECT} \
  --filter="status=READY" \
  --format="table(name,zone,sizeGb,lastAttachTimestamp)" \
  --quiet
# Samples output:
# NAME          ZONE        SIZE_GB  LAST_ATTACH_TIMESTAMP
# abandoned-disk us-east1-b  100      2025-02-10T12:00:00.000-08:00
# old-backup-disk us-east1-b  500     2024-11-01T09:00:00.000-08:00

# Delete them (careful!)
# gcloud compute disks delete abandoned-disk --zone=us-east1-b --quiet

Output

NAME ZONE SIZE_GB LAST_ATTACH_TIMESTAMP

abandoned-disk us-east1-b 100 2025-02-10T12:00:00.000-08:00

old-backup-disk us-east1-b 500 2024-11-01T09:00:00.000-08:00

⚠ Production Trap:

Do not use the 'f1-micro' or 'g1-small' shared-core machines for any production-facing service. They throttle CPU aggressively. Your latency will spike, your alerts will fire, and you will spend three hours debugging a non-issue. Always choose at least an e2-small with a dedicated vCPU.

🎯 Key Takeaway

Match machine family to workload, tag everything, and run a weekly cleanup job for orphaned disks and IPs.

thecodeforge.io

Google Cloud Compute Engine

The Silent Killer: Opaque Startup Scripts and Firmware

Your compute engine VM just booted, but the application won't start. You ssh in and find nothing in the logs. The problem is almost always your startup script — or lack of one. Google Compute Engine supports two types of startup: custom images with pre-baked configs, and metadata-driven startup scripts. The second is the common choice, but teams screw it up in three predictable ways.

First, they write startup scripts that exit silently on error. Always set 'set -e' at the top of your bash script. If any command fails, the script halts and the VM should fail to be marked healthy. Second, they don't use the compute engine's 'healthy' flag. You should set a health check that probes your app's port after the startup script finishes. If it fails, the instance is recycled. Third, they forget firmware settings. I have debugged a production outage caused by a VM with secure boot enabled but a custom kernel that wasn't signed. The VM booted into a black hole. Always test your image with shielded VM settings turned on in a test project before deploying to prod.

Use the 'gcloud compute instances create' flag '--metadata=startup-script-url' to point to a Cloud Storage bucket. Keep your scripts versioned there. When you need to update 500 instances, just change the bucket URL.

create-vm-with-healthy.shBASH

// io.thecodeforge

#!/bin/bash
# Create a VM that fails if startup script errors
# Assumes: health check exists named "http-health-check"

STARTUP_SCRIPT='#!/bin/bash
set -e
echo "Starting application..."
# Install dependencies
apt-get update -y
apt-get install -y nginx
# Start nginx and verify it runs
systemctl start nginx
systemctl status nginx --no-pager
echo "Startup complete"
'

gcloud compute instances create "web-server-prod-1" \
  --zone=us-central1-a \
  --machine-type=e2-medium \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --metadata=startup-script="${STARTUP_SCRIPT}" \
  --instance-termination-action=STOP \
  --shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --tags=http-server

# Output: VM created with startup script. SSH in to verify:
# gcloud compute ssh web-server-prod-1 --zone=us-central1-a --command="journalctl -u google-startup-scripts.service -n 50"

Output

Instance 'web-server-prod-1' created successfully.

Use 'gcloud compute ssh web-server-prod-1 --zone=us-central1-a' to verify startup.

⚠ Production Trap:

Never use the Google Cloud Console's 'Create Instance' page to generate startup scripts. It does not persist the script in a way that is easy to debug. Always store scripts in Cloud Storage or a Git repo, and reference them via metadata URL. Otherwise, you lose the script when you delete and recreate the VM.

🎯 Key Takeaway

Always use 'set -e' in startup scripts, tie health checks to VM lifecycle, and test shielded VM settings in a non-prod project first.

● Production incidentPOST-MORTEMseverity: high

Orphaned Persistent Disks silently burn $4,200/month after VM deletion

Symptom

Monthly GCP bill increases by $4,200 with no corresponding increase in running VMs or traffic. Cloud Billing shows Persistent Disk charges growing month-over-month in the dev project. The Compute Engine console VM list is empty — no running instances. The team lead confirms the dev environment was 'shut down weeks ago.' Finance escalates after noticing the charges haven't decreased despite reported cleanup.

Assumption

The team assumed deleting a VM deletes everything associated with it — the same mental model you'd have shutting down a physical server. They opened the Compute Engine console, saw the VM list was empty, and closed the ticket. Nobody checked the Disks tab. Nobody checked the billing breakdown by resource type. The cost anomaly sat undetected because the team had no billing alert configured — they only looked at the bill when it arrived at the end of the month.

Root cause

When the VMs were created — a mix of console-created instances and ad-hoc gcloud commands — nobody explicitly set --boot-disk-auto-delete=yes. GCE's default behavior when creating a boot disk is auto-delete=yes for boot disks created as part of the instance creation flow, but several of the instances had secondary data disks attached afterward using gcloud compute instances attach-disk with no --auto-delete flag. Those secondary disks defaulted to auto-delete=no. Additionally, two instances were created via the console with a boot disk that had been manually reconfigured to auto-delete=no to preserve a custom environment setup. Result: 15 disks at 500GB each, pd-balanced type at $0.10/GB/month, running for six months. $0.10 × 500GB × 15 disks × 6 months = $4,500. With snapshots that nobody remembered to delete, the total reached $25,200. No alerting existed for unattached disk resources, and the project had no budget alert threshold configured in Cloud Billing.

Fix

1. Immediately identify all unattached disks across every project and zone: gcloud compute disks list --filter='-users:*' --format='table(name,zone,sizeGb,type,status,creationTimestamp)'. 2. Before deleting anything, snapshot disks that might contain recoverable data: gcloud compute disks snapshot DISK_NAME --zone=ZONE --snapshot-names=recovery-$(date +%Y%m%d). 3. Delete confirmed orphaned disks: gcloud compute disks delete DISK_NAME --zone=ZONE --quiet. 4. Set up a Cloud Scheduler job that triggers a Cloud Function weekly to list unattached disks older than 7 days and post an alert to Slack with a delete-confirmation workflow. 5. Enforce --boot-disk-auto-delete=yes in all VM provisioning scripts, Terraform modules, and gcloud wrappers — make auto-delete the default that requires a documented exception to override, not the other way around. 6. Configure a billing alert at 110% and 150% of expected monthly spend in Cloud Billing — you want the anomaly notification before the monthly invoice, not with it.

Key lesson

Deleting a VM does NOT delete its Persistent Disks unless auto-delete was explicitly enabled at disk attachment time — this is the single most common source of unexpected GCE costs
The Compute Engine VM list shows zero instances but says nothing about disks — always check the Disks tab separately after any cleanup operation
Billing alerts are not optional infrastructure — configure them on day one, before the first resource is provisioned, not after the first surprise invoice
Auto-delete=yes should be the default in all provisioning automation — disabling it should require a comment in code explaining why the disk needs to outlive the VM
Secondary disks attached after VM creation do not inherit the boot disk's auto-delete setting — each disk attachment must be configured explicitly

Production debug guideThe failures that actually happen in production Compute Engine deployments, and the commands that cut through them6 entries

Symptom · 01

VM unreachable via SSH after creation — connection times out or refused

→

Fix

Work through this in order: first verify the VM has an external IP (if you expect one) with 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(networkInterfaces[0].accessConfigs[0].natIP)'. If the IP is blank, the VM was created without an external IP — use IAP tunneling instead: 'gcloud compute ssh VM_NAME --zone=ZONE --tunnel-through-iap'. If an external IP exists, check firewall rules: 'gcloud compute firewall-rules list --filter=network=default --sort-by=priority'. SSH requires an ingress rule allowing tcp:22. If OS Login is enabled on the project, confirm the connecting user has roles/compute.osLoginExternalUser — OS Login replaces SSH key management and is a common source of confusion when it's enabled project-wide but the user hasn't been granted the IAM role.

Symptom · 02

VM performance degrades randomly during business hours — latency spikes without increased traffic

→

Fix

First determine if you're on a shared-core machine: 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(machineType)'. E2-micro, E2-small, and E2-medium use shared physical CPUs and can be throttled when the host is under load from neighboring VMs — this is the noisy-neighbor problem in practice. Check CPU utilization in Cloud Monitoring and look for the sustained burst of CPU credits being exhausted pattern. If you're on E2-standard or larger and still seeing degradation, check disk I/O wait: SSH into the VM and run 'iostat -x 1' — if %iowait is consistently above 20%, the disk type is the bottleneck, not the CPU. Cross-reference with 'gcloud compute disks describe DISK_NAME --zone=ZONE --format=get(type)' to confirm the disk tier.

Symptom · 03

Disk I/O bottleneck — database queries slow, high latency on writes

→

Fix

Check disk type immediately: 'gcloud compute disks describe DISK_NAME --zone=ZONE --format=get(type,sizeGb)'. pd-standard delivers 0.3 read IOPS/GB and 1.5 write IOPS/GB — a 100GB pd-standard gives you 30 read IOPS, which is enough to saturate a database under moderate load. pd-balanced gives 3 IOPS/GB (300 read IOPS on 100GB). pd-ssd gives 30 IOPS/GB (3,000 read IOPS on 100GB). For database workloads, pd-ssd is almost always the correct choice. Also verify IOPS limits are not hitting the per-VM cap — IOPS are throttled at the VM level too, not just the disk level. Check the machine type's I/O limit in the GCE documentation and compare against measured throughput.

Symptom · 04

Preemptible VM terminated mid-job — batch processing incomplete with no checkpoint

→

Fix

Confirm it was a preemption and not an application crash: 'gcloud logging read resource.type=gce_instance AND jsonPayload.event_type=GCE_PREEMPTED_TERMINATION --limit=10'. If you see preemption events, the job needs checkpoint support — no configuration change will prevent GCE from reclaiming Spot VMs. The correct architectural response: implement checkpoint writes to Cloud Storage every N minutes (N depends on job duration and acceptable redo cost). Use gsutil cp or the Cloud Storage client library to write a progress file atomically. At startup, check for an existing checkpoint before starting from scratch. For jobs that cannot be checkpointed, use a Standard VM or run a small pool of Standard VMs as fallback for the final stage of processing.

Symptom · 05

VM cannot reach other VMs on the same VPC — internal service calls time out

→

Fix

Verify both VMs are in the same VPC network first — same VPC name does not guarantee connectivity if they're in different shared VPC host projects or different subnets with different firewall rules. Run 'gcloud compute firewall-rules list --filter=network=YOUR_NETWORK --sort-by=priority' and check for deny rules that might be blocking internal traffic before the allow-internal rule matches. GCE firewall rules are evaluated by priority (lower number = higher priority) — a deny rule at priority 500 overrides an allow rule at priority 1000. Enable VPC Flow Logs on the subnet to see exactly which connections are being allowed or denied: 'gcloud compute networks subnets update SUBNET_NAME --region=REGION --enable-flow-logs'. Check Flow Logs in Cloud Logging for entries with disposition=DENIED.

Symptom · 06

Managed Instance Group rolling update stuck — instances not replacing, update percentage frozen

→

Fix

Get the current MIG status including any errors: 'gcloud compute instance-groups managed describe MIG_NAME --zone=ZONE'. Look for the currentActions field — if it shows 'creating' or 'verifying' for an extended period, the new instances are failing health checks. Check the health check URL configured for the MIG: 'gcloud compute backend-services describe BACKEND_NAME --global --format=get(healthChecks)'. SSH into one of the new instances and manually hit the health check endpoint — if it returns a non-200 response or times out, that's why the update is stuck. The MIG will not proceed with the rolling update while the maxUnavailable threshold would be exceeded by both current unhealthy instances and unavailable-during-update instances. Fix the application health check endpoint first, then the update will resume automatically.

★ GCE Emergency Debug Cheat SheetWhen Compute Engine resources misbehave in production, run these commands in order. Match your symptom to the block — don't start at the bottom and work up.

VM is running but completely unreachable — SSH times out, HTTP returns nothing, ping drops 100%−

Immediate action

Verify external IP exists and firewall allows the traffic you expect. These are the two most common causes of complete unreachability and take 60 seconds to rule out.

Commands

gcloud compute instances describe VM_NAME --zone=ZONE --format='get(networkInterfaces[0].accessConfigs[0].natIP,status,networkInterfaces[0].network)'

gcloud compute firewall-rules list --filter='network=default AND direction=INGRESS' --sort-by=priority --format='table(name,priority,sourceRanges,allowed[].map().firewall_rule(),disabled)'

Fix now

If natIP is empty, the VM has no external IP — access it via IAP: 'gcloud compute ssh VM_NAME --zone=ZONE --tunnel-through-iap'. If the firewall list has no rule allowing your traffic, add one: 'gcloud compute firewall-rules create allow-ssh-ingress --allow=tcp:22 --source-ranges=35.235.240.0/20 --network=default' (the 35.235.240.0/20 range is Google's IAP range — prefer IAP over 0.0.0.0/0 for SSH). If the VM shows RUNNING but is still unreachable after firewall verification, check the serial console for boot errors: 'gcloud compute instances get-serial-port-output VM_NAME --zone=ZONE'.

GCP bill spiked unexpectedly — costs doubled or tripled month-over-month with no new feature deployments+

Live Migration triggered — VM shows performance degradation, monitoring shows latency spike+

Managed Instance Group not autoscaling — instances stuck at minimum count despite high CPU utilization+

On-Premise Servers vs Compute Engine (GCE)

Aspect	On-Premise Servers	Compute Engine (GCE)
Provisioning Time	2-8 weeks: purchase order, vendor fulfillment, shipping, rack installation, OS setup, network configuration. A capacity planning mistake means waiting another cycle.	25-45 seconds via gcloud CLI or API. VM is RUNNING before the provisioning command finishes scrolling. Mistakes cost seconds to undo, not weeks.
Scaling	Manual and hardware-bounded. Adding capacity means ordering new servers. Scaling down means decommissioning hardware that's already been paid for — CapEx doesn't refund.	Managed Instance Groups autoscale based on CPU utilization, custom metrics, or scheduled schedules. Scale-out and scale-in happen automatically within minutes, and you pay only for running instances.
Host Maintenance	Scheduled maintenance windows requiring planned downtime. Hardware failures mean unplanned outages until replacement hardware arrives or a spare is swapped in.	Live Migration transparently moves running VMs to healthy hosts during maintenance — most workloads see zero downtime. Hardware failures are handled by Google without operator involvement.
Cost Model	Capital Expenditure (CapEx): pay full hardware cost upfront, depreciate over 3-5 years, carry idle capacity as sunk cost. Utilization below 100% is money spent on capacity you're not using.	Operating Expenditure (OpEx): pay per second of use, billed monthly. Per-second billing means idle time costs nothing. Committed Use Discounts (1 or 3 year) offer 37-55% savings for predictable workloads without hardware commitment.
Security and Compliance	Physical security responsibility is yours: facility access, hardware disposal, BIOS/firmware security. Boot integrity requires custom tooling. Compliance audits cover physical controls.	Shielded VMs provide Secure Boot, vTPM-based Measured Boot, and Integrity Monitoring out of the box. Sole-tenant nodes provide physical isolation for compliance requirements (HIPAA, PCI-DSS). Google handles physical facility security and hardware disposal.
Operational Overhead	Your team manages hardware refresh cycles, firmware updates, failed drive replacement, datacenter networking, and power/cooling. These are real engineering hours that don't ship features.	Google manages physical hardware, networking infrastructure, and hypervisor security. Your team manages OS-level and above. Managed services (Cloud SQL, GKE) shift OS management to Google as well.

⚙ Quick Reference

4 commands from this guide

File	Command / Code	Purpose
iothecodeforgegceProvisionVM.sh	gcloud compute instances create forge-web-server-01 \	What Is Google Cloud Compute Engine and Why Does It Exist?
iothecodeforgegceLifecycleManagement.sh	DISK_NAME="boot-disk"	Common Mistakes and How to Avoid Them
find-orphaned-disks.sh	PROJECT="prod-east-12345"	Why Your VMs Are Leaking Money (and How to Fix It)
create-vm-with-healthy.sh	STARTUP_SCRIPT='#!/bin/bash	The Silent Killer

Key takeaways

GCE is a full IaaS platform

you get kernel-level control, custom OS images, GPU access, and persistent storage. That control comes with OS-level operational responsibility that serverless platforms eliminate. Choose based on whether you actually need what GCE provides, not because VMs are familiar.

Machine families are purpose-built and the wrong choice has real performance and cost consequences. E2 shared-core instances can be throttled by neighboring VMs. C3 instances deliver consistent per-core performance for compute-bound workloads. M2 instances provide up to 12TB RAM for workloads that cannot be sharded. Match the machine family to the workload characteristics before provisioning.

Custom Machine Types are the correct answer when predefined types don't fit your CPU-to-RAM ratio. Paying for 16GB RAM when your application uses 10GB is 60% waste on that resource dimension

Custom Machine Types let you specify exactly what you need in 256MB RAM increments.

Live Migration is a genuine operational advantage

it means Google's host maintenance doesn't become your application's maintenance window. For GPU instances and workloads that set TERMINATE policy, architect for instance-level failure using Managed Instance Groups with auto-healing instead.

Persistent Disk auto-delete behavior is the most common source of unexpected GCE costs. Deleting a VM does not delete its disks unless auto-delete was explicitly set. Make auto-delete=yes the default in all provisioning automation, and run 'gcloud compute disks list --filter=-users:*' as part of every environment teardown checklist.

The Default Compute Service Account with Editor role is a project-wide security risk attached to every VM that doesn't specify a custom SA. Create dedicated service accounts with least-privilege IAM for every VM or VM group in production. This is a 5-minute setup task that eliminates an entire category of breach blast radius.

Common mistakes to avoid

6 patterns

Not using Managed Instance Groups for production workloads

Symptom

A single VM serves production traffic. It crashes at 3am due to an OOM event. Traffic drops to zero. An on-call engineer is paged, diagnoses the issue, and manually recreates the VM — total downtime: 18 minutes. The next week it happens again because the underlying cause (a memory leak) wasn't fixed. Manual VM management means every failure requires human intervention, and single-VM deployments have no redundancy.

Fix

Use Managed Instance Groups for all production traffic-serving workloads. Create an instance template that captures your VM configuration, then create a MIG from it: 'gcloud compute instance-groups managed create forge-web-mig --template=forge-web-template --size=3 --zone=us-central1-a'. Configure a health check so GCE auto-heals unhealthy instances: 'gcloud compute instance-groups managed set-autohealing forge-web-mig --health-check=forge-http-health-check --initial-delay=60 --zone=us-central1-a'. A MIG with 3 instances across 2+ zones gives you redundancy, auto-healing, and the foundation for rolling deployments — all from a single configuration.

Hardcoding internal IP addresses in application configuration

Symptom

Service A connects to Service B using the internal IP 10.128.0.5 hardcoded in a configuration file. A maintenance event replaces the Service B VM (new VM, new internal IP: 10.128.0.8). Service A's connection pool starts timing out. The error is 'connection refused' — the old IP is gone. Finding and updating every configuration file that referenced the old IP takes 40 minutes. This repeats every time Service B's VM is replaced.

Fix

Never hardcode VM internal IPs anywhere — not in config files, not in environment variables, not in database records. Use one of three stable alternatives: (1) Cloud DNS with an internal DNS zone — create a record for service-b.internal.thecodeforge.io pointing to the current IP, update the DNS record when the VM changes; (2) Internal Load Balancer — the ILB IP is stable even as backend VMs are replaced; (3) For GKE-backed services, Kubernetes Service ClusterIP provides stable internal addressing regardless of pod replacement. The pattern to remember: applications should discover services by name, not by IP.

Running VMs with the Default Compute Service Account

Symptom

The Default Compute Service Account has the project Editor role. A web application running on a GCE VM has a Server-Side Request Forgery (SSRF) vulnerability. An attacker exploits it to make requests to the GCE metadata server at http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token, obtaining a valid OAuth2 token with Editor access to the entire project. The attacker uses the token to exfiltrate Cloud Storage buckets, access Cloud SQL databases, and create new VMs for mining. The breach affects every service in the project, not just the compromised VM.

Fix

Create a dedicated service account for each VM or group of VMs with only the permissions that workload requires. If a web server only needs to write to Cloud Logging and read from one Cloud Storage bucket, create an SA with only roles/logging.logWriter and roles/storage.objectViewer on that specific bucket. Attach it at VM creation: '--service-account=forge-web-sa@thecodeforge-prod.iam.gserviceaccount.com'. Block the Default Compute SA at the organization level using an Organization Policy constraint: constraints/iam.disableServiceAccountCreation combined with constraints/compute.disableDefaultServiceAccountCreation. This makes using a custom SA mandatory, not optional.

Running dev and staging VMs 24/7

Symptom

A team has 10 developer VMs and 3 staging VMs, all running around the clock. Developers use them from 8am to 7pm. From 7pm to 8am — 13 hours — the VMs sit idle consuming compute budget. At $0.067/hour for an e2-standard-2, 13 VMs × 13 idle hours × 30 days = $338/month in compute spend that produces no value. Over a year, that's over $4,000 in recoverable waste, and this is a small team.

Fix

Use Instance Schedules to automatically start and stop non-production VMs: 'gcloud compute resource-policies create instance-schedule dev-hours-schedule --region=us-central1 --vm-start-time=08:00 --vm-stop-time=20:00 --start-day-of-week=MONDAY --stop-day-of-week=FRIDAY --timezone=America/Chicago'. Apply the policy: 'gcloud compute instances add-resource-policies VM_NAME --zone=us-central1-a --resource-policies=dev-hours-schedule'. VMs stop at 8pm and start at 8am on weekdays automatically. If a developer needs after-hours access, they start the VM manually. The typical savings is 55-65% of non-production compute spend with zero change to developer workflow.

Using ephemeral IPs for production-facing endpoints

Symptom

A production API server has an ephemeral external IP. GCE performs host maintenance and live-migrates the VM — the IP is preserved during migration, but the team restarts the VM manually the following week for an OS patch. The ephemeral IP changes. DNS still points to the old IP. API clients start receiving connection errors. The on-call engineer doesn't immediately recognize that the IP changed because the VM shows RUNNING in the console. Resolution requires finding the new IP, updating the DNS record, and waiting for TTL propagation — 25 minutes of downtime for a 30-second one-time fix that should have been done at provisioning time.

Fix

Reserve a Static External IP before pointing DNS to any production VM: 'gcloud compute addresses create forge-api-ip --region=us-central1'. Attach it to the VM: 'gcloud compute instances add-access-config VM_NAME --zone=us-central1-a --access-config-name=External NAT --address=$(gcloud compute addresses describe forge-api-ip --region=us-central1 --format=get(address))'. The static IP survives VM restarts, replacements, and re-creation. If the VM is replaced by a MIG rolling update, attach the static IP to the load balancer frontend instead — the LB IP is stable regardless of backend VM changes.

Not enabling Shielded VM features on production instances

Symptom

A production VM shows unexpected processes in 'ps aux' that weren't deployed by the team. Investigation reveals a kernel-level rootkit that survived an OS reinstall because it's embedded in the bootloader. Standard security tools report a clean system because the rootkit operates below the OS layer. Forensics cannot determine when the compromise occurred because there's no boot integrity baseline to compare against. The VM must be destroyed and rebuilt from a known-good image — and the team has no confidence that other VMs aren't similarly compromised.

Fix

Enable all three Shielded VM features at VM creation: '--shielded-secure-boot' prevents unsigned UEFI firmware and bootloaders from executing; '--shielded-vtpm' enables a virtual Trusted Platform Module that records the boot sequence hash (Measured Boot); '--shielded-integrity-monitoring' compares each boot's measurements against the established baseline and flags deviations in Cloud Monitoring. If Integrity Monitoring reports a violation, the VM is quarantined and investigated rather than trusted. These features have negligible performance impact and are included at no additional cost — there is no valid argument for disabling them in production.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does GCE's Live Migration work mechanically, and how should you arch...

Q02SENIOR

You are architecting a batch ML training pipeline that runs for 4-6 hour...

Q03SENIOR

Explain the difference between Persistent Disk types and Local SSD. For ...

Q04SENIOR

What is a Managed Instance Group and how do you perform a zero-downtime ...

Q05SENIOR

How do Sole-tenant Nodes differ from standard multi-tenant GCE, and what...

Q06JUNIOR

What is the GCE Metadata Server, how does it enable secure credential ac...

Q01 of 06SENIOR

How does GCE's Live Migration work mechanically, and how should you architect applications to handle the cases where Live Migration isn't possible?

ANSWER

Live Migration is a hypervisor-level operation where Google moves a running VM from one physical host to another without shutting it down. Mechanically: GCE pre-copies the VM's memory pages to the destination host while the VM is still running on the source. When the delta (the pages modified during copying) is small enough, GCE briefly pauses the VM — typically 10-100 milliseconds — transfers the remaining state, and resumes execution on the new host. The VM's IP addresses, MAC address, memory state, and disk attachments are preserved. The application sees a momentary performance dip or a brief pause in network responses, but not a reboot. Live Migration is not available for GPU instances, instances configured with --maintenance-policy=TERMINATE, or certain instance types using local SSDs. For these cases, architect for instance-level failure: deploy behind a Managed Instance Group with auto-healing so a terminated VM is replaced automatically. Make the application stateless — any state that must survive instance replacement lives in Cloud SQL, Cloud Storage, Memorystore, or another managed service, not on local disk. Implement graceful shutdown handling (SIGTERM → drain connections → exit cleanly) so in-flight requests complete before the VM is terminated. Set minReadySec in MIG update policies so replacement instances have time to warm up before receiving traffic.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What exactly happens to my data when I delete a VM?

What is the difference between a Zone and a Region in GCE, and how does it affect my architecture?

Can I resize a VM after it's been created, and do I need to stop it?

What is the difference between Preemptible VMs and Spot VMs, and which should I use?

How do I control costs on GCE as the team and workload scale?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.

✓ Verified

production tested

July 04, 2026

last updated

377

articles · all by Naren

🔥

That's Google Cloud. Mark it forged?

4 min read · try the examples if you haven't