Junior 4 min · March 09, 2026

Compute Engine — Orphaned Disks Burn $4,200/Month

Deleting a GCE VM doesn't delete disks.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • GCE is Google's IaaS platform — you rent virtual machines on demand instead of buying physical servers
  • Machine families are purpose-built: E2 for dev/test, N2 for balanced production, C2/C3 for compute-heavy workloads
  • Live Migration moves your running VM to a different host during maintenance without rebooting — unique to Google Cloud
  • Preemptible (Spot) VMs cost up to 80% less but can be reclaimed at any time — use them for fault-tolerant batch jobs only
  • Persistent Disks survive VM deletion if auto-delete is disabled — orphaned disks silently accrue costs with no warning
  • The biggest trap: running VMs with the Default Compute Service Account (Editor role) — it's a project-wide security hole waiting to be exploited
Plain-English First

Think of Google Cloud Compute Engine as a high-end virtual computer living in one of Google's data centers — the same infrastructure that keeps YouTube and Google Search running. Instead of buying a physical server, racking it, cabling it, and paying for the electricity yourself, you rent exactly the amount of compute power you need, for exactly as long as you need it.

The mental model I use with engineers new to GCE: imagine you're a contractor who builds houses. Without cloud infrastructure, you'd have to buy every tool before starting a job — drill, saw, scaffolding — and store it all in your garage after. With GCE, you call a tool rental shop, say 'I need a drill and scaffolding for three days,' and return everything when the job is done. You paid for three days of use, not the lifetime cost of the equipment.

The part that surprises most developers: you don't just get a computer. You get a computer connected to Google's private fiber network, with the ability to resize it without buying new hardware, snapshot its disk before a risky deployment, and pay only for the seconds it's actually running. That last part — per-second billing — is what makes cloud-native cost modeling genuinely different from anything you'd do with physical hardware.

Google Cloud Compute Engine (GCE) is the Infrastructure-as-a-Service (IaaS) layer of Google Cloud Platform, and it's one of the most capable — and most frequently misused — services in the GCP catalog. Every time I've joined a new engineering organization running on GCP, the Compute Engine bill is where I find the most recoverable waste and the most preventable incidents.

GCE exists to give you the same infrastructure primitives Google uses internally, exposed through an API. That means you can provision a VM in under 30 seconds, attach and detach persistent disks without rebooting, resize a machine type with a single command, and deploy across 40+ regions worldwide — all without touching physical hardware or filing a procurement request.

This guide covers the real mechanics of GCE: how to provision VMs correctly, which machine families to reach for in different scenarios, how the disk model actually works (and where it silently burns budget), and the security decisions that most tutorials skip entirely. We'll also cover the failure modes that show up in production — the orphaned disk that runs up a $4,200 monthly bill, the ephemeral IP that breaks DNS at 2am, and the Default Compute Service Account that turns a compromised VM into a project-wide breach.

By the end, you'll have both the conceptual foundation and production-grade examples to provision and operate GCE workloads with confidence — and to audit the ones you've inherited.

What Is Google Cloud Compute Engine and Why Does It Exist?

Compute Engine is built on the same physical infrastructure that runs Google Search, Gmail, and YouTube. That's not a marketing claim — it's the architectural reason GCE has capabilities you don't find on competing platforms. Live Migration, Google's global private fiber backbone, and the custom Titanium chip that handles networking and security offloading all came from internal Google infrastructure before they became GCE features.

GCE exists to solve a problem that anyone who has run physical hardware understands viscerally: the gap between the capacity you need today and the capacity you provisioned three months ago when you ordered the hardware. Provisioning a physical server takes weeks of procurement, shipping, racking, cabling, and OS installation. Provisioning a GCE VM takes 25 seconds via the gcloud CLI. That gap — weeks versus seconds — is the entire premise of Infrastructure-as-a-Service.

But GCE is not just 'a VM in the cloud.' The decisions you make at provisioning time — machine family, disk type, service account, network configuration, maintenance policy — have meaningful operational and cost consequences that play out over months. Understanding those decisions is the difference between a GCE deployment that works well and one that generates surprise bills and 2am pages.

The fundamental question GCE answers is: do you need an operating system? If you need kernel-level control, a custom OS image, GPU access, long-running background processes, or a persistent filesystem that behaves like a local disk, GCE is your tool. If you're deploying a containerized stateless HTTP service, Cloud Run or GKE may be a better fit. The choice is not about which is 'better' — it's about matching the abstraction level to the workload requirements.

io/thecodeforge/gce/ProvisionVM.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/bin/bash
# io.thecodeforge: Production-grade VM Provisioning via gcloud CLI
# Every flag here is deliberate — see inline comments for rationale.
# Do not remove flags without understanding their security or operational purpose.

gcloud compute instances create forge-web-server-01 \
    --project=thecodeforge-prod \
    --zone=us-central1-a \
    \
    # Machine type: e2-standard-2 = 2 vCPU, 8GB RAM.
    # For production web APIs, start here and right-size after 2 weeks of metrics.
    --machine-type=e2-standard-2 \
    \
    # PREMIUM network tier uses Google's private backbone for egress.
    # STANDARD tier uses public internet routing — cheaper but higher latency.
    --network-interface=network-tier=PREMIUM,subnet=default \
    \
    # MIGRATE = Live Migration during host maintenance (no reboot).
    # TERMINATE is required for GPU instances — they cannot be live-migrated.
    --maintenance-policy=MIGRATE \
    \
    # STANDARD = on-demand pricing. Use SPOT for batch/CI workloads only.
    --provisioning-model=STANDARD \
    \
    # Critical: use a custom SA with least-privilege IAM, NOT the default compute SA.
    # The default SA has Editor role on the entire project — a major security risk.
    --service-account=forge-web-sa@thecodeforge-prod.iam.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    \
    # Network tags are used by firewall rules to target specific VMs.
    --tags=http-server,https-server \
    \
    # Boot disk configuration:
    # auto-delete=yes: disk is deleted when VM is deleted (prevents orphaned disk charges).
    # pd-balanced: good balance of cost and IOPS for web workloads.
    # 20GB is sufficient for OS + app — don't overprovision disk.
    --create-disk=auto-delete=yes,boot=yes,device-name=boot-disk,\
image=projects/debian-cloud/global/images/family/debian-12,\
mode=rw,size=20,type=pd-balanced \
    \
    # Shielded VM: prevents boot-level rootkits and provides integrity attestation.
    # Required for PCI-DSS, HIPAA, and most enterprise security baselines.
    --shielded-secure-boot \
    --shielded-vtpm \
    --shielded-integrity-monitoring \
    \
    # Labels: critical for cost allocation and lifecycle management.
    # Use these to identify and delete resources by environment.
    --labels=env=production,app=frontend,team=platform,owner=sre
Output
Created [https://www.googleapis.com/compute/v1/projects/thecodeforge-prod/zones/us-central1-a/instances/forge-web-server-01].
NAME ZONE MACHINE_TYPE INTERNAL_IP EXTERNAL_IP STATUS
forge-web-server-01 us-central1-a e2-standard-2 10.128.0.2 34.135.10.45 RUNNING
# Verify Shielded VM is active:
# gcloud compute instances get-shielded-instance-config forge-web-server-01 --zone=us-central1-a
# shieldedInstanceConfig:
# enableIntegrityMonitoring: true
# enableSecureBoot: true
# enableVtpm: true
VM vs Serverless — The Decision That Determines Your Operational Overhead
  • Use GCE when you need kernel-level control: custom OS builds, kernel parameters (vm.swappiness, tcp_keepalive), eBPF-based networking, or custom kernel modules for high-performance I/O
  • Use GCE for stateful workloads where data locality matters: databases, file servers, ML model serving with large model files, anything that writes to local disk faster than network storage can keep up
  • Use GCE for GPU workloads — A100, L4, and H100 GPUs are attached to GCE instances, not available in serverless environments
  • Use Cloud Run for stateless HTTP workloads that need to scale to zero and back up in seconds — if you're not SSH-ing into the machine, you probably don't need GCE
  • The operational cost comparison: GCE requires you to manage OS patches, security hardening, disk monitoring, and capacity planning. Cloud Run offloads all of that. That operational delta is real engineering time — factor it into the decision.
Production Insight
GCE's Live Migration is architecturally unique — AWS terminates and restarts affected instances during host maintenance, Azure gives you a short maintenance window notice but still reboots. On GCE, the maintenance event is invisible to most workloads. This matters for stateful applications where a reboot means a cold start, cache warm-up, and connection re-establishment.
Preemptible and Spot VMs are not just a cost option — they're an architectural forcing function. Designing your batch processing to tolerate preemption makes it resilient to all forms of unexpected VM loss, not just preemption. Teams that adopt Spot VMs for CI/CD runners typically end up with better pipeline resilience across the board.
Rule: use Standard VMs for always-on services with SLAs. Use Spot VMs for batch processing, CI/CD runners, and any workload that can checkpoint and resume. The 60-80% cost reduction is significant at scale — a fleet of 50 CI runners at Spot pricing vs Standard pricing is the difference between a manageable infrastructure budget and one that requires quarterly justification.
Key Takeaway
GCE gives you full kernel-level control that no serverless platform can match — but that control comes with operational responsibility for patching, hardening, and lifecycle management that serverless offloads.
Machine families are not interchangeable: E2 shared-core instances can be throttled by neighbors, C3 instances deliver consistent per-core performance that E2 can't guarantee. Choosing the wrong family for a workload means either overpaying or getting unexpected performance variance.
Punchline: if you're running a predefined machine type and Cloud Monitoring shows consistent RAM usage below 60% of your allocation, run 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(machineType)' and compare against Custom Machine Type pricing. Switching from n2-standard-8 to a custom 8 vCPU / 20GB machine for a Java service with a 16GB heap can save $80-120/month per instance at scale.
Choosing the Right GCE Machine Family
IfDev/test environments, bursty workloads, or cost-sensitive non-production jobs
UseUse E2 — shared-core options (e2-micro, e2-small) available at very low cost. E2-standard gives dedicated vCPUs. Good for workloads with variable CPU needs and tolerance for occasional throttling on shared-core instances.
IfBalanced production workloads: web servers, REST APIs, application servers, build systems
UseUse N2 (Intel) or N2D (AMD) — dedicated cores, no noisy-neighbor throttling, good CPU-to-RAM ratio (1:4 vCPU:GB). N2D is typically 10-15% cheaper than N2 for equivalent specs.
IfCPU-bound workloads: compilation, rendering, scientific computing, data transformation pipelines
UseUse C2 (Intel Cascade Lake) or C3 (Intel Sapphire Rapids, 2023+) — highest per-core performance in GCE, designed for sustained compute-intensive work. C3 has DDR5 memory and offers meaningfully better per-core throughput than C2.
IfMemory-intensive workloads: in-memory databases, large JVM heap sizes, SAP HANA, Redis with large datasets
UseUse M1 (up to 4TB RAM) or M2 (up to 12TB RAM) — these are designed specifically for workloads that don't fit in standard memory ratios. Expensive, but the alternative is sharding what could be a single-instance workload.
IfThe predefined machine types don't match your workload's CPU-to-RAM ratio
UseUse Custom Machine Types — specify exact vCPU count and RAM in 256MB increments. If you need 6 vCPUs and 20GB RAM, you pay for exactly that. This is often 20-40% cheaper than the next predefined type that fits.
IfML inference, video transcoding, scientific simulation, or any GPU-accelerated workload
UseUse A2 (A100 GPUs), G2 (L4 GPUs for inference), or A3 (H100 GPUs for large model training). GPU availability varies by region — check quotas before designing architecture around a specific GPU type in a specific region.

Common Mistakes and How to Avoid Them

Compute Engine is permissive by default in ways that create operational problems over time. The defaults were chosen to make getting started easy, not to be correct for production. The gap between 'easy to start' and 'correct for production' is where most GCE mistakes live.

The IP address problem is the one I see most often in teams that are new to GCE. Ephemeral external IPs are assigned at VM start time and released when the VM stops. This is fine for dev environments. For a production web server, it means every restart — planned or unplanned — changes the IP your DNS record points to. The fix is a one-time 30-second operation that most tutorials skip because it doesn't affect the happy path. The cost of skipping it is a 2am DNS debugging session.

The service account problem is the security debt that accumulates silently. Every GCE VM needs a service account to authenticate to other GCP services. The path of least resistance is the Default Compute Service Account, which has Editor role on the entire project. This means any process running on that VM — including a compromised web process — can read from any Cloud Storage bucket, write to any Pub/Sub topic, query any Cloud SQL database, and delete any other VM in the project. That's not hypothetical risk. It's the blast radius calculation for a supply-chain attack or a server-side request forgery vulnerability against a service running on that VM.

The over-provisioning problem is the cost debt that accumulates just as silently. GCE's Right-sizing Recommendations in the console analyze 8 days of CPU and memory utilization and suggest smaller machine types when resources are consistently underused. I've seen teams save 30-40% on compute spend just by reviewing these recommendations quarterly and acting on them.

io/thecodeforge/gce/LifecycleManagement.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/bin/bash
# io.thecodeforge: GCE Disk and Instance Lifecycle Management
# Run these as part of pre-deployment and post-teardown procedures.

# ============================================================
# STEP 1: Pre-deployment snapshot
# Take a consistent disk snapshot before any major deployment.
# This is your rollback point — do this before every production change.
# ============================================================
DISK_NAME="boot-disk"
ZONE="us-central1-a"
SNAPSHOT_NAME="pre-deploy-$(date +%Y%m%d-%H%M%S)"

gcloud compute disks snapshot "${DISK_NAME}" \
    --project=thecodeforge-prod \
    --snapshot-names="${SNAPSHOT_NAME}" \
    --zone="${ZONE}" \
    --storage-location=us-central1

echo "Snapshot created: ${SNAPSHOT_NAME}"
echo "To restore: gcloud compute disks create restored-disk --source-snapshot=${SNAPSHOT_NAME} --zone=${ZONE}"

# ============================================================
# STEP 2: Audit orphaned disks before teardown
# Run this BEFORE deleting VMs and AFTER deleting VMs.
# The before-run gives you a baseline. The after-run catches anything missed.
# ============================================================
echo "=== Unattached Disks (potential orphans) ==="
gcloud compute disks list \
    --filter='-users:*' \
    --format='table(name,zone,sizeGb,type,creationTimestamp,status)' \
    --sort-by=~sizeGb

# ============================================================
# STEP 3: Delete dev environment instances by label
# Labels are the correct mechanism for lifecycle management.
# Never maintain a manual list of instance names to delete.
# ============================================================
INSTANCES_TO_DELETE=$(gcloud compute instances list \
    --filter="labels.env=development" \
    --format="value(name,zone)" \
    --sort-by=zone)

if [ -z "${INSTANCES_TO_DELETE}" ]; then
    echo "No development instances found. Nothing to delete."
else
    echo "Instances to delete:"
    echo "${INSTANCES_TO_DELETE}"
    # Delete by zone to handle multi-zone dev environments correctly
    gcloud compute instances delete \
        $(gcloud compute instances list \
            --filter="labels.env=development" \
            --format="value(name)" | tr '\n' ' ') \
        --zone="${ZONE}" \
        --quiet
fi

# ============================================================
# STEP 4: Verify no orphaned disks remain after teardown
# If this list is non-empty after deletion, investigate before closing the ticket.
# ============================================================
echo "=== Post-teardown orphan check ==="
gcloud compute disks list \
    --filter='-users:*' \
    --format='table(name,zone,sizeGb,type,creationTimestamp)'
Output
Snapshot created: pre-deploy-20260319-143022
To restore: gcloud compute disks create restored-disk --source-snapshot=pre-deploy-20260319-143022 --zone=us-central1-a
=== Unattached Disks (potential orphans) ===
NAME ZONE SIZE_GB TYPE CREATED
old-data-disk-01 us-central1-a 500 pd-balanced 2025-09-12
stale-boot-02 us-central1-b 20 pd-standard 2025-11-03
Instances to delete:
forge-dev-vm-01 us-central1-a
forge-dev-vm-02 us-central1-a
Deleted [forge-dev-vm-01].
Deleted [forge-dev-vm-02].
=== Post-teardown orphan check ===
NAME ZONE SIZE_GB TYPE CREATED
old-data-disk-01 us-central1-a 500 pd-balanced 2025-09-12
stale-boot-02 us-central1-b 20 pd-standard 2025-11-03
# ACTION REQUIRED: These 2 disks were not deleted. Investigate before closing.
The Default Compute Service Account Is a Project-Wide Security Risk
Every GCE VM is created with a service account. The easy choice — accepting the default — attaches the Default Compute Service Account, which has the Editor role on the entire project. A single compromised process on a VM using this account can read secrets, delete infrastructure, and exfiltrate data from every service in the project. Create a custom service account for every VM with only the IAM roles it actually needs. This is a 5-minute setup task that eliminates an entire category of blast radius risk.
Production Insight
The ephemeral IP problem is predictable and preventable. Static IP reservation costs nothing while the IP is attached to a running VM — you pay $0.01/hour only when it's reserved but unattached. The operational risk of an ephemeral IP on a production endpoint is orders of magnitude more expensive than the reservation fee.
GCE Right-sizing Recommendations are generated automatically and available in the Compute Engine console under the Recommendations section. They require no setup and are based on actual utilization data. Teams that review these monthly and act on them consistently report 25-40% compute cost reductions over 12 months — not from dramatic architectural changes, but from incremental machine type adjustments that add up across a fleet.
Rule: on the first of every month, open the GCE Right-sizing Recommendations panel. Apply any recommendations where the CPU and memory savings are above 20%. Reserve any Static IPs that are currently ephemeral on production-facing VMs. Check for unattached disks older than 7 days. These three checks, done consistently, eliminate the vast majority of avoidable GCE costs.
Key Takeaway
Ephemeral IPs are the correct default for development and the wrong default for production. The distinction is simple: if a DNS record points to the IP, it must be static. Everything else can be ephemeral.
The Default Compute Service Account is a convenience that becomes a liability the moment a VM is compromised. Spending 5 minutes creating a custom service account with scoped IAM roles eliminates a project-wide blast radius. There is no argument for using the default SA in production.
Punchline: run 'gcloud compute instances list --format=table(name,zone,serviceAccounts[0].email)' across your production project. If any row shows the default compute service account (PROJECT_NUMBER-compute@developer.gserviceaccount.com), that VM is over-permissioned and needs a dedicated SA created and attached before the next deployment.
Networking and Cost Optimization Decisions
IfProduction web server, API endpoint, or any service with a DNS record pointing to it
UseReserve a Static External IP immediately. Cost: free while attached. Risk of not doing it: DNS breaks on every VM restart, maintenance event that triggers replacement, or MIG rolling update.
IfInternal microservice that only communicates with other services within the same VPC
UseNo external IP needed — use Internal IPs only. Eliminates the attack surface of a public IP, avoids egress charges for same-zone traffic, and enforces that the service is not accidentally exposed to the internet.
IfDeveloper needs SSH access to a VM that has no external IP
UseUse IAP (Identity-Aware Proxy) tunneling: 'gcloud compute ssh VM_NAME --zone=ZONE --tunnel-through-iap'. IAP authenticates via IAM — no bastion host, no VPN, no open SSH port to the internet. This is the correct pattern for all developer access in 2026.
IfVM shows consistent CPU below 40% or RAM below 50% in Cloud Monitoring for 2+ weeks
UseCheck GCE Right-sizing Recommendations in the console. If no recommendation exists yet (requires 8+ days of data), calculate the Custom Machine Type that fits your P95 utilization with 30% headroom and switch to it.
IfBatch job or CI/CD runner that runs for a bounded duration
UseUse Spot provisioning model: '--provisioning-model=SPOT'. Design the job to checkpoint progress to Cloud Storage. Cost savings of 60-80% over Standard pricing make this the correct default for all non-always-on workloads.
● Production incidentPOST-MORTEMseverity: high

Orphaned Persistent Disks silently burn $4,200/month after VM deletion

Symptom
Monthly GCP bill increases by $4,200 with no corresponding increase in running VMs or traffic. Cloud Billing shows Persistent Disk charges growing month-over-month in the dev project. The Compute Engine console VM list is empty — no running instances. The team lead confirms the dev environment was 'shut down weeks ago.' Finance escalates after noticing the charges haven't decreased despite reported cleanup.
Assumption
The team assumed deleting a VM deletes everything associated with it — the same mental model you'd have shutting down a physical server. They opened the Compute Engine console, saw the VM list was empty, and closed the ticket. Nobody checked the Disks tab. Nobody checked the billing breakdown by resource type. The cost anomaly sat undetected because the team had no billing alert configured — they only looked at the bill when it arrived at the end of the month.
Root cause
When the VMs were created — a mix of console-created instances and ad-hoc gcloud commands — nobody explicitly set --boot-disk-auto-delete=yes. GCE's default behavior when creating a boot disk is auto-delete=yes for boot disks created as part of the instance creation flow, but several of the instances had secondary data disks attached afterward using gcloud compute instances attach-disk with no --auto-delete flag. Those secondary disks defaulted to auto-delete=no. Additionally, two instances were created via the console with a boot disk that had been manually reconfigured to auto-delete=no to preserve a custom environment setup. Result: 15 disks at 500GB each, pd-balanced type at $0.10/GB/month, running for six months. $0.10 × 500GB × 15 disks × 6 months = $4,500. With snapshots that nobody remembered to delete, the total reached $25,200. No alerting existed for unattached disk resources, and the project had no budget alert threshold configured in Cloud Billing.
Fix
1. Immediately identify all unattached disks across every project and zone: gcloud compute disks list --filter='-users:*' --format='table(name,zone,sizeGb,type,status,creationTimestamp)'. 2. Before deleting anything, snapshot disks that might contain recoverable data: gcloud compute disks snapshot DISK_NAME --zone=ZONE --snapshot-names=recovery-$(date +%Y%m%d). 3. Delete confirmed orphaned disks: gcloud compute disks delete DISK_NAME --zone=ZONE --quiet. 4. Set up a Cloud Scheduler job that triggers a Cloud Function weekly to list unattached disks older than 7 days and post an alert to Slack with a delete-confirmation workflow. 5. Enforce --boot-disk-auto-delete=yes in all VM provisioning scripts, Terraform modules, and gcloud wrappers — make auto-delete the default that requires a documented exception to override, not the other way around. 6. Configure a billing alert at 110% and 150% of expected monthly spend in Cloud Billing — you want the anomaly notification before the monthly invoice, not with it.
Key lesson
  • Deleting a VM does NOT delete its Persistent Disks unless auto-delete was explicitly enabled at disk attachment time — this is the single most common source of unexpected GCE costs
  • The Compute Engine VM list shows zero instances but says nothing about disks — always check the Disks tab separately after any cleanup operation
  • Billing alerts are not optional infrastructure — configure them on day one, before the first resource is provisioned, not after the first surprise invoice
  • Auto-delete=yes should be the default in all provisioning automation — disabling it should require a comment in code explaining why the disk needs to outlive the VM
  • Secondary disks attached after VM creation do not inherit the boot disk's auto-delete setting — each disk attachment must be configured explicitly
Production debug guideThe failures that actually happen in production Compute Engine deployments, and the commands that cut through them6 entries
Symptom · 01
VM unreachable via SSH after creation — connection times out or refused
Fix
Work through this in order: first verify the VM has an external IP (if you expect one) with 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(networkInterfaces[0].accessConfigs[0].natIP)'. If the IP is blank, the VM was created without an external IP — use IAP tunneling instead: 'gcloud compute ssh VM_NAME --zone=ZONE --tunnel-through-iap'. If an external IP exists, check firewall rules: 'gcloud compute firewall-rules list --filter=network=default --sort-by=priority'. SSH requires an ingress rule allowing tcp:22. If OS Login is enabled on the project, confirm the connecting user has roles/compute.osLoginExternalUser — OS Login replaces SSH key management and is a common source of confusion when it's enabled project-wide but the user hasn't been granted the IAM role.
Symptom · 02
VM performance degrades randomly during business hours — latency spikes without increased traffic
Fix
First determine if you're on a shared-core machine: 'gcloud compute instances describe VM_NAME --zone=ZONE --format=get(machineType)'. E2-micro, E2-small, and E2-medium use shared physical CPUs and can be throttled when the host is under load from neighboring VMs — this is the noisy-neighbor problem in practice. Check CPU utilization in Cloud Monitoring and look for the sustained burst of CPU credits being exhausted pattern. If you're on E2-standard or larger and still seeing degradation, check disk I/O wait: SSH into the VM and run 'iostat -x 1' — if %iowait is consistently above 20%, the disk type is the bottleneck, not the CPU. Cross-reference with 'gcloud compute disks describe DISK_NAME --zone=ZONE --format=get(type)' to confirm the disk tier.
Symptom · 03
Disk I/O bottleneck — database queries slow, high latency on writes
Fix
Check disk type immediately: 'gcloud compute disks describe DISK_NAME --zone=ZONE --format=get(type,sizeGb)'. pd-standard delivers 0.3 read IOPS/GB and 1.5 write IOPS/GB — a 100GB pd-standard gives you 30 read IOPS, which is enough to saturate a database under moderate load. pd-balanced gives 3 IOPS/GB (300 read IOPS on 100GB). pd-ssd gives 30 IOPS/GB (3,000 read IOPS on 100GB). For database workloads, pd-ssd is almost always the correct choice. Also verify IOPS limits are not hitting the per-VM cap — IOPS are throttled at the VM level too, not just the disk level. Check the machine type's I/O limit in the GCE documentation and compare against measured throughput.
Symptom · 04
Preemptible VM terminated mid-job — batch processing incomplete with no checkpoint
Fix
Confirm it was a preemption and not an application crash: 'gcloud logging read resource.type=gce_instance AND jsonPayload.event_type=GCE_PREEMPTED_TERMINATION --limit=10'. If you see preemption events, the job needs checkpoint support — no configuration change will prevent GCE from reclaiming Spot VMs. The correct architectural response: implement checkpoint writes to Cloud Storage every N minutes (N depends on job duration and acceptable redo cost). Use gsutil cp or the Cloud Storage client library to write a progress file atomically. At startup, check for an existing checkpoint before starting from scratch. For jobs that cannot be checkpointed, use a Standard VM or run a small pool of Standard VMs as fallback for the final stage of processing.
Symptom · 05
VM cannot reach other VMs on the same VPC — internal service calls time out
Fix
Verify both VMs are in the same VPC network first — same VPC name does not guarantee connectivity if they're in different shared VPC host projects or different subnets with different firewall rules. Run 'gcloud compute firewall-rules list --filter=network=YOUR_NETWORK --sort-by=priority' and check for deny rules that might be blocking internal traffic before the allow-internal rule matches. GCE firewall rules are evaluated by priority (lower number = higher priority) — a deny rule at priority 500 overrides an allow rule at priority 1000. Enable VPC Flow Logs on the subnet to see exactly which connections are being allowed or denied: 'gcloud compute networks subnets update SUBNET_NAME --region=REGION --enable-flow-logs'. Check Flow Logs in Cloud Logging for entries with disposition=DENIED.
Symptom · 06
Managed Instance Group rolling update stuck — instances not replacing, update percentage frozen
Fix
Get the current MIG status including any errors: 'gcloud compute instance-groups managed describe MIG_NAME --zone=ZONE'. Look for the currentActions field — if it shows 'creating' or 'verifying' for an extended period, the new instances are failing health checks. Check the health check URL configured for the MIG: 'gcloud compute backend-services describe BACKEND_NAME --global --format=get(healthChecks)'. SSH into one of the new instances and manually hit the health check endpoint — if it returns a non-200 response or times out, that's why the update is stuck. The MIG will not proceed with the rolling update while the maxUnavailable threshold would be exceeded by both current unhealthy instances and unavailable-during-update instances. Fix the application health check endpoint first, then the update will resume automatically.
★ GCE Emergency Debug Cheat SheetWhen Compute Engine resources misbehave in production, run these commands in order. Match your symptom to the block — don't start at the bottom and work up.
VM is running but completely unreachable — SSH times out, HTTP returns nothing, ping drops 100%
Immediate action
Verify external IP exists and firewall allows the traffic you expect. These are the two most common causes of complete unreachability and take 60 seconds to rule out.
Commands
gcloud compute instances describe VM_NAME --zone=ZONE --format='get(networkInterfaces[0].accessConfigs[0].natIP,status,networkInterfaces[0].network)'
gcloud compute firewall-rules list --filter='network=default AND direction=INGRESS' --sort-by=priority --format='table(name,priority,sourceRanges,allowed[].map().firewall_rule(),disabled)'
Fix now
If natIP is empty, the VM has no external IP — access it via IAP: 'gcloud compute ssh VM_NAME --zone=ZONE --tunnel-through-iap'. If the firewall list has no rule allowing your traffic, add one: 'gcloud compute firewall-rules create allow-ssh-ingress --allow=tcp:22 --source-ranges=35.235.240.0/20 --network=default' (the 35.235.240.0/20 range is Google's IAP range — prefer IAP over 0.0.0.0/0 for SSH). If the VM shows RUNNING but is still unreachable after firewall verification, check the serial console for boot errors: 'gcloud compute instances get-serial-port-output VM_NAME --zone=ZONE'.
GCP bill spiked unexpectedly — costs doubled or tripled month-over-month with no new feature deployments+
Immediate action
Check the three most common sources of unexpected GCE cost: orphaned persistent disks, unattached reserved IP addresses, and idle running VMs in regions you may have forgotten.
Commands
gcloud compute disks list --filter='-users:*' --format='table(name,zone,sizeGb,type,creationTimestamp)' --sort-by=~sizeGb
gcloud compute addresses list --filter='status!=IN_USE' --format='table(name,region,address,status,creationTimestamp)'
Fix now
For orphaned disks, snapshot anything that might matter first: 'gcloud compute disks snapshot DISK_NAME --zone=ZONE --snapshot-names=pre-delete-$(date +%Y%m%d)', then delete: 'gcloud compute disks delete DISK_NAME --zone=ZONE'. For unused reserved IPs (which cost ~$7.20/month each when unattached): 'gcloud compute addresses delete IP_NAME --region=REGION'. Then immediately go to Cloud Billing and set a budget alert at 110% of last month's spend — this incident should not repeat.
Live Migration triggered — VM shows performance degradation, monitoring shows latency spike+
Immediate action
Confirm the event is Live Migration and not an application-level issue or disk bottleneck before acting.
Commands
gcloud logging read 'resource.type=gce_instance AND protoPayload.methodName=v1.compute.instances.migrate AND resource.labels.instance_id=INSTANCE_ID' --limit=5 --format='table(timestamp,protoPayload.methodName,protoPayload.status)'
gcloud compute instances describe VM_NAME --zone=ZONE --format='get(lastStartTimestamp,scheduling.onHostMaintenance,scheduling.automaticRestart)'
Fix now
Live Migration is expected and handled automatically by GCE — the performance dip is typically under one second and requires no intervention. If your workload cannot tolerate even sub-second pauses (real-time trading, HFT, certain gaming backends), set onHostMaintenance=TERMINATE and use a Managed Instance Group with auto-healing so the VM is automatically replaced rather than migrated. Note: GPU instances cannot be live-migrated — they are always TERMINATE on maintenance.
Managed Instance Group not autoscaling — instances stuck at minimum count despite high CPU utilization+
Immediate action
Verify the autoscaler is actually configured and active, then check whether utilization metrics are being read correctly.
Commands
gcloud compute instance-groups managed describe MIG_NAME --zone=ZONE --format='get(autoscaler,targetSize,status)'
gcloud compute instance-groups managed list-errors MIG_NAME --zone=ZONE
Fix now
If the autoscaler field is empty, no autoscaler is attached — create one: 'gcloud compute instance-groups managed set-autoscaling MIG_NAME --zone=ZONE --max-num-replicas=10 --target-cpu-utilization=0.6 --cool-down-period=90'. If the autoscaler exists but isn't scaling, check the cooldown period — if instances were recently added, the autoscaler waits for the cooldown to expire before evaluating again (default 60 seconds, but you may have set it higher). If scaling is genuinely needed right now, manually resize: 'gcloud compute instance-groups managed resize MIG_NAME --size=N --zone=ZONE' while you investigate the autoscaler configuration.
On-Premise Servers vs Compute Engine (GCE)
AspectOn-Premise ServersCompute Engine (GCE)
Provisioning Time2-8 weeks: purchase order, vendor fulfillment, shipping, rack installation, OS setup, network configuration. A capacity planning mistake means waiting another cycle.25-45 seconds via gcloud CLI or API. VM is RUNNING before the provisioning command finishes scrolling. Mistakes cost seconds to undo, not weeks.
ScalingManual and hardware-bounded. Adding capacity means ordering new servers. Scaling down means decommissioning hardware that's already been paid for — CapEx doesn't refund.Managed Instance Groups autoscale based on CPU utilization, custom metrics, or scheduled schedules. Scale-out and scale-in happen automatically within minutes, and you pay only for running instances.
Host MaintenanceScheduled maintenance windows requiring planned downtime. Hardware failures mean unplanned outages until replacement hardware arrives or a spare is swapped in.Live Migration transparently moves running VMs to healthy hosts during maintenance — most workloads see zero downtime. Hardware failures are handled by Google without operator involvement.
Cost ModelCapital Expenditure (CapEx): pay full hardware cost upfront, depreciate over 3-5 years, carry idle capacity as sunk cost. Utilization below 100% is money spent on capacity you're not using.Operating Expenditure (OpEx): pay per second of use, billed monthly. Per-second billing means idle time costs nothing. Committed Use Discounts (1 or 3 year) offer 37-55% savings for predictable workloads without hardware commitment.
Security and CompliancePhysical security responsibility is yours: facility access, hardware disposal, BIOS/firmware security. Boot integrity requires custom tooling. Compliance audits cover physical controls.Shielded VMs provide Secure Boot, vTPM-based Measured Boot, and Integrity Monitoring out of the box. Sole-tenant nodes provide physical isolation for compliance requirements (HIPAA, PCI-DSS). Google handles physical facility security and hardware disposal.
Operational OverheadYour team manages hardware refresh cycles, firmware updates, failed drive replacement, datacenter networking, and power/cooling. These are real engineering hours that don't ship features.Google manages physical hardware, networking infrastructure, and hypervisor security. Your team manages OS-level and above. Managed services (Cloud SQL, GKE) shift OS management to Google as well.

Key takeaways

1
GCE is a full IaaS platform
you get kernel-level control, custom OS images, GPU access, and persistent storage. That control comes with OS-level operational responsibility that serverless platforms eliminate. Choose based on whether you actually need what GCE provides, not because VMs are familiar.
2
Machine families are purpose-built and the wrong choice has real performance and cost consequences. E2 shared-core instances can be throttled by neighboring VMs. C3 instances deliver consistent per-core performance for compute-bound workloads. M2 instances provide up to 12TB RAM for workloads that cannot be sharded. Match the machine family to the workload characteristics before provisioning.
3
Custom Machine Types are the correct answer when predefined types don't fit your CPU-to-RAM ratio. Paying for 16GB RAM when your application uses 10GB is 60% waste on that resource dimension
Custom Machine Types let you specify exactly what you need in 256MB RAM increments.
4
Live Migration is a genuine operational advantage
it means Google's host maintenance doesn't become your application's maintenance window. For GPU instances and workloads that set TERMINATE policy, architect for instance-level failure using Managed Instance Groups with auto-healing instead.
5
Persistent Disk auto-delete behavior is the most common source of unexpected GCE costs. Deleting a VM does not delete its disks unless auto-delete was explicitly set. Make auto-delete=yes the default in all provisioning automation, and run 'gcloud compute disks list --filter=-users:*' as part of every environment teardown checklist.
6
The Default Compute Service Account with Editor role is a project-wide security risk attached to every VM that doesn't specify a custom SA. Create dedicated service accounts with least-privilege IAM for every VM or VM group in production. This is a 5-minute setup task that eliminates an entire category of breach blast radius.

Common mistakes to avoid

6 patterns
×

Not using Managed Instance Groups for production workloads

Symptom
A single VM serves production traffic. It crashes at 3am due to an OOM event. Traffic drops to zero. An on-call engineer is paged, diagnoses the issue, and manually recreates the VM — total downtime: 18 minutes. The next week it happens again because the underlying cause (a memory leak) wasn't fixed. Manual VM management means every failure requires human intervention, and single-VM deployments have no redundancy.
Fix
Use Managed Instance Groups for all production traffic-serving workloads. Create an instance template that captures your VM configuration, then create a MIG from it: 'gcloud compute instance-groups managed create forge-web-mig --template=forge-web-template --size=3 --zone=us-central1-a'. Configure a health check so GCE auto-heals unhealthy instances: 'gcloud compute instance-groups managed set-autohealing forge-web-mig --health-check=forge-http-health-check --initial-delay=60 --zone=us-central1-a'. A MIG with 3 instances across 2+ zones gives you redundancy, auto-healing, and the foundation for rolling deployments — all from a single configuration.
×

Hardcoding internal IP addresses in application configuration

Symptom
Service A connects to Service B using the internal IP 10.128.0.5 hardcoded in a configuration file. A maintenance event replaces the Service B VM (new VM, new internal IP: 10.128.0.8). Service A's connection pool starts timing out. The error is 'connection refused' — the old IP is gone. Finding and updating every configuration file that referenced the old IP takes 40 minutes. This repeats every time Service B's VM is replaced.
Fix
Never hardcode VM internal IPs anywhere — not in config files, not in environment variables, not in database records. Use one of three stable alternatives: (1) Cloud DNS with an internal DNS zone — create a record for service-b.internal.thecodeforge.io pointing to the current IP, update the DNS record when the VM changes; (2) Internal Load Balancer — the ILB IP is stable even as backend VMs are replaced; (3) For GKE-backed services, Kubernetes Service ClusterIP provides stable internal addressing regardless of pod replacement. The pattern to remember: applications should discover services by name, not by IP.
×

Running VMs with the Default Compute Service Account

Symptom
The Default Compute Service Account has the project Editor role. A web application running on a GCE VM has a Server-Side Request Forgery (SSRF) vulnerability. An attacker exploits it to make requests to the GCE metadata server at http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token, obtaining a valid OAuth2 token with Editor access to the entire project. The attacker uses the token to exfiltrate Cloud Storage buckets, access Cloud SQL databases, and create new VMs for mining. The breach affects every service in the project, not just the compromised VM.
Fix
Create a dedicated service account for each VM or group of VMs with only the permissions that workload requires. If a web server only needs to write to Cloud Logging and read from one Cloud Storage bucket, create an SA with only roles/logging.logWriter and roles/storage.objectViewer on that specific bucket. Attach it at VM creation: '--service-account=forge-web-sa@thecodeforge-prod.iam.gserviceaccount.com'. Block the Default Compute SA at the organization level using an Organization Policy constraint: constraints/iam.disableServiceAccountCreation combined with constraints/compute.disableDefaultServiceAccountCreation. This makes using a custom SA mandatory, not optional.
×

Running dev and staging VMs 24/7

Symptom
A team has 10 developer VMs and 3 staging VMs, all running around the clock. Developers use them from 8am to 7pm. From 7pm to 8am — 13 hours — the VMs sit idle consuming compute budget. At $0.067/hour for an e2-standard-2, 13 VMs × 13 idle hours × 30 days = $338/month in compute spend that produces no value. Over a year, that's over $4,000 in recoverable waste, and this is a small team.
Fix
Use Instance Schedules to automatically start and stop non-production VMs: 'gcloud compute resource-policies create instance-schedule dev-hours-schedule --region=us-central1 --vm-start-time=08:00 --vm-stop-time=20:00 --start-day-of-week=MONDAY --stop-day-of-week=FRIDAY --timezone=America/Chicago'. Apply the policy: 'gcloud compute instances add-resource-policies VM_NAME --zone=us-central1-a --resource-policies=dev-hours-schedule'. VMs stop at 8pm and start at 8am on weekdays automatically. If a developer needs after-hours access, they start the VM manually. The typical savings is 55-65% of non-production compute spend with zero change to developer workflow.
×

Using ephemeral IPs for production-facing endpoints

Symptom
A production API server has an ephemeral external IP. GCE performs host maintenance and live-migrates the VM — the IP is preserved during migration, but the team restarts the VM manually the following week for an OS patch. The ephemeral IP changes. DNS still points to the old IP. API clients start receiving connection errors. The on-call engineer doesn't immediately recognize that the IP changed because the VM shows RUNNING in the console. Resolution requires finding the new IP, updating the DNS record, and waiting for TTL propagation — 25 minutes of downtime for a 30-second one-time fix that should have been done at provisioning time.
Fix
Reserve a Static External IP before pointing DNS to any production VM: 'gcloud compute addresses create forge-api-ip --region=us-central1'. Attach it to the VM: 'gcloud compute instances add-access-config VM_NAME --zone=us-central1-a --access-config-name=External NAT --address=$(gcloud compute addresses describe forge-api-ip --region=us-central1 --format=get(address))'. The static IP survives VM restarts, replacements, and re-creation. If the VM is replaced by a MIG rolling update, attach the static IP to the load balancer frontend instead — the LB IP is stable regardless of backend VM changes.
×

Not enabling Shielded VM features on production instances

Symptom
A production VM shows unexpected processes in 'ps aux' that weren't deployed by the team. Investigation reveals a kernel-level rootkit that survived an OS reinstall because it's embedded in the bootloader. Standard security tools report a clean system because the rootkit operates below the OS layer. Forensics cannot determine when the compromise occurred because there's no boot integrity baseline to compare against. The VM must be destroyed and rebuilt from a known-good image — and the team has no confidence that other VMs aren't similarly compromised.
Fix
Enable all three Shielded VM features at VM creation: '--shielded-secure-boot' prevents unsigned UEFI firmware and bootloaders from executing; '--shielded-vtpm' enables a virtual Trusted Platform Module that records the boot sequence hash (Measured Boot); '--shielded-integrity-monitoring' compares each boot's measurements against the established baseline and flags deviations in Cloud Monitoring. If Integrity Monitoring reports a violation, the VM is quarantined and investigated rather than trusted. These features have negligible performance impact and are included at no additional cost — there is no valid argument for disabling them in production.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does GCE's Live Migration work mechanically, and how should you arch...
Q02SENIOR
You are architecting a batch ML training pipeline that runs for 4-6 hour...
Q03SENIOR
Explain the difference between Persistent Disk types and Local SSD. For ...
Q04SENIOR
What is a Managed Instance Group and how do you perform a zero-downtime ...
Q05SENIOR
How do Sole-tenant Nodes differ from standard multi-tenant GCE, and what...
Q06JUNIOR
What is the GCE Metadata Server, how does it enable secure credential ac...
Q01 of 06SENIOR

How does GCE's Live Migration work mechanically, and how should you architect applications to handle the cases where Live Migration isn't possible?

ANSWER
Live Migration is a hypervisor-level operation where Google moves a running VM from one physical host to another without shutting it down. Mechanically: GCE pre-copies the VM's memory pages to the destination host while the VM is still running on the source. When the delta (the pages modified during copying) is small enough, GCE briefly pauses the VM — typically 10-100 milliseconds — transfers the remaining state, and resumes execution on the new host. The VM's IP addresses, MAC address, memory state, and disk attachments are preserved. The application sees a momentary performance dip or a brief pause in network responses, but not a reboot. Live Migration is not available for GPU instances, instances configured with --maintenance-policy=TERMINATE, or certain instance types using local SSDs. For these cases, architect for instance-level failure: deploy behind a Managed Instance Group with auto-healing so a terminated VM is replaced automatically. Make the application stateless — any state that must survive instance replacement lives in Cloud SQL, Cloud Storage, Memorystore, or another managed service, not on local disk. Implement graceful shutdown handling (SIGTERM → drain connections → exit cleanly) so in-flight requests complete before the VM is terminated. Set minReadySec in MIG update policies so replacement instances have time to warm up before receiving traffic.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What exactly happens to my data when I delete a VM?
02
What is the difference between a Zone and a Region in GCE, and how does it affect my architecture?
03
Can I resize a VM after it's been created, and do I need to stop it?
04
What is the difference between Preemptible VMs and Spot VMs, and which should I use?
05
How do I control costs on GCE as the team and workload scale?
🔥

That's Google Cloud. Mark it forged?

4 min read · try the examples if you haven't

Previous
GCP vs AWS vs Azure — Key Differences
3 / 4 · Google Cloud
Next
Google Cloud Storage and BigQuery Overview