Mid-level 4 min · March 06, 2026

ArgoCD GitOps — Auto-Heal Reverted Scale-Down 3× Mid-Outage

ArgoCD self-heal fought kubectl scale-down every 60s during a CPU incident.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • ArgoCD is a Kubernetes controller that continuously reconciles live cluster state against a Git repository — the source of truth
  • Core components: Application CRD (defines source repo + target cluster), Sync waves (ordering of resources), Webhook (auto-sync on push), AppSets (dynamic app generation)
  • Performance impact: Default sync interval 3 minutes — large manifests (>10K resources) can take 30-60 seconds to process
  • Production trap: Auto-sync enabled without validation — a bad commit rolls out instantly to production, no one reviews
  • Biggest mistake: Mutating resources outside Git (kubectl edit) — ArgoCD auto-heal overwrites changes, causing "my fix disappeared" confusion
Plain-English First

Imagine your Kubernetes cluster is a LEGO city, and your Git repository is the official blueprint book. ArgoCD is the obsessive city manager who constantly compares the actual city to the blueprint — and the moment someone sneaks in an extra block that isn't in the book, the manager rips it out and puts things back exactly as drawn. You never have to phone the manager; they're always watching. That's GitOps: the blueprint IS the truth, and ArgoCD enforces it automatically.

Most teams reach a point where 'kubectl apply -f' starts feeling like playing Jenga blindfolded. One engineer deploys a hotfix directly to production, another runs a Helm upgrade from their laptop, and within a week nobody actually knows what's running in the cluster. Config drift is silent, cumulative, and eventually catastrophic. ArgoCD was built to make that problem structurally impossible — not through discipline or process, but through automation backed by a source of truth that everyone can see and audit.

ArgoCD implements the GitOps operator pattern: it runs inside your cluster, watches a Git repository, and continuously reconciles the live cluster state against the desired state declared in that repo. If the live state drifts — whether from a rogue kubectl command, a failing node replacement, or a mischievous CronJob — ArgoCD detects the divergence and can automatically heal it. This isn't just CI/CD with extra steps; it's a fundamentally different mental model where deployments are a side-effect of merging a pull request, not a separate pipeline stage.

By the end you'll understand ArgoCD's reconciliation engine, how to model complex multi-service deployments with sync waves and hooks, how to harden a production installation with RBAC and SSO, and the non-obvious gotchas that only surface after six months of running it.

The Reconciliation Loop — How ArgoCD Continuously Enforces Git State

ArgoCD runs as a set of Kubernetes controllers inside your cluster. The core component is the Application Controller, which runs a continuous reconciliation loop (default 3 minutes) for each Application resource. It fetches the desired state from Git (via repo server), compares it with the live state from the target cluster (using the Kubernetes API), and if they differ, it marks the Application as 'OutOfSync'. If auto-sync is enabled, it immediately applies the diff.

The key insight is that ArgoCD doesn't just apply YAML. It performs a three-way diff: live state (current cluster), desired state (Git), and last applied state (stored in the Application CRD status). This three-way merge prevents the 'last write wins' problem when resources are updated outside ArgoCD.

When a sync happens, ArgoCD orders resources using sync waves (annotations). Resources with lower wave numbers sync first. By default, all resources are wave 0. Custom Resource Definitions (CRDs) must be wave -1 or lower to be installed before custom resources that depend on them. Hooks (PreSync, Sync, PostSync) run as Jobs at specific stages, allowing you to run database migrations before updating deployments or health checks after.

The repository server caches Git contents. It supports Helm, Kustomize, and plain YAML. For Helm, it runs helm template internally. For Kustomize, it runs kustomize build. The repo server caches rendered manifests to speed up subsequent syncs.

io/thecodeforge/argocd/application.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# io.thecodeforge/argocd/application.yaml
# Production ArgoCD Application with sync waves and health checks
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-gateway
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io  # Clean up resources when app is deleted
spec:
  project: production
  source:
    repoURL: https://github.com/thecodeforge/infra.git
    targetRevision: HEAD
    path: overlays/production/api-gateway
    helm:
      valueFiles:
        - values.yaml
        - secrets.yaml  # Decrypted via SOPS with age key in argocd-cm
  destination:
    server: https://kubernetes.default.svc
    namespace: api-gateway
  syncPolicy:
    automated:
      prune: true       # Delete resources not in Git
      selfHeal: true    # Revert manual changes (kubectl edit)
      allowEmpty: false # Don't sync if would wipe all resources
    syncOptions:
      - ApplyOutOfSyncOnly=true
      - PrunePropagationPolicy=foreground
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  syncWave: "-1"  # Sync before dependent apps
status:
  sync:
    status: Synced
  health:
    status: Healthy
Output
# Apply the Application
kubectl apply -f application.yaml
# ArgoCD detects new app and syncs within 3 minutes (or webhook triggers instantly)
# Check sync status:
argocd app get api-gateway
# Output example:
Name: api-gateway
Project: production
Server: https://kubernetes.default.svc
Namespace: api-gateway
URL: https://argocd.example.com/applications/api-gateway
Repo: https://github.com/thecodeforge/infra.git
Target: HEAD
Path: overlays/production/api-gateway
SyncWave: -1
Health Status: Healthy
Sync Status: Synced
GROUP KIND NAMESPACE NAME STATUS HEALTH
apps Deployment api-gateway api-gateway Synced Healthy
Service api-gateway api Synced Healthy
The Reconciliation Loop — Git as Source of Truth
  • Application Controller: runs loop every 3 minutes (default). Compares live with desired using 3-way diff (live, desired, last-applied).
  • Repo Server: fetches Git, renders Helm/Kustomize, caches results. Runs as a separate pod to isolate security.
  • Sync Waves: resources with wave 0 sync first, then wave 1, etc. CRDs must be wave -1 to install before custom resources.
  • Hooks: PreSync runs before sync, Sync runs during, PostSync after. Use for DB migrations, smoke tests, or notifications.
  • Auto-heal: reverts manual changes (kubectl edit) within 3 minutes. Disable in production if you need incident overrides.
Production Insight
ArgoCD doesn't see live changes instantly. The default 3-minute reconciliation loop means drift exists for up to 3 minutes.
Set status.autoRefresh: 1m in argocd-cm to check more frequently, but know that API server load increases linearly.
Rule: Use webhooks for instant sync on Git push. Combine with a 1-minute poll as fallback.
Key Takeaway
ArgoCD's reconciliation loop is the heart of GitOps: compare, diff, sync, repeat.
The three-way merge (live, desired, last-applied) prevents last-write-wins conflicts when resources are updated outside ArgoCD.
Rule: For production, set selfHeal: true to enforce Git as truth, but have a runbook to disable auto-sync during incidents.
Sync Policy Decision Tree
IfDevelopment environment, multiple PRs, rapid iteration
UseManual sync. Developer clicks 'Sync' after reviewing diff in UI. Webhook triggers but requires approval if auto-sync disabled.
IfStaging environment, automated tests pass after deploy
UseAuto-sync with prune: true, selfHeal: false. Tests can manually trigger sync via argocd app sync. Rollback by reverting Git.
IfProduction environment, require Git as source of truth
UseAuto-sync with prune: true, selfHeal: true. But use syncOptions: [ApplyOutOfSyncOnly=true] to avoid full resource diff on every sync.
IfProduction with emergency overrides (incident response)
UseDisable auto-sync for critical apps. Use webhook on tags only (targetRevision: v1.2.3). Manual sync via argocd app sync after tag is pushed.
IfResources with finalizers (namespaces, custom resources)
UseSet prunePropagationPolicy: foreground in sync options. Ensure finalizer allows deletion; some require manual removal via kubectl patch.

Sync Waves and Hooks — Ordering Complex Deployments

Kubernetes has no built-in ordering for applying resources. If you need to install a CRD before creating a custom resource (like Prometheus CRDs before Prometheus operator), you must order the sync. ArgoCD's sync waves solve this.

Every resource can be annotated with argocd.argoproj.io/sync-wave. Lower numbers sync first. Resources with wave -1 sync before wave 0. Resources in the same wave sync in parallel (order not guaranteed).

CRDs must be in wave -1 or lower. Custom resources (like Prometheus, Istio) must be in wave 0 or higher. If a custom resource references a CRD not yet installed, sync will fail. PreSync hooks run before any wave -1 resources; PostSync hooks run after wave 100.

Hooks are Kubernetes Jobs or Pods that run at specific stages. Example: a database migration container must run before the new app version starts. Use a PreSync hook: the Job runs, ArgoCD waits for it to complete (successfully), then syncs the Deployment. If the hook fails, ArgoCD stops the sync.

Hook types: PreSync (before sync), Sync (during sync, rare), PostSync (after all resources healthy), Skip (don't apply to cluster, just run). Hooks can fail the sync if they return non-zero exit code.

io/thecodeforge/argocd/sync-wave-hooks.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
# io.thecodeforge/argocd/sync-wave-hooks.yaml
# Example: Install CRDs first (wave -1), then create operator (wave 0), then run migration (PreSync hook)

# CRDs must exist before custom resources
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: gatlingruns.gatling-operator.io
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
spec:
  group: gatling-operator.io
  names:
    kind: GatlingRun
    plural: gatlingruns
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
---
# Custom resource (depends on CRD above) — sync after wave -1
apiVersion: gatling-operator.io/v1
kind: GatlingRun
metadata:
  name: load-test-run
  annotations:
    argocd.argoproj.io/sync-wave: "0"
spec:
  image: gatling:latest
  replicas: 3
---
# PreSync hook: database migration before app deployment
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: hook-succeeded
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: migrate/migrate:v4
          command:
            - /bin/sh
            - -c
            - "migrate -database postgres://user:pass@postgres:5432/app?sslmode=disable -path /migrations up"
      restartPolicy: Never
  backoffLimit: 2
---
# PostSync hook: deploy smoke test runner
apiVersion: batch/v1
kind: Job
metadata:
  name: smoke-test
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: hook-succeeded
spec:
  template:
    spec:
      containers:
        - name: tester
          image: curlimages/curl:latest
          command:
            - /bin/sh
            - -c
            - "curl -f https://api.example.com/health || exit 1"
      restartPolicy: Never
Output
# ArgoCD processes resources in wave order:
1. All CRDs (wave -1) are applied and become available.
2. Custom resources (wave 0) are applied — they can now reference CRDs.
3. PreSync hook runs (db-migration Job) — ArgoCD waits for job completion.
4. If job succeeds, remaining resources (Deployment, Service) are synced.
5. PostSync hook (smoke-test) runs.
6. If any hook fails, ArgoCD marks the sync as Failed and stops.
Watch Out: Sync Waves in Parallel May Race
Resources in the same wave sync in parallel, but if one resource depends on another (e.g., Service references a Deployment's pod selector), they could apply out of order and cause temporary errors. ArgoCD will retry, but the Deployment might be missing when the Service is created. Use separate waves for strict ordering.
Production Insight
CRDs must be wave -1 or they cause sync failures. Custom resources referencing them must be wave 0 or higher.
PreSync hooks are for migrations, schema updates, or pre-deployment checks. They must finish within the sync timeout (default 5 minutes).
Rule: Use hook-delete-policy: hook-succeeded to clean up hook Jobs after success, preventing resource buildup.
Key Takeaway
Sync waves enforce ordering for CRDs, namespaces, and dependent resources. Lower waves sync first.
PreSync hooks run before any resources; PostSync hooks run after all resources are healthy. Use them for migrations and validation.
Rule: Start CRDs at wave -2, namespaces at -1, core resources at 0, application deployments at 1.
Sync Wave Strategy
IfInstalling CRDs (e.g., Istio, Prometheus, Cert Manager)
UseWave -2 (earliest). Wait for CRDs to be established before any custom resources. Use kubectl get crd check in PreSync hook.
IfCreating namespace before any resources in it
UseWave -1. Namespace must exist before resources inside it. Annotate namespace with wave -1.
IfDeployment depends on ConfigMap or Secret
UseWave 0 for ConfigMap/Secret, Wave 1 for Deployment. The Deployment will restart automatically when ConfigMap changes if you use hash annotations.
IfDatabase migration must run before new app version
UsePreSync hook (Job). The hook runs before any sync wave 0 resources. Application restarts will pick up migrated schema.
IfLoad balancer creation depends on existing Deployment
UseWave 1 for Service. The Service's endpoint is populated only after the Deployment's pods are ready. Use wave ordering to avoid temporary 503s.

Multi-Cluster and Multi-Tenant Patterns — Production Hardening

ArgoCD can manage thousands of clusters from a single control plane. Each target cluster is represented by a secret in the ArgoCD namespace containing the cluster API server endpoint and bearer token. The Application Controller uses these secrets to connect and sync.

For multi-cluster management, organise Applications by cluster and environment: clusters/prod/us-east-1/apps, clusters/staging/eu-west-1/apps. Use ApplicationSets to generate Applications dynamically for each cluster with a templated Git path.

Multi-tenancy within a cluster: use Projects to isolate teams. Each Project defines source repositories, destination clusters/namespaces, and role-based access. For example, the 'team-a' project can only deploy to namespaces prefixed with 'team-a-' and only from Git repos under github.com/team-a.

RBAC in ArgoCD is its own model: policies are defined in argocd-rbac-cm. A policy like p, role:admin, applications, , /, allow gives full access. p, role:viewer, applications, get, team-a/, allow allows read-only access to team-a apps. Map OIDC groups to roles via oidc.config.

For secrets management, ArgoCD integrates with SOPS, SealedSecrets, or Vault. The recommended pattern: commit encrypted secrets to Git, decrypt them in the repo server using a KMS key or age private key stored in Kubernetes secrets. Never put plaintext secrets in Git.

io/thecodeforge/argocd/argocd-project.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# io/thecodeforge/argocd/argocd-project.yaml
# Multi-tenant RBAC with Projects
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: team-payments
  namespace: argocd
spec:
  description: "Payments team microservices"
  # Restrict source repos
  sourceRepos:
    - 'https://github.com/org/payments-infra.git'
  # Only allow deploying to specific clusters/namespaces
  destinations:
    - namespace: 'payments-*'
      server: https://kubernetes.default.svc
    - namespace: 'payments-*'
      server: https://prod-eu-1.k8s.example.com
  # Deny cluster-scoped resources except specific ones
  clusterResourceWhitelist:
    - group: ''
      kind: Namespace
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
  # Allow only specific resource kinds
  namespaceResourceBlacklist:
    - group: ''
      kind: Secret  # Prevent teams from creating secrets outside vault
  roles:
    - name: developer
      policies:
        - p, proj:team-payments:developer, applications, sync, team-payments/*, allow
        - p, proj:team-payments:developer, applications, get, team-payments/*, allow
      groups:
        - payments-developers

---
# RBAC policy in argocd-rbac-cm (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.csv: |
    # Admin access to all projects
    p, role:admin, applications, *, */*, allow
    p, role:admin, projects, *, *, allow
    p, role:admin, clusters, *, *, allow
    
    # Team leads can sync their own apps but not change project config
    p, role:team-lead, applications, sync, team-*/production/*, allow
    p, role:team-lead, applications, get, team-*/*, allow
    
    # Read-only access for auditors
    p, role:auditor, applications, get, */*, allow
    
  policy.default: role:readonly  # Default role if not matched
  scopes: '[groups]'
Output
# Apply project and RBAC
kubectl apply -f argocd-project.yaml
kubectl apply -f argocd-rbac-cm.yaml
# Test access
argocd login argocd.example.com --sso
# Team payments developer (member of payments-developers group)
argocd app list # Shows only apps in team-payments project
# Attempt to create app in another project (fails due to RBAC)
argocd app create test --project other-project --repo ... # Error: permission denied
# List projects visible to user
argocd proj list
NAME DESCRIPTION
team-payments Payments team microservices
Project RBAC Prevents 'Cluster-Wide' Accidents
Always define an AppProject before creating Applications. It limits source repos, destination namespaces, and cluster resources. A missing project means developers can deploy to the kube-system namespace or any cluster. Use ClusterResourceWhitelist to restrict creation of ClusterRoles and ClusterRoleBindings.
Production Insight
Without Projects, any user with ArgoCD access can deploy to any namespace, including kube-system, and create cluster-scoped resources.
A single sourceRepos restriction prevents deploying from a fork that contains malicious manifests.
Rule: Start with a 'restricted' project that only allows specific namespaces. Create an 'elevated' project for platform team with cluster-scoped access.
Key Takeaway
Projects isolate teams and limit what they can deploy. Always define a Project before creating Applications.
Multi-cluster management is built in: register clusters via secrets, use ApplicationSets to generate apps per cluster.
Rule: Restrict sourceRepos to your organisation's Git org. Never allow https://github.com/* — one malicious fork is all it takes.
Multi-Cluster Organization
If1-5 clusters, simple topology
UseSingle ArgoCD instance. Register each cluster via argocd cluster add. Use ApplicationSets with cluster generator.
If5-50 clusters, different regions, strict compliance
UseHub-and-spoke: one ArgoCD per region/fleet, aggregated by a 'cluster-of-clusters' ArgoCD. Use ApplicationSet with git generator per cluster folder.
IfEphemeral clusters (CI/CD, preview environments)
UseUse ApplicationSet with pull request generator. Each PR gets a new namespace and app. Tear down on merge via PreSync hook.
IfMulti-tenant with isolated teams
UseUse Projects per team. Each Project has own sourceRepos, destinations, and RBAC roles. No sharing of Applications between Projects.
IfSecrets must not be in Git (PCI, HIPAA)
UseUse SealedSecrets or Vault CSI driver. ArgoCD can sync the SealedSecret resource; the controller unseals it in-cluster. Plaintext never touches Git.
● Production incidentPOST-MORTEMseverity: high

The Auto-Heal That Wiped Production During a PagerDuty Incident

Symptom
Deployment was consuming 100% CPU. Senior engineer ran kubectl scale deploy api --replicas=0 to stop traffic while investigating. Within 1 minute, the deployment scaled back to 3 replicas. They scaled down again; it scaled back up. The team thought the incident was automated chaos until someone noticed ArgoCD logs showing successfully synced every 60 seconds. The team spent 20 minutes fighting their own tooling.
Assumption
The team assumed auto-heal was only for 'drift like changed image tags', not for operational scaling during incidents. They didn't realise ArgoCD treats ANY deviation from Git as drift, including scaling operations. They had no temporary suspension mechanism for incidents.
Root cause
ArgoCD Application had syncPolicy.autoSync = true but lacked syncPolicy.retry config and syncPolicy.syncOptions = ["ApplyOutOfSyncOnly=true"]. The manual scale-down was treated as drift. ArgoCD's controller compared live state with Git state, saw replica count mismatch (0 vs 3), and reverted it every reconciliation loop (default 3 minutes). The team had no way to pause ArgoCD for a single Application without temporarily editing the Application CRD, which they couldn't do while debugging under pressure.
Fix
Added automated: selfHeal: true (separate from sync) to distinguish between auto-sync and self-heal. Actually, the root fix: implemented an incident runbook step: kubectl patch application api -p '{"spec":{"syncPolicy":{"automated":null}}}' -n argocd to disable auto-sync temporarily. Better: used ArgoCD's manual sync policy for production and moved to feature flags to disable sync during incidents. Added a argocd app pause workflow pattern documented in the team runbook. Installed ArgoCD Notifications to alert when sync overrides manual changes.
Key lesson
  • Auto-heal without an incident suspension mechanism turns your GitOps tool into an adversary during outages. Always have a documented way to pause sync.
  • Production should use manual sync or have strict automated: prune: true, selfHeal: true only with pre-commit hooks that validate manifests.
  • Use ArgoCD's annotation: argocd.argoproj.io/manual-sync to prevent auto-sync for critical apps, or set automated: {} on prod and trigger syncs via webhook only on tagged releases.
  • Monitor for sync-related events: argocd app get <app> --refresh shows when the last sync overrode something. Alert if syncs happen outside planned deployment windows.
Production debug guideSymptom → Action mapping for common GitOps failures5 entries
Symptom · 01
Syncing stuck in 'OutOfSync' but status says 'Synced' — resources not updating
Fix
Check if resource is managed by another controller (e.g., HPA scales replicas, ArgoCD reverts). Run kubectl describe app <app> -n argocd. Look at 'Conditions' and 'Reconciliation ID'. The resource might be excluded via resource.exclude: <kind> in argocd-cm. Also check for CompareOptions: IgnoreExtraneous if fields are being added by webhooks.
Symptom · 02
Sync stuck in 'Running' or 'Pending' for hours
Fix
Check if webhook is unreachable or if Git repo has locked files. Run argocd app manifests <app> to see if manifests resolve. Check network policies blocking ArgoCD to Git (port 443). Delete the pod of the Application Controller to force requeue — kubectl delete pod -n argocd argocd-application-controller-0.
Symptom · 03
Prune failing — resources not deleted from cluster
Fix
Check finalizers: resources with finalizers (e.g., kubernetes finalizer on namespaces) block deletion. Use --prune-propagation-policy=foreground in sync options. Check if resource is protected by another app (fluentd-logging might be shared). Set prune: false for resources that should persist across app deletion.
Symptom · 04
Webhook not triggering sync — have to click 'Sync' manually
Fix
Check GitHub/GitLab webhook delivery logs. Verify argocd-server service is reachable from internet (or use a webhook proxy like Smee). Ensure webhook secret in argocd-secret matches. Use argocd app sync --force --prune for manual push if webhook broken.
Symptom · 05
Permissions error: 'forbidden: User "system:serviceaccount:argocd:argocd-application-controller" cannot get resource'
Fix
ArgoCD's service account lacks RBAC. Add cluster roles in argocd-server cluster role. For cross-cluster deployments, ensure cluster credentials in argocd-cm and argocd-ssh-known-hosts-cm. Use argocd cluster add <context> to generate correct RBAC.
★ ArgoCD Quick Debug Cheat SheetFast diagnostics for production GitOps issues. Run these before changing any Git manifests.
OutOfSync — resources not matching Git
Immediate action
Check diff to see what ArgoCD thinks is different
Commands
argocd app get <app> --show-operation --refresh
argocd app diff <app> --revision HEAD
Fix now
Update Git manifest to match live state or force live state to match Git using argocd app sync <app> --force.
ArgoCD not syncing after Git push+
Immediate action
Verify webhook is firing and ArgoCD is listening
Commands
kubectl logs -n argocd argocd-server-<pod> --tail=100 | grep -i webhook
curl -v https://argocd.yourdomain.com/api/webhook -X POST (simulate)
Fix now
Temporarily enable polling: set status.autoRefresh: 1m in argocd-cm. Restart argocd-server: kubectl rollout restart deploy argocd-server -n argocd.
Deployment hangs — new pods not starting+
Immediate action
Check resource quotas, PV availability, and image pull secrets
Commands
kubectl describe pod <new-pod> | tail -20
kubectl get events --all-namespaces | grep -i error
Fix now
If image pull error, regen secrets: kubectl create secret docker-registry regcred.... If quota, increase in ResourceQuota or remove unnecessary pods.
ArgoCD UI says 'Unknown' or 'Connection Refused' for cluster+
Immediate action
Check that cluster secret is valid and network is open
Commands
argocd cluster list
kubectl get secret -n argocd | grep cluster
Fix now
Delete and re-add cluster: argocd cluster rm <name>; argocd cluster add <ctx>.
Sync failed: 'Failed to load target state: rpc error'+
Immediate action
Check if CRDs are missing or Helm repo is unreachable
Commands
argocd app manifests <app> > /dev/null
kubectl get crd | grep -E 'istio|prometheus'
Fix now
Install CRDs separately (as a PreSync hook). For Helm, run helm repo add inside ArgoCD repo server pod or set reponame correctly.
ArgoCD Sync Strategies Compared
StrategySync TriggerAuto-HealRiskBest For
ManualUser clicks 'Sync' in UI or CLINo (drift detected but not corrected)Human delay — config drift accumulatesProduction with change approval process
Automated with selfHeal: falseWebhook or scheduled poll (3 min)No (drift detected, requires manual fix)Drift can persist until manual sync. Good for audit trails.Staging environments, high-reg industries
Automated with selfHeal: trueWebhook or pollYes (any drift reverted)"Emergency overrides impossible without pausing syncProduction where Git is absolute truth
ApplicationSet + PR GeneratorGitHub/GitLab webhook on PRNo (PR apps are ephemeral)Old apps may linger if not prunedPreview environments per pull request
Image Updater (auto-update image tags)Image registry webhookYes (drift on image tag changed manually)Unpinned tags (latest) roll out untested imagesNightly builds, canary environments

Key takeaways

1
ArgoCD's reconciliation loop is the core of GitOps
it continuously enforces Git as the single source of truth, either by manual approval or automatic sync.
2
Sync waves order resource creation (CRDs first, then resources). PreSync/PostSync hooks run migrations and validation before/after sync critical stateful changes.
3
Projects with RBAC and sourceRepo restrictions prevent cluster-wide accidents and multi-tenant chaos. Always define an AppProject before creating Applications.
4
Auto-heal is powerful but can fight operators during incidents. Disable it temporarily with kubectl patch when you need emergency overrides.
5
Never put plaintext secrets in Git. Use SealedSecrets or SOPS with age/KMS to encrypt, commit encrypted manifests, and decrypt in the repo server or cluster.

Common mistakes to avoid

5 patterns
×

Mutating resources directly with kubectl while auto-heal is enabled

Symptom
kubectl scale down a Deployment, ArgoCD scales it back up. kubectl edit configmap, ArgoCD reverts it. Team wastes hours fighting the controller.
Fix
For temporary incident overrides, disable auto-sync for that application: kubectl patch app <name> -n argocd --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'. Re-enable after incident: kubectl patch app <name> -n argocd --type merge -p '{"spec":{"syncPolicy":{"automated":{"selfHeal":true,"prune":true}}}}'. Better: create an emergency 'break-glass' override procedure documented in runbook.
×

Putting secrets in Git without encryption

Symptom
Database passwords, API keys, TLS certs are visible in clear text in the Git repo. Anyone with repo access has credentials. Security audit fails.
Fix
Use SealedSecrets: commit encrypted SealedSecret CRD. Use SOPS with age or KMS: commit encrypted .sops.yaml files, decrypted by ArgoCD repo server using a private key stored in Kubernetes secret. Never put plaintext secrets in Git. Ever.
×

Not setting prune: true, leading to resource leaks

Symptom
You remove a Deployment from Git, but it still runs in the cluster. Load balancer remains, costing money. Orphaned PVCs fill up storage.
Fix
Set syncPolicy.automated.prune: true in production. For resources that should persist (persistent volumes), use prune: false or exclude them via resource.exclude. Use syncOptions: [PrunePropagationPolicy=foreground] to ensure dependent resources are deleted first.
×

Applying raw YAML without Helm or Kustomize — no environment overrides

Symptom
Same deployment YAML for dev, staging, prod. Developers change image tag in prod manually. Drift returns. No way to inject environment-specific config (e.g., database host).
Fix
Use Helm with values-dev.yaml, values-prod.yaml or Kustomize with overlays. ArgoCD can point to path: overlays/production. Use helm valueFiles parameter in Application spec to pick correct values file per environment.
×

Forgetting to set finalizers on Application CRs

Symptom
You delete an Application with kubectl. The Application CR disappears, but all managed resources (Deployments, Services, PVCs, etc.) remain in the cluster, orphaned and still running.
Fix
Add metadata.finalizers: [resources-finalizer.argocd.argoproj.io] to every Application. When you delete the Application, ArgoCD will first prune all managed resources, then remove the finalizer and delete the CR. This is the default in newer ArgoCD versions but check it.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between `selfHeal` and `prune` in ArgoCD sync pol...
Q02SENIOR
How does ArgoCD handle secrets? Walk me through a secure pattern for dat...
Q03SENIOR
What is the difference between an Application and an ApplicationSet? Whe...
Q04SENIOR
What happens when you delete an ArgoCD Application but forget to set the...
Q01 of 04SENIOR

Explain the difference between `selfHeal` and `prune` in ArgoCD sync policy — and why you might disable selfHeal in production.

ANSWER
prune controls whether ArgoCD deletes resources that exist in the cluster but are not present in Git. If disabled, removing a Deployment from Git leaves it running in the cluster (resource leak). selfHeal controls whether ArgoCD reverts manual changes made directly to the cluster (e.g., kubectl edit deployment). If enabled, any drift detected in the next reconciliation loop is overwritten with the Git state. In production, you might disable selfHeal temporarily during incident response: if a pod is crashing and you need to scale down manually to stop the crash loop, selfHeal would immediately revert the scale down. Instead, you disable selfHeal, make your operational changes, investigate, then re-enable it. For long-running production, selfHeal is usually enabled to enforce Git as the absolute source of truth — but you need a documented process to suspend it during emergencies.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Can I use ArgoCD with Helm?
02
How do I roll back a deployment in ArgoCD?
03
Can ArgoCD manage CRDs that are installed by another operator?
04
How does ArgoCD compare to Flux CD?
🔥

That's CI/CD. Mark it forged?

4 min read · try the examples if you haven't

Previous
Infrastructure as Code Introduction
9 / 14 · CI/CD
Next
Semantic Versioning Explained