ArgoCD GitOps — Auto-Heal Reverted Scale-Down 3× Mid-Outage
ArgoCD self-heal fought kubectl scale-down every 60s during a CPU incident.
- ArgoCD is a Kubernetes controller that continuously reconciles live cluster state against a Git repository — the source of truth
- Core components: Application CRD (defines source repo + target cluster), Sync waves (ordering of resources), Webhook (auto-sync on push), AppSets (dynamic app generation)
- Performance impact: Default sync interval 3 minutes — large manifests (>10K resources) can take 30-60 seconds to process
- Production trap: Auto-sync enabled without validation — a bad commit rolls out instantly to production, no one reviews
- Biggest mistake: Mutating resources outside Git (kubectl edit) — ArgoCD auto-heal overwrites changes, causing "my fix disappeared" confusion
Imagine your Kubernetes cluster is a LEGO city, and your Git repository is the official blueprint book. ArgoCD is the obsessive city manager who constantly compares the actual city to the blueprint — and the moment someone sneaks in an extra block that isn't in the book, the manager rips it out and puts things back exactly as drawn. You never have to phone the manager; they're always watching. That's GitOps: the blueprint IS the truth, and ArgoCD enforces it automatically.
Most teams reach a point where 'kubectl apply -f' starts feeling like playing Jenga blindfolded. One engineer deploys a hotfix directly to production, another runs a Helm upgrade from their laptop, and within a week nobody actually knows what's running in the cluster. Config drift is silent, cumulative, and eventually catastrophic. ArgoCD was built to make that problem structurally impossible — not through discipline or process, but through automation backed by a source of truth that everyone can see and audit.
ArgoCD implements the GitOps operator pattern: it runs inside your cluster, watches a Git repository, and continuously reconciles the live cluster state against the desired state declared in that repo. If the live state drifts — whether from a rogue kubectl command, a failing node replacement, or a mischievous CronJob — ArgoCD detects the divergence and can automatically heal it. This isn't just CI/CD with extra steps; it's a fundamentally different mental model where deployments are a side-effect of merging a pull request, not a separate pipeline stage.
By the end you'll understand ArgoCD's reconciliation engine, how to model complex multi-service deployments with sync waves and hooks, how to harden a production installation with RBAC and SSO, and the non-obvious gotchas that only surface after six months of running it.
The Reconciliation Loop — How ArgoCD Continuously Enforces Git State
ArgoCD runs as a set of Kubernetes controllers inside your cluster. The core component is the Application Controller, which runs a continuous reconciliation loop (default 3 minutes) for each Application resource. It fetches the desired state from Git (via repo server), compares it with the live state from the target cluster (using the Kubernetes API), and if they differ, it marks the Application as 'OutOfSync'. If auto-sync is enabled, it immediately applies the diff.
The key insight is that ArgoCD doesn't just apply YAML. It performs a three-way diff: live state (current cluster), desired state (Git), and last applied state (stored in the Application CRD status). This three-way merge prevents the 'last write wins' problem when resources are updated outside ArgoCD.
When a sync happens, ArgoCD orders resources using sync waves (annotations). Resources with lower wave numbers sync first. By default, all resources are wave 0. Custom Resource Definitions (CRDs) must be wave -1 or lower to be installed before custom resources that depend on them. Hooks (PreSync, Sync, PostSync) run as Jobs at specific stages, allowing you to run database migrations before updating deployments or health checks after.
The repository server caches Git contents. It supports Helm, Kustomize, and plain YAML. For Helm, it runs helm template internally. For Kustomize, it runs kustomize build. The repo server caches rendered manifests to speed up subsequent syncs.
- Application Controller: runs loop every 3 minutes (default). Compares live with desired using 3-way diff (live, desired, last-applied).
- Repo Server: fetches Git, renders Helm/Kustomize, caches results. Runs as a separate pod to isolate security.
- Sync Waves: resources with wave 0 sync first, then wave 1, etc. CRDs must be wave -1 to install before custom resources.
- Hooks: PreSync runs before sync, Sync runs during, PostSync after. Use for DB migrations, smoke tests, or notifications.
- Auto-heal: reverts manual changes (kubectl edit) within 3 minutes. Disable in production if you need incident overrides.
status.autoRefresh: 1m in argocd-cm to check more frequently, but know that API server load increases linearly.selfHeal: true to enforce Git as truth, but have a runbook to disable auto-sync during incidents.prune: true, selfHeal: false. Tests can manually trigger sync via argocd app sync. Rollback by reverting Git.prune: true, selfHeal: true. But use syncOptions: [ApplyOutOfSyncOnly=true] to avoid full resource diff on every sync.targetRevision: v1.2.3). Manual sync via argocd app sync after tag is pushed.prunePropagationPolicy: foreground in sync options. Ensure finalizer allows deletion; some require manual removal via kubectl patch.Sync Waves and Hooks — Ordering Complex Deployments
Kubernetes has no built-in ordering for applying resources. If you need to install a CRD before creating a custom resource (like Prometheus CRDs before Prometheus operator), you must order the sync. ArgoCD's sync waves solve this.
Every resource can be annotated with argocd.argoproj.io/sync-wave. Lower numbers sync first. Resources with wave -1 sync before wave 0. Resources in the same wave sync in parallel (order not guaranteed).
CRDs must be in wave -1 or lower. Custom resources (like Prometheus, Istio) must be in wave 0 or higher. If a custom resource references a CRD not yet installed, sync will fail. PreSync hooks run before any wave -1 resources; PostSync hooks run after wave 100.
Hooks are Kubernetes Jobs or Pods that run at specific stages. Example: a database migration container must run before the new app version starts. Use a PreSync hook: the Job runs, ArgoCD waits for it to complete (successfully), then syncs the Deployment. If the hook fails, ArgoCD stops the sync.
Hook types: PreSync (before sync), Sync (during sync, rare), PostSync (after all resources healthy), Skip (don't apply to cluster, just run). Hooks can fail the sync if they return non-zero exit code.
hook-delete-policy: hook-succeeded to clean up hook Jobs after success, preventing resource buildup.kubectl get crd check in PreSync hook.hash annotations.Multi-Cluster and Multi-Tenant Patterns — Production Hardening
ArgoCD can manage thousands of clusters from a single control plane. Each target cluster is represented by a secret in the ArgoCD namespace containing the cluster API server endpoint and bearer token. The Application Controller uses these secrets to connect and sync.
For multi-cluster management, organise Applications by cluster and environment: clusters/prod/us-east-1/apps, clusters/staging/eu-west-1/apps. Use ApplicationSets to generate Applications dynamically for each cluster with a templated Git path.
Multi-tenancy within a cluster: use Projects to isolate teams. Each Project defines source repositories, destination clusters/namespaces, and role-based access. For example, the 'team-a' project can only deploy to namespaces prefixed with 'team-a-' and only from Git repos under github.com/team-a.
RBAC in ArgoCD is its own model: policies are defined in argocd-rbac-cm. A policy like p, role:admin, applications, , /, allow gives full access. p, role:viewer, applications, get, team-a/, allow allows read-only access to team-a apps. Map OIDC groups to roles via oidc.config.
For secrets management, ArgoCD integrates with SOPS, SealedSecrets, or Vault. The recommended pattern: commit encrypted secrets to Git, decrypt them in the repo server using a KMS key or age private key stored in Kubernetes secrets. Never put plaintext secrets in Git.
sourceRepos restriction prevents deploying from a fork that contains malicious manifests.sourceRepos to your organisation's Git org. Never allow https://github.com/* — one malicious fork is all it takes.argocd cluster add. Use ApplicationSets with cluster generator.sourceRepos, destinations, and RBAC roles. No sharing of Applications between Projects.The Auto-Heal That Wiped Production During a PagerDuty Incident
kubectl scale deploy api --replicas=0 to stop traffic while investigating. Within 1 minute, the deployment scaled back to 3 replicas. They scaled down again; it scaled back up. The team thought the incident was automated chaos until someone noticed ArgoCD logs showing successfully synced every 60 seconds. The team spent 20 minutes fighting their own tooling.syncPolicy.autoSync = true but lacked syncPolicy.retry config and syncPolicy.syncOptions = ["ApplyOutOfSyncOnly=true"]. The manual scale-down was treated as drift. ArgoCD's controller compared live state with Git state, saw replica count mismatch (0 vs 3), and reverted it every reconciliation loop (default 3 minutes). The team had no way to pause ArgoCD for a single Application without temporarily editing the Application CRD, which they couldn't do while debugging under pressure.automated: selfHeal: true (separate from sync) to distinguish between auto-sync and self-heal. Actually, the root fix: implemented an incident runbook step: kubectl patch application api -p '{"spec":{"syncPolicy":{"automated":null}}}' -n argocd to disable auto-sync temporarily. Better: used ArgoCD's manual sync policy for production and moved to feature flags to disable sync during incidents. Added a argocd app pause workflow pattern documented in the team runbook. Installed ArgoCD Notifications to alert when sync overrides manual changes.- Auto-heal without an incident suspension mechanism turns your GitOps tool into an adversary during outages. Always have a documented way to pause sync.
- Production should use manual sync or have strict
automated: prune: true, selfHeal: trueonly with pre-commit hooks that validate manifests. - Use ArgoCD's
annotation: argocd.argoproj.io/manual-syncto prevent auto-sync for critical apps, or setautomated: {}on prod and trigger syncs via webhook only on tagged releases. - Monitor for sync-related events:
argocd app get <app> --refreshshows when the last sync overrode something. Alert if syncs happen outside planned deployment windows.
kubectl describe app <app> -n argocd. Look at 'Conditions' and 'Reconciliation ID'. The resource might be excluded via resource.exclude: <kind> in argocd-cm. Also check for CompareOptions: IgnoreExtraneous if fields are being added by webhooks.argocd app manifests <app> to see if manifests resolve. Check network policies blocking ArgoCD to Git (port 443). Delete the pod of the Application Controller to force requeue — kubectl delete pod -n argocd argocd-application-controller-0.kubernetes finalizer on namespaces) block deletion. Use --prune-propagation-policy=foreground in sync options. Check if resource is protected by another app (fluentd-logging might be shared). Set prune: false for resources that should persist across app deletion.argocd-server service is reachable from internet (or use a webhook proxy like Smee). Ensure webhook secret in argocd-secret matches. Use argocd app sync --force --prune for manual push if webhook broken.argocd-server cluster role. For cross-cluster deployments, ensure cluster credentials in argocd-cm and argocd-ssh-known-hosts-cm. Use argocd cluster add <context> to generate correct RBAC.argocd app sync <app> --force.Key takeaways
kubectl patch when you need emergency overrides.Common mistakes to avoid
5 patternsMutating resources directly with kubectl while auto-heal is enabled
kubectl patch app <name> -n argocd --type merge -p '{"spec":{"syncPolicy":{"automated":null}}}'. Re-enable after incident: kubectl patch app <name> -n argocd --type merge -p '{"spec":{"syncPolicy":{"automated":{"selfHeal":true,"prune":true}}}}'. Better: create an emergency 'break-glass' override procedure documented in runbook.Putting secrets in Git without encryption
SealedSecret CRD. Use SOPS with age or KMS: commit encrypted .sops.yaml files, decrypted by ArgoCD repo server using a private key stored in Kubernetes secret. Never put plaintext secrets in Git. Ever.Not setting prune: true, leading to resource leaks
syncPolicy.automated.prune: true in production. For resources that should persist (persistent volumes), use prune: false or exclude them via resource.exclude. Use syncOptions: [PrunePropagationPolicy=foreground] to ensure dependent resources are deleted first.Applying raw YAML without Helm or Kustomize — no environment overrides
values-dev.yaml, values-prod.yaml or Kustomize with overlays. ArgoCD can point to path: overlays/production. Use helm valueFiles parameter in Application spec to pick correct values file per environment.Forgetting to set finalizers on Application CRs
metadata.finalizers: [resources-finalizer.argocd.argoproj.io] to every Application. When you delete the Application, ArgoCD will first prune all managed resources, then remove the finalizer and delete the CR. This is the default in newer ArgoCD versions but check it.Interview Questions on This Topic
Explain the difference between `selfHeal` and `prune` in ArgoCD sync policy — and why you might disable selfHeal in production.
prune controls whether ArgoCD deletes resources that exist in the cluster but are not present in Git. If disabled, removing a Deployment from Git leaves it running in the cluster (resource leak). selfHeal controls whether ArgoCD reverts manual changes made directly to the cluster (e.g., kubectl edit deployment). If enabled, any drift detected in the next reconciliation loop is overwritten with the Git state. In production, you might disable selfHeal temporarily during incident response: if a pod is crashing and you need to scale down manually to stop the crash loop, selfHeal would immediately revert the scale down. Instead, you disable selfHeal, make your operational changes, investigate, then re-enable it. For long-running production, selfHeal is usually enabled to enforce Git as the absolute source of truth — but you need a documented process to suspend it during emergencies.Frequently Asked Questions
That's CI/CD. Mark it forged?
4 min read · try the examples if you haven't