I pushed a broken Helm chart on a Friday afternoon last October. Not my finest moment. The cluster had three microservices that needed coordinated config updates, I was in a hurry, and I did what I always did: SSH into the jump box, run a bunch of kubectl apply -f commands, and hope for the best. The deploy partially succeeded, partially failed, and I spent the next two hours reverse-engineering what state the cluster was actually in versus what I thought I’d applied.
That was the last time I managed Kubernetes deployments that way.
Our team is small — four engineers, two clusters (staging and prod), maybe 15 microservices across them. Not massive scale, but enough complexity that “just apply the YAML manually” stopped being a viable strategy months before I finally admitted it. ArgoCD was something I’d been circling for a while, read a dozen tutorials about, and kept putting off because it looked complicated. Setting it up properly took me about two weeks of on-and-off work. This post is what I wish had existed during that time.
Why kubectl apply Gets Messy Fast
The core problem isn’t kubectl apply itself — it’s the gap between what you think your cluster looks like and what it actually looks like. Every manual apply command is an undocumented change. Someone runs kubectl edit to fix something urgent at 2am (speaking from experience), and now your Git repo and your cluster have silently diverged. You won’t notice until something breaks.
GitOps flips the deployment model: your Git repository is the single source of truth for cluster state, and instead of pushing changes directly to the cluster, you push to Git and let the cluster pull them in. ArgoCD watches a repo, compares what’s there to what’s running, and either alerts you about drift or automatically syncs the cluster back to match.
There are other tools in this space — Flux is the main alternative, and honestly, Flux has a more “Kubernetes-native” feel if you’re already deep in the operator pattern. I went with ArgoCD because the web UI is genuinely useful, the team could see app health at a glance without knowing the CLI, and there’s a large enough community that most weird issues I hit had an existing GitHub issue or discussion. Your mileage may vary.
Installing ArgoCD Without Shooting Yourself in the Foot
ArgoCD v2.11 is what I’m running as of this writing. The install is actually pretty straightforward — the tricky part is the configuration decisions that come right after.
# Create the namespace first
kubectl create namespace argocd
# Install using the stable manifest
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.11.0/manifests/install.yaml
# Wait for all pods to be Running — this takes 2-3 minutes
kubectl wait --for=condition=available --timeout=300s deployment/argocd-server -n argocd
Once that’s done, you need to get to the UI. For local dev or initial setup, port-forwarding works fine:
kubectl port-forward svc/argocd-server -n argocd 8080:443
The initial admin password is auto-generated and stored in a secret:
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d; echo
Log in at https://localhost:8080 with username admin. First thing you should do after logging in is change that password. Second thing: delete the initial secret. The docs tell you this, but I glossed over it the first time.
For production, you’ll want a proper ingress instead of port-forwarding. I use ingress-nginx with cert-manager for TLS, and there’s a small but important detail — ArgoCD serves its API and UI over HTTPS by default, and if your ingress is also doing TLS termination, you’ll get a redirect loop unless you either disable TLS on the ArgoCD server or configure the ingress to pass through to HTTP. The ArgoCD docs cover both options. I went with the --insecure flag on the argocd-server deployment plus TLS termination at the ingress, which is the more common setup.
One thing I didn’t anticipate: resource limits. By default ArgoCD doesn’t set CPU/memory limits, and in a busy cluster the repo-server pod can get quite hungry when it’s rendering a lot of Helm charts. I ended up adding limits after argocd-repo-server got OOMKilled on a day we were doing a lot of syncs. Add resource requests and limits to your deployment if you’re running this anywhere near production.
Defining Your First Application (Where the Real Decisions Are)
ArgoCD’s core concept is the Application custom resource — a Kubernetes object that tells ArgoCD “watch this Git repo at this path, and deploy it to this cluster/namespace.” Here’s a minimal example for a Helm chart:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-api
namespace: argocd
# This finalizer causes ArgoCD to cascade-delete all managed resources
# when you delete this Application object — omit it if you want the
# workloads to survive Application deletion
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/your-org/your-repo
targetRevision: HEAD
path: deployments/my-api
helm:
valueFiles:
- values.yaml
- values-prod.yaml # environment-specific overrides
destination:
server: https://kubernetes.default.svc # in-cluster
namespace: production
syncPolicy:
syncOptions:
- CreateNamespace=true
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual cluster changes
The syncPolicy.automated section is where people get tripped up — including me. selfHeal: true means if someone manually edits a resource in the cluster, ArgoCD will overwrite it back to match Git within a few minutes. This is usually what you want (it’s the whole point), but it can cause confusion if your team isn’t fully bought into the workflow yet. I’d recommend setting it to manual sync at first — just syncPolicy: {} with no automated block — so you can watch how ArgoCD detects drift before letting it auto-correct.
prune: true is similarly powerful. If you remove a resource from your Git repo, ArgoCD will delete it from the cluster. I left this off initially because deleting production resources automatically felt scary, and I wanted to review the prune operations manually first. Reasonable caution, in retrospect.
The actual progression I’d suggest: start with no automated sync policy. Merge some changes to Git. Watch ArgoCD report “OutOfSync” in the UI. Click sync manually. Get comfortable with what it’s showing you. After a week of that, turn on automated sync without prune. After another week, consider prune — but only if your team is actually disciplined about removing stale resources from the repo.
Sync Policies, Health Checks, and the Stuff That Bit Me
ArgoCD has built-in health checks for common Kubernetes resources — Deployments, StatefulSets, Services, Ingresses. It knows a Deployment is “Healthy” when its rollout finishes. That’s more useful than it might sound: you can see at a glance whether your deploy actually completed or just got submitted to the API server.
For custom resources (CRDs that ArgoCD doesn’t understand natively), it defaults to “Healthy” as soon as the resource exists — which is wrong if your custom resource has its own status conditions. You can write custom health check scripts in Lua if you need this. I needed it for a CertificateRequest resource from cert-manager, and the Lua API is actually not bad once you figure out the shape of the object you’re working with.
Here’s a mistake that genuinely cost me an afternoon. I had a Helm chart that generated some resources conditionally based on values. When I first synced it, ArgoCD showed the app as Synced and Healthy. Great. Then I changed a value that should have removed one of those conditional resources. I committed, pushed, ArgoCD synced — and the old resource was still there. Took me a while to realize: I hadn’t enabled prune: true, and the old resource was just sitting there, orphaned, not causing any obvious issues but also not matching my intent.
The other thing that surprised me was how ArgoCD handles Helm hooks. By default, ArgoCD runs Helm hooks during sync — and it treats the hook’s success or failure as part of the sync result. If you have a pre-upgrade hook that runs a database migration and it fails, your entire sync fails and ArgoCD reports the app as degraded. This is actually the correct behavior, but I’d been used to running migrations separately, and this forced me to think more carefully about hook design. Worth testing in staging first before you find out in prod.
One more thing: private Git repos. You’ll need to add your repo credentials to ArgoCD via the UI (Settings → Repositories) or via a secret. For GitHub, I use a deploy key (read-only SSH key) per repo — works reliably. Some teams use a GitHub App for broader access, which is probably the better call if you’re managing a lot of repos, but I haven’t gotten there yet.
Multi-Cluster and App-of-Apps (Once You Outgrow the Basics)
The setup I described above works fine for a few apps on one cluster. When I added a second cluster (staging), things got more interesting.
ArgoCD runs in one cluster and can manage resources on other clusters. You register external clusters using the CLI:
argocd cluster add my-staging-context --name staging
This installs a ServiceAccount and ClusterRole in the target cluster, and stores the credentials in the ArgoCD namespace. Which brings up an access control question: who can tell ArgoCD to deploy to prod? By default, anyone with access to the argocd namespace can add clusters and deploy anywhere. ArgoCD Projects let you restrict which repos can deploy to which clusters — I set up separate projects for staging and prod with different teams having sync access to each.
The “App of Apps” pattern is something I’d been skeptical of — sounds like needless indirection — but once we had 12+ applications, it made sense. You create one Application that points to a directory of other Application manifests. ArgoCD syncs that, which creates all the child Applications, which then sync their actual workloads. The whole cluster state is defined declaratively in Git, including which applications exist. Removing an app means deleting its Application manifest from the repo, and ArgoCD handles the cleanup.
I’m not going to pretend I have this perfectly figured out. The bootstrapping problem — how do you deploy ArgoCD itself using ArgoCD? — is a real philosophical puzzle I resolved by keeping the ArgoCD install in a separate “platform” repo managed by Terraform. Maybe there’s a cleaner way.
What I’d Actually Recommend
If you’re starting from scratch with a small-to-medium Kubernetes setup: install ArgoCD, point it at your existing Helm charts or plain manifests, and run in manual sync mode for two to four weeks. The sync diff view alone — seeing exactly what would change before you apply it — is worth the setup cost.
Don’t turn on automated sync with self-healing and pruning on day one. Get your team comfortable with the mental model first. GitOps requires discipline: every cluster change goes through Git, full stop. If half your team is still running kubectl edit for quick fixes, automated sync will feel hostile rather than helpful.
The web UI is genuinely good for cluster visibility, but the real value I’ve gotten is audit trail and rollback. Every deployment is a Git commit. Rolling back is git revert. Knowing exactly what changed between two states is git diff. After years of kubectl apply chaos, that’s worth a lot.
One last thing: spend some time on the ArgoCD notifications controller if you want Slack or email alerts on sync failures. It’s a separate component, slightly under-documented, but once it’s set up you get instant alerts when an app goes OutOfSync or Degraded. The alternative is watching the UI all day, which nobody actually does.
The two-week investment was worth it. The Friday afternoon I pushed a bad config last month, I reverted the Git commit, ArgoCD detected the change within 30 seconds, and the old version was back up in under two minutes. That’s the version of the story I prefer.