Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Part 44: When to Nuke and Rebuild

"The cluster is the cache. Caches that are wrong should be evicted, not patched."


Why

Part 43 showed how to recreate a cluster from declarative state. This part is about when — the operational decision tree of "fix this in place" vs "nuke it and rebuild". The answer is more often "nuke and rebuild" than developers expect, because the recreate is fast and the debug session is unbounded.

The thesis: the cost of recreation is fixed at ~30 minutes. The cost of debugging an unfamiliar cluster failure is unbounded. If the time to debug exceeds the time to rebuild, rebuild is the right call. The decision is a calculation, not a heroism contest.


The decision tree

Diagram
The cost of recreate is a fixed thirty minutes, the cost of debugging an unfamiliar failure is unbounded — the tree is a calculation, not a heroism contest.

The tree has two leaves: fix-in-place and recreate. The route to each is short.

For dev clusters, the answer is overwhelmingly "recreate". You do not need to learn why your dev cluster broke if learning takes longer than recreating. The dev cluster's job is to be available for development, not to be a learning lab for k8s internals (unless that is specifically the day's task).

For prod clusters, the answer requires investigation. A prod cluster outage is a learning opportunity (postmortem) and a "do not lose data" constraint at the same time. But even in prod, the recreate path is sometimes the right one — if the only alternative is a 6-hour debug session and you have a tested backup that is less than an hour old.


The math

Time to recreate: ~30 minutes (Part 43). This is fixed and known.

Time to debug an unfamiliar failure: unbounded. Some failures are 5 minutes (kubectl describe pod reveals the missing image). Some are 4 hours (CNI configuration drift, requires reading two operator changelogs and three GitHub issues). The average unfamiliar debug is 60-90 minutes for a moderately experienced operator.

Decision rule: if you have any uncertainty about the failure mode after 5-10 minutes of kubectl describe and log inspection, recreate is the cheaper path on the median.


When to fix in place

The cases where fix-in-place beats recreate:

  1. You know the failure exactly. A pod is OOMing because the limit is too low. Edit the manifest in the GitOps repo, push, ArgoCD reconciles. Total time: ~3 minutes.
  2. The failure is a recent change you made. Revert the commit. ArgoCD reconciles back to the previous state. Total time: ~5 minutes.
  3. Recreate would lose data that backup does not capture. Some workloads have in-memory state (a long-running Spark job, a caching layer with 30 minutes of warmth). If the data is not in any PVC and not in any backup, recreate destroys it.
  4. The failure is in cert-manager or similar — easy to diagnose. A missing ClusterIssuer is kubectl get clusterissuer away from being obvious.

For everything else, the recreate path is faster than the debug path on average.


When to recreate

The cases where recreate is the right answer:

  1. etcd corruption. If etcd is in a bad state, recovering from etcd corruption is harder than rebuilding the cluster. Recreate.
  2. Half-finished kubeadm upgrade. If kubeadm upgrade apply failed midway and left the control plane at one version and the workers at another, the recovery is complex. Recreate.
  3. CNI catastrophic failure. A botched Cilium upgrade that breaks pod-to-pod networking and leaves CRDs in an inconsistent state. Recreate.
  4. You inherited the cluster. Someone else built it, you do not know the history, it is broken. Spend 10 minutes understanding what is broken; if you cannot, recreate.
  5. The cluster has been up for a year and you are no longer sure what is in it. Drift accumulates. Recreate is the cheapest way to get back to a known state.

What "tested backup" means in this context

The recreate path depends on a backup that is known to be restorable. The Velero restore-test from Part 42 is what makes this true. If the restore test passed last night, the backup from last night is recoverable, and the recreate path is safe.

If the restore test has not passed recently, the recreate path is risky — you might get a fresh cluster with a backup that does not restore cleanly, and now you have two problems instead of one.

This is why the restore-test job is non-negotiable. Without it, the recreate option is theoretical. With it, the recreate option is a known-good escape hatch.


A worked decision

Scenario: it is Tuesday, 14:30. The freelancer is working on Globex. kubectl get pods -n globex-prod shows that the gateway pod has been restarting in a loop for 20 minutes. kubectl logs gateway shows panic: nil pointer dereference in a function the freelancer does not recognize.

Investigation (5 min): the panic is from a Kafka consumer library. The freelancer does not maintain that library. The library was upgraded yesterday by Renovate.

Options:

  • Fix in place: roll back the library version in the Maven dependencies, push, wait for CI to rebuild, ArgoCD redeploys. Total time: ~15 minutes (mostly waiting for the build).
  • Recreate: nope. The library version is in the image, not in the cluster manifests. Recreate would not help because the image still has the bad library.

The right answer is fix-in-place, because the failure is not a cluster-state failure. It is a workload bug. Recreate cannot fix workload bugs.

The decision tree only applies to cluster-state failures (etcd, CNI, CSI, RBAC drift, kubeadm upgrade fallout). Workload bugs are always fix-in-place via the normal CI loop.


What this gives you that "always debug" doesn't

Always-debug is the heroism path. It is also the slow path on average. Senior engineers know that the best operators are not the ones who solve every problem in place — they are the ones who know when to give up on debugging and reach for the recreate verb.

Having a fast, tested recreate path gives you, for the same surface area:

  • A real escape hatch for the failures that are not worth debugging
  • A 30-minute upper bound on cluster restoration time
  • A confidence boost for trying risky changes (you know you can roll back)
  • A decision tree the freelancer follows without ego

The bargain pays back the first time you save 90 minutes by recognizing a "this is etcd corruption" pattern at 14:35 and typing homelab k8s recreate instead of debugging until 16:30.


End of Act VII

Day-2 is covered: kubeadm upgrades, k3s upgrades, restore tests, recreation from declarative state, the decision tree. From here, Act VIII shows four real-world use cases — Spring Boot microservices, .NET API, Airflow, GPU ML training — that exercise everything we have built.


⬇ Download