Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Part 43: Cluster Recreation from Declarative State

"The cluster is the cache. The declarative state is the source of truth."


Why

The most powerful operational guarantee a system can offer is: the running state is reconstructible from the declarative state. If your dev cluster gets corrupted, your laptop dies, or you simply want to start fresh, you should be able to type one command and have the cluster come back exactly as it was, with the same workloads, same data, same configuration.

For HomeLab K8s, the declarative state is three sources:

  1. config-homelab.yaml — the cluster spec (topology, distribution, version, plugin set)
  2. The GitOps repo in DevLab GitLab — every Application, every Helm chart, every overlay
  3. The latest Velero backup in MinIO — every workload's data (Postgres, MinIO objects, PVCs)

The thesis: homelab k8s recreate <cluster> is one verb that destroys the existing cluster, re-runs k8s create from config-homelab.yaml, lets ArgoCD reconcile the GitOps repo, and restores the latest Velero backup. Total time: ~30 minutes for a k8s-multi topology. The cluster ends up in the same state it was in before the destruction.


The flow

Diagram
Cluster recreation is one compensable saga over six steps — vagrant, kubeadm, ArgoCD, Velero — ending in a cluster indistinguishable from the one that was destroyed.

The saga is six steps. Each step has a compensation that brings the cluster back toward the previous state. The compensation chain on failure is intentional: if step 5 (ArgoCD bootstrap) fails, the saga rolls back to "no cluster at all" rather than leaving a half-bootstrapped one. The user can re-run homelab k8s recreate and start over.


Why this works

Three properties of HomeLab K8s make recreation feasible:

  1. The declarative state is complete. Every choice (topology, distribution, plugin set, manifest set, Helm release set, secret references) lives in config-homelab.yaml + the GitOps repo. There is no hidden state in the cluster except the workload data, and the workload data lives in Velero backups.
  2. The Velero backups include PVC contents. Postgres data, MinIO objects, Longhorn volumes — all of it. Restoring the backup brings the data back. The workloads come up and find their state where they expect.
  3. ArgoCD is the deployment mechanism. When ArgoCD reconciles a fresh cluster against the GitOps repo, it creates every workload with the right manifest. The workloads are not restored from backup; they are deployed from git, and then their PVCs are populated from backup.

The two-step pattern (deploy from git → restore PVCs) is the key. If you tried to restore everything from backup, you would replay any manifest drift the backup captured and miss any change committed to git after the backup. If you tried to deploy only from git, the workloads would come up with empty databases. The combination gets it right: manifests come from git (always current), data comes from backup (most recent).


What the user types

$ homelab k8s recreate acme
WARNING: this will DESTROY the acme cluster (4 VMs, all data) and rebuild from declarative state.
The latest Velero backup will be restored to bring data back.
Latest backup: daily-acme-prod (2026-04-20 02:00, 1.4 GB, age: 4h 12m)

Continue? [y/N] y

[1/6] Destroying VMs ... ✓ (45s)
[2/6] Booting fresh VMs ... ✓ (3m 12s)
[3/6] Bootstrapping cluster ... ✓ (4m 30s)
[4/6] Installing platform components ... ✓ (8m 15s)
[5/6] Bootstrapping ArgoCD App-of-Apps ... ✓ (3m 50s)
[6/6] Restoring Velero backup ... ✓ (7m 22s)

✓ acme recreated successfully
  total elapsed: 27m54s
  workloads: 47/47 healthy
  cluster age: <1 minute (recreated)
  data age: 4h 12m (from backup)

28 minutes from "the cluster is gone" to "the cluster is back with all the data". For a real disaster recovery scenario, this is fast — most production teams measure DR exercises in hours or days.


What this enables

Three operational use cases:

  1. Recover from a broken cluster. A bad upgrade, a corrupted etcd, a misconfigured CRD that locked the API server. Instead of debugging, you recreate.
  2. Migrate to a new topology. Want to switch Acme from k8s-multi to k8s-ha? Edit config-homelab.yaml, run homelab k8s recreate. The new topology comes up with the old data.
  3. Onboard a new machine. A colleague gets the same hardware setup and git clones the HomeLab repo. They run homelab k8s recreate acme. They have the same cluster the original developer had, with the same data (assuming they have access to the MinIO backup bucket).

What this gives you that "rm -rf and rebuild" doesn't

Manual rebuild is vagrant destroy && vagrant up && kubeadm init && kubeadm join && kubectl apply -f manifests/ && hope. It produces a cluster with empty databases. The workloads come up but the data is gone.

Cluster recreation from declarative state gives you, for the same surface area:

  • One verb for the whole flow
  • Mandatory backup verification before destroy (the saga checks that a recent backup exists)
  • Two-source state restoration (git for manifests, Velero for data)
  • Compensation on failure that leaves the system in a known state
  • Tested via the restore-test from Part 42, which is essentially the same flow

The bargain pays back the first time you trash a cluster intentionally to validate the recovery story.


⬇ Download