Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Part 42: Velero Restore Into a Throwaway Cluster

"A backup that has not been restored is hopeful tarball. Restored once a week into a clean cluster, it is insurance."


Why

Part 32 installed Velero and configured daily backups. The backups land in MinIO. They are encrypted. They have manifests and PVC contents. They look fine in velero backup get.

None of that matters. The only proof a backup is useful is restoring it and seeing the workloads come up healthy. Production teams that do not run restore tests routinely discover, six months in, that their backups have been silently corrupt for the entire period — usually because a chart upgrade changed a CRD field and the old backup uses the old field name. Restore-tests catch this within a week.

The thesis: K8s.Dsl ships two restore-test mechanisms. The in-cluster one (a CronJob in the velero namespace, restoring into a throwaway namespace inside the same cluster) is fast and validates basic restore. The cross-cluster one (a HomeLab CI job that spins up a fresh ephemeral HomeLab instance, restores the backup into it, verifies, destroys) is slow but exercises the full disaster recovery path.


In-cluster restore test

The in-cluster CronJob from Part 32 runs weekly. It restores acme-prod into acme-restore-test, verifies a known pod comes up, and cleans up. ~5 minutes, no dependencies, runs entirely inside the cluster. We saw the manifest in Part 32. Recap:

# What the CronJob does, in shell form
LATEST=$(velero backup get -o name | grep daily-acme-prod | sort | tail -1)
velero restore create test-restore-$(date +%Y%m%d) \
    --from-backup $LATEST \
    --namespace-mappings acme-prod:acme-restore-test \
    --wait
kubectl wait --for=condition=Ready pod -l app=acme-api -n acme-restore-test --timeout=300s
velero restore delete test-restore-$(date +%Y%m%d) --confirm
kubectl delete namespace acme-restore-test

The test catches:

  • Manifest restore failures (CRD field changed, RBAC role missing, etc.)
  • PVC content restore failures (restic can't decrypt, snapshot is corrupt)
  • Workload health failures (the pod restores but never becomes Ready, e.g. because a Secret reference was wrong)

It does not catch:

  • Cluster-level state loss (etcd corruption that the backup did not capture)
  • CRD drift between backup time and restore time (because the CRD definitions are still in the same cluster)

For those, the cross-cluster test is needed.


Cross-cluster restore test (the dogfood loop)

The full disaster recovery exercise is: a fresh, empty cluster, with the current CRD definitions, restoring a backup taken from the original cluster. The fresh cluster is a HomeLab K8s instance spun up by CI. After the test, the instance is destroyed.

[Injectable(ServiceLifetime.Singleton)]
[VerbGroup("k8s")]
public sealed class K8sRestoreTestCommand : IHomeLabVerbCommand
{
    public Command Build()
    {
        var sourceCluster = new Argument<string>("source");
        var backupId = new Option<string?>("--backup-id");
        var topology = new Option<string>("--topology", () => "k8s-multi");

        var cmd = new Command("restore-test", "Spin up a fresh cluster, restore the latest backup, verify, destroy");
        cmd.AddArgument(sourceCluster);
        cmd.AddOption(backupId);
        cmd.AddOption(topology);

        cmd.SetHandler(async (string src, string? id, string topo) =>
        {
            var result = await _mediator.SendAsync(new K8sRestoreTestRequest(src, id, topo), default);
            _console.Render(result);
            Environment.ExitCode = result.IsSuccess ? 0 : 1;
        }, sourceCluster, backupId, topology);

        return cmd;
    }
}

[Injectable(ServiceLifetime.Singleton)]
public sealed class K8sRestoreTestRequestHandler : IRequestHandler<K8sRestoreTestRequest, Result<K8sRestoreTestResponse>>
{
    public async Task<Result<K8sRestoreTestResponse>> HandleAsync(K8sRestoreTestRequest req, CancellationToken ct)
    {
        var ephemeralName = $"restore-test-{_clock.UtcNow:yyyyMMddHHmmss}";

        // 1. Acquire a fresh HomeLab instance
        var scope = await _registry.AcquireAsync(ephemeralName, ct);
        if (scope.IsFailure) return scope.Map<K8sRestoreTestResponse>();

        try
        {
            // 2. Stand up a fresh cluster in that instance
            var createResult = await _mediator.SendAsync(
                new K8sCreateRequest(scope.Value.Name, _config.K8s!.Distribution, req.Topology, _config.K8s.Version), ct);
            if (createResult.IsFailure) return createResult.Map<K8sRestoreTestResponse>();

            // 3. Install Velero in the fresh cluster, pointing at the SAME MinIO bucket
            await _veleroInstaller.InstallAsync(scope.Value, _config.K8s.Backup!.MinioEndpoint, ct);

            // 4. Find the latest backup of the source cluster
            var latestBackup = req.BackupId ?? await _velero.GetLatestBackupAsync(req.SourceCluster, ct);

            // 5. Restore it
            var restoreResult = await _velero.RestoreAsync(latestBackup, ct);
            if (restoreResult.IsFailure) return restoreResult.Map<K8sRestoreTestResponse>();

            // 6. Verify a known healthcheck endpoint
            await Task.Delay(TimeSpan.FromMinutes(2), ct);   // give workloads time to come up
            var healthResult = await _http.GetAsync($"https://gitlab.{scope.Value.TldPrefix}.lab/-/health", ct);
            if (healthResult.IsFailure || !healthResult.Value.Contains("ok"))
                return Result.Failure<K8sRestoreTestResponse>("restore test failed: workload did not become healthy");

            await _events.PublishAsync(new RestoreTestPassed(req.SourceCluster, latestBackup, _clock.UtcNow), ct);
            return Result.Success(new K8sRestoreTestResponse(req.SourceCluster, latestBackup, "passed"));
        }
        finally
        {
            // 7. Destroy the ephemeral instance regardless of outcome
            await _mediator.SendAsync(new K8sDestroyRequest(ephemeralName), ct);
            await _registry.ReleaseAsync(ephemeralName, ct);
        }
    }
}

The handler:

  1. Acquires a new HomeLab instance with a unique name and subnet
  2. Spins up a fresh k8s cluster in it (~10 minutes)
  3. Installs Velero pointing at the original MinIO (cross-cluster MinIO access)
  4. Finds the latest backup of the source cluster
  5. Restores it into the fresh cluster
  6. Verifies the workload is healthy
  7. Destroys the ephemeral instance no matter what

Total time: ~25 minutes for a k8s-multi topology. Runs nightly via cron on the HomeLab CI runner.

The CI runner that runs the test is itself inside one of the HomeLab clusters (the dogfood pattern from homelab-docker Part 06). So the chain is:

  • The CI runner inside the prod cluster runs the restore test
  • The restore test creates an ephemeral cluster
  • The ephemeral cluster restores from the prod cluster's MinIO
  • The ephemeral cluster comes up healthy
  • The ephemeral cluster is destroyed
  • The CI runner reports success

If any step fails, the alert wakes the user. Dogfood loop #4 from homelab-docker, applied to k8s.


What this gives you that "I checked the backup file size" doesn't

Checking that the backup files exist and are non-zero size proves nothing. The standard postmortem after a failed restore is "the backup files were there, they were the right size, the restore command failed because the operator deleted a CRD field three months ago".

Two restore tests give you, for the same surface area:

  • A fast in-cluster test (~5 min) that catches manifest and PVC restore failures
  • A slow cross-cluster test (~25 min) that catches everything plus disaster recovery
  • Periodic execution via CronJob and HomeLab CI
  • Loud failure via the event bus and Alertmanager

The bargain pays back the first time the test fails on a Tuesday morning and you fix the broken backup before the day you actually need it.


⬇ Download