Why
We have spent forty-eight parts building DevLab. This part is about keeping it running. Day-2 operations are the unglamorous middle of any system's life: upgrades, backups, restores, cert rotations, log management, occasional service restarts, occasional alert investigations. None of them are exciting. All of them are mandatory.
The thesis of this part is: HomeLab provides a small set of typed verbs for day-2 operations, all driven by the same Ops.DataGovernance / Ops.Configuration / Ops.Observability declarations the rest of the lab already uses. The user runs homelab backup, homelab restore, homelab upgrade, homelab cert rotate, homelab logs, homelab status — and that is enough for most days.
The day-2 verbs
| Verb | Purpose | When you run it |
|---|---|---|
homelab status |
Show lab health (every service, healthcheck, recent backups, cert expiry) | Whenever you wonder |
homelab backup run [target] |
Trigger an immediate backup | After a risky change |
homelab backup restore <id> [target] |
Restore a specific backup | When something is broken |
homelab backup verify |
Run the restore-test job manually | Once a week minimum |
homelab upgrade gitlab --to <version> |
Upgrade GitLab Omnibus | When a new GitLab is out |
homelab upgrade postgres --to <version> |
Upgrade Postgres | When a new major is out |
homelab cert rotate |
Re-issue the wildcard cert | Before expiry |
homelab logs <service> [--tail N] [--follow] |
Tail logs for a service | When something is wrong |
homelab vos snapshot create <name> |
Snapshot every VM (Vagrant snapshots) | Before a risky change |
homelab vos snapshot restore <name> |
Roll back to a snapshot | After a failed change |
Ten verbs. They cover the vast majority of day-2 needs.
homelab status
The most-used verb. Aggregates everything HomeLab knows into one terminal page:
$ homelab status
DevLab — single topology — running
VMs:
✓ devlab-main 4 cpu / 8 GB / 50 GB up 3d 4h
Services (14):
✓ traefik healthy up 3d 4h
✓ pihole healthy up 3d 4h
✓ gitlab healthy up 3d 3h
✓ gitlab-runner healthy up 3d 3h
✓ baget healthy up 3d 4h
✓ vagrant-registry healthy up 3d 4h
✓ postgres healthy up 3d 4h
✓ minio healthy up 3d 4h
✓ meilisearch healthy up 3d 4h
✓ prometheus healthy up 3d 4h
✓ grafana healthy up 3d 4h
✓ loki healthy up 3d 4h
✓ alertmanager healthy up 3d 4h
✓ docs-site healthy up 3d 4h
Recent backups:
postgres 2026-04-12 02:00 ✓ pgbackrest 1.2 GB
gitlab-config 2026-04-12 03:00 ✓ restic 24 MB
gitlab-lfs 2026-04-07 04:00 ✓ minio-mirror 890 MB
Last restore test:
postgres 2026-04-08 05:30 ✓ passed 18m elapsed
TLS:
CA HomeLab CA expires 2036-04-09 (10 years)
*.lab frenchexdev wildcard expires 2028-04-09 (2 years)
Recent alerts (last 24h): none
Cost (last 30d): 41.8 kWh / €8.36 (estimated)$ homelab status
DevLab — single topology — running
VMs:
✓ devlab-main 4 cpu / 8 GB / 50 GB up 3d 4h
Services (14):
✓ traefik healthy up 3d 4h
✓ pihole healthy up 3d 4h
✓ gitlab healthy up 3d 3h
✓ gitlab-runner healthy up 3d 3h
✓ baget healthy up 3d 4h
✓ vagrant-registry healthy up 3d 4h
✓ postgres healthy up 3d 4h
✓ minio healthy up 3d 4h
✓ meilisearch healthy up 3d 4h
✓ prometheus healthy up 3d 4h
✓ grafana healthy up 3d 4h
✓ loki healthy up 3d 4h
✓ alertmanager healthy up 3d 4h
✓ docs-site healthy up 3d 4h
Recent backups:
postgres 2026-04-12 02:00 ✓ pgbackrest 1.2 GB
gitlab-config 2026-04-12 03:00 ✓ restic 24 MB
gitlab-lfs 2026-04-07 04:00 ✓ minio-mirror 890 MB
Last restore test:
postgres 2026-04-08 05:30 ✓ passed 18m elapsed
TLS:
CA HomeLab CA expires 2036-04-09 (10 years)
*.lab frenchexdev wildcard expires 2028-04-09 (2 years)
Recent alerts (last 24h): none
Cost (last 30d): 41.8 kWh / €8.36 (estimated)The implementation walks every relevant data source — IDockerClient.PsAsync, the cost store, the backup store, the cert files, the alertmanager API — and renders them into one structured view.
GitLab Omnibus upgrade
GitLab Omnibus has a strict upgrade path: you cannot skip major versions. The Omnibus changelog tells you which versions you must stop at on the way. HomeLab knows this:
[Injectable(ServiceLifetime.Singleton)]
public sealed class GitLabUpgradePathResolver
{
private static readonly IReadOnlyList<string> RequiredStops = new[]
{
"13.12.15-ce.0",
"14.0.12-ce.0",
"14.3.6-ce.0",
"14.9.5-ce.0",
"14.10.5-ce.0",
"15.0.5-ce.0",
"15.4.6-ce.0",
"15.11.13-ce.0",
"16.0.10-ce.0",
"16.3.9-ce.0",
"16.7.10-ce.0",
"16.11.0-ce.0",
// ... etc
};
public IReadOnlyList<string> ResolvePath(string from, string to)
{
var fromV = ParseVersion(from);
var toV = ParseVersion(to);
return RequiredStops
.Where(s => ParseVersion(s).IsBetween(fromV, toV))
.Concat(new[] { to })
.Distinct()
.ToList();
}
}[Injectable(ServiceLifetime.Singleton)]
public sealed class GitLabUpgradePathResolver
{
private static readonly IReadOnlyList<string> RequiredStops = new[]
{
"13.12.15-ce.0",
"14.0.12-ce.0",
"14.3.6-ce.0",
"14.9.5-ce.0",
"14.10.5-ce.0",
"15.0.5-ce.0",
"15.4.6-ce.0",
"15.11.13-ce.0",
"16.0.10-ce.0",
"16.3.9-ce.0",
"16.7.10-ce.0",
"16.11.0-ce.0",
// ... etc
};
public IReadOnlyList<string> ResolvePath(string from, string to)
{
var fromV = ParseVersion(from);
var toV = ParseVersion(to);
return RequiredStops
.Where(s => ParseVersion(s).IsBetween(fromV, toV))
.Concat(new[] { to })
.Distinct()
.ToList();
}
}homelab upgrade gitlab --to 17.0.0-ce.0 from 15.4.0-ce.0:
Upgrade path:
15.4.6-ce.0
15.11.13-ce.0
16.0.10-ce.0
16.3.9-ce.0
16.7.10-ce.0
16.11.0-ce.0
17.0.0-ce.0 (target)
This upgrade has 7 stops and will take ~2 hours.
A backup will be taken before each step.
Continue? [y/N]Upgrade path:
15.4.6-ce.0
15.11.13-ce.0
16.0.10-ce.0
16.3.9-ce.0
16.7.10-ce.0
16.11.0-ce.0
17.0.0-ce.0 (target)
This upgrade has 7 stops and will take ~2 hours.
A backup will be taken before each step.
Continue? [y/N]The user confirms. HomeLab walks the path: backup → bump version → wait for healthy → repeat. If any step fails, the saga compensates: restore the previous backup and roll the version back. The user is left with a clear error.
[Saga]
public sealed class GitLabUpgradeSaga
{
[SagaStep(Order = 1, Compensation = nameof(NothingToCompensate))]
public async Task<Result> BackupBeforeUpgrade(GitLabUpgradeContext ctx, CancellationToken ct)
{
var result = await _backup.RunAsync(new BackupSpec("postgres", "/", "backups"), ct);
if (result.IsFailure) return result.Map();
ctx.LastBackupId = result.Value;
return Result.Success();
}
[SagaStep(Order = 2, Compensation = nameof(RollbackVersion))]
public async Task<Result> BumpVersion(GitLabUpgradeContext ctx, CancellationToken ct)
{
await _compose.UpdateImageAsync("gitlab", $"gitlab/gitlab-ce:{ctx.NextVersion}", ct);
return await _compose.RecreateAsync("gitlab", ct);
}
[SagaStep(Order = 3, Compensation = nameof(RestoreFromBackup))]
public async Task<Result> WaitHealthy(GitLabUpgradeContext ctx, CancellationToken ct)
{
return await _gitlab.WaitForHealthAsync(timeout: TimeSpan.FromMinutes(15), ct);
}
public async Task<Result> RollbackVersion(GitLabUpgradeContext ctx, CancellationToken ct)
{
await _compose.UpdateImageAsync("gitlab", $"gitlab/gitlab-ce:{ctx.PreviousVersion}", ct);
return await _compose.RecreateAsync("gitlab", ct);
}
public async Task<Result> RestoreFromBackup(GitLabUpgradeContext ctx, CancellationToken ct)
{
return await _backup.RestoreAsync(ctx.LastBackupId!, new RestoreSpec("/var/opt/gitlab"), ct);
}
public Task<Result> NothingToCompensate(GitLabUpgradeContext ctx, CancellationToken ct) => Task.FromResult(Result.Success());
}[Saga]
public sealed class GitLabUpgradeSaga
{
[SagaStep(Order = 1, Compensation = nameof(NothingToCompensate))]
public async Task<Result> BackupBeforeUpgrade(GitLabUpgradeContext ctx, CancellationToken ct)
{
var result = await _backup.RunAsync(new BackupSpec("postgres", "/", "backups"), ct);
if (result.IsFailure) return result.Map();
ctx.LastBackupId = result.Value;
return Result.Success();
}
[SagaStep(Order = 2, Compensation = nameof(RollbackVersion))]
public async Task<Result> BumpVersion(GitLabUpgradeContext ctx, CancellationToken ct)
{
await _compose.UpdateImageAsync("gitlab", $"gitlab/gitlab-ce:{ctx.NextVersion}", ct);
return await _compose.RecreateAsync("gitlab", ct);
}
[SagaStep(Order = 3, Compensation = nameof(RestoreFromBackup))]
public async Task<Result> WaitHealthy(GitLabUpgradeContext ctx, CancellationToken ct)
{
return await _gitlab.WaitForHealthAsync(timeout: TimeSpan.FromMinutes(15), ct);
}
public async Task<Result> RollbackVersion(GitLabUpgradeContext ctx, CancellationToken ct)
{
await _compose.UpdateImageAsync("gitlab", $"gitlab/gitlab-ce:{ctx.PreviousVersion}", ct);
return await _compose.RecreateAsync("gitlab", ct);
}
public async Task<Result> RestoreFromBackup(GitLabUpgradeContext ctx, CancellationToken ct)
{
return await _backup.RestoreAsync(ctx.LastBackupId!, new RestoreSpec("/var/opt/gitlab"), ct);
}
public Task<Result> NothingToCompensate(GitLabUpgradeContext ctx, CancellationToken ct) => Task.FromResult(Result.Success());
}The upgrade is a saga, so partial failures roll back cleanly. The user runs one command, gets a confirmation prompt, walks away, and either has a new GitLab or a clean previous-state GitLab when they come back.
Postgres upgrade
Postgres major upgrades are harder because they require a pg_upgrade step that is not just docker pull + recreate. HomeLab generates the right pg_upgrade --link command in a sidecar container:
homelab upgrade postgres --from 16 --to 17homelab upgrade postgres --from 16 --to 17Walks through:
- Backup via pgbackrest
- Stop GitLab (so nothing writes to Postgres)
- Stop Postgres
- Run
pg_upgradein a sidecar with both old and new bin dirs - Start the new Postgres
- Run
ANALYZE(recommended after pg_upgrade) - Start GitLab
- Verify healthcheck
If any step fails, restore the backup and revert.
Cert rotation
The wildcard cert from Part 34 lasts 2 years. The CA lasts 10. Both expire eventually. homelab cert rotate:
$ homelab cert rotate
Current wildcard cert expires 2027-04-12 (in 365 days).
This is well within the rotation window.
Generate a new wildcard cert (will be valid for 2 years from now)? [y/N]$ homelab cert rotate
Current wildcard cert expires 2027-04-12 (in 365 days).
This is well within the rotation window.
Generate a new wildcard cert (will be valid for 2 years from now)? [y/N]If the user confirms:
- Generate a new cert via the same
ITlsCertificateProviderused in init - Atomically swap the cert files in
data/certs/wildcard.crtandwildcard.key - Send Traefik a
SIGHUPso it reloads the cert without restarting - Verify the new cert via a curl against
https://gitlab.frenchexdev.lab
The CA is not rotated automatically — replacing the CA invalidates every signed cert and requires re-trusting on every consuming machine. CA rotation is an explicit, deliberate operation: homelab cert rotate-ca --confirm-i-know-this-is-disruptive.
Log shipping
Logs from compose services are ingested by Loki via promtail (from Part 44). homelab logs gitlab --tail 100 --follow is a thin wrapper around loki query:
[Injectable(ServiceLifetime.Singleton)]
public sealed class LogsRequestHandler : IRequestHandler<LogsRequest, Result>
{
private readonly ILokiApi _loki;
private readonly IHomeLabConsole _console;
public async Task<Result> HandleAsync(LogsRequest req, CancellationToken ct)
{
var query = $"{{compose_service=\"{req.Service}\"}}";
if (req.Follow)
{
await foreach (var line in _loki.TailAsync(query, ct))
_console.WriteLine(line);
return Result.Success();
}
else
{
var lines = await _loki.QueryAsync(query, limit: req.Tail ?? 100, ct);
foreach (var line in lines.Value) _console.WriteLine(line);
return Result.Success();
}
}
}[Injectable(ServiceLifetime.Singleton)]
public sealed class LogsRequestHandler : IRequestHandler<LogsRequest, Result>
{
private readonly ILokiApi _loki;
private readonly IHomeLabConsole _console;
public async Task<Result> HandleAsync(LogsRequest req, CancellationToken ct)
{
var query = $"{{compose_service=\"{req.Service}\"}}";
if (req.Follow)
{
await foreach (var line in _loki.TailAsync(query, ct))
_console.WriteLine(line);
return Result.Success();
}
else
{
var lines = await _loki.QueryAsync(query, limit: req.Tail ?? 100, ct);
foreach (var line in lines.Value) _console.WriteLine(line);
return Result.Success();
}
}
}The user does not need to know about Loki. They run homelab logs gitlab --tail 100, get the last 100 lines, move on.
The test
[Fact]
public void gitlab_upgrade_path_resolver_finds_required_stops()
{
var resolver = new GitLabUpgradePathResolver();
var path = resolver.ResolvePath(from: "15.4.0-ce.0", to: "17.0.0-ce.0");
path.Should().Contain("15.11.13-ce.0");
path.Should().Contain("16.11.0-ce.0");
path.Last().Should().Be("17.0.0-ce.0");
}
[Fact]
public async Task gitlab_upgrade_saga_rolls_back_on_health_check_failure()
{
var compose = new ScriptedComposeClient();
compose.OnRecreate("gitlab", exitCode: 0);
var gitlab = new ScriptedGitLabApi();
gitlab.OnWaitForHealth(timeoutAfterCount: 0, returnFailure: true);
var backup = new ScriptedBackupProvider();
backup.OnBackup(snapshotId: "snap-1");
backup.OnRestore(success: true);
var saga = new GitLabUpgradeSaga(compose, gitlab, backup);
var ctx = new GitLabUpgradeContext("16.4.0-ce.0", "16.7.10-ce.0");
var result = await saga.RunAsync(ctx, default);
result.IsFailure.Should().BeTrue();
backup.RestoreCalls.Should().ContainSingle();
}
[Fact]
public async Task cert_rotate_atomically_swaps_files_and_signals_traefik()
{
var fs = new MockFileSystem();
fs.AddFile("/lab/data/certs/wildcard.crt", new MockFileData("old-cert"));
var docker = new ScriptedDockerClient();
docker.OnExec("traefik", new[] { "kill", "-HUP", "1" }, exitCode: 0);
var handler = new CertRotateRequestHandler(/* ... */);
var result = await handler.HandleAsync(new CertRotateRequest(), default);
result.IsSuccess.Should().BeTrue();
fs.File.ReadAllText("/lab/data/certs/wildcard.crt").Should().NotBe("old-cert");
docker.ExecCalls.Should().ContainSingle(c => c.Container == "traefik" && c.Command.Contains("HUP"));
}[Fact]
public void gitlab_upgrade_path_resolver_finds_required_stops()
{
var resolver = new GitLabUpgradePathResolver();
var path = resolver.ResolvePath(from: "15.4.0-ce.0", to: "17.0.0-ce.0");
path.Should().Contain("15.11.13-ce.0");
path.Should().Contain("16.11.0-ce.0");
path.Last().Should().Be("17.0.0-ce.0");
}
[Fact]
public async Task gitlab_upgrade_saga_rolls_back_on_health_check_failure()
{
var compose = new ScriptedComposeClient();
compose.OnRecreate("gitlab", exitCode: 0);
var gitlab = new ScriptedGitLabApi();
gitlab.OnWaitForHealth(timeoutAfterCount: 0, returnFailure: true);
var backup = new ScriptedBackupProvider();
backup.OnBackup(snapshotId: "snap-1");
backup.OnRestore(success: true);
var saga = new GitLabUpgradeSaga(compose, gitlab, backup);
var ctx = new GitLabUpgradeContext("16.4.0-ce.0", "16.7.10-ce.0");
var result = await saga.RunAsync(ctx, default);
result.IsFailure.Should().BeTrue();
backup.RestoreCalls.Should().ContainSingle();
}
[Fact]
public async Task cert_rotate_atomically_swaps_files_and_signals_traefik()
{
var fs = new MockFileSystem();
fs.AddFile("/lab/data/certs/wildcard.crt", new MockFileData("old-cert"));
var docker = new ScriptedDockerClient();
docker.OnExec("traefik", new[] { "kill", "-HUP", "1" }, exitCode: 0);
var handler = new CertRotateRequestHandler(/* ... */);
var result = await handler.HandleAsync(new CertRotateRequest(), default);
result.IsSuccess.Should().BeTrue();
fs.File.ReadAllText("/lab/data/certs/wildcard.crt").Should().NotBe("old-cert");
docker.ExecCalls.Should().ContainSingle(c => c.Container == "traefik" && c.Command.Contains("HUP"));
}What this gives you that bash doesn't
Day-2 in bash is a binder of runbook.md files in a wiki nobody reads. Each runbook is a list of commands the on-call engineer is expected to copy-paste at 3 AM. There is no test that proves the runbook still works.
A typed set of day-2 verbs gives you, for the same surface area:
- Ten typed verbs covering 80% of day-2 needs
homelab statusas the one-page health view- GitLab upgrades with required-stop path resolution and saga compensation
- Postgres upgrades with
pg_upgradeorchestration - Cert rotation with atomic swap and Traefik SIGHUP
- Loki-backed log tailing without the user knowing about Loki
- Tests for each verb
The bargain pays back the first time you run homelab upgrade gitlab while you make coffee and come back to a successfully upgraded lab — or a clean rollback to the previous version, with a clear log of why.