Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Part 49: DevLab Day-2 Operations

"Day 1 is when you get the lab up. Day 2 is the next ten years."


Why

We have spent forty-eight parts building DevLab. This part is about keeping it running. Day-2 operations are the unglamorous middle of any system's life: upgrades, backups, restores, cert rotations, log management, occasional service restarts, occasional alert investigations. None of them are exciting. All of them are mandatory.

The thesis of this part is: HomeLab provides a small set of typed verbs for day-2 operations, all driven by the same Ops.DataGovernance / Ops.Configuration / Ops.Observability declarations the rest of the lab already uses. The user runs homelab backup, homelab restore, homelab upgrade, homelab cert rotate, homelab logs, homelab status — and that is enough for most days.


The day-2 verbs

Verb Purpose When you run it
homelab status Show lab health (every service, healthcheck, recent backups, cert expiry) Whenever you wonder
homelab backup run [target] Trigger an immediate backup After a risky change
homelab backup restore <id> [target] Restore a specific backup When something is broken
homelab backup verify Run the restore-test job manually Once a week minimum
homelab upgrade gitlab --to <version> Upgrade GitLab Omnibus When a new GitLab is out
homelab upgrade postgres --to <version> Upgrade Postgres When a new major is out
homelab cert rotate Re-issue the wildcard cert Before expiry
homelab logs <service> [--tail N] [--follow] Tail logs for a service When something is wrong
homelab vos snapshot create <name> Snapshot every VM (Vagrant snapshots) Before a risky change
homelab vos snapshot restore <name> Roll back to a snapshot After a failed change

Ten verbs. They cover the vast majority of day-2 needs.


homelab status

The most-used verb. Aggregates everything HomeLab knows into one terminal page:

$ homelab status
DevLab — single topology — running

VMs:
  ✓ devlab-main      4 cpu / 8 GB / 50 GB    up   3d 4h

Services (14):
  ✓ traefik             healthy   up 3d 4h
  ✓ pihole              healthy   up 3d 4h
  ✓ gitlab              healthy   up 3d 3h
  ✓ gitlab-runner       healthy   up 3d 3h
  ✓ baget               healthy   up 3d 4h
  ✓ vagrant-registry    healthy   up 3d 4h
  ✓ postgres            healthy   up 3d 4h
  ✓ minio               healthy   up 3d 4h
  ✓ meilisearch         healthy   up 3d 4h
  ✓ prometheus          healthy   up 3d 4h
  ✓ grafana             healthy   up 3d 4h
  ✓ loki                healthy   up 3d 4h
  ✓ alertmanager        healthy   up 3d 4h
  ✓ docs-site           healthy   up 3d 4h

Recent backups:
  postgres        2026-04-12 02:00  ✓ pgbackrest    1.2 GB
  gitlab-config   2026-04-12 03:00  ✓ restic        24 MB
  gitlab-lfs      2026-04-07 04:00  ✓ minio-mirror  890 MB

Last restore test:
  postgres        2026-04-08 05:30  ✓ passed         18m elapsed

TLS:
  CA            HomeLab CA           expires 2036-04-09  (10 years)
  *.lab         frenchexdev wildcard expires 2028-04-09  (2 years)

Recent alerts (last 24h): none

Cost (last 30d): 41.8 kWh / €8.36 (estimated)

The implementation walks every relevant data source — IDockerClient.PsAsync, the cost store, the backup store, the cert files, the alertmanager API — and renders them into one structured view.


GitLab Omnibus upgrade

GitLab Omnibus has a strict upgrade path: you cannot skip major versions. The Omnibus changelog tells you which versions you must stop at on the way. HomeLab knows this:

[Injectable(ServiceLifetime.Singleton)]
public sealed class GitLabUpgradePathResolver
{
    private static readonly IReadOnlyList<string> RequiredStops = new[]
    {
        "13.12.15-ce.0",
        "14.0.12-ce.0",
        "14.3.6-ce.0",
        "14.9.5-ce.0",
        "14.10.5-ce.0",
        "15.0.5-ce.0",
        "15.4.6-ce.0",
        "15.11.13-ce.0",
        "16.0.10-ce.0",
        "16.3.9-ce.0",
        "16.7.10-ce.0",
        "16.11.0-ce.0",
        // ... etc
    };

    public IReadOnlyList<string> ResolvePath(string from, string to)
    {
        var fromV = ParseVersion(from);
        var toV = ParseVersion(to);
        return RequiredStops
            .Where(s => ParseVersion(s).IsBetween(fromV, toV))
            .Concat(new[] { to })
            .Distinct()
            .ToList();
    }
}

homelab upgrade gitlab --to 17.0.0-ce.0 from 15.4.0-ce.0:

Upgrade path:
  15.4.6-ce.0
  15.11.13-ce.0
  16.0.10-ce.0
  16.3.9-ce.0
  16.7.10-ce.0
  16.11.0-ce.0
  17.0.0-ce.0  (target)

This upgrade has 7 stops and will take ~2 hours.
A backup will be taken before each step.

Continue? [y/N]

The user confirms. HomeLab walks the path: backup → bump version → wait for healthy → repeat. If any step fails, the saga compensates: restore the previous backup and roll the version back. The user is left with a clear error.

[Saga]
public sealed class GitLabUpgradeSaga
{
    [SagaStep(Order = 1, Compensation = nameof(NothingToCompensate))]
    public async Task<Result> BackupBeforeUpgrade(GitLabUpgradeContext ctx, CancellationToken ct)
    {
        var result = await _backup.RunAsync(new BackupSpec("postgres", "/", "backups"), ct);
        if (result.IsFailure) return result.Map();
        ctx.LastBackupId = result.Value;
        return Result.Success();
    }

    [SagaStep(Order = 2, Compensation = nameof(RollbackVersion))]
    public async Task<Result> BumpVersion(GitLabUpgradeContext ctx, CancellationToken ct)
    {
        await _compose.UpdateImageAsync("gitlab", $"gitlab/gitlab-ce:{ctx.NextVersion}", ct);
        return await _compose.RecreateAsync("gitlab", ct);
    }

    [SagaStep(Order = 3, Compensation = nameof(RestoreFromBackup))]
    public async Task<Result> WaitHealthy(GitLabUpgradeContext ctx, CancellationToken ct)
    {
        return await _gitlab.WaitForHealthAsync(timeout: TimeSpan.FromMinutes(15), ct);
    }

    public async Task<Result> RollbackVersion(GitLabUpgradeContext ctx, CancellationToken ct)
    {
        await _compose.UpdateImageAsync("gitlab", $"gitlab/gitlab-ce:{ctx.PreviousVersion}", ct);
        return await _compose.RecreateAsync("gitlab", ct);
    }

    public async Task<Result> RestoreFromBackup(GitLabUpgradeContext ctx, CancellationToken ct)
    {
        return await _backup.RestoreAsync(ctx.LastBackupId!, new RestoreSpec("/var/opt/gitlab"), ct);
    }

    public Task<Result> NothingToCompensate(GitLabUpgradeContext ctx, CancellationToken ct) => Task.FromResult(Result.Success());
}

The upgrade is a saga, so partial failures roll back cleanly. The user runs one command, gets a confirmation prompt, walks away, and either has a new GitLab or a clean previous-state GitLab when they come back.

Postgres upgrade

Postgres major upgrades are harder because they require a pg_upgrade step that is not just docker pull + recreate. HomeLab generates the right pg_upgrade --link command in a sidecar container:

homelab upgrade postgres --from 16 --to 17

Walks through:

  1. Backup via pgbackrest
  2. Stop GitLab (so nothing writes to Postgres)
  3. Stop Postgres
  4. Run pg_upgrade in a sidecar with both old and new bin dirs
  5. Start the new Postgres
  6. Run ANALYZE (recommended after pg_upgrade)
  7. Start GitLab
  8. Verify healthcheck

If any step fails, restore the backup and revert.


Cert rotation

The wildcard cert from Part 34 lasts 2 years. The CA lasts 10. Both expire eventually. homelab cert rotate:

$ homelab cert rotate
Current wildcard cert expires 2027-04-12 (in 365 days).
This is well within the rotation window.

Generate a new wildcard cert (will be valid for 2 years from now)? [y/N]

If the user confirms:

  1. Generate a new cert via the same ITlsCertificateProvider used in init
  2. Atomically swap the cert files in data/certs/wildcard.crt and wildcard.key
  3. Send Traefik a SIGHUP so it reloads the cert without restarting
  4. Verify the new cert via a curl against https://gitlab.frenchexdev.lab

The CA is not rotated automatically — replacing the CA invalidates every signed cert and requires re-trusting on every consuming machine. CA rotation is an explicit, deliberate operation: homelab cert rotate-ca --confirm-i-know-this-is-disruptive.


Log shipping

Logs from compose services are ingested by Loki via promtail (from Part 44). homelab logs gitlab --tail 100 --follow is a thin wrapper around loki query:

[Injectable(ServiceLifetime.Singleton)]
public sealed class LogsRequestHandler : IRequestHandler<LogsRequest, Result>
{
    private readonly ILokiApi _loki;
    private readonly IHomeLabConsole _console;

    public async Task<Result> HandleAsync(LogsRequest req, CancellationToken ct)
    {
        var query = $"{{compose_service=\"{req.Service}\"}}";
        if (req.Follow)
        {
            await foreach (var line in _loki.TailAsync(query, ct))
                _console.WriteLine(line);
            return Result.Success();
        }
        else
        {
            var lines = await _loki.QueryAsync(query, limit: req.Tail ?? 100, ct);
            foreach (var line in lines.Value) _console.WriteLine(line);
            return Result.Success();
        }
    }
}

The user does not need to know about Loki. They run homelab logs gitlab --tail 100, get the last 100 lines, move on.


The test

[Fact]
public void gitlab_upgrade_path_resolver_finds_required_stops()
{
    var resolver = new GitLabUpgradePathResolver();
    var path = resolver.ResolvePath(from: "15.4.0-ce.0", to: "17.0.0-ce.0");

    path.Should().Contain("15.11.13-ce.0");
    path.Should().Contain("16.11.0-ce.0");
    path.Last().Should().Be("17.0.0-ce.0");
}

[Fact]
public async Task gitlab_upgrade_saga_rolls_back_on_health_check_failure()
{
    var compose = new ScriptedComposeClient();
    compose.OnRecreate("gitlab", exitCode: 0);
    var gitlab = new ScriptedGitLabApi();
    gitlab.OnWaitForHealth(timeoutAfterCount: 0, returnFailure: true);
    var backup = new ScriptedBackupProvider();
    backup.OnBackup(snapshotId: "snap-1");
    backup.OnRestore(success: true);

    var saga = new GitLabUpgradeSaga(compose, gitlab, backup);
    var ctx = new GitLabUpgradeContext("16.4.0-ce.0", "16.7.10-ce.0");

    var result = await saga.RunAsync(ctx, default);

    result.IsFailure.Should().BeTrue();
    backup.RestoreCalls.Should().ContainSingle();
}

[Fact]
public async Task cert_rotate_atomically_swaps_files_and_signals_traefik()
{
    var fs = new MockFileSystem();
    fs.AddFile("/lab/data/certs/wildcard.crt", new MockFileData("old-cert"));
    var docker = new ScriptedDockerClient();
    docker.OnExec("traefik", new[] { "kill", "-HUP", "1" }, exitCode: 0);

    var handler = new CertRotateRequestHandler(/* ... */);
    var result = await handler.HandleAsync(new CertRotateRequest(), default);

    result.IsSuccess.Should().BeTrue();
    fs.File.ReadAllText("/lab/data/certs/wildcard.crt").Should().NotBe("old-cert");
    docker.ExecCalls.Should().ContainSingle(c => c.Container == "traefik" && c.Command.Contains("HUP"));
}

What this gives you that bash doesn't

Day-2 in bash is a binder of runbook.md files in a wiki nobody reads. Each runbook is a list of commands the on-call engineer is expected to copy-paste at 3 AM. There is no test that proves the runbook still works.

A typed set of day-2 verbs gives you, for the same surface area:

  • Ten typed verbs covering 80% of day-2 needs
  • homelab status as the one-page health view
  • GitLab upgrades with required-stop path resolution and saga compensation
  • Postgres upgrades with pg_upgrade orchestration
  • Cert rotation with atomic swap and Traefik SIGHUP
  • Loki-backed log tailing without the user knowing about Loki
  • Tests for each verb

The bargain pays back the first time you run homelab upgrade gitlab while you make coffee and come back to a successfully upgraded lab — or a clean rollback to the previous version, with a clear log of why.


⬇ Download