Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Part 48: GPU ML Training on a Joined Node

"The GPU is in the workstation. The Vagrant VM gets passthrough access. The kubelet sees the GPU as a schedulable resource. PyTorch trains."


Why

The fourth and final use case: a GPU. Not every freelancer has one, but those who do — ML engineers, data scientists, anyone training models — need their dev cluster to expose the GPU to workloads. The chain is:

  1. The host machine has an NVIDIA GPU (e.g. RTX 4090)
  2. A specific Vagrant VM has GPU passthrough enabled (from homelab-docker Part 48)
  3. The VM joins the cluster as a worker
  4. The NVIDIA device plugin registers the GPU as nvidia.com/gpu in the kubelet
  5. A workload requests nvidia.com/gpu: 1 and lands on that node
  6. PyTorch sees the GPU via torch.cuda.is_available() and trains

The thesis: K8s.Dsl ships an NvidiaDevicePluginHelmReleaseContributor that installs the NVIDIA device plugin DaemonSet. Plus a topology configuration option that adds GPU passthrough to a specific worker node. Workloads request the GPU via standard Kubernetes resource requests. The whole flow works without any K8s.Dsl-specific GPU magic — the magic is in the upstream NVIDIA device plugin, which we just install.


The contributor

[Injectable(ServiceLifetime.Singleton)]
public sealed class NvidiaDevicePluginHelmReleaseContributor : IHelmReleaseContributor
{
    public string TargetCluster => "*";
    public bool ShouldContribute() => _config.K8s?.Gpu?.Enabled ?? false;

    public void Contribute(KubernetesBundle bundle)
    {
        bundle.HelmReleases.Add(new HelmReleaseSpec
        {
            Name = "nvidia-device-plugin",
            Namespace = "kube-system",
            Chart = "nvdp/nvidia-device-plugin",
            Version = "0.16.2",
            RepoUrl = "https://nvidia.github.io/k8s-device-plugin",
            Values = new()
            {
                ["nodeSelector"] = new Dictionary<string, object?>
                {
                    ["nvidia.com/gpu.present"] = "true"
                },
                ["tolerations"] = new[]
                {
                    new Dictionary<string, object?>
                    {
                        ["key"] = "nvidia.com/gpu",
                        ["operator"] = "Exists",
                        ["effect"] = "NoSchedule"
                    }
                }
            }
        });
    }
}

The DaemonSet only runs on nodes labelled nvidia.com/gpu.present=true. The label is added by the K8s.Dsl topology resolver to the GPU node:

private IEnumerable<VosMachineConfig> WithGpuWorker(IEnumerable<VosMachineConfig> machines, HomeLabConfig hl)
{
    foreach (var m in machines)
    {
        if (m.Role == "k8s-worker" && hl.K8s?.Gpu?.NodeName == m.Name)
        {
            yield return m with
            {
                Labels = m.Labels.Concat(new[]
                {
                    new KeyValuePair<string, string>("nvidia.com/gpu.present", "true"),
                    new KeyValuePair<string, string>("nvidia.com/gpu.product", hl.K8s?.Gpu?.Product ?? "")
                }).ToDictionary(kv => kv.Key, kv => kv.Value),
                GpuPassthrough = new GpuPassthroughSpec { Enabled = true, VendorId = "10de", DeviceId = hl.K8s.Gpu.DeviceId }
            };
        }
        else
        {
            yield return m;
        }
    }
}

The node gets two labels (nvidia.com/gpu.present and nvidia.com/gpu.product) plus the GPU passthrough spec. The Vagrant provisioning installs the NVIDIA driver and nvidia-container-toolkit (we already covered this in homelab-docker Part 48). The kubelet sees the GPU as a device and registers it as nvidia.com/gpu: 1 (or however many physical GPUs there are).


A PyTorch training Job

public void Contribute(KubernetesBundle bundle)
{
    bundle.CrdInstances.Add(new RawManifest
    {
        ApiVersion = "batch/v1",
        Kind = "Job",
        Metadata = new() { Name = "train-resnet50", Namespace = "ml-training" },
        Spec = new Dictionary<string, object?>
        {
            ["backoffLimit"] = 0,
            ["template"] = new Dictionary<string, object?>
            {
                ["spec"] = new Dictionary<string, object?>
                {
                    ["restartPolicy"] = "Never",
                    ["nodeSelector"] = new Dictionary<string, object?>
                    {
                        ["nvidia.com/gpu.present"] = "true"
                    },
                    ["containers"] = new[]
                    {
                        new Dictionary<string, object?>
                        {
                            ["name"] = "train",
                            ["image"] = "pytorch/pytorch:2.4.1-cuda12.1-cudnn9-runtime",
                            ["command"] = new[] { "python", "-c", "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))" },
                            ["resources"] = new Dictionary<string, object?>
                            {
                                ["limits"] = new Dictionary<string, object?>
                                {
                                    ["nvidia.com/gpu"] = 1
                                }
                            },
                            ["volumeMounts"] = new[]
                            {
                                new Dictionary<string, object?> { ["name"] = "dataset", ["mountPath"] = "/data" }
                            }
                        }
                    },
                    ["volumes"] = new[]
                    {
                        new Dictionary<string, object?>
                        {
                            ["name"] = "dataset",
                            ["persistentVolumeClaim"] = new Dictionary<string, object?> { ["claimName"] = "training-dataset" }
                        }
                    }
                }
            }
        }
    });
}

The Job:

  • Runs on a node labelled nvidia.com/gpu.present=true
  • Requests one GPU via nvidia.com/gpu: 1
  • Mounts a PVC (Longhorn-backed) containing the training dataset
  • Runs a Python container that prints the GPU info

kubectl logs job/train-resnet50 shows:

True
NVIDIA GeForce RTX 4090

Replace the trivial Python command with a real training script, and you have a development environment for ML training that uses the same Kubernetes patterns as a production GPU cluster.


Cost tracking

The cost tracking from homelab-docker Part 47 captures wall-clock VM hours, CPU-hours, RAM-GB-hours. For GPU nodes, K8s.Dsl extends it with gpu_hours:

$ homelab cost report --since 2026-04-21T00:00:00Z
Cost report for instance 'acme' (today)
─────────────────────────────────────────
acme-cp-1     8.5 cpu-h, 17.0 ram-GB-h, 0 gpu-h
acme-w-1      8.5 cpu-h, 34.0 ram-GB-h, 0 gpu-h
acme-w-2      8.5 cpu-h, 34.0 ram-GB-h, 0 gpu-h
acme-w-gpu    8.5 cpu-h, 68.0 ram-GB-h, 4.2 gpu-h    ← GPU node
─────────────────────────────────────────
Total CPU-h:  34.0
Total RAM-GB-h: 153.0
Total GPU-h:  4.2

Power proxy:
  CPU:    34.0 × 5W = 170 Wh
  RAM:    153.0 × 0.5W = 77 Wh
  GPU:    4.2 × 350W = 1470 Wh    ← the GPU dominates
─────────────────────────────────────────
  Total: 1717 Wh = 1.72 kWh
  Cost:  €0.34 (at €0.20/kWh)

The GPU dominates the power budget. The freelancer sees this in the report and can decide whether the experiment is worth the electricity bill.


What this gives you that cloud GPUs don't

Cloud GPUs work. They also cost €1-3 per hour and require shipping data to and from the cloud. For a freelancer with intermittent ML workloads, a workstation with a GPU pays back fast: a €1500 RTX 4090 amortizes against ~500 hours of cloud GPU time (at €3/hr) — about 3 months of moderate use.

A GPU on a HomeLab K8s cluster gives you, for the same surface area:

  • Standard k8s GPU resource model (nvidia.com/gpu: 1)
  • Standard PyTorch container image
  • Standard Job/CronJob/Deployment manifests
  • A real cluster experience instead of "ssh into a GPU box and run Python"
  • Per-instance cost tracking in kWh and EUR

The bargain pays back the first time you train a model in your dev cluster and the only thing different from production is the model's training set size.


End of Act VIII

Four real-world cases: Spring Boot microservices, .NET API with SignalR, Airflow data pipeline, GPU ML training. All run on the same HomeLab K8s machinery. All exercise the same operators, the same observability stack, the same backup framework, the same ArgoCD GitOps loop. The K8s.Dsl plugin is as general as it needs to be; the cases are as specific as they need to be.

Act IX is the closing two parts: what is still missing, and the conclusion.


⬇ Download