k3d Don't Cut It

"Yes, I know about kind. I am writing this series because I know about kind."

The honest starting point

Every developer who has needed a Kubernetes cluster on their laptop in the last decade has tried minikube, then kind, then k3d, then maybe Rancher Desktop or Docker Desktop's built-in Kubernetes. They all do something useful. They all share a fatal property: they are not real Kubernetes clusters. They are simulations of Kubernetes that are good enough for kubectl tutorials and CI smoke tests, and not good enough for the work that actually needs a cluster.

If your only need is "I want to run kubectl get pods against something", kind is fine. If your need is "I want to develop a real workload that will run on real Kubernetes in production, with real network policies, real persistent storage, real ingress, real upgrades", kind is not fine. The simulation diverges from production at the exact moments you most need fidelity, and the divergence is silent.

This part is the indictment. Six concrete things that toy k8s does badly and real k8s does well. Each one is a Saturday afternoon nobody got back.

1. CNI features that don't exist

Kubernetes has a Container Network Interface (CNI) standard. Real clusters pick a CNI implementation: Calico (the default for kubeadm in many setups), Cilium (the eBPF one, increasingly the new default), Flannel (the simple one). Each one supports different features beyond the bare minimum: network policies, encryption, observability, BGP routing, eBPF programs, transparent encryption.

Toy clusters ship with a CNI but it is usually a stripped-down variant. kind uses a custom CNI called kindnet that supports almost nothing beyond pod-to-pod connectivity. minikube defaults to its own bridge plugin. k3d uses Flannel by default but with limited configuration.

This means: if you want to test a NetworkPolicy that denies egress from a namespace, you cannot do that on kindnet — kindnet does not enforce network policies. If you want to test a Cilium CiliumNetworkPolicy with an L7 HTTP rule, you cannot do that on a default kind cluster because it does not have Cilium. You can install Cilium on kind by tearing down kindnet and reinstalling, but at that point you are already debugging the simulation rather than your code.

The first time you write a NetworkPolicy and it "works" in kind but does nothing in production is the first time you discover this. The fix is to build your dev cluster with the same CNI you run in production — which means either installing it inside kind (fragile, half-supported) or running real Kubernetes (this series).

2. CSI providers that don't behave like production

The Container Storage Interface (CSI) is the standard for persistent volumes. Real clusters use a CSI driver: Longhorn, OpenEBS, Rook/Ceph, LINSTOR, or a cloud-provider-managed driver. Each one has different semantics: replication, snapshots, resize, RWX (multi-writer) support.

Toy clusters use either a hostPath provisioner ("the volume is just a directory on the node") or a "local" CSI driver that pretends to be CSI but skips most of the interesting features. kind's default storage class is hostPath; minikube's is hostPath; k3d's is hostPath via local-path-provisioner.

This means: if your workload uses a PersistentVolumeClaim with accessModes: ReadWriteMany, it will work in production (because production has a real CSI driver with RWX support) and fail in kind (because kindnet's hostPath provisioner only supports RWO). The error you get in kind is a generic "no PV available", not "this CSI does not support RWX". You spend an hour staring at it before you realise the problem is the simulation.

The same goes for snapshots, expansion, and topology constraints. None of them work the same way in kind as in production. The PVC that mounts in 200ms in kind takes 30 seconds in real Longhorn because Longhorn replicates the volume across three nodes first. Your liveness probe never knew that.

3. Ingress controllers that exist on a different network

Real Kubernetes ingress: an Ingress object describes a routing rule, an Ingress controller (nginx-ingress, Traefik, Contour, Gateway API) reads it and configures a load balancer that listens on the cluster's external IP. The external IP is reachable from outside the cluster. DNS points at it. Browsers hit it.

Toy clusters do not have external IPs. kind exposes the cluster on the host's localhost via a docker port mapping. minikube uses a tunnel command (minikube tunnel) that needs root and only works on certain platforms. k3d does port mappings via traefik but the port mapping is fragile and does not survive restarts cleanly.

This means: when you write an Ingress with a Host(\api.acme.dev`)rule, getting your browser to actually load that URL involves either an/etc/hostshack, a tunnel, a port forward, or anxip.io` workaround. And the certs you wanted to test? You cannot test them, because cert-manager's HTTP-01 challenge needs a publicly resolvable hostname, and your kind cluster does not have one.

The fix in this series: real Vagrant VMs with real private IPs, real PiHole DNS, real wildcard certs, and a real Ingress controller running for real.

4. Multi-node failure modes that never happen

A real production Kubernetes cluster has multiple nodes. Pods get scheduled across them. When a node goes down, pods reschedule. PodDisruptionBudgets protect against accidental drains. Affinity and anti-affinity rules spread workloads. Tolerations and taints isolate critical pods.

Toy clusters are single-node by default. kind supports multi-node ("control-plane" + N "worker" containers all running on the same docker host) but every "node" is a docker container on the same machine, so the failure modes are still wrong. A kind delete node worker-1 is not the same thing as a real node falling off the network.

The first time you write a PodDisruptionBudget and discover that draining your one node in kind breaks every workload at once because there is nowhere to reschedule, you realise the simulation has lied to you about what production will look like.

In this series, the multi-node topology has at least three real worker VMs. You can drain one and watch pods actually reschedule onto the others. You can vagrant halt a node and watch the cluster go through a real failure-detection cycle. The behaviour is the same as production at the scale of the API server and the scheduler — because there is no simulation.

5. Upgrades that do not exist

Production Kubernetes clusters get upgraded. There is a release every three months. The upgrade path has rules: you cannot skip minor versions, you must upgrade the control plane before the workers, you must respect the kubeadm upgrade plan output, you must drain workers before upgrading them.

Toy clusters are not "upgraded". kind does not upgrade in place — you delete the cluster and create a new one with a new image. minikube has minikube start --kubernetes-version=... which is an in-place version bump, but it does not exercise the real kubeadm upgrade flow. k3d has the same delete-and-recreate model.

This means: the entire body of operational knowledge about how to upgrade Kubernetes — the saga, the order, the failure recovery — is unreachable on a toy cluster. If you want to learn kubeadm upgrade plan and kubeadm upgrade apply, you need a real kubeadm cluster.

In this series, Part 40 walks the real upgrade path: drain one control plane, upgrade kubeadm, upgrade the control plane, uncordon, repeat for each control plane, then drain workers in turn. The whole flow is wrapped in a Saga from the toolbelt so it compensates on failure. You learn it in your dev environment, you apply it in production, and the dev environment is the production rehearsal.

6. The "it works in kind" trap

Most fundamentally: every divergence between toy k8s and real k8s is silent. The toy cluster does not warn you that your NetworkPolicy is being ignored. It does not tell you that your PersistentVolumeClaim is not actually using a real CSI. It does not say "this Ingress will not work the same way in production". It just runs your code, returns 200 OK, and lets you ship.

The bug shows up in production. The author of the bug is the developer who tested in kind and trusted the green check mark. The fix is to learn the lesson the hard way and remember it for next time. The next time, the developer has a different bug, in a different layer, that kind also did not warn about. The cycle is endless because the simulation is not labelled as such.

This is why a serious dev-k8s setup is worth building. Real Kubernetes on real Vagrant VMs is more expensive in RAM (~16 GB for the smallest topology) and more expensive in setup (~30 minutes for homelab k8s init the first time). It is cheaper in bugs that escape to production, and the cost of one such bug — pager at 3 AM, customer outage, postmortem — buys back the RAM cost a hundred times over.

What we want instead

Stating the requirement positively, we want a tool that:

Provisions a real Kubernetes cluster on local Vagrant VMs, using either kubeadm or k3s, with the option of single-node, multi-node, or HA topologies.
Lets us pick the real CNI (Cilium, Calico, Flannel) and the real CSI (Longhorn, local-path) we will use in production, and tests our workloads against them.
Provides a real Ingress controller (nginx-ingress or Traefik) reachable from the host machine over a real DNS hostname with a real wildcard cert.
Exercises real upgrades via kubeadm upgrade plan / apply, with compensation on failure.
Supports multi-client isolation so a freelancer with multiple clients can run several real clusters in parallel without overlap.
Plugs into HomeLab so the cluster lifecycle is one CLI verb tree away from the rest of the lab — no fork, no parallel tooling, same IHomeLabPipeline, same Result<T>, same event bus, same toolbelt.
Generates the GitOps repository so workloads are deployed via ArgoCD against a repo that HomeLab itself created in DevLab.

That tool is the K8s.Dsl plugin. The next 49 parts of this series build it.

What this part gives you that toy k8s doesn't

Nothing yet. This part is the indictment, not the cure. The cure starts in Part 02, where we measure exactly how much RAM each topology consumes and prove that 64 GB is enough for a real HA cluster.

For now, the contribution of this part is the indictment. Six divergences. Six classes of silent bug. Six reasons why a serious developer doing serious Kubernetes work needs more than kind, and why the cost of running real Kubernetes on local VMs is small compared to the cost of not doing so.

Cross-links

Part 02: The 64 GB Sweet Spot
Part 04: Real K8s vs Toy K8s — the matrix
Part 22: k8s-single Topology
Part 40: kubeadm Upgrades
HomeLab Docker — Part 01: The Problem — the parallel indictment for Compose
Ops DSL Ecosystem — Part 01: Why Ops Is Still Untyped

`[` or `Alt+S`	Focus sidebar navigation
`]` or `Alt+C`	Focus main content
`↑` `↓`	Navigate between sidebar items
`Enter`	Open page / toggle section
`Space`	Toggle section expand/collapse
`Escape`	Close overlay / sidebar

`Ctrl+K`	Open search
`?`	Show this help

`Ctrl+=` or `Ctrl+↑`	Increase font size
`Ctrl+−` or `Ctrl+↓`	Decrease font size
`f`	Open console font selector

`Ctrl+⇧+=` or `Ctrl+⇧+↑`	Browser zoom in
`Ctrl+⇧+−` or `Ctrl+⇧+↓`	Browser zoom out
`Ctrl+⇧+0`	Reset browser zoom

`Tab`	Focus a diagram or image
`Enter`	Open full size overlay
`+` `−`	Zoom in / out (in overlay)
`Escape`	Close overlay, return focus

Part 01: The Problem — Why kind / minikube / k3d Don't Cut It📋

The honest starting point📋

1. CNI features that don't exist📋

2. CSI providers that don't behave like production📋

3. Ingress controllers that exist on a different network📋

4. Multi-node failure modes that never happen📋

5. Upgrades that do not exist📋

6. The "it works in kind" trap📋

What we want instead📋

What this part gives you that toy k8s doesn't📋

Cross-links📋