I wanted a cheap, production-grade homelab Kubernetes cluster. Most cloud providers charge €70+/month just for the control plane. My setup on Hetzner Cloud runs for about €10/month using Talos Linux, Cilium, and WireGuard. Everything is automated with Terraform, Packer, and a single post-deploy script.
As this is my ongoing-evolving homelab setup, this post might change from time to time (e.g. I want to move the Terraform code out of a single main.tf into modules). The cluster runs worker nodes in multiple European zones, just for trying it out. With my small homelab cluster this has no real-world benefit, rather downsides due to cross-zone networking.
This article is not meant as full 0-to-100 documentation (that is better done in the repo itself). Instead, I explain the design decisions and why I went for certain parts in the setup. The full code (Terraform, Packer, Bash) is available on GitHub: hetzner_k8s
Architecture overview
I decided to encapsulate the entire cluster management in a WireGuard harness, as this exposes the cluster to less risk of being hacked. Open to the world, I only have ports 80 (HTTP) and 443 (HTTPS), which are exposed via a Traefik load balancer (with certificates managed via Let’s Encrypt).
This also secures the cluster-internal services (without an ingress). E.g. you can only access the monitoring stack (Grafana) via WireGuard tunnel + port forwarding. Might be a bit paranoid, but as I don’t want people running crypto miners on my neck (especially since the cluster has auto-scaling of nodes), better safe than sorry.
graph TD
subgraph internet[" "]
client["Internet<br/><i>HTTP / HTTPS</i>"]
end
subgraph hetzner["Hetzner Cloud — eu-central · Private Network 10.0.0.0/16"]
lb["Load Balancer<br/><b>lb11</b> · 10.0.1.254"]
subgraph cp["Control Plane · cx23 · 10.0.1.1"]
cp_talos["Talos Linux"]
cp_cilium["Cilium CNI"]
cp_traefik["Traefik <i>(DaemonSet)</i>"]
cp_etcd["etcd"]
end
subgraph workers["Workers 1‒5 · cx23 · 10.0.1.10+"]
w_talos["Talos Linux"]
w_cilium["Cilium CNI"]
w_traefik["Traefik <i>(DaemonSet)</i>"]
end
end
subgraph local["Your Machine · wireproxy · 10.200.200.2"]
kubectl["kubectl → 127.0.0.1:6443"]
talosctl["talosctl → 127.0.0.1:50000"]
end
client -- "HTTP/HTTPS" --> lb
lb -- "Proxy Protocol" --> cp_traefik
lb -- "Proxy Protocol" --> w_traefik
cp -- "WireGuard<br/>UDP 51820" --> local
style hetzner fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
style cp fill:#e0f2fe,stroke:#0284c7
style workers fill:#e0f2fe,stroke:#0284c7,stroke-dasharray: 5 5
style local fill:#f0fdf4,stroke:#16a34a,stroke-width:2px
style internet fill:none,stroke:noneBuilding the Talos image with Packer
Hetzner doesn’t offer Talos Linux as a native image. You build a snapshot using Packer and Hetzner’s rescue mode, a temporary Linux environment that lets you write directly to the server’s disk.
The Packer config lives in hetzner_k8s/packer/hcloud.pkr.hcl:
source "hcloud" "talos" {
token = var.hcloud_token
location = var.server_location
server_type = var.server_type
server_name = "packer-talos-builder"
rescue = "linux64" # Boot into rescue mode
image = "ubuntu-24.04" # Base image (irrelevant, rescue overwrites it)
snapshot_name = "talos-${var.talos_version}"
snapshot_labels = {
os = "talos"
version = var.talos_version
}
ssh_username = "root"
}
The schematic ID (ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515) is generated at factory.talos.dev and encodes the qemu-guest-agent extension. Packer creates a snapshot that Terraform references by label:
data "hcloud_image" "talos" {
with_selector = "os=talos"
most_recent = true
with_architecture = "x86"
}
Networking
A single private network connects all nodes. Hetzner’s eu-central zone spans three datacenters (fsn1, nbg1, hel1), so workers can be distributed for availability.
Firewall rules
The firewall only allows HTTP/HTTPS (for ingress), WireGuard (UDP 51820), and ICMP. There is no SSH as Talos doesn’t have an SSH server. Hetzner firewalls don’t filter private network traffic, so all inter-node communication is unrestricted by design.
Load balancer
A Terraform-created lb11 sits on the private network at 10.0.1.254. Its targets and service definitions are managed by the Hetzner CCM (not Terraform). The CCM watches for Kubernetes Services with matching annotations and auto-configures the LB.
Talos machine configuration
After first setting up the cluster with k3s on Ubuntu (Hetzner base image), I was looking for a different, more sophisticated solution and remembered Talos. Talos has the benefit of being immutable and more locked down, which helps getting the cluster into a reproducible, stable, and secure state.
Here are the key design decisions baked into the control plane config (hetzner_k8s/talos.tf):
CNI and kube-proxy disabled, Cilium takes over:
cluster:
network:
cni:
name: none
proxy:
disabled: true
WireGuard interface on the control plane, so you can reach the K8s API without exposing port 6443 publicly:
machine:
network:
interfaces:
- interface: wg0
mtu: 1420
addresses: ["10.200.200.1/24"]
wireguard:
privateKey: ${wireguard_server_private_key}
listenPort: 51820
peers:
- publicKey: ${wireguard_client_public_key}
allowedIPs: ["10.200.200.2/32"]
etcd only advertised on private network, prevents etcd from binding to public IPs:
cluster:
etcd:
advertisedSubnets: ["10.0.1.0/24"]
Secure access with WireGuard
As already mentioned in the intro, this might be a bit paranoid, but I just want to lock down the server as much as possible to not have to deal with any malicious actors. I go for a defense-in-depth approach of locking down everything that I don’t need publicly, and the control plane is one important part here. Hence the Kubernetes API (port 6443) and Talos API (port 50000) are never exposed publicly. Instead, I set up a WireGuard tunnel using wireproxy, a userspace WireGuard client that requires no root privileges and no kernel module. (Why wireproxy? I want to be able to set this up with a single command (post-deploy.sh for now) without any “sudo” access on my local machine. wg-quick needs kernel/sudo access. Wireproxy keeps it all in userspace. And as the traffic I send over the tunnel is negligibly small, I don’t see a relevant performance downside.)
Terraform generates a .wireproxy.conf and starts it automatically:
[Interface]
PrivateKey = <client-private-key>
Address = 10.200.200.2/32
MTU = 1420
[Peer]
PublicKey = <server-public-key>
Endpoint = <control-plane-public-ip>:51820
AllowedIPs = 10.200.200.0/24, 10.0.0.0/16
PersistentKeepalive = 25
[TCPClientTunnel]
BindAddress = 127.0.0.1:50000
Target = 10.200.200.1:50000
[TCPClientTunnel]
BindAddress = 127.0.0.1:6443
Target = 10.200.200.1:6443
After terraform apply, kubectl and talosctl talk to 127.0.0.1. Wireproxy tunnels the traffic through WireGuard to the control plane’s private VPN address. (For later connections to the cluster, you need to set up wireproxy again, as documented in the repo’s README.)
Post-deploy stack
After terraform apply, the cluster has nodes but no networking (CNI=none). The scripts/post-deploy.sh script installs everything in a carefully ordered sequence. Order matters because each component depends on the previous one:
- Cilium: Enables pod networking. Nothing else works without this.
- Wait for nodes: Nodes stay
NotReadyuntil Cilium runs. - Hetzner CCM: Removes the
node.cloudprovider.kubernetes.io/uninitializedtaint so pods can schedule. - Hetzner CSI: Enables persistent volumes via Hetzner block storage. (These are not yet destroyed on
tf destroy, something I want to optimize going forward so it only takes a single command to wipe all resources.) - metrics-server: Enables HPA CPU/memory-based autoscaling.
- Traefik: Ingress controller; the CCM auto-adds nodes to the Hetzner LB.
- cert-manager: TLS certificate automation via Let’s Encrypt.
- Cluster Autoscaler: Scales workers 0→5 based on pending pods.
- Monitoring (optional, asked in script): Prometheus + Grafana with persistent storage.
All Helm chart versions are pinned for reproducibility (I never like basing my infrastructure on a “latest” tag, as I was bitten too often by this in the past):
CILIUM_VERSION="1.19.0"
HCLOUD_CCM_VERSION="1.29.2"
HCLOUD_CSI_VERSION="2.18.3"
METRICS_SERVER_VERSION="3.12.2"
TRAEFIK_VERSION="39.0.0"
CERT_MANAGER_VERSION="v1.19.3"
KUBE_PROMETHEUS_VERSION="81.5.0"
Ingress with Traefik and Hetzner LB
Traefik runs as a DaemonSet (not a Deployment) so it’s present on every node, including the control plane when there are zero workers. The key configuration is in hetzner_k8s/manifests/traefik-values.yaml:
deployment:
kind: DaemonSet
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
The Hetzner LB is configured via annotations on the Traefik Service. The CCM watches these and auto-manages the load balancer:
service:
annotations:
load-balancer.hetzner.cloud/name: "hetzner-k8s-lb"
load-balancer.hetzner.cloud/network-zone: eu-central
load-balancer.hetzner.cloud/use-private-ip: "true"
load-balancer.hetzner.cloud/uses-proxyprotocol: "true"
Proxy protocol is essential. Without it, your application logs show the LB’s IP instead of the real client. Both Traefik and the LB must agree on proxy protocol being enabled:
ports:
web:
port: 80
proxyProtocol:
trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
forwardedHeaders:
trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
websecure:
port: 443
proxyProtocol:
trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
Automatic TLS with cert-manager
cert-manager handles Let’s Encrypt certificates automatically. Two ClusterIssuers are configured in hetzner_k8s/manifests/cert-manager-issuer.yaml: staging (for testing without rate limits) and production.
To get a certificate for your service, add a single annotation and a tls block to your Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hello
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: traefik
tls:
- hosts: ["hello.k8s.example.com"]
secretName: hello-tls
rules:
- host: hello.k8s.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service: { name: hello, port: { number: 80 } }
cert-manager sees the annotation, creates a Certificate resource, solves the HTTP-01 challenge through Traefik, and stores the signed certificate in the hello-tls Secret. Renewal is automatic.
Cluster autoscaler
The autoscaler uses the Hetzner cloud provider to spin up and tear down cx23 workers based on pending pod demand. The configuration in hetzner_k8s/manifests/cluster-autoscaler.yaml defines two node groups, one per datacenter location:
command:
- ./cluster-autoscaler
- --cloud-provider=hetzner
- --nodes=0:5:cx23:NBG1:workers
- --nodes=0:5:cx23:HEL1:workers
- --scale-down-enabled=true
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=5m
- --scale-down-utilization-threshold=0.5
- --balance-similar-node-groups=true
- --expander=least-waste
Each location can scale from 0 to 5 workers independently. balance-similar-node-groups distributes workers evenly across NBG1 and HEL1 for availability. least-waste picks the group that would use resources most efficiently.
The Talos worker machine config is passed to the autoscaler as a base64-encoded cloud-init secret. This is how newly created workers get their Talos configuration:
CLOUD_INIT_B64=$(terraform output -raw worker_machine_config_base64)
kubectl -n cluster-autoscaler create secret generic hcloud-autoscaler \
--from-literal=token="$HCLOUD_TOKEN" \
--from-literal=cloud-init="$CLOUD_INIT_B64" \
--from-literal=network="$NETWORK_NAME" \
--from-literal=firewall="$FIREWALL_NAME" \
--from-literal=image="$TALOS_IMAGE_ID"
The autoscaler deployment itself tolerates the control plane taint and prefers scheduling on the control plane, so it runs without any workers:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
Security hardening
I try to lock down this cluster as much as possible:
Immutable OS
Talos Linux has no SSH server, no shell, no package manager. The root filesystem is read-only. You manage nodes exclusively through the Talos API (which is mTLS-authenticated and only reachable via WireGuard). If someone compromises a container, there is no host to escape to.
No public management ports
The Kubernetes API (6443) and Talos API (50000) are never exposed publicly. The firewall only opens HTTP/HTTPS (for ingress), WireGuard (51820), and ICMP. All management traffic flows through the WireGuard tunnel to localhost.
Pod Security Standards
Namespaces enforce the restricted Pod Security Standard, which blocks privilege escalation, requires non-root users, and drops all capabilities:
apiVersion: v1
kind: Namespace
metadata:
name: demo
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/warn: restricted
Hardened security contexts
Every workload runs with the minimum possible privileges:
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
seccompProfile:
type: RuntimeDefault
capabilities:
drop: ["ALL"]
Resource governance
ResourceQuotas prevent any single namespace from consuming the cluster:
spec:
hard:
requests.cpu: "2"
requests.memory: 2Gi
limits.cpu: "4"
limits.memory: 4Gi
pods: "10"
RBAC
The cluster autoscaler ServiceAccount has minimal permissions, only what’s needed for node scaling, pod eviction, and leader election. No cluster-admin roles are assigned to workloads.
This setup gives you a production-capable Kubernetes cluster on Hetzner for about €10/month at rest, scaling to ~€35 under load. The key ingredients:
- Talos Linux eliminates an entire class of security concerns by removing SSH, shells, and package managers.
- Cilium replaces kube-proxy with eBPF for better performance and observability.
- WireGuard via wireproxy keeps management APIs private without requiring root or kernel modules.
- Cluster autoscaler scales workers from 0 to 5 based on actual demand, keeping costs proportional to usage.
- Terraform + Packer make the whole thing reproducible. Tear it down and rebuild it in minutes.
The full code, including Terraform configs, Packer template, Helm values, and the post-deploy script, is in the hetzner_k8s/ directory.
Cost breakdown
The base cluster (one control plane node and a load balancer) costs about €10/month. Workers scale from zero on demand. (right now, they are mostly stuck at 1 worker-node as the downscaling still has some issues)
| Resource | Spec | Monthly cost |
|---|---|---|
| Control plane | cx23 (2 vCPU, 4 GB RAM) | ~€5 |
| Load balancer | lb11 | ~€5 |
| Base total | ~€10 | |
| Workers (0–5) | cx23, on-demand | €0–25 |
| Maximum | 5 workers | ~€35 |
The trick: the control plane tolerates workloads (allowSchedulingOnControlPlanes: true), so small deployments run without any workers at all. The autoscaler spins up cx23 workers when pods can’t be scheduled and tears them down after 5 minutes of low utilization.