Production-ready Kubernetes homelab on Hetzner Cloud for ~€10/month

show table of contents

I wanted a cheap, production-grade homelab Kubernetes cluster. Most cloud providers charge €70+/month just for the control plane. My setup on Hetzner Cloud runs for about €10/month using Talos Linux, Cilium, and WireGuard. Everything is automated with Terraform, Packer, and a single post-deploy script.

As this is my ongoing-evolving homelab setup, this post might change from time to time (e.g. I want to move the Terraform code out of a single main.tf into modules). The cluster runs worker nodes in multiple European zones, just for trying it out. With my small homelab cluster this has no real-world benefit, rather downsides due to cross-zone networking.

This article is not meant as full 0-to-100 documentation (that is better done in the repo itself). Instead, I explain the design decisions and why I went for certain parts in the setup. The full code (Terraform, Packer, Bash) is available on GitHub: hetzner_k8s

Architecture overview

I decided to encapsulate the entire cluster management in a WireGuard harness, as this exposes the cluster to less risk of being hacked. Open to the world, I only have ports 80 (HTTP) and 443 (HTTPS), which are exposed via a Traefik load balancer (with certificates managed via Let’s Encrypt).

This also secures the cluster-internal services (without an ingress). E.g. you can only access the monitoring stack (Grafana) via WireGuard tunnel + port forwarding. Might be a bit paranoid, but as I don’t want people running crypto miners on my neck (especially since the cluster has auto-scaling of nodes), better safe than sorry.

graph TD
    subgraph internet[" "]
        client["Internet<br/><i>HTTP / HTTPS</i>"]
    end

    subgraph hetzner["Hetzner Cloud — eu-central · Private Network 10.0.0.0/16"]
        lb["Load Balancer<br/><b>lb11</b> · 10.0.1.254"]

        subgraph cp["Control Plane · cx23 · 10.0.1.1"]
            cp_talos["Talos Linux"]
            cp_cilium["Cilium CNI"]
            cp_traefik["Traefik <i>(DaemonSet)</i>"]
            cp_etcd["etcd"]
        end

        subgraph workers["Workers 1‒5 · cx23 · 10.0.1.10+"]
            w_talos["Talos Linux"]
            w_cilium["Cilium CNI"]
            w_traefik["Traefik <i>(DaemonSet)</i>"]
        end
    end

    subgraph local["Your Machine · wireproxy · 10.200.200.2"]
        kubectl["kubectl → 127.0.0.1:6443"]
        talosctl["talosctl → 127.0.0.1:50000"]
    end

    client -- "HTTP/HTTPS" --> lb
    lb -- "Proxy Protocol" --> cp_traefik
    lb -- "Proxy Protocol" --> w_traefik
    cp -- "WireGuard<br/>UDP 51820" --> local

    style hetzner fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style cp fill:#e0f2fe,stroke:#0284c7
    style workers fill:#e0f2fe,stroke:#0284c7,stroke-dasharray: 5 5
    style local fill:#f0fdf4,stroke:#16a34a,stroke-width:2px
    style internet fill:none,stroke:none

Building the Talos image with Packer

Hetzner doesn’t offer Talos Linux as a native image. You build a snapshot using Packer and Hetzner’s rescue mode, a temporary Linux environment that lets you write directly to the server’s disk.

The Packer config lives in hetzner_k8s/packer/hcloud.pkr.hcl:

source "hcloud" "talos" {
  token       = var.hcloud_token
  location    = var.server_location
  server_type = var.server_type
  server_name = "packer-talos-builder"

  rescue      = "linux64"        # Boot into rescue mode
  image       = "ubuntu-24.04"   # Base image (irrelevant, rescue overwrites it)

  snapshot_name = "talos-${var.talos_version}"
  snapshot_labels = {
    os      = "talos"
    version = var.talos_version
  }

  ssh_username = "root"
}

The schematic ID (ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515) is generated at factory.talos.dev and encodes the qemu-guest-agent extension. Packer creates a snapshot that Terraform references by label:

data "hcloud_image" "talos" {
  with_selector     = "os=talos"
  most_recent       = true
  with_architecture = "x86"
}

Networking

A single private network connects all nodes. Hetzner’s eu-central zone spans three datacenters (fsn1, nbg1, hel1), so workers can be distributed for availability.

Firewall rules

The firewall only allows HTTP/HTTPS (for ingress), WireGuard (UDP 51820), and ICMP. There is no SSH as Talos doesn’t have an SSH server. Hetzner firewalls don’t filter private network traffic, so all inter-node communication is unrestricted by design.

Load balancer

A Terraform-created lb11 sits on the private network at 10.0.1.254. Its targets and service definitions are managed by the Hetzner CCM (not Terraform). The CCM watches for Kubernetes Services with matching annotations and auto-configures the LB.

Talos machine configuration

After first setting up the cluster with k3s on Ubuntu (Hetzner base image), I was looking for a different, more sophisticated solution and remembered Talos. Talos has the benefit of being immutable and more locked down, which helps getting the cluster into a reproducible, stable, and secure state.

Here are the key design decisions baked into the control plane config (hetzner_k8s/talos.tf):

CNI and kube-proxy disabled, Cilium takes over:

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true

WireGuard interface on the control plane, so you can reach the K8s API without exposing port 6443 publicly:

machine:
  network:
    interfaces:
      - interface: wg0
        mtu: 1420
        addresses: ["10.200.200.1/24"]
        wireguard:
          privateKey: ${wireguard_server_private_key}
          listenPort: 51820
          peers:
            - publicKey: ${wireguard_client_public_key}
              allowedIPs: ["10.200.200.2/32"]

etcd only advertised on private network, prevents etcd from binding to public IPs:

cluster:
  etcd:
    advertisedSubnets: ["10.0.1.0/24"]

Secure access with WireGuard

As already mentioned in the intro, this might be a bit paranoid, but I just want to lock down the server as much as possible to not have to deal with any malicious actors. I go for a defense-in-depth approach of locking down everything that I don’t need publicly, and the control plane is one important part here. Hence the Kubernetes API (port 6443) and Talos API (port 50000) are never exposed publicly. Instead, I set up a WireGuard tunnel using wireproxy, a userspace WireGuard client that requires no root privileges and no kernel module. (Why wireproxy? I want to be able to set this up with a single command (post-deploy.sh for now) without any “sudo” access on my local machine. wg-quick needs kernel/sudo access. Wireproxy keeps it all in userspace. And as the traffic I send over the tunnel is negligibly small, I don’t see a relevant performance downside.)

Terraform generates a .wireproxy.conf and starts it automatically:

[Interface]
PrivateKey = <client-private-key>
Address = 10.200.200.2/32
MTU = 1420

[Peer]
PublicKey = <server-public-key>
Endpoint = <control-plane-public-ip>:51820
AllowedIPs = 10.200.200.0/24, 10.0.0.0/16
PersistentKeepalive = 25

[TCPClientTunnel]
BindAddress = 127.0.0.1:50000
Target = 10.200.200.1:50000

[TCPClientTunnel]
BindAddress = 127.0.0.1:6443
Target = 10.200.200.1:6443

After terraform apply, kubectl and talosctl talk to 127.0.0.1. Wireproxy tunnels the traffic through WireGuard to the control plane’s private VPN address. (For later connections to the cluster, you need to set up wireproxy again, as documented in the repo’s README.)

Post-deploy stack

After terraform apply, the cluster has nodes but no networking (CNI=none). The scripts/post-deploy.sh script installs everything in a carefully ordered sequence. Order matters because each component depends on the previous one:

Cilium: Enables pod networking. Nothing else works without this.
Wait for nodes: Nodes stay NotReady until Cilium runs.
Hetzner CCM: Removes the node.cloudprovider.kubernetes.io/uninitialized taint so pods can schedule.
Hetzner CSI: Enables persistent volumes via Hetzner block storage. (These are not yet destroyed on tf destroy, something I want to optimize going forward so it only takes a single command to wipe all resources.)
metrics-server: Enables HPA CPU/memory-based autoscaling.
Traefik: Ingress controller; the CCM auto-adds nodes to the Hetzner LB.
cert-manager: TLS certificate automation via Let’s Encrypt.
Cluster Autoscaler: Scales workers 0→5 based on pending pods.
Monitoring (optional, asked in script): Prometheus + Grafana with persistent storage.

All Helm chart versions are pinned for reproducibility (I never like basing my infrastructure on a “latest” tag, as I was bitten too often by this in the past):

CILIUM_VERSION="1.19.0"
HCLOUD_CCM_VERSION="1.29.2"
HCLOUD_CSI_VERSION="2.18.3"
METRICS_SERVER_VERSION="3.12.2"
TRAEFIK_VERSION="39.0.0"
CERT_MANAGER_VERSION="v1.19.3"
KUBE_PROMETHEUS_VERSION="81.5.0"

Ingress with Traefik and Hetzner LB

Traefik runs as a DaemonSet (not a Deployment) so it’s present on every node, including the control plane when there are zero workers. The key configuration is in hetzner_k8s/manifests/traefik-values.yaml:

deployment:
  kind: DaemonSet

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule

The Hetzner LB is configured via annotations on the Traefik Service. The CCM watches these and auto-manages the load balancer:

service:
  annotations:
    load-balancer.hetzner.cloud/name: "hetzner-k8s-lb"
    load-balancer.hetzner.cloud/network-zone: eu-central
    load-balancer.hetzner.cloud/use-private-ip: "true"
    load-balancer.hetzner.cloud/uses-proxyprotocol: "true"

Proxy protocol is essential. Without it, your application logs show the LB’s IP instead of the real client. Both Traefik and the LB must agree on proxy protocol being enabled:

ports:
  web:
    port: 80
    proxyProtocol:
      trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
    forwardedHeaders:
      trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
  websecure:
    port: 443
    proxyProtocol:
      trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]

Automatic TLS with cert-manager

cert-manager handles Let’s Encrypt certificates automatically. Two ClusterIssuers are configured in hetzner_k8s/manifests/cert-manager-issuer.yaml: staging (for testing without rate limits) and production.

To get a certificate for your service, add a single annotation and a tls block to your Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  tls:
    - hosts: ["hello.k8s.example.com"]
      secretName: hello-tls
  rules:
    - host: hello.k8s.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service: { name: hello, port: { number: 80 } }

cert-manager sees the annotation, creates a Certificate resource, solves the HTTP-01 challenge through Traefik, and stores the signed certificate in the hello-tls Secret. Renewal is automatic.

Cluster autoscaler

The autoscaler uses the Hetzner cloud provider to spin up and tear down cx23 workers based on pending pod demand. The configuration in hetzner_k8s/manifests/cluster-autoscaler.yaml defines two node groups, one per datacenter location:

command:
  - ./cluster-autoscaler
  - --cloud-provider=hetzner
  - --nodes=0:5:cx23:NBG1:workers
  - --nodes=0:5:cx23:HEL1:workers
  - --scale-down-enabled=true
  - --scale-down-delay-after-add=5m
  - --scale-down-unneeded-time=5m
  - --scale-down-utilization-threshold=0.5
  - --balance-similar-node-groups=true
  - --expander=least-waste

Each location can scale from 0 to 5 workers independently. balance-similar-node-groups distributes workers evenly across NBG1 and HEL1 for availability. least-waste picks the group that would use resources most efficiently.

The Talos worker machine config is passed to the autoscaler as a base64-encoded cloud-init secret. This is how newly created workers get their Talos configuration:

CLOUD_INIT_B64=$(terraform output -raw worker_machine_config_base64)
kubectl -n cluster-autoscaler create secret generic hcloud-autoscaler \
  --from-literal=token="$HCLOUD_TOKEN" \
  --from-literal=cloud-init="$CLOUD_INIT_B64" \
  --from-literal=network="$NETWORK_NAME" \
  --from-literal=firewall="$FIREWALL_NAME" \
  --from-literal=image="$TALOS_IMAGE_ID"

The autoscaler deployment itself tolerates the control plane taint and prefers scheduling on the control plane, so it runs without any workers:

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: node-role.kubernetes.io/control-plane
              operator: Exists

Security hardening

I try to lock down this cluster as much as possible:

Immutable OS

Talos Linux has no SSH server, no shell, no package manager. The root filesystem is read-only. You manage nodes exclusively through the Talos API (which is mTLS-authenticated and only reachable via WireGuard). If someone compromises a container, there is no host to escape to.

No public management ports

The Kubernetes API (6443) and Talos API (50000) are never exposed publicly. The firewall only opens HTTP/HTTPS (for ingress), WireGuard (51820), and ICMP. All management traffic flows through the WireGuard tunnel to localhost.

Pod Security Standards

Namespaces enforce the restricted Pod Security Standard, which blocks privilege escalation, requires non-root users, and drops all capabilities:

apiVersion: v1
kind: Namespace
metadata:
  name: demo
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted

Hardened security contexts

Every workload runs with the minimum possible privileges:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop: ["ALL"]

Resource governance

ResourceQuotas prevent any single namespace from consuming the cluster:

spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "4"
    limits.memory: 4Gi
    pods: "10"

RBAC

The cluster autoscaler ServiceAccount has minimal permissions, only what’s needed for node scaling, pod eviction, and leader election. No cluster-admin roles are assigned to workloads.

This setup gives you a production-capable Kubernetes cluster on Hetzner for about €10/month at rest, scaling to ~€35 under load. The key ingredients:

Talos Linux eliminates an entire class of security concerns by removing SSH, shells, and package managers.
Cilium replaces kube-proxy with eBPF for better performance and observability.
WireGuard via wireproxy keeps management APIs private without requiring root or kernel modules.
Cluster autoscaler scales workers from 0 to 5 based on actual demand, keeping costs proportional to usage.
Terraform + Packer make the whole thing reproducible. Tear it down and rebuild it in minutes.

The full code, including Terraform configs, Packer template, Helm values, and the post-deploy script, is in the hetzner_k8s/ directory.

Cost breakdown

The base cluster (one control plane node and a load balancer) costs about €10/month. Workers scale from zero on demand. (right now, they are mostly stuck at 1 worker-node as the downscaling still has some issues)

Resource	Spec	Monthly cost
Control plane	cx23 (2 vCPU, 4 GB RAM)	~€5
Load balancer	lb11	~€5
Base total		~€10
Workers (0–5)	cx23, on-demand	€0–25
Maximum	5 workers	~€35

The trick: the control plane tolerates workloads (allowSchedulingOnControlPlanes: true), so small deployments run without any workers at all. The autoscaler spins up cx23 workers when pods can’t be scheduled and tears them down after 5 minutes of low utilization.