Production-ready Kubernetes homelab on Hetzner Cloud for ~€10/month

I wanted a cheap, production-grade homelab Kubernetes cluster. Most cloud providers charge €70+/month just for the control plane and my setup on Hetzner Cloud runs for about €14/month using Talos Linux, Cilium, and WireGuard. Everything is automated with Terraform, Packer, and ArgoCD for GitOps.

As this is my ongoing, evolving homelab setup, this post might change from time to time. The Terraform code is split across multiple files (main.tf, talos.tf, argocd.tf, etc.) and all workloads are managed declaratively through ArgoCD. The cluster runs worker nodes in multiple European zones, just for trying it out, no real valuable reason for a homesetup. With my small homelab cluster this has no real-world benefit, rather downsides due to cross-zone networking.

This article is not meant as full 0-to-100 documentation (that is better done in the repo itself). Instead, I explain the design decisions and why I went for certain parts in the setup. The full code (Terraform, Packer, GitOps manifests) is available on GitHub: hetzner_k8s


Architecture overview

I decided to encapsulate the entire cluster management in a WireGuard harness, as this exposes the cluster to less risk of being hacked. Open to the world, I only have ports 80 (HTTP) and 443 (HTTPS), which are exposed via a Traefik load balancer (with certificates managed via Let’s Encrypt).

This also secures the cluster-internal services (without an ingress). E.g. you can only access the monitoring stack (Grafana) or the ArgoCD UI via WireGuard tunnel + port forwarding. Might be a bit paranoid, but as I don’t want people running crypto miners on my neck (especially since the cluster has auto-scaling of nodes), better safe than sorry.

The control plane node is tainted with NoSchedule, so only infrastructure services (Cilium, Traefik, ArgoCD, CCM, CSI, metrics-server, monitoring) that carry explicit tolerations run there. Application workloads become Pending until the cluster autoscaler provisions worker nodes. This keeps the control plane stable while still allowing the cluster to function without any workers for pure infrastructure.

Building the Talos image with Packer

Hetzner doesn’t offer Talos Linux as a native image. You build a snapshot using Packer and Hetzner’s rescue mode, a temporary Linux environment that lets you write directly to the server’s disk.

The Packer config lives in hetzner_k8s/packer/hcloud.pkr.hcl:

source "hcloud" "talos" {
  token       = var.hcloud_token
  location    = var.server_location
  server_type = var.server_type
  server_name = "packer-talos-builder"

  rescue      = "linux64"        # Boot into rescue mode
  image       = "ubuntu-24.04"   # Base image (irrelevant, rescue overwrites it)

  snapshot_name = "talos-${var.talos_version}"
  snapshot_labels = {
    os      = "talos"
    version = var.talos_version
  }

  ssh_username = "root"
}

The schematic ID (ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515) is generated at factory.talos.dev and encodes the qemu-guest-agent extension. Packer creates a snapshot that Terraform references by label:

data "hcloud_image" "talos" {
  with_selector     = "os=talos"
  most_recent       = true
  with_architecture = "x86"
}

Networking

A single private network connects all nodes. Hetzner’s eu-central zone spans three datacenters (fsn1, nbg1, hel1), so workers can be distributed for availability.

Firewall rules

The firewall only allows HTTP/HTTPS (for ingress), WireGuard (UDP 51820), and ICMP. There is no SSH as Talos doesn’t have an SSH server. Hetzner firewalls don’t filter private network traffic, so all inter-node communication is unrestricted by design.

Load balancer

A Terraform-created lb11 sits on the private network at 10.0.1.254. Its targets and service definitions are managed by the Hetzner CCM (not Terraform). The CCM watches for Kubernetes Services with matching annotations and auto-configures the LB.


Highly skilled DevOps/SRE Freelancer

I am Simon, the author of this blog. And I have great news: You can work with me

As DevOps and Infrastructure freelancer, I will help you choose the right Infrastructure technology for your company, fix your cloud problems and support your team in building scalable products.

I work with Golang, Docker, Kubernetes, Google Cloud, AWS and Terraform.

Checkout my CV to learn more or directly contact me via the button below.

Simon Frey Header image
Let’s work together!

Talos machine configuration

After first setting up the cluster with k3s on Ubuntu (Hetzner base image), I was looking for a different, more sophisticated solution and remembered Talos. Talos has the benefit of being immutable and more locked down, which helps getting the cluster into a reproducible, stable, and secure state.

Here are the key design decisions baked into the control plane config (hetzner_k8s/talos.tf):

CNI and kube-proxy disabled, Cilium takes over:

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true

Control plane scheduling disabled, the node gets a NoSchedule taint so only infrastructure pods with explicit tolerations run there:

cluster:
  allowSchedulingOnControlPlanes: false

WireGuard interface on the control plane, so you can reach the K8s API without exposing port 6443 publicly:

machine:
  network:
    interfaces:
      - interface: wg0
        mtu: 1420
        addresses: ["10.200.200.1/24"]
        wireguard:
          privateKey: ${wireguard_server_private_key}
          listenPort: 51820
          peers:
            - publicKey: ${wireguard_client_public_key}
              allowedIPs: ["10.200.200.2/32"]

etcd only advertised on private network, prevents etcd from binding to public IPs:

cluster:
  etcd:
    advertisedSubnets: ["10.0.1.0/24"]

Secure access with WireGuard

As already mentioned in the intro, this might be a bit paranoid, but I just want to lock down the server as much as possible to not have to deal with any malicious actors. I go for a defense-in-depth approach of locking down everything that I don’t need publicly, and the control plane is one important part here. Hence the Kubernetes API (port 6443) and Talos API (port 50000) are never exposed publicly. Instead, I set up a WireGuard tunnel using wireproxy, a userspace WireGuard client that requires no root privileges and no kernel module. (Why wireproxy? I want to be able to set this up with a single command without any “sudo” access on my local machine. wg-quick needs kernel/sudo access. Wireproxy keeps it all in userspace. And as the traffic I send over the tunnel is negligibly small, I don’t see a relevant performance downside.)

Terraform generates a .wireproxy.conf and starts it automatically:

[Interface]
PrivateKey = <client-private-key>
Address = 10.200.200.2/32
MTU = 1420

[Peer]
PublicKey = <server-public-key>
Endpoint = <control-plane-public-ip>:51820
AllowedIPs = 10.200.200.0/24, 10.0.0.0/16
PersistentKeepalive = 25

[TCPClientTunnel]
BindAddress = 127.0.0.1:50000
Target = 10.200.200.1:50000

[TCPClientTunnel]
BindAddress = 127.0.0.1:6443
Target = 10.200.200.1:6443

After terraform apply, kubectl and talosctl talk to 127.0.0.1. Wireproxy tunnels the traffic through WireGuard to the control plane’s private VPN address. (For later connections to the cluster, you need to set up wireproxy again, as documented in the repo’s README.)


GitOps with ArgoCD

The cluster originally used a 686-line shell script (post-deploy.sh) that sequentially installed every component via Helm and kubectl. That approach was fragile: no retries on transient failures, no drift detection, no self-healing, and it required manual re-runs whenever something changed. I migrated to ArgoCD using the app-of-apps pattern, and it was one of the best decisions for this project.

What Terraform bootstraps

Terraform installs only the bare minimum that must exist before ArgoCD can function:

  1. Cilium (Helm release): Without the CNI, nodes stay NotReady and nothing can schedule.
  2. Wait for CoreDNS: A null_resource polls CoreDNS readiness so image pulls from external registries work.
  3. ArgoCD (Helm release): Installed with tolerations for both the control-plane and cloud-provider-uninitialized taints, solving the chicken-and-egg problem where the CCM (which removes the cloud-provider taint) is itself deployed by ArgoCD.
  4. Secrets and ConfigMaps: Terraform generates passwords (random_password), SSH keys, and cloud-init configs and creates Kubernetes secrets that ArgoCD-managed apps reference by name. This keeps secrets out of git while letting ArgoCD manage the workloads.
  5. Root Application: A single ArgoCD Application that points to the gitops/root-app/ Helm chart.

App-of-apps and sync waves

The root Helm chart contains one ArgoCD Application template per component. Sync wave annotations control the installation order, just like the old shell script did, but with automatic retries, self-healing, and drift detection:

WaveComponentPurpose
-5Hetzner CCMRemoves cloud-provider taint, enables LB management
-4Hetzner CSIPersistent volumes via Hetzner block storage
-3metrics-serverCPU/memory metrics for HPA
-2TraefikIngress controller, auto-registered with Hetzner LB
-1cert-managerTLS certificate automation
0cert-issuersLet’s Encrypt staging + production ClusterIssuers
4Cluster autoscalerScales workers 0 to 5 based on pending pods
5MonitoringVictoriaMetrics + Grafana (optional)
5kwatchCluster event notifications (optional)

All applications share a common sync policy defined in a Helm helper (gitops/root-app/templates/_helpers.tpl):

{{- define "root-app.syncPolicy" -}}
syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
    - ServerSideApply=true
  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m
{{- end -}}

Server-side apply

The ServerSideApply=true setting deserves an explanation. By default, ArgoCD uses client-side apply, which stores the entire previous manifest in a kubectl.kubernetes.io/last-applied-configuration annotation on every resource. That annotation has a hard limit of 262144 bytes. Several CRDs in this cluster blow past that limit, particularly the six prometheus-operator CRDs shipped by the VictoriaMetrics Helm chart. When client-side apply hits the limit, the sync fails permanently and no amount of retries will fix it.

Server-side apply sidesteps this entirely. Instead of stuffing the previous state into an annotation, Kubernetes tracks field ownership via managedFields in the resource metadata. There is no size limit. The API server also handles the diff and merge logic itself, which gives more accurate dry runs (including admission webhook validation) and proper conflict detection when multiple controllers manage the same resource. For the monitoring app specifically, ServerSideDiff=true is also enabled so ArgoCD delegates the diff calculation to the API server too, which handles unknown status fields in CRDs more gracefully than client-side structured merge diff.

I initially only enabled ServerSideApply for the apps that needed it (KubeVirt, CDI, monitoring), but after hitting the annotation limit on yet another CRD, I moved it into the shared helper so every app gets it by default. There is no downside for smaller resources, and it prevents future surprises.

What post-deploy.sh does now

The script went from 686 lines to about 75. All it does is fetch the kubeconfig from Terraform output, rewrite the API server address for the WireGuard tunnel, and print the cluster status with ArgoCD application sync states. No more Helm installs, no more kubectl apply, no more manual steps.


Ingress with Traefik and Hetzner LB

Traefik runs as a DaemonSet (not a Deployment) so it’s present on every node, including the control plane. This is important because the control plane might be the only node when no workers are running. The Traefik values live in the ArgoCD Application template at gitops/root-app/templates/traefik.yaml:

deployment:
  kind: DaemonSet

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule

The Hetzner LB is configured via annotations on the Traefik Service. The CCM watches these and auto-manages the load balancer:

service:
  annotations:
    load-balancer.hetzner.cloud/name: "hetzner-k8s-lb"
    load-balancer.hetzner.cloud/network-zone: eu-central
    load-balancer.hetzner.cloud/use-private-ip: "true"
    load-balancer.hetzner.cloud/uses-proxyprotocol: "true"

Proxy protocol is essential. Without it, your application logs show the LB’s IP instead of the real client. Both Traefik and the LB must agree on proxy protocol being enabled:

ports:
  web:
    port: 80
    proxyProtocol:
      trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
    forwardedHeaders:
      trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]
  websecure:
    port: 443
    proxyProtocol:
      trustedIPs: ["10.0.0.0/8", "127.0.0.1/32"]

Automatic TLS with cert-manager

cert-manager handles Let’s Encrypt certificates automatically. Two ClusterIssuers are configured in gitops/apps/cert-issuers/: staging (for testing without rate limits) and production.

To get a certificate for your service, add a single annotation and a tls block to your Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  tls:
    - hosts: ["hello.k8s.example.com"]
      secretName: hello-tls
  rules:
    - host: hello.k8s.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: hello
                port:
                  number: 80

cert-manager sees the annotation, creates a Certificate resource, solves the HTTP-01 challenge through Traefik, and stores the signed certificate in the hello-tls Secret. Renewal is automatic.


Cluster autoscaler

The autoscaler uses the Hetzner cloud provider to spin up and tear down cx33 workers based on pending pod demand. The configuration in gitops/apps/autoscaler/cluster-autoscaler.yaml defines two node groups, one per datacenter location:

command:
  - ./cluster-autoscaler
  - --cloud-provider=hetzner
  - --nodes=0:5:cx33:nbg1:workers
  - --nodes=0:5:cx33:hel1:workers
  - --scale-down-enabled=true
  - --scale-down-delay-after-add=5m
  - --scale-down-unneeded-time=5m
  - --scale-down-utilization-threshold=0.5
  - --balance-similar-node-groups=true
  - --expander=least-waste

Each location can scale from 0 to 5 workers independently. balance-similar-node-groups distributes workers evenly across nbg1 and hel1 for availability. least-waste picks the group that would use resources most efficiently.

Terraform creates zero static workers (initial_worker_count = 0). The autoscaler fully manages the worker lifecycle, spinning nodes up when pods are pending and tearing them down after 5 minutes of low utilization. This was a deliberate change from the earlier setup where Terraform created 2 permanent workers that the autoscaler couldn’t scale down without causing Terraform state drift.

The Talos worker machine config is passed to the autoscaler as a base64-encoded cloud-init secret. Terraform creates this secret directly in argocd.tf, so there are no manual kubectl commands involved:

resource "kubernetes_secret" "autoscaler" {
  metadata {
    name      = "hcloud-autoscaler"
    namespace = "cluster-autoscaler"
  }
  data = {
    token      = var.hcloud_token
    cloud-init = base64encode(data.talos_machine_configuration.worker.machine_configuration)
    network    = hcloud_network.cluster.name
    firewall   = hcloud_firewall.cluster.name
    image      = data.hcloud_image.talos.id
  }
}

The autoscaler deployment itself tolerates the control plane taint and prefers scheduling on the control plane, so it runs without any workers:

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: node-role.kubernetes.io/control-plane
              operator: Exists

Monitoring with VictoriaMetrics

The cluster originally ran Prometheus via kube-prometheus-stack, but I switched to VictoriaMetrics for lower resource usage. On a small cx33 node with 8 GB RAM, every megabyte counts, and VictoriaMetrics delivers the same functionality with a noticeably smaller footprint.

The monitoring stack is deployed as an optional ArgoCD Application (gated behind enable_monitoring) using the victoria-metrics-k8s-stack Helm chart (v0.72.4). It includes:

  • VMSingle: The metrics storage engine, replacing Prometheus. Configured with 7 day retention and 10Gi persistent storage on Hetzner block volumes.
  • VMAgent: Scrapes metrics from all targets and forwards them to VMSingle. The VictoriaMetrics operator auto-converts any existing Prometheus ServiceMonitor and PodMonitor CRDs into its native VMServiceScrape and VMPodMonitor resources, so other apps that ship ServiceMonitors (like cert-manager or Traefik) work without changes.
  • VMAlert: Evaluates alerting and recording rules. The ~100 default PrometheusRules from the Helm chart (KubeNodeNotReady, KubePodCrashLooping, etc.) are auto-converted to VMRules.
  • VMAlertmanager: Routes alerts to Pushover for mobile notifications. Credentials are stored in a Terraform-managed Kubernetes secret and mounted as files, so they never appear in the Alertmanager config YAML.
  • Grafana: Auto-configured to use VMSingle as its datasource. Comes with default dashboards for node, pod, and cluster metrics. Admin credentials are generated by Terraform.

Several Talos-incompatible scrape targets (etcd, kube-scheduler, kube-controller-manager, kube-proxy) are disabled, as Talos does not expose those metrics endpoints in the standard way.

The trade-off of switching was losing historical metric data, but with a 7 day retention window, full coverage is back within a week. The migration itself was painless because ArgoCD pruned all old Prometheus resources automatically.

Alerting with Pushover

I wanted alerts on my phone without running a separate notification service. The stack has two notification paths:

  1. VMAlertmanager with Pushover receiver: Handles all ~100 built-in PrometheusRules. The config uses user_key_file and token_file (mounted from a Kubernetes secret) so credentials are never in git or in the Alertmanager config.
  2. kwatch: A lightweight Kubernetes event watcher that monitors all namespaces, PVCs, and nodes. It sends notifications to Pushover via their webhook integration. The webhook URL (which contains a secret token) lives in a Terraform-managed ConfigMap.

Both notification paths go to Pushover, but they cover different failure modes. kwatch catches events that don’t trigger Prometheus alerts (like a pod failing to pull an image), while VMAlertmanager covers metric-based conditions (like node memory pressure or persistent high CPU).


Security hardening

I try to lock down this cluster as much as possible:

Immutable OS

Talos Linux has no SSH server, no shell, no package manager. The root filesystem is read-only. You manage nodes exclusively through the Talos API (which is mTLS-authenticated and only reachable via WireGuard). If someone compromises a container, there is no host to escape to.

No public management ports

The Kubernetes API (6443) and Talos API (50000) are never exposed publicly. The firewall only opens HTTP/HTTPS (for ingress), WireGuard (51820), and ICMP. All management traffic flows through the WireGuard tunnel to localhost.

Control plane isolation

The control plane is tainted with node-role.kubernetes.io/control-plane:NoSchedule. Only infrastructure services (Cilium, ArgoCD, Traefik, CCM, CSI, cert-manager, metrics-server, monitoring) carry explicit tolerations. Application workloads must schedule on worker nodes, which the autoscaler provisions on demand. This prevents application pods from consuming the control plane’s RAM and causing instability.

Pod Security Standards

Namespaces enforce the restricted Pod Security Standard, which blocks privilege escalation, requires non-root users, and drops all capabilities:

apiVersion: v1
kind: Namespace
metadata:
  name: demo
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted

Hardened security contexts

Every workload runs with the minimum possible privileges:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop: ["ALL"]

Resource governance

ResourceQuotas prevent any single namespace from consuming the cluster:

spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "4"
    limits.memory: 4Gi
    pods: "10"

RBAC

The cluster autoscaler ServiceAccount has minimal permissions, only what’s needed for node scaling, pod eviction, and leader election. No cluster-admin roles are assigned to workloads.

Secrets management

All passwords and credentials are generated by Terraform (random_password, tls_private_key) and stored in Terraform state. Terraform creates Kubernetes secrets that ArgoCD-managed apps reference by name. Nothing sensitive ever touches git. You retrieve passwords via terraform output -raw grafana_password and similar commands.


This setup gives you a production-capable Kubernetes cluster on Hetzner for about €14/month at rest, scaling to ~€59 under load. The key ingredients:

  • Talos Linux eliminates an entire class of security concerns by removing SSH, shells, and package managers.
  • Cilium replaces kube-proxy with eBPF for better performance and observability.
  • WireGuard via wireproxy keeps management APIs private without requiring root or kernel modules.
  • ArgoCD manages all workloads declaratively with automatic drift detection, self-healing, and sync wave ordering.
  • VictoriaMetrics provides full monitoring and alerting with lower resource usage than Prometheus.
  • Cluster autoscaler scales workers from 0 to 5 based on actual demand, keeping costs proportional to usage.
  • Terraform + Packer make the whole thing reproducible. Tear it down and rebuild it in minutes.

The full code, including Terraform configs, Packer template, GitOps manifests, and the ArgoCD app-of-apps setup, is in the hetzner_k8s/ directory.

Cost breakdown

The base cluster (one control plane node and a load balancer) costs about €14/month. Workers scale from zero on demand.

ResourceSpecMonthly cost
Control planecx33 (4 vCPU, 8 GB RAM)~€6
Load balancerlb11~€5
Base total~€14
Workers (0–5)cx33, on demand€0–30
Maximum5 workers~€59

I originally ran cx23 nodes (4 GB RAM), but the Kubernetes infrastructure overhead alone (Cilium, ArgoCD, Traefik, CCM, CSI, cert-manager, metrics-server, VictoriaMetrics) consumed nearly all of it, leaving nothing for actual workloads. Upgrading to cx33 (8 GB RAM) was the pragmatic fix. It is a bit mind-blowing that serving a single website with a WordPress blog now requires two servers (control plane + one worker), but that is the Kubernetes tax you pay for all the automation, self-healing, and GitOps convenience. Whether that trade-off is worth it depends on your priorities. For me, the learning experience and the “it just works” factor of ArgoCD reconciling everything automatically make it worth a few extra euros.

The autoscaler fully manages the worker lifecycle. When application pods are deployed and cannot schedule on the tainted control plane, the autoscaler provisions cx33 workers within a few minutes. When utilization drops below 50% for 5 minutes, it tears them back down. No static workers, no Terraform state drift, just pay for what you use.

To never miss an article subscribe to my newsletter
No ads. One click unsubscribe.