{"id":1299,"date":"2026-02-13T09:30:18","date_gmt":"2026-02-13T08:30:18","guid":{"rendered":"https:\/\/simon-frey.com\/blog\/?p=1299"},"modified":"2026-03-10T14:24:20","modified_gmt":"2026-03-10T13:24:20","slug":"kubernetes-on-hetzner-cloud","status":"publish","type":"post","link":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/","title":{"rendered":"Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month"},"content":{"rendered":"\n<p>I wanted a cheap, production-grade homelab Kubernetes cluster. Most cloud providers charge \u20ac70+\/month just for the control plane and my setup on <a href=\"https:\/\/www.hetzner.com\/cloud\/?utm_source=simon-frey.com\">Hetzner Cloud<\/a> runs for about \u20ac14\/month using <strong>Talos Linux<\/strong>, <strong>Cilium<\/strong>, and <strong>WireGuard<\/strong>. Everything is automated with Terraform, Packer, and <strong>ArgoCD<\/strong> for GitOps.<\/p>\n\n\n\n<p>As this is my ongoing, evolving homelab setup, this post might change from time to time. The Terraform code is split across multiple files (<code>main.tf<\/code>, <code>talos.tf<\/code>, <code>argocd.tf<\/code>, etc.) and all workloads are managed declaratively through ArgoCD. The cluster runs worker nodes in multiple European zones, just for trying it out, no real valuable reason for a homesetup. With my small homelab cluster this has no real-world benefit, rather downsides due to cross-zone networking.<\/p>\n\n\n\n<p>This article is not meant as full 0-to-100 documentation (that is better done in the repo itself). Instead, I explain the design decisions and why I went for certain parts in the setup. The full code (Terraform, Packer, GitOps manifests) is available on GitHub: <a href=\"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/hetzner_k8s\/\"><code>hetzner_k8s<\/code><\/a><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Architecture overview<\/h2>\n\n\n\n<p>I decided to encapsulate the entire cluster management in a WireGuard harness, as this exposes the cluster to less risk of being hacked. Open to the world, I only have ports 80 (HTTP) and 443 (HTTPS), which are exposed via a Traefik load balancer (with certificates managed via Let&#8217;s Encrypt).<\/p>\n\n\n\n<p>This also secures the cluster-internal services (without an ingress). E.g. you can only access the monitoring stack (Grafana) or the ArgoCD UI via WireGuard tunnel + port forwarding. Might be a bit paranoid, but as I don&#8217;t want people running crypto miners on my neck (especially since the cluster has auto-scaling of nodes), better safe than sorry.<\/p>\n\n\n\n<p>The control plane node is tainted with <code>NoSchedule<\/code>, so only infrastructure services (Cilium, Traefik, ArgoCD, CCM, CSI, metrics-server, monitoring) that carry explicit tolerations run there. Application workloads become Pending until the cluster autoscaler provisions worker nodes. This keeps the control plane stable while still allowing the cluster to function without any workers for pure infrastructure.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Building the Talos image with Packer<\/h2>\n\n\n\n<p>Hetzner doesn&#8217;t offer Talos Linux as a native image. You build a snapshot using Packer and Hetzner&#8217;s rescue mode, a temporary Linux environment that lets you write directly to the server&#8217;s disk.<\/p>\n\n\n\n<p>The Packer config lives in <code>hetzner_k8s\/packer\/hcloud.pkr.hcl<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>source \"hcloud\" \"talos\" {\n  token       = var.hcloud_token\n  location    = var.server_location\n  server_type = var.server_type\n  server_name = \"packer-talos-builder\"\n\n  rescue      = \"linux64\"        # Boot into rescue mode\n  image       = \"ubuntu-24.04\"   # Base image (irrelevant, rescue overwrites it)\n\n  snapshot_name = \"talos-${var.talos_version}\"\n  snapshot_labels = {\n    os      = \"talos\"\n    version = var.talos_version\n  }\n\n  ssh_username = \"root\"\n}<\/code><\/pre>\n\n\n\n<p>The schematic ID (<code>ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515<\/code>) is generated at <a href=\"https:\/\/factory.talos.dev\/?utm_source=simon-frey.com\">factory.talos.dev<\/a> and encodes the qemu-guest-agent extension. Packer creates a snapshot that Terraform references by label:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>data \"hcloud_image\" \"talos\" {\n  with_selector     = \"os=talos\"\n  most_recent       = true\n  with_architecture = \"x86\"\n}<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Networking<\/h2>\n\n\n\n<p>A single private network connects all nodes. Hetzner&#8217;s eu-central zone spans three datacenters (fsn1, nbg1, hel1), so workers can be distributed for availability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Firewall rules<\/h3>\n\n\n\n<p>The firewall only allows HTTP\/HTTPS (for ingress), WireGuard (UDP 51820), and ICMP. There is no SSH as Talos doesn&#8217;t have an SSH server. Hetzner firewalls don&#8217;t filter private network traffic, so all inter-node communication is unrestricted by design.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Load balancer<\/h3>\n\n\n\n<p>A Terraform-created lb11 sits on the private network at <code>10.0.1.254<\/code>. Its targets and service definitions are managed by the Hetzner CCM (not Terraform). The CCM watches for Kubernetes Services with matching annotations and auto-configures the LB.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n<div class=\"lazyblock-ad-Z1YWGw1 wp-block-lazyblock-ad\"><div style=\"display:none;font-family:sans-serif; border:2px solid #00000020;padding: 0.5em;margin-top:1em;margin-bottom:1em;\">\n  <div style=\"display:flex;justify-content:center;align-items:center;gap:10px;\">\n  <div style=\"line-height: 1.3em;text-align:left;\"><h3>Highly skilled DevOps\/SRE Freelancer<\/h3>\n  <p>I am Simon, the author of this blog. And I have great news: <b style=\"font-weight:bold;\">You can work with me<\/b><\/p>\n  <p>As DevOps and Infrastructure freelancer, I will help you choose the right Infrastructure technology for your company, fix your cloud problems and support your team in building scalable products.<\/p>\n  <p>I work with Golang, Docker, Kubernetes, Google Cloud, AWS and Terraform.<\/p>\n  <p>Checkout my <a href=\"https:\/\/simon-frey.com\/cv\" target=\"_blank\">CV<\/a> to learn more or directly contact me via the button below.<\/p>\n  <\/div>  \n  <img decoding=\"async\" data-src=\"https:\/\/simon-frey.com\/cv\/img\/simon-frey.jpg\" alt=\"Simon Frey Header image\" style=\"height:10em;\" src=\"data:image\/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==\" class=\"lazyload\">\n\n  <\/div>\n  \n  <a href=\"mailto:contact@simon-frey.com\" style=\"display:block;text-align:center;color:black;text-decoration:none;border:solid 2px black;padding:10px;border-radius:5px;margin-top:1em;\">Let&#8217;s work together!<\/a>\n<\/div><\/div>\n\n\n<h2 class=\"wp-block-heading\">Talos machine configuration<\/h2>\n\n\n\n<p>After first setting up the cluster with k3s on Ubuntu (Hetzner base image), I was looking for a different, more sophisticated solution and remembered Talos. Talos has the benefit of being immutable and more locked down, which helps getting the cluster into a reproducible, stable, and secure state.<\/p>\n\n\n\n<p>Here are the key design decisions baked into the control plane config (<code>hetzner_k8s\/talos.tf<\/code>):<\/p>\n\n\n\n<p><strong>CNI and kube-proxy disabled<\/strong>, Cilium takes over:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cluster:\n  network:\n    cni:\n      name: none\n  proxy:\n    disabled: true<\/code><\/pre>\n\n\n\n<p><strong>Control plane scheduling disabled<\/strong>, the node gets a <code>NoSchedule<\/code> taint so only infrastructure pods with explicit tolerations run there:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cluster:\n  allowSchedulingOnControlPlanes: false<\/code><\/pre>\n\n\n\n<p><strong>WireGuard interface<\/strong> on the control plane, so you can reach the K8s API without exposing port 6443 publicly:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>machine:\n  network:\n    interfaces:\n      - interface: wg0\n        mtu: 1420\n        addresses: &#91;\"10.200.200.1\/24\"]\n        wireguard:\n          privateKey: ${wireguard_server_private_key}\n          listenPort: 51820\n          peers:\n            - publicKey: ${wireguard_client_public_key}\n              allowedIPs: &#91;\"10.200.200.2\/32\"]<\/code><\/pre>\n\n\n\n<p><strong>etcd only advertised on private network<\/strong>, prevents etcd from binding to public IPs:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cluster:\n  etcd:\n    advertisedSubnets: &#91;\"10.0.1.0\/24\"]<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Secure access with WireGuard<\/h2>\n\n\n\n<p>As already mentioned in the intro, this might be a bit paranoid, but I just want to lock down the server as much as possible to not have to deal with any malicious actors. I go for a defense-in-depth approach of locking down everything that I don&#8217;t need publicly, and the control plane is one important part here. Hence the Kubernetes API (port 6443) and Talos API (port 50000) are never exposed publicly. Instead, I set up a WireGuard tunnel using <strong>wireproxy<\/strong>, a userspace WireGuard client that requires no root privileges and no kernel module. (Why wireproxy? I want to be able to set this up with a single command without any &#8220;sudo&#8221; access on my local machine. wg-quick needs kernel\/sudo access. Wireproxy keeps it all in userspace. And as the traffic I send over the tunnel is negligibly small, I don&#8217;t see a relevant performance downside.)<\/p>\n\n\n\n<p>Terraform generates a <code>.wireproxy.conf<\/code> and starts it automatically:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Interface]\nPrivateKey = &lt;client-private-key&gt;\nAddress = 10.200.200.2\/32\nMTU = 1420\n\n&#91;Peer]\nPublicKey = &lt;server-public-key&gt;\nEndpoint = &lt;control-plane-public-ip&gt;:51820\nAllowedIPs = 10.200.200.0\/24, 10.0.0.0\/16\nPersistentKeepalive = 25\n\n&#91;TCPClientTunnel]\nBindAddress = 127.0.0.1:50000\nTarget = 10.200.200.1:50000\n\n&#91;TCPClientTunnel]\nBindAddress = 127.0.0.1:6443\nTarget = 10.200.200.1:6443<\/code><\/pre>\n\n\n\n<p>After <code>terraform apply<\/code>, <code>kubectl<\/code> and <code>talosctl<\/code> talk to <code>127.0.0.1<\/code>. Wireproxy tunnels the traffic through WireGuard to the control plane&#8217;s private VPN address. (For later connections to the cluster, you need to set up wireproxy again, as documented in the repo&#8217;s README.)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">GitOps with ArgoCD<\/h2>\n\n\n\n<p>The cluster originally used a 686-line shell script (<code>post-deploy.sh<\/code>) that sequentially installed every component via Helm and kubectl. That approach was fragile: no retries on transient failures, no drift detection, no self-healing, and it required manual re-runs whenever something changed. I migrated to <strong>ArgoCD<\/strong> using the app-of-apps pattern, and it was one of the best decisions for this project.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What Terraform bootstraps<\/h3>\n\n\n\n<p>Terraform installs only the bare minimum that must exist before ArgoCD can function:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Cilium<\/strong> (Helm release): Without the CNI, nodes stay <code>NotReady<\/code> and nothing can schedule.<\/li>\n\n\n\n<li><strong>Wait for CoreDNS<\/strong>: A <code>null_resource<\/code> polls CoreDNS readiness so image pulls from external registries work.<\/li>\n\n\n\n<li><strong>ArgoCD<\/strong> (Helm release): Installed with tolerations for both the <code>control-plane<\/code> and <code>cloud-provider-uninitialized<\/code> taints, solving the chicken-and-egg problem where the CCM (which removes the cloud-provider taint) is itself deployed by ArgoCD.<\/li>\n\n\n\n<li><strong>Secrets and ConfigMaps<\/strong>: Terraform generates passwords (<code>random_password<\/code>), SSH keys, and cloud-init configs and creates Kubernetes secrets that ArgoCD-managed apps reference by name. This keeps secrets out of git while letting ArgoCD manage the workloads.<\/li>\n\n\n\n<li><strong>Root Application<\/strong>: A single ArgoCD Application that points to the <code>gitops\/root-app\/<\/code> Helm chart.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">App-of-apps and sync waves<\/h3>\n\n\n\n<p>The root Helm chart contains one ArgoCD Application template per component. Sync wave annotations control the installation order, just like the old shell script did, but with automatic retries, self-healing, and drift detection:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Wave<\/th><th>Component<\/th><th>Purpose<\/th><\/tr><\/thead><tbody><tr><td>-5<\/td><td>Hetzner CCM<\/td><td>Removes cloud-provider taint, enables LB management<\/td><\/tr><tr><td>-4<\/td><td>Hetzner CSI<\/td><td>Persistent volumes via Hetzner block storage<\/td><\/tr><tr><td>-3<\/td><td>metrics-server<\/td><td>CPU\/memory metrics for HPA<\/td><\/tr><tr><td>-2<\/td><td>Traefik<\/td><td>Ingress controller, auto-registered with Hetzner LB<\/td><\/tr><tr><td>-1<\/td><td>cert-manager<\/td><td>TLS certificate automation<\/td><\/tr><tr><td>0<\/td><td>cert-issuers<\/td><td>Let&#8217;s Encrypt staging + production ClusterIssuers<\/td><\/tr><tr><td>4<\/td><td>Cluster autoscaler<\/td><td>Scales workers 0 to 5 based on pending pods<\/td><\/tr><tr><td>5<\/td><td>Monitoring<\/td><td>VictoriaMetrics + Grafana (optional)<\/td><\/tr><tr><td>5<\/td><td>kwatch<\/td><td>Cluster event notifications (optional)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>All applications share a common sync policy defined in a Helm helper (<code>gitops\/root-app\/templates\/_helpers.tpl<\/code>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{{- define \"root-app.syncPolicy\" -}}\nsyncPolicy:\n  automated:\n    prune: true\n    selfHeal: true\n  syncOptions:\n    - ServerSideApply=true\n  retry:\n    limit: 5\n    backoff:\n      duration: 5s\n      factor: 2\n      maxDuration: 3m\n{{- end -}}<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Server-side apply<\/h3>\n\n\n\n<p>The <code>ServerSideApply=true<\/code> setting deserves an explanation. By default, ArgoCD uses client-side apply, which stores the entire previous manifest in a <code>kubectl.kubernetes.io\/last-applied-configuration<\/code> annotation on every resource. That annotation has a hard limit of 262144 bytes. Several CRDs in this cluster blow past that limit, particularly the six prometheus-operator CRDs shipped by the VictoriaMetrics Helm chart. When client-side apply hits the limit, the sync fails permanently and no amount of retries will fix it.<\/p>\n\n\n\n<p>Server-side apply sidesteps this entirely. Instead of stuffing the previous state into an annotation, Kubernetes tracks field ownership via <code>managedFields<\/code> in the resource metadata. There is no size limit. The API server also handles the diff and merge logic itself, which gives more accurate dry runs (including admission webhook validation) and proper conflict detection when multiple controllers manage the same resource. For the monitoring app specifically, <code>ServerSideDiff=true<\/code> is also enabled so ArgoCD delegates the diff calculation to the API server too, which handles unknown status fields in CRDs more gracefully than client-side structured merge diff.<\/p>\n\n\n\n<p>I initially only enabled ServerSideApply for the apps that needed it (KubeVirt, CDI, monitoring), but after hitting the annotation limit on yet another CRD, I moved it into the shared helper so every app gets it by default. There is no downside for smaller resources, and it prevents future surprises.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What post-deploy.sh does now<\/h3>\n\n\n\n<p>The script went from 686 lines to about 75. All it does is fetch the kubeconfig from Terraform output, rewrite the API server address for the WireGuard tunnel, and print the cluster status with ArgoCD application sync states. No more Helm installs, no more kubectl apply, no more manual steps.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Ingress with Traefik and Hetzner LB<\/h2>\n\n\n\n<p>Traefik runs as a DaemonSet (not a Deployment) so it&#8217;s present on every node, including the control plane. This is important because the control plane might be the only node when no workers are running. The Traefik values live in the ArgoCD Application template at <code>gitops\/root-app\/templates\/traefik.yaml<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>deployment:\n  kind: DaemonSet\n\ntolerations:\n  - key: node-role.kubernetes.io\/control-plane\n    operator: Exists\n    effect: NoSchedule<\/code><\/pre>\n\n\n\n<p>The Hetzner LB is configured via annotations on the Traefik Service. The CCM watches these and auto-manages the load balancer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>service:\n  annotations:\n    load-balancer.hetzner.cloud\/name: \"hetzner-k8s-lb\"\n    load-balancer.hetzner.cloud\/network-zone: eu-central\n    load-balancer.hetzner.cloud\/use-private-ip: \"true\"\n    load-balancer.hetzner.cloud\/uses-proxyprotocol: \"true\"<\/code><\/pre>\n\n\n\n<p>Proxy protocol is essential. Without it, your application logs show the LB&#8217;s IP instead of the real client. Both Traefik and the LB must agree on proxy protocol being enabled:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ports:\n  web:\n    port: 80\n    proxyProtocol:\n      trustedIPs: &#91;\"10.0.0.0\/8\", \"127.0.0.1\/32\"]\n    forwardedHeaders:\n      trustedIPs: &#91;\"10.0.0.0\/8\", \"127.0.0.1\/32\"]\n  websecure:\n    port: 443\n    proxyProtocol:\n      trustedIPs: &#91;\"10.0.0.0\/8\", \"127.0.0.1\/32\"]<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Automatic TLS with cert-manager<\/h2>\n\n\n\n<p>cert-manager handles Let&#8217;s Encrypt certificates automatically. Two ClusterIssuers are configured in <code>gitops\/apps\/cert-issuers\/<\/code>: staging (for testing without rate limits) and production.<\/p>\n\n\n\n<p>To get a certificate for your service, add a single annotation and a <code>tls<\/code> block to your Ingress:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>apiVersion: networking.k8s.io\/v1\nkind: Ingress\nmetadata:\n  name: hello\n  annotations:\n    cert-manager.io\/cluster-issuer: letsencrypt-prod\nspec:\n  ingressClassName: traefik\n  tls:\n    - hosts: &#91;\"hello.k8s.example.com\"]\n      secretName: hello-tls\n  rules:\n    - host: hello.k8s.example.com\n      http:\n        paths:\n          - path: \/\n            pathType: Prefix\n            backend:\n              service:\n                name: hello\n                port:\n                  number: 80<\/code><\/pre>\n\n\n\n<p>cert-manager sees the annotation, creates a Certificate resource, solves the HTTP-01 challenge through Traefik, and stores the signed certificate in the <code>hello-tls<\/code> Secret. Renewal is automatic.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Cluster autoscaler<\/h2>\n\n\n\n<p>The autoscaler uses the Hetzner cloud provider to spin up and tear down cx33 workers based on pending pod demand. The configuration in <code>gitops\/apps\/autoscaler\/cluster-autoscaler.yaml<\/code> defines two node groups, one per datacenter location:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>command:\n  - .\/cluster-autoscaler\n  - --cloud-provider=hetzner\n  - --nodes=0:5:cx33:nbg1:workers\n  - --nodes=0:5:cx33:hel1:workers\n  - --scale-down-enabled=true\n  - --scale-down-delay-after-add=5m\n  - --scale-down-unneeded-time=5m\n  - --scale-down-utilization-threshold=0.5\n  - --balance-similar-node-groups=true\n  - --expander=least-waste<\/code><\/pre>\n\n\n\n<p>Each location can scale from 0 to 5 workers independently. <code>balance-similar-node-groups<\/code> distributes workers evenly across nbg1 and hel1 for availability. <code>least-waste<\/code> picks the group that would use resources most efficiently.<\/p>\n\n\n\n<p>Terraform creates zero static workers (<code>initial_worker_count = 0<\/code>). The autoscaler fully manages the worker lifecycle, spinning nodes up when pods are pending and tearing them down after 5 minutes of low utilization. This was a deliberate change from the earlier setup where Terraform created 2 permanent workers that the autoscaler couldn&#8217;t scale down without causing Terraform state drift.<\/p>\n\n\n\n<p>The Talos worker machine config is passed to the autoscaler as a base64-encoded cloud-init secret. Terraform creates this secret directly in <code>argocd.tf<\/code>, so there are no manual kubectl commands involved:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>resource \"kubernetes_secret\" \"autoscaler\" {\n  metadata {\n    name      = \"hcloud-autoscaler\"\n    namespace = \"cluster-autoscaler\"\n  }\n  data = {\n    token      = var.hcloud_token\n    cloud-init = base64encode(data.talos_machine_configuration.worker.machine_configuration)\n    network    = hcloud_network.cluster.name\n    firewall   = hcloud_firewall.cluster.name\n    image      = data.hcloud_image.talos.id\n  }\n}<\/code><\/pre>\n\n\n\n<p>The autoscaler deployment itself tolerates the control plane taint and prefers scheduling on the control plane, so it runs without any workers:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tolerations:\n  - key: node-role.kubernetes.io\/control-plane\n    operator: Exists\n    effect: NoSchedule\naffinity:\n  nodeAffinity:\n    preferredDuringSchedulingIgnoredDuringExecution:\n      - weight: 100\n        preference:\n          matchExpressions:\n            - key: node-role.kubernetes.io\/control-plane\n              operator: Exists<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Monitoring with VictoriaMetrics<\/h2>\n\n\n\n<p>The cluster originally ran Prometheus via kube-prometheus-stack, but I switched to <strong>VictoriaMetrics<\/strong> for lower resource usage. On a small cx33 node with 8 GB RAM, every megabyte counts, and VictoriaMetrics delivers the same functionality with a noticeably smaller footprint.<\/p>\n\n\n\n<p>The monitoring stack is deployed as an optional ArgoCD Application (gated behind <code>enable_monitoring<\/code>) using the <code>victoria-metrics-k8s-stack<\/code> Helm chart (v0.72.4). It includes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>VMSingle<\/strong>: The metrics storage engine, replacing Prometheus. Configured with 7 day retention and 10Gi persistent storage on Hetzner block volumes.<\/li>\n\n\n\n<li><strong>VMAgent<\/strong>: Scrapes metrics from all targets and forwards them to VMSingle. The VictoriaMetrics operator auto-converts any existing Prometheus <code>ServiceMonitor<\/code> and <code>PodMonitor<\/code> CRDs into its native <code>VMServiceScrape<\/code> and <code>VMPodMonitor<\/code> resources, so other apps that ship ServiceMonitors (like cert-manager or Traefik) work without changes.<\/li>\n\n\n\n<li><strong>VMAlert<\/strong>: Evaluates alerting and recording rules. The ~100 default PrometheusRules from the Helm chart (KubeNodeNotReady, KubePodCrashLooping, etc.) are auto-converted to VMRules.<\/li>\n\n\n\n<li><strong>VMAlertmanager<\/strong>: Routes alerts to <strong>Pushover<\/strong> for mobile notifications. Credentials are stored in a Terraform-managed Kubernetes secret and mounted as files, so they never appear in the Alertmanager config YAML.<\/li>\n\n\n\n<li><strong>Grafana<\/strong>: Auto-configured to use VMSingle as its datasource. Comes with default dashboards for node, pod, and cluster metrics. Admin credentials are generated by Terraform.<\/li>\n<\/ul>\n\n\n\n<p>Several Talos-incompatible scrape targets (etcd, kube-scheduler, kube-controller-manager, kube-proxy) are disabled, as Talos does not expose those metrics endpoints in the standard way.<\/p>\n\n\n\n<p>The trade-off of switching was losing historical metric data, but with a 7 day retention window, full coverage is back within a week. The migration itself was painless because ArgoCD pruned all old Prometheus resources automatically.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Alerting with Pushover<\/h3>\n\n\n\n<p>I wanted alerts on my phone without running a separate notification service. The stack has two notification paths:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>VMAlertmanager with Pushover receiver<\/strong>: Handles all ~100 built-in PrometheusRules. The config uses <code>user_key_file<\/code> and <code>token_file<\/code> (mounted from a Kubernetes secret) so credentials are never in git or in the Alertmanager config.<\/li>\n\n\n\n<li><strong>kwatch<\/strong>: A lightweight Kubernetes event watcher that monitors all namespaces, PVCs, and nodes. It sends notifications to Pushover via their webhook integration. The webhook URL (which contains a secret token) lives in a Terraform-managed ConfigMap.<\/li>\n<\/ol>\n\n\n\n<p>Both notification paths go to Pushover, but they cover different failure modes. kwatch catches events that don&#8217;t trigger Prometheus alerts (like a pod failing to pull an image), while VMAlertmanager covers metric-based conditions (like node memory pressure or persistent high CPU).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Security hardening<\/h2>\n\n\n\n<p>I try to lock down this cluster as much as possible:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Immutable OS<\/h3>\n\n\n\n<p>Talos Linux has no SSH server, no shell, no package manager. The root filesystem is read-only. You manage nodes exclusively through the Talos API (which is mTLS-authenticated and only reachable via WireGuard). If someone compromises a container, there is no host to escape to.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">No public management ports<\/h3>\n\n\n\n<p>The Kubernetes API (6443) and Talos API (50000) are never exposed publicly. The firewall only opens HTTP\/HTTPS (for ingress), WireGuard (51820), and ICMP. All management traffic flows through the WireGuard tunnel to localhost.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Control plane isolation<\/h3>\n\n\n\n<p>The control plane is tainted with <code>node-role.kubernetes.io\/control-plane:NoSchedule<\/code>. Only infrastructure services (Cilium, ArgoCD, Traefik, CCM, CSI, cert-manager, metrics-server, monitoring) carry explicit tolerations. Application workloads must schedule on worker nodes, which the autoscaler provisions on demand. This prevents application pods from consuming the control plane&#8217;s RAM and causing instability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pod Security Standards<\/h3>\n\n\n\n<p>Namespaces enforce the <code>restricted<\/code> Pod Security Standard, which blocks privilege escalation, requires non-root users, and drops all capabilities:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>apiVersion: v1\nkind: Namespace\nmetadata:\n  name: demo\n  labels:\n    pod-security.kubernetes.io\/enforce: restricted\n    pod-security.kubernetes.io\/warn: restricted<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Hardened security contexts<\/h3>\n\n\n\n<p>Every workload runs with the minimum possible privileges:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>securityContext:\n  runAsNonRoot: true\n  runAsUser: 1000\n  allowPrivilegeEscalation: false\n  readOnlyRootFilesystem: true\n  seccompProfile:\n    type: RuntimeDefault\n  capabilities:\n    drop: &#91;\"ALL\"]<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Resource governance<\/h3>\n\n\n\n<p>ResourceQuotas prevent any single namespace from consuming the cluster:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>spec:\n  hard:\n    requests.cpu: \"2\"\n    requests.memory: 2Gi\n    limits.cpu: \"4\"\n    limits.memory: 4Gi\n    pods: \"10\"<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">RBAC<\/h3>\n\n\n\n<p>The cluster autoscaler ServiceAccount has minimal permissions, only what&#8217;s needed for node scaling, pod eviction, and leader election. No cluster-admin roles are assigned to workloads.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets management<\/h3>\n\n\n\n<p>All passwords and credentials are generated by Terraform (<code>random_password<\/code>, <code>tls_private_key<\/code>) and stored in Terraform state. Terraform creates Kubernetes secrets that ArgoCD-managed apps reference by name. Nothing sensitive ever touches git. You retrieve passwords via <code>terraform output -raw grafana_password<\/code> and similar commands.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>This setup gives you a production-capable Kubernetes cluster on Hetzner for about \u20ac14\/month at rest, scaling to ~\u20ac59 under load. The key ingredients:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Talos Linux<\/strong> eliminates an entire class of security concerns by removing SSH, shells, and package managers.<\/li>\n\n\n\n<li><strong>Cilium<\/strong> replaces kube-proxy with eBPF for better performance and observability.<\/li>\n\n\n\n<li><strong>WireGuard via wireproxy<\/strong> keeps management APIs private without requiring root or kernel modules.<\/li>\n\n\n\n<li><strong>ArgoCD<\/strong> manages all workloads declaratively with automatic drift detection, self-healing, and sync wave ordering.<\/li>\n\n\n\n<li><strong>VictoriaMetrics<\/strong> provides full monitoring and alerting with lower resource usage than Prometheus.<\/li>\n\n\n\n<li><strong>Cluster autoscaler<\/strong> scales workers from 0 to 5 based on actual demand, keeping costs proportional to usage.<\/li>\n\n\n\n<li><strong>Terraform + Packer<\/strong> make the whole thing reproducible. Tear it down and rebuild it in minutes.<\/li>\n<\/ul>\n\n\n\n<p>The full code, including Terraform configs, Packer template, GitOps manifests, and the ArgoCD app-of-apps setup, is in the <a href=\"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/hetzner_k8s\/\"><code>hetzner_k8s\/<\/code><\/a> directory.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cost breakdown<\/h2>\n\n\n\n<p>The base cluster (one control plane node and a load balancer) costs about \u20ac14\/month. Workers scale from zero on demand.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Resource<\/th><th>Spec<\/th><th>Monthly cost<\/th><\/tr><\/thead><tbody><tr><td>Control plane<\/td><td>cx33 (4 vCPU, 8 GB RAM)<\/td><td>~\u20ac6<\/td><\/tr><tr><td>Load balancer<\/td><td>lb11<\/td><td>~\u20ac5<\/td><\/tr><tr><td><strong>Base total<\/strong><\/td><td><\/td><td><strong>~\u20ac14<\/strong><\/td><\/tr><tr><td>Workers (0\u20135)<\/td><td>cx33, on demand<\/td><td>\u20ac0\u201330<\/td><\/tr><tr><td><strong>Maximum<\/strong><\/td><td>5 workers<\/td><td><strong>~\u20ac59<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>I originally ran cx23 nodes (4 GB RAM), but the Kubernetes infrastructure overhead alone (Cilium, ArgoCD, Traefik, CCM, CSI, cert-manager, metrics-server, VictoriaMetrics) consumed nearly all of it, leaving nothing for actual workloads. Upgrading to cx33 (8 GB RAM) was the pragmatic fix. It is a bit mind-blowing that serving a single website with a WordPress blog now requires two servers (control plane + one worker), but that is the Kubernetes tax you pay for all the automation, self-healing, and GitOps convenience. Whether that trade-off is worth it depends on your priorities. For me, the learning experience and the &#8220;it just works&#8221; factor of ArgoCD reconciling everything automatically make it worth a few extra euros.<\/p>\n\n\n\n<p>The autoscaler fully manages the worker lifecycle. When application pods are deployed and cannot schedule on the tainted control plane, the autoscaler provisions cx33 workers within a few minutes. When utilization drops below 50% for 5 minutes, it tears them back down. No static workers, no Terraform state drift, just pay for what you use.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I wanted a cheap, production-grade homelab Kubernetes cluster. Most cloud providers charge \u20ac70+\/month just for the control plane and my setup on Hetzner Cloud runs for about \u20ac14\/month using Talos Linux, Cilium, and WireGuard. Everything&hellip;<\/p>\n<p><a href=\"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/\" class=\"more-link\">Read more<span class=\"screen-reader-text\"> of Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month<\/span><span aria-hidden=\"true\"> &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[376,375,374,231],"tags":[],"class_list":["post-1299","post","type-post","status-publish","format-standard","hentry","category-hetzner","category-homelab","category-kubernetes-k8s","category-sre"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month - Blog by Simon Frey<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Simon Frey\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month - Blog by Simon Frey","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/","twitter_misc":{"Written by":"Simon Frey","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/#article","isPartOf":{"@id":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/"},"author":{"name":"Simon Frey","@id":"https:\/\/simon-frey.com\/blog\/#\/schema\/person\/34753982b648415636ee7a079f3e19a3"},"headline":"Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month","datePublished":"2026-02-13T08:30:18+00:00","dateModified":"2026-03-10T13:24:20+00:00","mainEntityOfPage":{"@id":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/"},"wordCount":2677,"publisher":{"@id":"https:\/\/simon-frey.com\/blog\/#\/schema\/person\/34753982b648415636ee7a079f3e19a3"},"articleSection":["Hetzner","Homelab","Kubernetes (k8s)","SRE"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/","url":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/","name":"Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month - Blog by Simon Frey","isPartOf":{"@id":"https:\/\/simon-frey.com\/blog\/#website"},"datePublished":"2026-02-13T08:30:18+00:00","dateModified":"2026-03-10T13:24:20+00:00","breadcrumb":{"@id":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/simon-frey.com\/blog\/kubernetes-on-hetzner-cloud\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/simon-frey.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Production-ready Kubernetes homelab on Hetzner Cloud for ~\u20ac10\/month"}]},{"@type":"WebSite","@id":"https:\/\/simon-frey.com\/blog\/#website","url":"https:\/\/simon-frey.com\/blog\/","name":"Blog by Simon Frey","description":"","publisher":{"@id":"https:\/\/simon-frey.com\/blog\/#\/schema\/person\/34753982b648415636ee7a079f3e19a3"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/simon-frey.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/simon-frey.com\/blog\/#\/schema\/person\/34753982b648415636ee7a079f3e19a3","name":"Simon Frey","logo":{"@id":"https:\/\/simon-frey.com\/blog\/#\/schema\/person\/image\/"},"sameAs":["https:\/\/simon-frey.com","https:\/\/www.linkedin.com\/in\/simonfrey\/","https:\/\/x.com\/eu_frey"]}]}},"_links":{"self":[{"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/posts\/1299","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/comments?post=1299"}],"version-history":[{"count":6,"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/posts\/1299\/revisions"}],"predecessor-version":[{"id":1319,"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/posts\/1299\/revisions\/1319"}],"wp:attachment":[{"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/media?parent=1299"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/categories?post=1299"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/simon-frey.com\/blog\/wp-json\/wp\/v2\/tags?post=1299"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}