VibeOps: A Secure read-only setup for AI-Assisted Kubernetes (k8s) Debugging

show table of contents

There is a lot of noise right now about letting AI “fix” your infrastructure (be it via aws cli commands, or in the case for this article: kubernetes). You paste an error, the AI suggests a kubectl apply, and you hope it knows what it’s doing.

Do NOT work that way.

When production is acting up, you need to maintain a complete mental model of the system. If you let the AI be the driving force, you lose the overview.

Instead, use Claude for what I (and others) call “VibeDebugging”—getting a second opinion on the state of the cluster while I do the actual surgery.

When I have a problem/an alert fires, I kick off two parallel streams:

The Vibe Check (AI): I trigger Claude with MCP access to the kubernetes cluster with a broad directive. e.g. “Analyze the logs and events for the payment-service in the prod namespace. Look for correlation with the database.”
The Deep Dive (me): While Claude is processing, I start my own investigation—checking pods, tailing logs, looking at Grafana…the same flow as I always do and did in the pre-AI days.

This effectively forces me to stay in the loop. I don’t just blindly follow the AI; I verify its findings against what I am seeing myself. It’s not an autopilot; it’s a force multiplier that spots the weird log lines I might scroll past.

It would never feel comfortable doing this if the AI had write access or could read secrets. My rule for AI in Ops: Read-Only, No Secrets.

I don’t want an LLM to hallucinate a command that prints my DB passwords into a chat history, or have the AI start changing stuff in the cluster (and making the entire incident even worse, by introducing this non-deterministic agent)

Setup Kubernetes MCP Server in read-only mode

Step 1: Dedicated, read-only ServiceAccount

We need a ServiceAccount that has permissions to see everything relevant (Deployments, Pods, Logs) but is strictly blind to sensitive data and can’t change anything.

Kubernetes Roles work with an explicit-allow strategy, hence we have to actively grant access to resources. There is a wildcard * option, but especially for the core group I advice against this, as we don’t want to have the agent be able to read secrets.

What exactly you want to give the agent/mcp server access to, is your decision. Here is a basic setup, that grants access to most “standard” resources. (If you have operators and CRDs in your cluster, you might want to allow access to those as well)

Note: We are adding this into a new namespace debug-access-ns , so the full purge of the resources is as easy as just deleting that namespace. (To make sure there is no left-overs)

apiVersion: v1
kind: Namespace
metadata:
  name: debug-access-ns

Hint: If you want to figure out all the resources in your cluster, use the command kubectl api-resources. NAME => “resource” and everything before the / in APIVERSION => “apiGroup”

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ai-read-everything-except-secrets
rules:
  # 1. Core API Group (The critical one)
  # explicitly listing resources to exclude 'secrets'
  - apiGroups: [""]
    resources: 
      - bindings
      - componentstatuses
      - configmaps
      - endpoints
      - events
      - limitranges
      - namespaces
      - nodes
      - persistentvolumeclaims
      - persistentvolumes
      - pods
      - pods/log
      - pods/status
      - podtemplates
      - replicationcontrollers
      - resourcequotas
      - serviceaccounts
      - services
    verbs: ["get", "list", "watch"]

  # 2. All other common API Groups
  # It is safe to wildcard '*' resources here because Secrets live in the Core group above.
  - apiGroups: 
      - "apps"
      - "autoscaling"
      - "batch"
      - "cronjob"
      - "extensions"
      - "policy"
      - "networking.k8s.io"
      - "rbac.authorization.k8s.io"
      - "storage.k8s.io"
      - "apiextensions.k8s.io"
      - "admissionregistration.k8s.io"
      - "metrics.k8s.io"
      - "discovery.k8s.io"
    resources: ["*"]
    verbs: ["get", "list", "watch"]

  # 3. Non-Resource URLs (Optional but recommended for full visibility)
  # Allows checking healthz, version, and metrics endpoints
  - nonResourceURLs: ["*"]
    verbs: ["get"]

Applying the service account

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ai-debugger
  namespace: debug-access-ns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: ai-debugger-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: ai-read-everything-except-secrets
subjects:
  - kind: ServiceAccount
    name: ai-debugger
    namespace: debug-access-ns

In modern Kubernetes version (1.24+), ServiceAccounts do not get long-lived tokens by default. You have to manually create a Secret to generate one.

apiVersion: v1
kind: Secret
metadata:
  name: ai-debugger-token
  namespace: debug-access-ns
  annotations:
    kubernetes.io/service-account.name: "ai-debugger"
type: kubernetes.io/service-account-token

Apply all that resources, then generate a dedicated kubeconfig for it.

Step 2: Creating the Kubeconfig

Now we need to extract the data (Token, CA Cert, and Server URL) and mash it into a valid kubeconfig file. You can use this bash script to generate a readonly-config.yaml automatically:

# 1. Get the Token
TOKEN=$(kubectl get secret ai-debugger-token -n debug-access-ns -o jsonpath='{.data.token}' | base64 --decode)

# 2. Get the Cluster CA Certificate
CA=$(kubectl config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')

# 3. Get the API Server URL
SERVER=$(kubectl config view -o jsonpath='{.clusters[0].cluster.server}')

# 4. Write the kubeconfig file
cat <<EOF > ~/.kube/readonly-config.yaml
apiVersion: v1
kind: Config
clusters:
- cluster:
    certificate-authority-data: $CA
    server: $SERVER
  name: secure-cluster
contexts:
- context:
    cluster: secure-cluster
    user: ai-debugger
  name: secure-context
current-context: secure-context
users:
- name: ai-debugger
  user:
    token: $TOKEN
EOF

echo "File '~/.kube/readonly-config.yaml' created successfully."

Step 3: The Locked-Down MCP Server

Now that we have the identity, we need to connect it. I use the Kubernetes MCP Server to let Claude talk to the cluster.

But I don’t just run it. I lock it down with flags to ensure it can’t switch contexts or try to write data.

Here is the command I use to add it to my Claude config:

claude mcp add kubernetes --scope user -- npx -y [email protected] \
  --read-only \
  --kubeconfig ~/.kube/readonly-config.yaml \ 
  --cluster-provider kubeconfig \
  --disable-multi-cluster

There are a few important things happening here:

Pin the Version (@v0.0.57): Never, ever use @latest for infrastructure tools. I don’t want an auto-update to change behavior or introduce a bug during a debugging session. Check the releases page and pin it.
--kubeconfig: I point this explicitly to the restricted config we made above. Even if the MCP server code has a bug, the Kubernetes API will reject any write attempts.
--read-only: A second layer of defense. This tells the application layer to disable tool-calling for creating or updating resources.
--disable-multi-cluster: This keeps the AI focused. It ensures Claude works only on the specific cluster I pointed it to and can’t wander off into other contexts defined in my standard kubeconfig.

Highly skilled DevOps/SRE Freelancer

I am Simon, the author of this blog. And I have great news: You can work with me

As DevOps and Infrastructure freelancer, I will help you choose the right Infrastructure technology for your company, fix your cloud problems and support your team in building scalable products.

I work with Golang, Docker, Kubernetes, Google Cloud, AWS and Terraform.

Checkout my CV to learn more or directly contact me via the button below.

Let’s work together!

This setup gives me the best of both worlds. I get the speed and correlation powers of AI, but I keep the situational awareness of a human engineer. I verify the infrastructure; Claude checks the vibes. And thanks to the setup, I know for a fact it can’t delete my database.