There is a lot of noise right now about letting AI “fix” your infrastructure (be it via aws cli commands, or in the case for this article: kubernetes). You paste an error, the AI suggests a kubectl apply, and you hope it knows what it’s doing.
Do NOT work that way.
When production is acting up, you need to maintain a complete mental model of the system. If you let the AI be the driving force, you lose the overview.
Instead, use Claude for what I (and others) call “VibeDebugging”—getting a second opinion on the state of the cluster while I do the actual surgery.
When I have a problem/an alert fires, I kick off two parallel streams:
- The Vibe Check (AI): I trigger Claude with MCP access to the kubernetes cluster with a broad directive. e.g. “Analyze the logs and events for the
payment-servicein theprodnamespace. Look for correlation with the database.” - The Deep Dive (me): While Claude is processing, I start my own investigation—checking pods, tailing logs, looking at Grafana…the same flow as I always do and did in the pre-AI days.
This effectively forces me to stay in the loop. I don’t just blindly follow the AI; I verify its findings against what I am seeing myself. It’s not an autopilot; it’s a force multiplier that spots the weird log lines I might scroll past.
It would never feel comfortable doing this if the AI had write access or could read secrets. My rule for AI in Ops: Read-Only, No Secrets.
I don’t want an LLM to hallucinate a command that prints my DB passwords into a chat history, or have the AI start changing stuff in the cluster (and making the entire incident even worse, by introducing this non-deterministic agent)
Setup Kubernetes MCP Server in read-only mode
Step 1: Dedicated, read-only ServiceAccount
We need a ServiceAccount that has permissions to see everything relevant (Deployments, Pods, Logs) but is strictly blind to sensitive data and can’t change anything.
Kubernetes Roles work with an explicit-allow strategy, hence we have to actively grant access to resources. There is a wildcard * option, but especially for the core group I advice against this, as we don’t want to have the agent be able to read secrets.
What exactly you want to give the agent/mcp server access to, is your decision. Here is a basic setup, that grants access to most “standard” resources. (If you have operators and CRDs in your cluster, you might want to allow access to those as well)
Note: We are adding this into a new namespace debug-access-ns , so the full purge of the resources is as easy as just deleting that namespace. (To make sure there is no left-overs)
apiVersion: v1
kind: Namespace
metadata:
name: debug-access-ns
Hint: If you want to figure out all the resources in your cluster, use the command kubectl api-resources. NAME => “resource” and everything before the / in APIVERSION => “apiGroup”
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ai-read-everything-except-secrets
rules:
# 1. Core API Group (The critical one)
# explicitly listing resources to exclude 'secrets'
- apiGroups: [""]
resources:
- bindings
- componentstatuses
- configmaps
- endpoints
- events
- limitranges
- namespaces
- nodes
- persistentvolumeclaims
- persistentvolumes
- pods
- pods/log
- pods/status
- podtemplates
- replicationcontrollers
- resourcequotas
- serviceaccounts
- services
verbs: ["get", "list", "watch"]
# 2. All other common API Groups
# It is safe to wildcard '*' resources here because Secrets live in the Core group above.
- apiGroups:
- "apps"
- "autoscaling"
- "batch"
- "cronjob"
- "extensions"
- "policy"
- "networking.k8s.io"
- "rbac.authorization.k8s.io"
- "storage.k8s.io"
- "apiextensions.k8s.io"
- "admissionregistration.k8s.io"
- "metrics.k8s.io"
- "discovery.k8s.io"
resources: ["*"]
verbs: ["get", "list", "watch"]
# 3. Non-Resource URLs (Optional but recommended for full visibility)
# Allows checking healthz, version, and metrics endpoints
- nonResourceURLs: ["*"]
verbs: ["get"]
Applying the service account
apiVersion: v1
kind: ServiceAccount
metadata:
name: ai-debugger
namespace: debug-access-ns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ai-debugger-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ai-read-everything-except-secrets
subjects:
- kind: ServiceAccount
name: ai-debugger
namespace: debug-access-ns
In modern Kubernetes version (1.24+), ServiceAccounts do not get long-lived tokens by default. You have to manually create a Secret to generate one.
apiVersion: v1
kind: Secret
metadata:
name: ai-debugger-token
namespace: debug-access-ns
annotations:
kubernetes.io/service-account.name: "ai-debugger"
type: kubernetes.io/service-account-token
Apply all that resources, then generate a dedicated kubeconfig for it.
Step 2: Creating the Kubeconfig
Now we need to extract the data (Token, CA Cert, and Server URL) and mash it into a valid kubeconfig file. You can use this bash script to generate a readonly-config.yaml automatically:
# 1. Get the Token
TOKEN=$(kubectl get secret ai-debugger-token -n debug-access-ns -o jsonpath='{.data.token}' | base64 --decode)
# 2. Get the Cluster CA Certificate
CA=$(kubectl config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')
# 3. Get the API Server URL
SERVER=$(kubectl config view -o jsonpath='{.clusters[0].cluster.server}')
# 4. Write the kubeconfig file
cat <<EOF > ~/.kube/readonly-config.yaml
apiVersion: v1
kind: Config
clusters:
- cluster:
certificate-authority-data: $CA
server: $SERVER
name: secure-cluster
contexts:
- context:
cluster: secure-cluster
user: ai-debugger
name: secure-context
current-context: secure-context
users:
- name: ai-debugger
user:
token: $TOKEN
EOF
echo "File '~/.kube/readonly-config.yaml' created successfully."
Step 3: The Locked-Down MCP Server
Now that we have the identity, we need to connect it. I use the Kubernetes MCP Server to let Claude talk to the cluster.
But I don’t just run it. I lock it down with flags to ensure it can’t switch contexts or try to write data.
Here is the command I use to add it to my Claude config:
claude mcp add kubernetes --scope user -- npx -y [email protected] \
--read-only \
--kubeconfig ~/.kube/readonly-config.yaml \
--cluster-provider kubeconfig \
--disable-multi-cluster
There are a few important things happening here:
- Pin the Version (
@v0.0.57): Never, ever use@latestfor infrastructure tools. I don’t want an auto-update to change behavior or introduce a bug during a debugging session. Check the releases page and pin it. --kubeconfig: I point this explicitly to the restricted config we made above. Even if the MCP server code has a bug, the Kubernetes API will reject any write attempts.--read-only: A second layer of defense. This tells the application layer to disable tool-calling for creating or updating resources.--disable-multi-cluster: This keeps the AI focused. It ensures Claude works only on the specific cluster I pointed it to and can’t wander off into other contexts defined in my standard kubeconfig.
This setup gives me the best of both worlds. I get the speed and correlation powers of AI, but I keep the situational awareness of a human engineer. I verify the infrastructure; Claude checks the vibes. And thanks to the setup, I know for a fact it can’t delete my database.