K8sGPT for Kubernetes troubleshooting: How AI helps in different cases

K8sGPT is probably today’s most well-established AI tool for Kubernetes operators. Introduced in the Spring of 2023 and accepted to the CNCF as a Sandbox project at the end of the same year, this tool aims to help with a variety of Kubernetes-related issues.

In this article, I will explain what K8sGPT is, how to install it and connect to AI, and which features it offers. I will also share some examples of the output you can expect from this tool and what diagnostics it can perform. Throughout the preparation of this overview, I tested different AI integrations available as well as a number of models (including a local one). All of my examples will be backed up by commands and detailed logs. So, without further ado, let’s dive right in!

What K8sGPT does

To put it short, K8sGPT scans Kubernetes clusters, helps diagnose problems, and offers remediation suggestions. To do so, it leverages the SRE (Site Reliability Engineering) knowledge — e.g., how to extract all the essential information about the current Kubernetes cluster state and existing issues — and GenAI models to process this data and provide helpful guidance on solving the issues.

Suppose there is a Pending Pod in the cluster. Running K8sGPT allows you to acquire a detailed breakdown of the problem, a list of possible causes, and commands for subsequent troubleshooting and fixing. It will also add context to the prompt and let you chat with it as if you were communicating with an expert. Sure, you can use off-the-shelf online chatbots like ChatGPT, but K8sGPT significantly streamlines the process.

Installing K8sGPT

There is a sandbox available for those who don’t want to install the tool. However, I will install K8sGPT on my computer to get a better taste of it. There are two main options for installation:

You can install it as a plain console tool. You invoke it via a command and get a response.
You can install it as a K8s cluster operator. In this case, K8sGPT will run in the background and feed the results to a dedicated CR (Custom Resource) of the Result type. This option comes in handy when you need to keep a history of scans and when you want to automate scans for problems that can’t be captured manually (i.e., in console mode).

This article will focus on the first installation option. (You can also refer to this piece on how to run K8sGPT as a K8s operator.) The installation assumes two steps:

1. Install K8sGPT on the master node of the Kubernetes cluster:

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.4.22/k8sgpt_amd64.deb
dpkg -i k8sgpt_amd64.deb

2. Register the AI integration (hereinafter referred to as the backend). K8sGPT supports several different backend options:

external: OpenAI, Amazon Bedrock and SageMaker, Azure OpenAI, Google Gemini and Vertex AI, Hugging Face, Cohere, IBM watsonx.ai;
local: LocalAI, Ollama, FakeAI.

The last one, FakeAI, comes in handy in situations where you need to test a new feature without actually invoking the AI backend. Its responses to queries start with the following string:

I am a noop response to the prompt …

Now, let’s go real and register the first AI backend on the list (OpenAI). Follow this link to generate an API key. When invoking the backend registration command, specify the backend name (openai) and the model name, e.g., gpt-3.5-turbo:

k8sgpt auth add -b openai -m gpt-3.5-turbo
Enter openai Key:

We assume that ~/.kube/config exists and the cluster API is available. If this is the case, the tool is ready for use.

Refer to the official documentation to learn more about installation and the startup process.

K8sGPT features and options: Quick overview

Before we get into how to use the tool, let’s take a look at its key features. Below are some of its main commands:

analyze helps you find issues in the Kubernetes cluster;
cache is used to handle the cache of analysis results;
filters allows you to set up filters to analyze Kubernetes resources;
explain — seek AI’s help.

For example, you can use the following command to ask k8sgpt to describe what is happening in the cluster:

k8sgpt analyze -b openai --explain --no-cache --namespace=dev --filter Pod

Here:

analyze collects data on cluster issues.
The --explain flag is used to connect to the AI backend. Without it, the tool will not use the AI. Instead, it will launch the internal analyzer, which is essentially a collection of SRE practices for debugging a certain issue, where the status of the resource determines the list of further diagnostic actions to be performed.
The --no-cache option allows you to disregard the result of the previous analysis stored in the cache.
The --namespace=dev -f Pod filter analyzes issues in a particular namespace for specific resources (Pods).

To illustrate this point, the query below did not use AI, but the tool still provided structured information on the issues discovered in the cluster. This is sufficient to make an educated guess as to the causes of the error:

$ k8sgpt analyze --namespace=dev
AI Provider: AI not used; --explain not set

0 dev/web-0(StatefulSet/web)
- Error: 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector,1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
...

The -d flag adds a description from the official Kubernetes documentation section (Kubernetes Doc) to each bug:

$ k8sgpt analyze --namespace=dev --filter Service -d
AI Provider: AI not used; --explain not set

0 dev/web-0(web-0)
- Error: Service has no endpoints, expected label app.kubernetes.io/name=local-ai
 Kubernetes Doc: Route service traffic to pods with label keys and values matching this selector. If empty or not present, the service is assumed to have an external process managing its endpoints, which Kubernetes will not modify. Only applies to types ClusterIP, NodePort, and LoadBalancer. Ignored if type is ExternalName. More info: https://kubernetes.io/docs/concepts/services-networking/service/

That means you can make good use of k8sgpt even without connecting it to the AI backend! On the other hand, the --explain flag will forward the issues discovered in the cluster to the specified AI backend to find a solution. Later in this article, I will compare particular answers of various AI models.

Having the built-in (non-AI) analyzer is crucial since it reduces the number of false and made-up theories (hallucinations) that the AI analyzer would have suggested otherwise.

Reducing the requests number

K8sGPT provides two choices for controlling the number of requests. You can:

Filter them by resource type (--filter). It will help you reduce the number of requests to the AI backend. For example, suppose there are rate limits for the OpenAI API, and exceeding them will result in Error 429: Too Many Requests.
Limit the number of concurrent requests sent to the Kubernetes API server (via --max-concurrency). In doing so, you’ll be able to avoid excessive load if, for example, you need to analyze the resources of an entire cluster or a namespace.

Interactive mode

K8sGPT is interactive: you can refine your question, ask the tool about command syntax, and more. Enable this mode using the --interactive flag:

$ k8sgpt analyze -b openai --no-cache -m 1 --namespace=dev --filter Ingress --explain --interactive
...

0 dev/my-test(my-test)
- Error: Ingress uses the ingress class nginxnew which does not exist.
...
Solution:
1. Verify the existence of the specified ingress class (nginxnew).
...

Interactive mode enabled [type exit to close.]
Input text: Write me command to fix ingress class name

To fix the ingress class name, you can use the following command:

```
kubectl edit ingress dev/my-test
```

This command will open the Ingress resource in an editor where you can update the specified ingress class from "nginxnew" to the correct value. Make sure to save the changes after editing the file.

You can now communicate with the AI model as you would with a regular LLM chatbot.

Using external AI backends

Let’s see what K8sGPT is capable of with the enabled external AI backends. To test it, I faked a few cluster resource issues to see what troubleshooting tips the tool could offer.

Case 1. Pod’s status is Pending

I used .spec.affinity.nodeAffinity to limit the selection of nodes to which a Pod could be scheduled. If the conditions are not met, the Pod will get stuck in Pending status. Let’s see what solutions K8sGPT can come up with:

$ k8sgpt analyze --explain --namespace=dev

100% |██████████████████████████████████████████████████| (2/2, 16 it/min)       
AI Provider: openai

0 dev/web-0(StatefulSet/web)
- Error: 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {dedicated: kafka}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Solution:
1. Update the pod's node affinity/selector to match a node that is available.
2. Remove the taint {dedicated: kafka} from the node or add toleration in the pod's spec to tolerate the taint.
3. Remove the taint {node-role.kubernetes.io/master: } from the node or add toleration in the pod's spec to tolerate the taint.

In its response, K8sGPT suggests checking whether the selectors (nodeSelector) and affinity are correct, which is a valid assumption. Recommendations to remove the taint can also be helpful, as these are also areas where mistakes are often made.

Case 2. Pod’s status is ErrImagePull

This error stems from the inability to pull the image. In this example, I replaced the name of the current image with one that doesn’t exist. Here’s what K8sGPT and OpenAI’s gpt-3.5-turbo base model responded with:

$ k8sgpt analyze --explain -m 1 --namespace=dev
100% |██████████████████████████████████████████████████| (2/2, 13 it/min)       
AI Provider: openai

0 dev/web-0(StatefulSet/web)
- Error: Back-off pulling image "registry.k8s.io/alpine:3.88"
Error: The Kubernetes container is experiencing a back-off while trying to pull the image "registry.k8s.io/alpine:3.88".
Solution:
1. Check the image repository registry for any issues or downtime.
2. Verify the image name and tag are correct.
3. Ensure there is enough disk space on the Kubernetes node.
4. Restart the Kubernetes cluster or pod to retry the image pull.
5. If the issue persists, check network connectivity to the image registry.
6. Consider using a different image or version if the problem persists.

The tool first suggests checking if the registry is working at all. It then proposes checking the tag and name. If the above steps fail, K8sGPT recommends verifying whether there is enough space on the node to store the image, rebooting the cluster, checking the connectivity between the cluster and the image store, and, finally, it recommends trying a different image.

The model’s reasoning is correct, but the suggestion to reboot the cluster will cause everything running in it to become unavailable. So, you have to keep the possible implications in mind in following the provided recommendations.

The documentation states that the tool parses the output of several sources, including kubectl get events:

$ kubectl -n dev get events --sort-by='.metadata.creationTimestamp'
LAST SEEN   TYPE      REASON                   OBJECT            MESSAGE
15s         Normal    Pulling                  pod/web-0         Pulling image "registry.k8s.io/alpine:3.88"
15s         Warning   Failed                   pod/web-0         Error: ErrImagePull
15s         Warning   Failed                   pod/web-0         Failed to pull image "registry.k8s.io/alpine:3.88": rpc error: code = NotFound desc = failed to pull and unpack image "registry.k8s.io/alpine:3.88": failed to resolve reference "registry.k8s.io/alpine:3.88": registry.k8s.io/alpine:3.88: not found
4s          Normal    BackOff                  pod/web-0         Back-off pulling image "registry.k8s.io/alpine:3.88"
4s          Warning   Failed                   pod/web-0         Error: ImagePullBackOff

This confirms that K8sGPT draws from the cluster data, meaning it sees the problem and offers a solution for that particular issue.

Now, let’s consider the same problem to test the tool with another AI backend — Cohere (here you can learn more about this LLM). There are two ways to add a new model: you can use the k8sgpt auth add -b cohere -m command-nightly command or edit the ~/.config/k8sgpt/k8sgpt.yaml configuration file. Cohere’s model is easy to register and no phone number is required for confirmation, while its free mode is sufficient for the majority of tasks.

Here is how the Cohere model addressed the incorrect tag name problem:

...

0 dev/test-8e8c52aae2-ue3s2(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"
Error: Kubernetes is having trouble pulling the image "alpine:3.88".
This could be due to a few reasons, such as a network issue,
a problem with the image repository, or a slow download.
Solution: Check the Kubernetes logs for more details on the error.
You can also try to pull the image manually to see if there are any issues
with the repository or download.
If the issue persists, try restarting the Kubernetes node or cluster.

Looks like the Cohere LLM has a different prompt. For example, there is no numbered recommendation list. The approach is also slightly different, with Cohere offering top-level checks for network, registry, or slow pull speed issues. It also recommends examining the K8s logs and trying to pull the image manually. Finally, it suggests restarting the cluster if none of the above helps. The potential dire consequences of this have been mentioned above.

The solution offered by OpenAI’s gpt-4o model is essentially the same as gpt-3.5-turbo’s answer:

...

0 dev/test-8e8c52aae2-ue3s2(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"
Error: Kubernetes is unable to pull the Docker image "alpine:3.88".
Solution:
1. Verify the image name and tag are correct.
2. Check if the image exists in the repository.
3. Ensure your Kubernetes nodes have internet access.
4. Try pulling the image manually using `docker pull alpine:3.88`.
5. Update the image tag if necessary.

Comparing the results of two backends and three models, two of which insist on rebooting the cluster, the most worthwhile choice is the last one, gpt-4o. Besides, I like the structured response from OpenAI better. On the other hand, Cohere is easier to sign up for.

Case 3. Pod’s status is ErrImagePull + wrong password

Now it’s time to play with ErrImagePull. Let’s change the login and password required to access gitlab.registry. This is a local registry with the Alpine image pushed into it. As a result, the tag has changed. This is the response I got from K8sGPT with OpenAI’s gpt-3.5-turbo backend:

...

1 dev/web-0(StatefulSet/web)
- Error: Back-off pulling image "registry.k8s.io/alpine:7c36eb2e27a23465498b483c1fc02c94142b31919e51bdede2146e6a-1690898126442"
Error: The Kubernetes is experiencing a back-off while pulling the image "registry.k8s.io/alpine:7c36eb2e27a23465498b483c1fc02c94142b31919e51bdede2146e6a-1690898126442".

Solution:
1. Check the network connectivity to the image registry.
2. Verify the image name and tag are correct.
3. If using custom image repository, ensure proper authentication is configured.

It recommends checking the network connection, whether the name and tag are correct, and making sure the credentials to connect to the private storage are valid. That last suggestion is right on point, but there’s just one little problem — when the same request was made later, the recommendation list has been revised:

...

Solution:
1. Check if the image name and tag are correct.
2. Verify if the registry URL is accessible.
3. Ensure that you have the necessary permissions to pull the image.
4. Make sure there is an active internet connection.
5. Restart the Kubernetes pod or deployment.

The fact that the recommendations list has changed seems peculiar, given that high temperature — the ChatGPT API parameter responsible for the variability of the generated output — was not explicitly specified when the model was invoked. This PR addresses that issue: starting from K8sGPT v0.3.18, you can specify a temperature (default is 0.7), which makes the model more creative in its responses. You can set the temperature in your configuration file (k8sgpt.yaml):

ai:
   providers:
       - name: openai
         model: gpt-3.5-turbo
         password: ...
         temperature: 0.7
         topp: 0.5
         maxtokens: 2048
...

Still, the idea that the temperature is to blame is just my guess. You might as well blame the new events in the cluster and the use of the --no-cache option, which makes each new request ignore the previous request’s data.

In the updated response, K8sGPT recommended checking the image name and tag, the accuracy of the registry link, and whether you have the necessary permissions to pull the image (this makes sense in some cases). It also recommended checking the internet connection and (surprise!) restarting the cluster.

The updated recommendations list is broader; most of the proposed solutions are valid. At the same time, the only proper recommendation — to check that the right login and password were used — has vanished, although there is an obvious access error in the Pod description (kubectl -n dev describe pod web-0):

failed to authorize: failed to fetch anonymous token: unexpected status: 403 Forbidden.

It turns out that in some cases, the model fails to produce the correct answer. But this is an artificial example in which I know what the problem is. In real life, the LLM should provide suggestions. You can try them out and ask the model again. To be fair, this time, the gpt-4o model fared better than its predecessor:

...

Solution:
1. Verify image registry URL and credentials.
2. Ensure your Kubernetes secret for the registry is correctly configured.
3. Check OAuth token endpoint and permissions.
4. Retry pulling the image.

Case 4. Pod’s status is CrashLoopBackOff

This time, let’s make a mistake in the command to run the Pod. For example, let’s attempt to run bash, which does not exist in the target image. Here’s what K8sGPT responds with:

$ kubectl -n dev get pod
NAME                    READY   STATUS             RESTARTS      AGE
test-3ef82a3ca2-op3e9   0/1     CrashLoopBackOff   4 (25s ago)   2m12s

$ k8sgpt analyze --explain --no-cache --namespace=dev
100% |██████████████████████████████████████████████████| (2/2, 7 it/min)        
AI Provider: openai

1 dev/test-3ef82a3ca2-op3e9(Deployment/test)
- Error: back-off 1m20s restarting failed container=alpine pod=test-3ef82a3ca2-op3e9_dev(...)
Error: The container "alpine" in the "test-3ef82a3ca2-op3e9_dev" pod is failing and restarting with a back-off time of 1 minute and 20 seconds.

Solution:
1. Check the logs of the "alpine" container to identify the cause of the failure.
2. Make necessary changes to fix the issue.
3. Restart the pod to apply the changes and validate if the "alpine" container starts successfully.

It recommends checking the container logs, fixing the problem, and then restarting the Pod. No specific recommendations are provided. Let’s see which events we have:

19s         Warning   Failed                   pod/test-3ef82a3ca2-op3e9         Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "bash": executable file not found in $PATH: unknown

The problem is exec: "bash": executable file not found in $PATH: unknown, but there is no mention of it in the recommendations.

Another query on the same problem shows that the answer is not static — there is some variability to it. Still, setting the temperature to 0.7 fails to produce a radically different response:

...

1 dev/test-3ef82a3ca2-op3e9(Deployment/test)
- Error: back-off 2m40s restarting failed container=alpine pod=test-3ef82a3ca2-op3e9_dev(...)
Error: The container "alpine" in the pod "test-3ef82a3ca2-op3e9_dev" failed to restart after a back-off period.

Solution:
1. Check the logs of the pod to identify the specific error causing the container to fail.
2. Update the container configuration or troubleshoot the issue causing the failure.
3. Apply the necessary changes to resolve the error.
4. Restart the pod to check if the issue has been resolved.

Meanwhile, here’s the gpt-4o model’s suggestion:

...

Solution:
1. Check container logs: `kubectl logs test-3ef82a3ca2-op3e9 -c alpine`
2. Describe the pod for more details: `kubectl describe pod test-3ef82a3ca2-op3e9`
3. Fix any issues found in the logs or description.
4. Restart the pod: `kubectl delete pod test-3ef82a3ca2-op3e9` (it will be recreated).

As expected, the new recommendations differ in form but are identical in essence: check the logs, update the container configuration, find and fix the problem, and restart the Pod. The only difference is that the new model cited specific commands for debugging.

Case 5. Pod’s status is OOMKilled

This error is also common. To cause it, let’s run the Pod with the following command:

command:
           - bash
           - "-c"
           - "for i in {1..100};do echo " $i s waiting..." && sleep 1;donenecho "some strange command"ntail /dev/zeronEOF n"

Events show that the Pod is getting rebooted repeatedly:

52s         Normal    Started                  pod/test-388ea60f3c-f2pe3         Started container alpine
52s         Normal    Pulled                   pod/test-388ea60f3c-f2pe3         Container image "alpine:3.6" already present on machine
52s         Normal    Created                  pod/test-388ea60f3c-f2pe3         Created container alpine
4s          Warning   BackOff                  pod/test-388ea60f3c-f2pe3         Back-off restarting failed container

However, you can’t say for certain what caused the issue here. OOM is not explicitly mentioned here, but you can derive clues by running the describe pod command:

$ kubectl -n dev describe pod test-388ea60f3c-f2pe3
...     
   State:          Waiting
     Reason:       CrashLoopBackOff
   Last State:     Terminated
     Reason:       OOMKilled

On top of that, the following information briefly shows up in the Pod status:

$ kubectl -n dev get pod -w
NAME                    READY   STATUS             RESTARTS      AGE
test-388ea60f3c-f2pe3   1/1     Running            2 (20s ago)   44s
test-388ea60f3c-f2pe3   0/1     OOMKilled          2 (21s ago)   45s
test-388ea60f3c-f2pe3   0/1     CrashLoopBackOff   2 (12s ago)   54s
test-388ea60f3c-f2pe3   1/1     Running            3 (27s ago)   66s
test-388ea60f3c-f2pe3   0/1     OOMKilled          3 (45s ago)   79s
test-388ea60f3c-f2pe3   0/1     CrashLoopBackOff   3 (14s ago)   90s

K8sGPT responds to that with the following recommendations:

$ k8sgpt analyze --explain --no-cache --namespace=dev -f Pod
100% |██████████████████████████████████████████████████| (1/1, 7 it/min)       
AI Provider: openai

0 dev/test-388ea60f3c-f2pe3(Deployment/test)
- Error: back-off 5m0s restarting failed container=alpine pod=test-388ea60f3c-f2pe3_dev(...)
Error: The alpine container in pod test-388ea60f3c-f2pe3_dev failed to start and is continuously restarting with a back-off period of 5 minutes.
Solution:
1. Check the logs of the pod using `kubectl logs pod/test-388ea60f3c-f2pe3_dev`
2. Identify the cause of the failure in the logs.
3. Fix the issue with the alpine container.
4. Update the pod using `kubectl apply -f <pod_yaml_file>` to apply the changes.
5. Monitor the pod using `kubectl get pods -w` to ensure it is running without restarts

So, you have to figure out the cause, make the corresponding changes, restart the Pod, and keep monitoring. Retrying with different delays and efforts to pinpoint the error failed to produce more meaningful recommendations. It turns out that K8sGPT may miss randomly occurring errors. Perhaps an operator version of the tool would be just what the doctor ordered here.

Case 6. Ingress-related errors

In this section, I pretend that there is a fairly simple typo error. But first, let’s run a Pod with the nginx image and connect it to kind: Ingress. There are no errors in events, and describe isn’t helpful:

$ kubectl -n dev get events
LAST SEEN   TYPE     REASON    OBJECT                      MESSAGE
5m          Normal   Pulled    pod/test-548bf544cd-nttq5   Container image "alpine:3.6" already present on machine
5m          Normal   Created   pod/test-548bf544cd-nttq5   Created container alpine
7m12s       Normal   Sync      ingress/my-test             Scheduled for sync

$ kubectl -n dev describe ingress
Name:             my-test
Labels:           <none>
Namespace:        dev
Address:          foo.bar.baz.qux
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
TLS:
 my-test-tls terminates my-test-app.sandbox.com
Rules:
 Host                                         Path  Backends
 ----                                         ----  --------
 my-test-app.sandbox.com                        /   my-test:app (<error: endpoints "my-test" not found>)
Annotations:                                  kubernetes.io/ingress.class: nginx
                                              nginx.ingress.kubernetes.io/force-ssl-redirect: true
                                              nginx.ingress.kubernetes.io/ssl-redirect: true
Events:
 Type    Reason  Age                From                      Message
 ----    ------  ----               ----                      -------
 Normal  Sync    10m (x2 over 10m)  nginx-ingress-controller  Scheduled for sync

Run out of ideas for further investigation? K8sGPT can provide hints as to what the potential problem might be:

$ k8sgpt analyze --explain -m 1  --namespace=dev
100% |██████████████████████████████████████████████████| (2/2, 20 it/min)       
AI Provider: openai

0 dev/my-test(my-test)
- Error: Ingress uses the ingress class nginxnew which does not exist.
- Error: Ingress uses the service dev/my-test which does not exist.
- Error: Ingress uses the secret dev/my-test-tls as a TLS certificate which does not exist.
Error: Ingress uses the ingress class nginxnew which does not exist. Ingress uses the service dev/my-test which does not exist. Ingress uses the secret dev/my-test-tls as a TLS certificate which does not exist.

Solution:
1. Check the spelling and syntax of the ingress class "nginxnew" to ensure it matches a valid ingress class name.
2. Verify if the service "dev/my-test" exists. If not, create the service with the required specifications.
3. Confirm if the secret "dev/my-test-tls" exists. If not, create the secret with the correct TLS certificate information.

The correct suggestion pops up right away: a typo in the ingress class name. The nginxnew class does not exist in the cluster:

$ kubectl get ingressclasses.networking.k8s.io --no-headers -o custom-columns=":metadata.name"
nginx

I should note that the Error: Ingress uses the ingress class nginxnew which does not exist was detected in the analyze section — that is, by the K8sGPT analyzer based on data from the cluster and before AI was used. So it is no surprise that in the Solution section, AI also emphasizes the importance of this error and suggests the proper way to address it.

As for the typo in the resource name, Cohere’s command-nightly LLM provided unexpectedly detailed answers and step-by-step recommendations with commands. You didn’t even have to set the -d option:

$ k8sgpt analyze -b cohere --explain -m 1  --namespace=dev
100% |██████████████████████████████████████████████████| (2/2, 5 it/min)        
AI Provider: cohere

1 dev/my-test(my-test)
- Error: Ingress uses the ingress class nginxnew which does not exist.
- Error: Ingress uses the service dev/my-test which does not exist.
- Error: Ingress uses the secret dev/my-test-tls as a TLS certificate
which does not exist.
Error: Ingress uses the ingress class nginxnew which does not exist.
Ingress uses the service dev/my-test which does not exist.
Ingress uses the secret dev/my-test-tls as a TLS certificate
which does not exist.

Solution: To resolve this error, you can follow these steps:

1. Verify the ingress class: Make sure that the ingress class `nginx`
exists in the Kubernetes cluster. You can use the following command to check:

```
kubectl get ingressclass nginxnew
```

2. Verify the service: Make sure that the service `dev/my-test`
exists in the Kubernetes cluster. You can use the following command to check:

```
kubectl get service dev/my-test
```

3. Verify the secret: Make sure that the secret `dev/my-test-tls`
exists in the Kubernetes cluster. You can use the following command to check:

```
kubectl get secret dev/my-test-tls
```

If any of these resources do not exist, you can create them with the following commands:

```
kubectl apply -f ingress.yaml
kubectl apply -f service.yaml
kubectl apply -f secret.yaml
```

4. Update the ingress: Once you have verified that the required
resources exist, you can update the ingress to use the correct resources.
You can use the following command to patch the ingress:

```
kubectl patch ingress my-ingress --type='json'
--patch='{"spec": {"ingressClassName": "nginx",
"defaultBackend": {"serviceName": "my-test", "servicePort": 80}}'
```

5. Verify the ingress: After updating the ingress, you can verify that it is using the correct resources by running the following command:

```
kubectl get ingress my-ingress
```

Output:
```
NAME           CLASS    HOSTNAME   ADDRESS        PORTS   AGE
my-ingress     nginxnew            127.0.0.1      80      1d3h
```

This output shows that the ingress `my-ingress` is using the correct
ingress class (`nginx`), service (`my-test`), and secret (`my-test-tls`).

Here is what the Cohere-based backend suggests:

Check whether there is an ingress class with that name.
Check if the service and the secret are present.
Adjust the ingress settings to match the resource names found.
Check that the ingress settings are correct.

On top of that, Cohere provides ready-to-use commands. Well done!

Adding privacy to the K8sGPT usage

The use of external LLMs may sound insecure, since requests to them explicitly include the namespaces, Pods, and other cluster resources. What is more, it is not known whether sensitive data, such as secrets, is shared in the process. Fortunately, you can anonymize the shared data using the --anonymize flag. For example, here’s how the non-anonymized request looks:

$ k8sgpt analyze -b localai  --no-cache -m 1 --namespace=dev --filter Ingress --explain

...
--- Ingress uses the ingress class nginxnew which does not exist. Ingress uses the service dev/my-test which does not exist. Ingress uses the secret dev/my-test-tls as a TLS certificate which does not exist. ---.
...

And this is what the same request looks like on the backend side with the --anonymize option:

$ k8sgpt analyze -b localai --no-cache -m 1 --namespace=dev --filter Ingress --explain --anonymize
...
--- Ingress uses the ingress class UTNzjE0P which does not exist. Ingress uses the service Jmnw/s3eiDbIRJhfDTa== which does not exist. Ingress uses the secret ELZw/FKrrKJ4vqK9tDWMpUEC= as a TLS certificate which does not exist. ---.
...

You can see that the resource names are obscured. K8sGPT sends these anonymized names to the AI backend, so the latter deals with them without knowing the original names. However, it’s important to note that anonymization doesn’t cover Kubernetes events at the moment.

Working with locally running LLMs from LocalAI

If you want an even more secure method to use K8sGPT, consider local LLM backends (-b localai). LocalAI allows you to run AI models locally, without an internet connection.

To install LocalAI, you will need to follow the instructions for a relevant Helm chart. (Personally, I installed it using werf instead of Helm by slightly editing the Helm charts and creating werf.yaml alongside .gitlab-ci.yml to deploy LocalAI to the K8s cluster the way I wanted. Feel free to use Helm itself or other preferred tooling for this.)

Here is what values.yaml looks like:

deployment:
 env:
   ...
   preload_models: '[{ "url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "overrides": { "parameters": { "model": "ggml-gpt4all-j" }}, "files": [ { "uri": "https://gpt4all.io/models/ggml-gpt4all-j.bin", "sha256": "acd54f6da1cad7c04c48b785178d686c720dcbe549903032a0945f97b1a43d20", "filename": "ggml-gpt4all-j" }]}]'

Note the model name and the resources section here. The default AI model is ggml-gpt4all-j (I’m going to show you how to run a different model below). I started my testing on a K8s cluster with a 4 CPU 8 GB node available at the time. Following this configuration, the Pod has got almost all the available resources:

resources:
 requests:
   cpu: 3500m
   memory: 7Gi

persistence:
   pvc:
     enabled: true
     size: 20Gi

You can also customize the prompt template in the values.yaml variables:

promptTemplates:
  ggml-gpt4all-j.tmpl: |
    The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
    ### Prompt:
    {{.Input}}
    ### Response:

It will be shown in the logs during inference:

Prompt (after templating): The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
Simplify the following Kubernetes error message delimited by triple dashes written in --- english --- language; --- Ingress uses the ingress class UTNzjE0P which does not exist. Ingress uses the service Jmnw/s3eiDbIRJhfDTa== which does not exist. Ingress uses the secret ELZw/FKrrKJ4vqK9tDWMpUEC= as a TLS certificate which does not exist. ---.
	Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
	Error: {Explain error here}
	Solution: {Step by step solution here}
### Response:

Here, you can clearly see the part generated by the backend for LocalAI, and the part generated by the K8sGPT tool (it was {{.Input}}, remember?). I left the rest of the values.yaml parameters unchanged.

If there are enough resources for the Pod, enough space on the runner to pull the image, and imagePullSecrets are in place, everything will run smoothly right from the get-go:

gitlab-runner:~$ docker image ls
REPOSITORY                   TAG       IMAGE ID       CREATED      SIZE
docker.io/localai/localai    latest    cc3e53a4b23a   2 days ago   14GB

In this case, you will see the following in the log:

$ kubectl -n dev logs local-ai-dev-83dfe4a931-30aex -f
...
12:42PM DBG File "ggml-gpt4all-j" downloaded and verified
12:42PM DBG Prompt template "gpt4all-completion" written
12:42PM DBG Prompt template "gpt4all-chat" written
12:42PM DBG Written config file /models/gpt4all-j.yaml

┌───────────────────────────────────────────────────┐
│                   Fiber v2.48.0                   │
│               http://127.0.0.1:8080               │
│       (bound on host 0.0.0.0 and port 8080)       │
│                                                   │
│ Handlers ............ 55  Processes ........... 1 │
│ Prefork ....... Disabled  PID ................ 14 │
└───────────────────────────────────────────────────┘

Once you have successfully deployed a LocalAI image and run the model in that Pod, you need to register a new backend with k8sgpt:

k8sgpt auth add -b localai -u local-ai-dev.dev.svc.cluster.local:8080/v1 -m ggml-gpt4all-j
k8sgpt auth default -p localai

You can even ask K8sGPT for more information on the LocalAI Pod:

k8sgpt analyze -b localai --explain --no-cache --namespace=dev -f Pod

As mentioned above, when the request is sent to the local Pod in which the LLM is running, you can even observe how K8sGPT does its magic. In particular, you can see the way it processes the original error report generated upon running the tool with the --analyze option:

### Prompt:
Simplify the following Kubernetes error message delimited by triple dashes written in --- english --- language; --- Back-off pulling image "alpine:3.88" ---.
	Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
	Error: {Explain error here}
	Solution: {Step by step solution here}

Below is an example of the LocalAI and ggml-gpt4all-j response to the missing tag issue:

$ k8sgpt analyze -b localai --explain --no-cache --namespace=dev -f Pod
100% |██████████████████████████████████████████████████| (1/1, 2 it/min)        
AI Provider: localai

0 dev/test-8e8c52aae2-8e0ks(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"

 I'm sorry, as an AI language model, it is not within my programming to provide step-by-step solutions or explain errors. However, I can provide the most common response that could occur. In this case, it seems like an error message about pulling the image "alpine:3.88" is being displayed, but there may be no solution to it.

The answer is unhelpful but honest: this model can’t provide a step-by-step solution, but it suggests considering the image pull error.

And here’s a different response from ggml-gpt4all-j to the same question, but with a temperature of 0.9:

$ k8sgpt analyze -b localai --explain --no-cache --namespace=dev -f Pod
100% |██████████████████████████████████████████████████| (1/1, 1 it/min)        
AI Provider: localai

0 dev/test-8e8c52aae2-8e0ks(Deployment/test)
- Error: Back-off pulling image "alpine:3.88"
The error message is "back-off pulling image alpine:3.88". The possible cause of the error is a network issue or an incorrect configuration. To resolve the error, you need to check your network connection and make sure that the image has been successfully pushed to your server. Once the image is uploaded, you can specify it as a dependency in your project's package.json file and run "npm install" to include it in your project. Alternatively, you can also use a CI/CD tool such as Jenkins or Travis CI to automate the build process and ensure that all dependencies are up-to-date.

Well, this one is a bit more detailed. For example, the LLM suggests checking the network connection and making sure that the target image is available on the server. Yet, it is misleading: the package.json part is clearly irrelevant to the issue. But at least the LLM doesn’t suggest restarting the cluster.

As you can see from the examples above, the local LLM-based approach is quite feasible. Granted, you will need a more powerful GPU-enabled node, as waiting for hours for K8sGPT to provide its recommendations is not a non-starter. This option may be suitable, for instance, when K8sGPT is run as an operator. Also, in LocalAI mode, I failed to find a way to take advantage of one of the Kubernetes’ key features: running multiple Pods with a model for parallel computing. Chances are, this option is not supported.

Local LLM repository

The next obvious step is to run a different model and check out its responses. Luckily, the community provides some ready-to-use options in the gallery. All you will need to do is change or add the preload_models parameter in values.yaml.

But what if I wanted to try an LLM that is not in the gallery, i.e., not among the available LocalAI templates? Or what if I have my own trained model? This is where a local template gallery comes into play. It’s this simple: download a model from a public repository or train your own, compose a template, and make the template file available (e.g., via a dedicated cluster ingress). Next, point the LocalAI application to the YAML file for your LLM.

Here is an example of such a template file for the llama_2_7b_7b_chat_gglm model:

name: "llama-2-7b-chat-gglm"

description: |
 ...

license: ...
urls:
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

config_file: |
   backend: llama
   parameters:
     model: llama-2-7b-chat.ggmlv3.q6_K.bin
     top_k: 80
     temperature: 0.2
     top_p: 0.7
   context_size: 1024
   template:
     completion: openllama-completion
     chat: openllama-chat
files:
   - filename: "llama-2-7b-chat.ggmlv3.q6_K.bin"
     sha256: "e32c8f063b357001a1da0431778b40a78aa71dd664561ff14c51f18556381818"
     uri: "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/blob/main/llama-2-7b-chat.ggmlv3.q6_K.bin"

prompt_templates:
- name: "openllama-completion"
 content: |
   Q: Complete the following text: {{.Input}}nA:
- name: "openllama-chat"
 content: |
   Q: {{.Input}}nA:

The other parameters can be customized as needed. In my case, I left them unchanged.

Next, add a link to the local template to the LocalAI application’s values.yaml and redeploy:

preload_models: '[{"url": "https://my-own-llm-gallery.com/llama_2_7b_chat_gglm.yaml", "name": "llama_2_7b_chat_gglm"}]'

Since I didn’t have the capacity to run the model locally, I can’t show the model responses here. Note that recommended resources are usually listed in the model description. For example, the resources required for llama_2_7b are listed here.

The above tests show that models trained on a small corpus of data or highly quantized models, such as ggml-gpt4all-j, often underperform when it comes to generating useful recommendations. Quantization reduces the amount of memory required to store the LLM and speeds up inference by reducing the computation load. But it also leads to reduced accuracy and degradation in the quality of the model output. Thus, you need more resources (e.g., GPU) to achieve better results.

By the way, if your company is looking to run GenAI models privately for various business needs (such as using them within K8sGPT), our company offers a self-hosted LLM service. It is based on a subscription model and comes with affordable pricing for the initial setup and ongoing maintenance.

Takeaways

The key conclusion I drew is that K8sGPT can be used as a pretty good onboarding aid. Even without an AI backend, the tool features useful cluster resource state analyzers, which are quite informative in many cases. On top of that, you can browse the K8s documentation right in the console.

I must add that the most basic troubleshooting commands, such as kubectl explain, describe, get events, logs, and your favorite search engine is usually enough for even a beginner to address the issues demonstrated in this article. However, K8sGPT’s answers help accelerate you in the right direction by focusing on possible solutions rather than error codes, along with structured output and command examples.

The extent of the benefit that can be derived from K8sGPT troubleshooting tips is determined by the backend, i. e., the LLM that generates the response. Upon comparing several LLMs, I found OpenAI’s 4o to be the best, although gpt-3.5-turbo often produces comparable results and comes at a lower cost. A clear advantage 4o holds is that it provides troubleshooting commands in addition to recommendations. Local LLMs are also capable of generating useful recommendations, given that you have adequate computing resources.

As part of my experiment, I refrained from comparing the K8sGPT output to the conversation with LLMs in the respective chats. In my opinion, such a comparison would be inappropriate. Chat is a more flexible tool where you can repeatedly update the context, make comments and refinements, such as copying the output of kubectl get events and Pod logs into the conversation to get recommendations. However, the line between the two has become more blurred by the interactive mode in K8sGPT.

Convenience is one of the advantages K8sGPT offers. Debug recommendations are provided right away — eliminating the need to switch from the cluster terminal to the browser. K8sGPT also highlights typos or missing resources if they’ve not been defined. In addition, the analysis module checks for errors in some resources and sends debugging information corresponding to that type of resource to the backend, along with the error text. Since K8sGPT is so good for onboarding and outputs a list of action steps, anyone can learn how to troubleshoot a cluster. With this tool, even a novice can eliminate the most commonly occurring errors.

As for the drawbacks, K8sGPT cannot troubleshoot intermittent errors. I was also surprised that K8sGPT was only able to detect the ErrImagePull error on the main container (the initContainer went unnoticed). You’re also limited in which cluster resources you can run the analysis on: debugging of custom resources is not supported out of the box. At the same time, the custom analyzers were added to the project last year. After all, this is an Open Source project that is being actively developed!

Overall, K8sGPT is an exciting project, and I really enjoyed using it. Most recently, it got a lot of new analyzers — for storage (StorageClasses, PersistentVolumes, and PVCs), security (ServiceAccounts, RoleBindings, and PodSecurityContexts), configurations (ConfigMaps), and Jobs — as well as MCP server support. This all reassures that the project becomes an even more powerful and efficient tool.

P.S.

Previously, we have also published this in-depth overview of existing AIOps solutions for Kubernetes, including K8sGPT, kubectl-ai, Botkube, and many others.