Kubernetes

Decoding Kubernetes: When HPA Can’t Fetch Metrics

The Horizontal Pod Autoscaler (HPA) is pivotal in Kubernetes. It’s like our trusty assistant, automatically adjusting the number of pods in a deployment according to observed metrics like CPU usage. However, there are moments when it encounters hurdles. One such instance is when you stumble upon error messages such as:

Name:                                                  widget-app-sun
Namespace:                                             development
...
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 55%
Min replicas:                                          1
Max replicas:                                          3
Deployment pods:                                       1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Warning  FailedComputeMetricsReplicas  20m (x20 over 9m)   horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedGetResourceMetric       12s (x29 over 9m)   horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API

What’s the Scoop?

This cryptic message is essentially HPA’s way of saying, “I’m having a hard time fetching those CPU metrics I need.” But why? Here are a few culprits:

Perhaps Metrics-server isn’t installed or isn’t operating correctly.
Maybe Metrics-server is present, but it’s struggling to fetch metrics from the nodes.
It could be a misconfiguration on HPA’s end.
Sometimes, network policies or RBAC restrictions come into play, obstructing access to the metrics API.

The Detective Work: Troubleshooting Steps

.- Is Metrics-server Onboard?

kubectl get deployments -n kube-system | grep metrics-server

metrics-server           1/1     1            1           221d

If it’s missing in action, it’s time to deploy it. Helm is a handy tool for this. https://artifacthub.io/packages/helm/metrics-server/metrics-server

.- How’s Metrics-server Feeling Today?

kubectl get pods -n kube-system -l k8s-app=metrics-server

NAME                              READY   STATUS    RESTARTS      AGE
metrics-server-5f9f776df5-zlg42   1/1     Running   6 (71d ago)   221d

Make sure it’s running smoothly. If it’s throwing a tantrum, dive into its logs:

kubectl logs metrics-server-5f9f776df5-zlg42 -n kube-system

I0730 17:07:50.422754       1 secure_serving.go:266] Serving securely on [::]:10250
I0730 17:07:50.425140       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0730 17:07:50.425155       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController

.- A Peek into Metrics-server’s Config

Sometimes, it needs some flags to communicate correctly, especially if your cluster has a unique CNI or is lounging on a special cloud provider. You might need to check and adjust flags like:

--kubelet-preferred-address-types or --kubelet-insecure-tls

.- The Network or RBAC Culprits

Are there any stringent network policies that are hindering the conversation between the metrics-server and the API server or the kubelets? Or maybe, metrics-server doesn’t have the right RBAC permissions to access metrics?

Peek into network policies in the kube-system namespace:

kubectl get networkpolicy -n kube-system

And don’t forget to inspect the ClusterRole:

kubectl describe clusterrole | grep metrics-server -A10

Name:         system:metrics-server
Labels:       objectset.rio.cattle.io/hash=9a6f488150c249811b9df07e116280789628963e
Annotations:  objectset.rio.cattle.io/applied:
                H4sIAAAAAAAA/4yRwY6bMBCGX6WasyEhSQkg9VD10ENvPfRScRjsSXABG80Yom7Eu69MotVKq93syRr/+j7711wBR/uHWKx3UAE3qFOcQuvZPmGw3qVdIan1mzkDBZ11Bir40U8SiH...

.- Version Harmony: HPA & Metrics-server

Compatibility matters! Ensure HPA and metrics-server are on the same page. Sometimes, a version mismatch might be the root cause.

Here’s how to check your Kubernetes version:

kubectl version

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

And let’s not forget about the metrics-server:

kubectl describe deployment metrics-server -n kube-system | grep Image:

Image:      rancher/mirrored-metrics-server:v0.6.2

Troubleshooting in Kubernetes can be challenging, but with a systematic approach, many issues, like the HPA metrics problem, can be resolved. It’s essential to understand the components involved and to remain adaptable. As Kubernetes continues to evolve, so too should our methods for diagnosing and fixing problems.

Navigating Kubernetes: Understanding and Addressing the OutOfPods Error

When maneuvering through Kubernetes, one might often encounter the notorious “OutOfPods” error. This error message is predominantly seen when delving into the details of a pod that has failed to be scheduled, illustrated in the example below:

Name:        user-api-server-7869b4c8d9-qw4zp
Namespace:   default
Priority:    0
Node:        <none>
Labels:      app=user-api-server
Annotations: <none>
Status:      Pending
Reason:      Unschedulable
IP:          <none>
IPs:         <none>

Events:
  Type     Reason           Age                 From               Message
  ----     ------           ----                ----               -------
  Warning  FailedScheduling 4m32s (x7 over 5m)  default-scheduler  0/6 nodes are available: 3 OutOfPods, 6 node(s) had taints that the pod didn't tolerate.

In this context, the “Reason” field is categorized as “Unschedulable,” and the “Message” field clarifies why the pod couldn’t be scheduled. In this scenario, three nodes have reached their scheduling capacity, denoted by “3 OutOfPods.”

Understanding the OutOfPods Error
The “OutOfPods” error signifies that a node has surpassed its pod allocation capacity. Each node within a Kubernetes cluster harbors a specific threshold on the number of pods it can operate, influenced by several factors including the node’s specific configuration and the overall cluster setting.

To investigate this limit, the command kubectl describe node can be employed:

Capacity:
  cpu:                1
  ephemeral-storage:  47145992Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             6058428Ki
  pods:               110

Both the “Capacity” and “Allocatable” fields illustrate the maximum number of pods that can be scheduled on the node.

Strategies to Mitigate OutOfPods Error
When confronted with an “OutOfPods” error, it reveals that the node has attained its capacity, and can’t accommodate any more pods until the current ones are terminated or additional resources are integrated.

  1. Node Capacity:

Every node possesses a definitive limit on the pods it can run, influenced by the node’s resources and its configuration.
Solutions: Scale up the nodes if they are perpetually operating at or near capacity, or optimize resource requests and limits.

  1. Cluster Scaling:

Implement auto-scaling solutions to dynamically adapt the number of nodes as needed, especially if your entire cluster is consistently approaching its capacity.

  1. Pod Configuration:

Assess and review resource requests and limits to ensure that pods are not demanding more resources than necessary. Leverage Quality of Service (QoS) classes to aid the scheduler in making more informed decisions.
Implementing QoS Classes: In Kubernetes, pods are categorized into one of three QoS classes: Guaranteed, Burstable, and BestEffort, based on the resource requests and limits set on them.
.- Guaranteed: All containers in the pod have memory and CPU limits, and they are equal to the requests. Use this for critical pods that need specific resources.

.- Burstable: At least one container in the pod has a memory or CPU request. Use this for pods that require a minimum amount of resources to run but can use more resources when available.

.- BestEffort: The pod doesn’t have memory or CPU limits or requests. Use this for non-critical tasks that can run with the remaining resources.

  1. Resource Fragmentation:

Employ affinity and anti-affinity rules to minimize fragmentation by intelligently placing the pods, ensuring optimal utilization of available resources.

  1. Kubelet Configuration:

Adjusting the maxPods configuration option in the Kubelet configuration can alleviate “OutOfPods” errors by allowing more pods to run on a node, considering the node’s available resources.
Implementing Adjustment:
To adjust the maxPods value, you would typically need to modify the Kubelet configuration file, usually located at /var/lib/kubelet/config.yaml on the node. You need to do this on every node you want to adjust.
For example, open the Kubelet configuration file in a text editor:

sudo vim /var/lib/kubelet/config.yaml

Find the line with maxPods and adjust the value to the desired number, or add a new line with maxPods: if it’s not there.
Save and exit the text editor.
Restart the Kubelet service for the changes to take effect:

sudo systemctl restart kubelet

Conclusion

The OutOfPods error in Kubernetes underscores the criticality of proper resource management within a cluster. Addressing this can be achieved by optimizing node and pod configurations, conscientiously adjusting the maxPods value, and employing Quality of Service (QoS) classes to ensure effective resource allocation. By proactively implementing these strategies, operational hurdles can be avoided, maintaining a robust and efficient Kubernetes environment.