A complete guide to
Kubernetes monitoring

Viki Auslender

•

july 5, 2025

•

9 min read

TL;DR
Pelanor is reimagining cloud cost management with AI-native FinOps tools that explain spending, not just track it. By rebuilding the data layer from scratch, we deliver true unit economics across complex multi-tenant environments - revealing what each customer, product, or team actually costs. Our AI vision is deeper: we're building systems that truly reason about infrastructure, learning what's normal for your environment and understanding why costs change, not just when.

Kubernetes has become a go-to tool for managing distributed systems in the cloud. Its strength lies in how it automates the deployment, scaling, and operation of containerized applications. But with all that power comes a hefty dose of complexity.

Organizations running large-scale, cloud-native environments rely on Kubernetes for its flexibility and cost-efficiency. According to the 2023 CNCF Annual Survey, 84% of organizations are using or evaluating Kubernetes, making it the de facto standard for container orchestration. It enables agile application deployment, shifts workloads across clusters, and in many cases, reduces infrastructure spending. But besides orchestrating your containers, it also orchestrates operational chaos if you’re not paying attention.

Monitoring Kubernetes is about making sense of thousands of moving parts, pods spinning up and dying, nodes joining and leaving, ephemeral containers that vanish before you can ask them what went wrong. Anything less than visibility across all that activity, is operating blind.

Kubernetes includes built-in observability tools, but they barely scratch the surface. They might tell you which pods are alive, how much memory a container is using, or whether a node is healthy. But they can’t answer deeper questions like:

Why did that service crash?
What caused a sudden cost spike?
How is traffic flowing across workloads?

Most critically, Kubernetes doesn’t retain much historical context. When a pod dies or a container is rescheduled, its telemetry disappears with it. Unless you were watching in real time, the evidence is gone. That’s a huge problem in any production-grade system. And that’s why successful teams turn to more advanced tools that offer monitoring, and true observability.

Most critically, Kubernetes doesn’t retain much historical context. When a pod dies or a container is rescheduled, its telemetry disappears with it

Why Kubernetes monitoring is so hard

In theory, Kubernetes should make infrastructure easier to manage. In practice, it creates new layers of abstraction, each one hiding something you really need to see. You don’t monitor a single server anymore. You monitor an ecosystem of interdependent workloads, each distributed across volatile environments and talking to each other over virtual networks.

A modern Kubernetes environment includes multiple clusters, namespaces, microservices, and APIs. Workloads move constantly. Containers spin up and terminate in seconds. Pods crash and get rescheduled on different nodes, leaving no trace behind. By the time you get a Slack alert, the entity that failed may already be gone.

This means that traditional monitoring tools, which assume fixed infrastructure, quickly fall short. In fact, according to a survey by Dynatrace, to address these challenges, a typical enterprise uses an average of 10 different observability or monitoring tools for Kubernetes (K8s), each generating vast amounts of data in different formats.

To make matters more complicated, the signals you need aren’t all in one place. Some live in logs. Others in metrics. Some in DNS. Some in the network layer. And unless you stitch them together properly, you’ll be trying to debug a distributed system without instruments. If traditional monitoring approaches fall short, what should you be watching in this complex ecosystem?

What are you actually trying to monitor?

Kubernetes environments are full of interdependent components: workloads call databases, services talk to APIs, and containers generate spikes of traffic and cost, sometimes in milliseconds. Monitoring needs to reflect that complexity, with metrics that connect behavior to impact.

Monitoring the cluster

The cluster is your foundation, It’s what holds your workloads, allocates resources, and defines your limits. Monitoring at this level means knowing whether your nodes are healthy and operational, but also whether they’re being used efficiently.

In cloud environments, underutilized nodes are expensive overhead, while overcommitted ones become points of failure. Are you running too many nodes for too few workloads? Are you prepared to survive the failure of one? Cluster-level monitoring helps answer these questions by exposing how your infrastructure behaves under pressure, and how much it’s costing you. This isn't just theoretical, the 2024 Kubernetes Benchmark Report found that 37% of organizations have 50% or more workloads in need of container rightsizing. Highlighting widespread inefficiencies that directly contribute to both overprovisioning and underutilization.

Monitoring pods

Next come the pods, short-lived, highly mobile execution units that run your containers. They scale in and out, get rescheduled across nodes, and often vanish before you even realize there was a problem. Monitoring pods means keeping track of how many are running, how many are failing, and how many are stuck waiting for resources they may never get.

Pod restarts are another red flag. If something keeps crashing and Kubernetes keeps bringing it back to life, you’ll never see an outage, but your users will feel the impact in the form of sluggish performance or silent errors. Without visibility here, you’re just guessing.

Monitoring containers

Inside each pod are one or more containers, the runtime environments where your application code actually executes. These containers are what your development team builds, packages, and deploys, and they're often where subtle but critical issues emerge.

Container-level problems can be deceptively dangerous.

Is your container approaching CPU or memory limits? Is it experiencing throttling that's silently degrading performance? Has a memory leak developed that hasn't yet triggered an out-of-memory (OOM) kill but is steadily consuming resources? These issues don't always surface in pod-level metrics, yet they directly impact application behavior and user experience. Real-time container monitoring is essential for detecting these early warning signs before they escalate into service disruptions.

Monitoring applications

At the top of the stack lies your application logic, what your users interact with, and what actually drives value. The infrastructure could be green across the board, but if your checkout API is returning 500 errors, no one's buying anything.

This is where you measure things like latency, request throughput, error rates, and database performance. It’s also where you start tying system behavior back to business outcomes. When a database query slows down, how does that impact your S3 access? When a DNS call fails, how long does it take to recover? These metrics are strategic as they are technical.

Layers aren't enough on their own

Each of these layers is important. But the real value of Kubernetes monitoring comes from how you connect them. A container crashing might be a code issue, or a resource misconfiguration, or a cluster-level scheduling problem. Unless you can trace events from top to bottom, and back up again, you’ll never know.

Built-in Kubernetes tools give you partial answers: which pods are running, how much CPU you’re using, whether a node is down. But they rarely connect the dots. If you want real observability, you need a system that can trace a problem from a SQL query all the way down to a throttled container, and then back up to the cloud region where that container was deployed.

Understanding these layers is crucial, but the real challenge is finding tools that can actually capture and correlate all this information effectively.

Built-in tools don’t cut it

Kubernetes comes with some open source basic observability features, kubelet exposes node stats, kube-state-metrics reports on cluster resources, and cAdvisor (container Advisor) collects container performance data.

These tools help you answer questions like “Is the pod running?” or “How much CPU is this container using right now?” They don’t help you understand why something failed, what it’s costing you, or how different components are affecting one another.

More importantly, Kubernetes doesn’t retain much historical context. If you don't catch the failure in real time, the trail goes cold. For production systems, that’s a serious limitation.

This is where third-party observability tools step in, not as luxuries, but as essentials. You need systems that can correlate logs, metrics, traces, and cost signals across clusters, workloads, namespaces, and network flows. Tools that understand not just “this pod used too much memory,” but which function call, in which request, triggered the spike, and what else it impacted downstream.

Take Pelanor, for example. Its Kubernetes sensor offers cost attribution at a level of granularity most platforms don’t even attempt. It can tell you how much you’re spending per workload, per namespace, even per network endpoint. With eBPF-based traffic tracing, DNS mapping, SQL attribution, and S3-level visibility, it doesn’t just monitor the system, it reconstructs it in real time.

Best practices that work

How do high-performing teams monitor Kubernetes environments without drowning in data, or missing critical signals? They don’t rely on luck. They build smart observability strategies from the ground up.

Deploy monitoring agents with daemonSets

DaemonSets are your eyes and ears on every node. They ensure that a monitoring agent (like a metrics scraper or log forwarder) runs on every machine, capturing everything from resource usage to network traffic. This gives you uniform visibility across the clustereven when workloads move around.

Label everything like you mean it

A chaotic label schema is the fastest way to ruin your observability stack. Good teams define clear, consistent, and hierarchical labels, by environment, team, service, version, geography. This makes it trivial to filter metrics, trace issues, and route alerts to the right people.

Embrace service discovery

Especially in dynamic environments like GKE, workloads are constantly shifting. Instead of manually configuring targets, rely on tools that support automatic service discovery, like Prometheus or Datadog agents. That way, your observability adapts as quickly as your infrastructure.

Alert on impact

It’s tempting to set alerts on every metric that has a red zone. But smarter teams focus on alerts that reflect actual impact, user-facing errors, repeated restarts, degraded latency. Combine these with intelligent thresholds and anomaly detection to cut down on alert fatigue and surface only what matters.

Monitor the control plane

Your workloads aren’t the only things that can fail. The Kubernetes control plane, components like the API server, scheduler, kube-dns, etcd, and kube-controller-manager, are all critical to cluster stability. If you’re not monitoring them, you might not even know when your ability to manage the cluster is compromised.

Don’t ignore the user experience

Kubernetes may not care about your page load time, but your users do. External monitoring, whether through synthetic checks, real user monitoring (RUM), or application-level metrics, is essential to detecting problems that infrastructure metrics won’t catch.

Even with all these practices in place, true observability requires more than just following a checklist.

Observability is a discipline

Too often, Kubernetes monitoring is treated as a checkbox. Install a metrics scraper here, a dashboard there, throw in a few alerts, and call it done. But the teams that actually understand what’s going on in their clusters? They treat observability as a first-class engineering concern.

They invest in telemetry that captures not just metrics but context. They correlate cost with performance. They trace requests across microservices. They simulate failure. And they continuously ask: How quickly could we explain the next incident?

Kubernetes gives you incredible power. But it also gives you a thousand opportunities to get blindsided by a failure you didn’t see coming. With the right monitoring approach, you don’t just detect issues, you gain insight. You see the system not just as a tangle of services, but as a living, evolving whole. And in the world of distributed systems, seeing clearly is everything.