Pelanor is reimagining cloud cost management with AI-native FinOps tools that explain spending, not just track it. By rebuilding the data layer from scratch, we deliver true unit economics across complex multi-tenant environments - revealing what each customer, product, or team actually costs. Our AI vision is deeper: we're building systems that truly reason about infrastructure, learning what's normal for your environment and understanding why costs change, not just when.
Amazon’s cloud division has led the market ever since the tech giant stepped into it. While Google and Microsoft have gradually chipped away at its share, AWS still holds 29% of the Western market as of the end of Q2 2025, serving around 4.2 million customers of all sizes.
Large enterprises are certainly among them, but the vast majority (92%) are small businesses. Most of these smaller companies don’t have the resources to build in-house tools to monitor and analyze their AWS usage and cloud spending. And in a world where cloud costs are quickly becoming one of the biggest expenses for organizations of all sizes, having clear visibility and control over that spending is becoming more critical than ever.
AWS monitoring is generally divided into two main tracks: Amazon’s own built-in tools, and third-party solutions that extend or enhance those capabilities. In both cases, AWS supports a layered monitoring strategy that covers performance, health, and security across the cloud environment. This relies on the continuous collection and analysis of metrics, logs, and events, not just to keep things running smoothly, but to catch potential issues before they become real problems.
Over the past decade, business operations have steadily migrated to the cloud. In parallel, recent years have witnessed a sharp rise in the adoption of artificial intelligence. AI is no longer just a standalone product; it's an essential tool driving the daily efficiency of both organizations and individuals.
Despite this deep integration between cloud infrastructure and business outcomes, clear policies for managing and overseeing this connection have often lagged. As a result, the rapid pace of technology deployment has outpaced many organizations' ability to effectively monitor and govern their cloud environments.
This makes monitoring an organization's AWS environment no longer optional. It's as fundamental to business health as ensuring human resources are aligned with meaningful, productive outcomes.
Monitoring the right metrics is crucial for maintaining healthy, performant, and cost-effective AWS environments. Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. With AWS services automatically publishing metrics to CloudWatch, you have access to thousands of data points - the key is knowing which ones matter most for your specific use case.
AWS metrics fall into two main categories:
Basic Monitoring: Many AWS services offer basic monitoring by publishing a default set of metrics to CloudWatch with no charge to customers. These metrics typically have 5-minute intervals and provide essential visibility into resource health.
Detailed Monitoring: Detailed monitoring is offered by only some services. It also incurs charges. To use it for an AWS service, you must choose to activate it. This provides metrics at 1-minute intervals for more granular visibility.
Understanding which metrics to prioritize helps you:
Observability in AWS starts where everything else does - the infrastructure. Metrics are your early warning system, your truth-tellers.
EC2 feeds CloudWatch with essential signals: CPU usage, network traffic in and out, disk operations, and system health checks. Installing CloudWatch agent on your instances opens the door to OS-level metrics, like memory usage, disk space, swap activity.
EBS volumes have their own story to tell. Metrics like read/write throughput and IOPS show you how hard your storage is working. On gp3, you’ll want to watch throughput percentage. On gp2, BurstBalance quietly warns when your performance cushion is wearing thin.
Then there’s RDS. It may be managed, but it’s not magic. CloudWatch gives you a look into CPU load, memory availability, connection counts, and disk latency. If your queries are dragging, this is where you start digging.
Application Load Balancers bring their own layer of truth. You’ll see how many requests hit your endpoints, how fast targets respond, and whether errors are on you or the client. Move up the stack and the focus shifts. With Lambda, it’s all about invocations, duration, errors, throttles, in other words, is your function working, and is it working fast enough? API Gateway metrics complement that with insights on request volume, latency, and error codes, plus cache hit ratios that reveal how much work you're avoiding.
And if you're running containers on ECS or EKS, metrics are how you keep chaos from turning into downtime. Track running tasks, service counts, container-level CPU and memory. Container Insights turns scattered signals into something human-readable.
Application metrics provide insights into how your applications perform and help identify bottlenecks or issues affecting user experience. According to CloudWatch Documentation, these metrics include:
Monitoring cost and utilization metrics is essential for optimizing AWS spending and ensuring efficient resource usage.
Billing metric data is stored in the US East region and represents worldwide charges. This data includes the estimated charges for every service in AWS that you use. AWS provides EstimatedCharges metrics that track total costs and break them down by service and linked accounts. These metrics update several times daily, enabling proactive budget monitoring through CloudWatch alarms.
CloudWatch has the following types of API requests that generate costs by the request type and number of metrics requested. The main cost drivers include metric storage, API calls (especially GetMetricData from third-party tools), and log ingestion/storage volumes. Understanding these components helps target optimization efforts effectively.
Amazon CloudWatch is a monitoring and management service designed to provide data and actionable insights for Amazon Web Services, hybrid, and on-premises applications and infrastructure resources. AWS provides several native monitoring tools that work together to give you comprehensive visibility into your cloud infrastructure, applications, and services. These tools integrate seamlessly with AWS services and provide real-time insights, historical data analysis, and proactive alerting capabilities.
The primary AWS native monitoring tools include:
Each tool serves a specific purpose in the monitoring ecosystem, and together they provide a complete observability solution for AWS environments.
A monitoring and management service designed to provide data and actionable insights for Amazon Web Services, hybrid, and on-premises applications and infrastructure resources. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services.
An AWS service that helps you enable operational and risk auditing, governance, and compliance. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail.
AWS provides distributed tracing capabilities primarily through AWS X-Ray, which helps developers analyze and debug distributed applications.
AWS X-Ray receives data from services as segments, then groups segments that have a common request into traces. X-Ray processes the traces to generate a service graph that provides a visual representation of your application.
Segments: The compute resources running application logic, sends data about their work as segments. A segment provides the resource's name, details about the request, and details about the work done.
Subsegments: A segment can break down the data about the work done into subsegments. Subsegments provide more granular timing information and details about downstream calls.
Service graph: X-Ray uses the data that applications send to generate a service graph. Each AWS resource that sends data to X-Ray appears as a service in the graph.
Traces: tracks the path of a request through applications. Collects all the segments generated by a single request.
Enables developers to use a standardized set of open source APIs, SDKs, and agents to instrument their application once and collect correlated metrics and traces for multiple monitoring solutions.
The primary service for configuration management and compliance monitoring in AWS environments.
AWS Config provides a detailed view of the configuration of AWS resources in the AWS account. This includes how the resources are related to one another, and how they were configured in the past so that anyone can see how the configurations and relationships change over time. The key components of this feature are: Configuration Items or CI of a resource at a given point-in-time; Configuration History which is a historical record of configuration changes; Configuration Snapshots that gives a Point-in-time backup; and Config Rules that are used to evaluate compliance information for the recorded resource types.
Enterprise platforms like Datadog and New Relic offer comprehensive observability beyond CloudWatch's capabilities. Specialized tools target specific needs - Prometheus for metrics, Grafana for visualization, Honeycomb for distributed tracing. Choose based on your team's expertise and actual requirements, not feature lists.
Multi-account monitoring requires centralized aggregation using AWS Organizations. Hybrid environments need tools that normalize data from various sources - consider OpenTelemetry to avoid vendor lock-in. Microservices demand distributed tracing with correlation IDs to understand request flows across services.
Focus on golden signals: latency, traffic, errors, and saturation. Design dashboards that tell stories at a glance using clear visual hierarchy. Every alert must be actionable and tied to remediation steps - avoid alert fatigue through careful threshold tuning and composite alarms.
Data volume drives costs - implement retention policies and use metric filters aggressively. Alert optimization requires continuous refinement based on actual incidents. Plan for API limits and authentication complexity when integrating multiple tools.
Define success criteria before selecting tools. Phase your rollout: infrastructure monitoring first, then application metrics, finally distributed tracing. Test regularly - simulate failures and practice incident response. Untested monitoring is expensive decoration.
CloudWatch charges add up through custom metrics, logs, and dashboards. Use Logs Insights sparingly, implement log sampling, and leverage metric math instead of creating new metrics. Consider open-source alternatives for non-critical workloads.
AI transforms monitoring from reactive to predictive through anomaly detection and automated root cause analysis. The shift to observability means understanding system behavior, not just collecting metrics. This requires new skills and data collection approaches.
Audit current coverage, identify gaps, and create a prioritized roadmap. Build monitoring into development - include it in design documents and create alerts alongside features. Remember that monitoring evolves with your architecture. Stay curious and ensure your monitoring serves business objectives.