Back

AWS cloud monitoring in 2025

Viki Auslender
August 16, 2025
8 min read
  • TL;DR

    Pelanor is reimagining cloud cost management with AI-native FinOps tools that explain spending, not just track it. By rebuilding the data layer from scratch, we deliver true unit economics across complex multi-tenant environments - revealing what each customer, product, or team actually costs. Our AI vision is deeper: we're building systems that truly reason about infrastructure, learning what's normal for your environment and understanding why costs change, not just when.

Amazon’s cloud division has led the market ever since the tech giant stepped into it. While Google and Microsoft have gradually chipped away at its share, AWS still holds 29% of the Western market as of the end of Q2 2025, serving around 4.2 million customers of all sizes.

Large enterprises are certainly among them, but the vast majority (92%) are small businesses. Most of these smaller companies don’t have the resources to build in-house tools to monitor and analyze their AWS usage and cloud spending. And in a world where cloud costs are quickly becoming one of the biggest expenses for organizations of all sizes, having clear visibility and control over that spending is becoming more critical than ever.

What is AWS monitoring?

AWS monitoring is generally divided into two main tracks: Amazon’s own built-in tools, and third-party solutions that extend or enhance those capabilities. In both cases, AWS supports a layered monitoring strategy that covers performance, health, and security across the cloud environment. This relies on the continuous collection and analysis of metrics, logs, and events, not just to keep things running smoothly, but to catch potential issues before they become real problems.

Why AWS monitoring matters today, more than ever

Over the past decade, business operations have steadily migrated to the cloud. In parallel, recent years have witnessed a sharp rise in the adoption of artificial intelligence. AI is no longer just a standalone product; it's an essential tool driving the daily efficiency of both organizations and individuals.

Despite this deep integration between cloud infrastructure and business outcomes, clear policies for managing and overseeing this connection have often lagged. As a result, the rapid pace of technology deployment has outpaced many organizations' ability to effectively monitor and govern their cloud environments.

This makes monitoring an organization's AWS environment no longer optional. It's as fundamental to business health as ensuring human resources are aligned with meaningful, productive outcomes.

Essential AWS metrics you need to track

Monitoring the right metrics is crucial for maintaining healthy, performant, and cost-effective AWS environments. Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. With AWS services automatically publishing metrics to CloudWatch, you have access to thousands of data points - the key is knowing which ones matter most for your specific use case.

AWS metrics fall into two main categories:

Basic Monitoring: Many AWS services offer basic monitoring by publishing a default set of metrics to CloudWatch with no charge to customers. These metrics typically have 5-minute intervals and provide essential visibility into resource health.

Detailed Monitoring: Detailed monitoring is offered by only some services. It also incurs charges. To use it for an AWS service, you must choose to activate it. This provides metrics at 1-minute intervals for more granular visibility.

Understanding which metrics to prioritize helps you:

  • Detect and resolve issues faster
  • Optimize resource utilization
  • Control costs effectively
  • Ensure application performance meets SLAs

Infrastructure metrics that matter

Observability in AWS starts where everything else does - the infrastructure. Metrics are your early warning system, your truth-tellers. 

EC2 feeds CloudWatch with essential signals: CPU usage, network traffic in and out, disk operations, and system health checks. Installing CloudWatch agent on your instances opens the door to OS-level metrics, like memory usage, disk space, swap activity. 

EBS volumes have their own story to tell. Metrics like read/write throughput and IOPS show you how hard your storage is working. On gp3, you’ll want to watch throughput percentage. On gp2, BurstBalance quietly warns when your performance cushion is wearing thin.

Then there’s RDS. It may be managed, but it’s not magic. CloudWatch gives you a look into CPU load, memory availability, connection counts, and disk latency. If your queries are dragging, this is where you start digging.

Application Load Balancers bring their own layer of truth. You’ll see how many requests hit your endpoints, how fast targets respond, and whether errors are on you or the client. Move up the stack and the focus shifts. With Lambda, it’s all about invocations, duration, errors, throttles, in other words, is your function working, and is it working fast enough? API Gateway metrics complement that with insights on request volume, latency, and error codes, plus cache hit ratios that reveal how much work you're avoiding.

And if you're running containers on ECS or EKS, metrics are how you keep chaos from turning into downtime. Track running tasks, service counts, container-level CPU and memory. Container Insights turns scattered signals into something human-readable.

Application performance metrics

Application metrics provide insights into how your applications perform and help identify bottlenecks or issues affecting user experience. According to CloudWatch Documentation, these metrics include:  

1.Lambda metrics

  • Invocations - number of times a function is invoked
  • Duration - time taken to execute the function
  • Errors - number of invocations that result in errors
  • Throttles - number of throttled invocation requests
  • ConcurrentExecutions - number of function instances processing events

2. API gateway metrics

  • Count - total number of API calls
  • Latency - time between request receipt and response return
  • 4XXError/5XXError - number of client-side/server-side errors
  • CacheHitCount/CacheMissCount - number of cache hits/misses

3. Container metrics (ECS/EKS)

  • ContainerInstanceCount - number of container instances
  • TaskCount - number of running tasks
  • ServiceCount - number of services
  • ContainerCPUUtilization - CPU usage by containers
  • ContainerMemoryUtilization - memory usage by containers

Application-specific metrics

  • Business KPIs (orders processed, user signups)
  • Application performance (response times, queue depths)
  • Error rates and success rates
  • User experience metrics (page load times, transaction completion rates)

Cost and resource utilization metrics

Monitoring cost and utilization metrics is essential for optimizing AWS spending and ensuring efficient resource usage.

Billing and Cost Visibility

Billing metric data is stored in the US East region and represents worldwide charges. This data includes the estimated charges for every service in AWS that you use. AWS provides EstimatedCharges metrics that track total costs and break them down by service and linked accounts. These metrics update several times daily, enabling proactive budget monitoring through CloudWatch alarms.

CloudWatch has the following types of API requests that generate costs by the request type and number of metrics requested. The main cost drivers include metric storage, API calls (especially GetMetricData from third-party tools), and log ingestion/storage volumes. Understanding these components helps target optimization efforts effectively.

  1. Resource utilization tracking - monitor compute utilization to identify oversized EC2 instances, track Reserved Instance coverage, and optimize Spot Instance usage. For storage, analyze S3 access patterns, identify unutilized EBS volumes, and monitor IOPS consumption to balance performance with cost.
  2. Optimization Strategies - to reduce CloudWatch metrics charges, AWS recommends turning off detailed monitoring for instances, Auto Scaling group launch configurations, and API gateways. Basic monitoring often suffices and comes free. For logs, modify the retention policy to a shorter retention period to limit data that are stored over time. 

AWS native monitoring tools

Amazon CloudWatch is a monitoring and management service designed to provide data and actionable insights for Amazon Web Services, hybrid, and on-premises applications and infrastructure resources. AWS provides several native monitoring tools that work together to give you comprehensive visibility into your cloud infrastructure, applications, and services. These tools integrate seamlessly with AWS services and provide real-time insights, historical data analysis, and proactive alerting capabilities.

The primary AWS native monitoring tools include:

  • Amazon CloudWatch for metrics, logs, and alarms
  • AWS CloudTrail for API activity logging and auditing
  • AWS X-Ray for distributed tracing
  • AWS Config for configuration management and compliance

Each tool serves a specific purpose in the monitoring ecosystem, and together they provide a complete observability solution for AWS environments.

Amazon CloudWatch

A monitoring and management service designed to provide data and actionable insights for Amazon Web Services, hybrid, and on-premises applications and infrastructure resources. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services.

Key Components

  1. Metrics:a time-ordered set of data points that are published to CloudWatch. AWS services automatically send metrics to CloudWatch. Customers can publish custom metrics from their applications.
  2. Namespaces: a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other, so that metrics from different applications are not mistakenly aggregated into the same statistics.
  3. Dimensions: a name/value pair that is part of the identity of a metric. Up to 30 dimensions can be assigned to a metric.
  4. Alarms: watches a single metric over a specified time period, and performs one or more specified actions, based on the value of the metric relative to a threshold over time.

AWS CloudTrail

An AWS service that helps you enable operational and risk auditing, governance, and compliance. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail.

Key features

  • Event History: provides a viewable, searchable, downloadable, and immutable record of the past 90 days of management events in an AWS Region.
  • CloudTrail Lake: a managed data lake for capturing, storing, accessing, and analyzing user and API activity on AWS for audit and security purposes. It stores events for up to 10 years and enables SQL-based queries.
  • Trails: a record of AWS activities, delivering and storing these events in an Amazon S3 bucket, with optional delivery to CloudWatch Logs and Amazon EventBridge.

AWS distributed tracing

AWS provides distributed tracing capabilities primarily through AWS X-Ray, which helps developers analyze and debug distributed applications.

AWS X-Ray

AWS X-Ray receives data from services as segments, then groups segments that have a common request into traces. X-Ray processes the traces to generate a service graph that provides a visual representation of your application.

Key Concepts:

Segments: The compute resources running application logic, sends data about their work as segments. A segment provides the resource's name, details about the request, and details about the work done.

Subsegments: A segment can break down the data about the work done into subsegments. Subsegments provide more granular timing information and details about downstream calls.

Service graph: X-Ray uses the data that applications send to generate a service graph. Each AWS resource that sends data to X-Ray appears as a service in the graph.

Traces: tracks the path of a request through applications. Collects all the segments generated by a single request.

AWS Distro for open telemetry

Enables developers to use a standardized set of open source APIs, SDKs, and agents to instrument their application once and collect correlated metrics and traces for multiple monitoring solutions. 

AWS configuration management

The primary service for configuration management and compliance monitoring in AWS environments.

AWS Config

AWS Config provides a detailed view of the configuration of AWS resources in the AWS account. This includes how the resources are related to one another, and how they were configured in the past so that anyone can see how the configurations and relationships change over time. The key components of this feature are: Configuration Items or CI of a resource at a given point-in-time; Configuration History which is a historical record of configuration changes; Configuration Snapshots that gives a Point-in-time backup; and Config Rules that are used to evaluate compliance information for the recorded resource types. 

Third-party solutions

Enterprise platforms like Datadog and New Relic offer comprehensive observability beyond CloudWatch's capabilities. Specialized tools target specific needs - Prometheus for metrics, Grafana for visualization, Honeycomb for distributed tracing. Choose based on your team's expertise and actual requirements, not feature lists.

Architecture patterns

Multi-account monitoring requires centralized aggregation using AWS Organizations. Hybrid environments need tools that normalize data from various sources - consider OpenTelemetry to avoid vendor lock-in. Microservices demand distributed tracing with correlation IDs to understand request flows across services.

Best practices

Focus on golden signals: latency, traffic, errors, and saturation. Design dashboards that tell stories at a glance using clear visual hierarchy. Every alert must be actionable and tied to remediation steps - avoid alert fatigue through careful threshold tuning and composite alarms.

Managing challenges

Data volume drives costs - implement retention policies and use metric filters aggressively. Alert optimization requires continuous refinement based on actual incidents. Plan for API limits and authentication complexity when integrating multiple tools.

Implementation approach

Define success criteria before selecting tools. Phase your rollout: infrastructure monitoring first, then application metrics, finally distributed tracing. Test regularly - simulate failures and practice incident response. Untested monitoring is expensive decoration.

Cost optimization

CloudWatch charges add up through custom metrics, logs, and dashboards. Use Logs Insights sparingly, implement log sampling, and leverage metric math instead of creating new metrics. Consider open-source alternatives for non-critical workloads.

Future directions

AI transforms monitoring from reactive to predictive through anomaly detection and automated root cause analysis. The shift to observability means understanding system behavior, not just collecting metrics. This requires new skills and data collection approaches.

Action plan

Audit current coverage, identify gaps, and create a prioritized roadmap. Build monitoring into development - include it in design documents and create alerts alongside features. Remember that monitoring evolves with your architecture. Stay curious and ensure your monitoring serves business objectives.

Ready to step into the light?