EC2 Monitoring

Status Checks

Status monitoring helps quickly determine whether EC2 has detected any problems that might prevent instances from running applications.

EC2 performs automated checks on every running EC2 instance to identify hardware and software issues.
Status checks are performed every minute and each returns a pass or a fail status.

If all checks pass, the overall status of the instance is OK.
If one or more checks fail, the overall status is Impaired.
Status checks are built into EC2, so they cannot be disabled or deleted.

There are three types of status checks:
- System status checks
- Instance status checks
- Attached EBS status checks
Status checks data augments the information that EC2 already provides about the intended state of each instance (such as pending, running, and stopping) as well as the utilization metrics that CloudWatch monitors (CPU utilization, network traffic, and disk activity).
Alarms can be created or deleted, that are triggered based on the result of the status checks. for e.g., an alarm can be created to warn if status checks fail on a specific instance.

System Status Checks

monitor the AWS systems, required to use the instance, to ensure they are working properly.
detect problems with the instance that require AWS involvement to repair.
System status checks failure might due to
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host

When a system status check fails, one can either
- check AWS Health Dashboard for any scheduled critical maintenance by AWS to the instance’s host.
- wait for AWS to fix the issue
- or resolve it by stopping and restarting or terminating and replacing an instance

Instance Status Checks

monitor the software and network configuration of the individual instance
checks to detect problems that require involvement to repair.
Instance status checks failure might be due to
- Failed system status checks
- Misconfigured networking or startup configuration
- Exhausted memory
- Corrupted file system
- Incompatible kernel
When an instance status check fails, it can be resolved by either rebooting the instance or by making modifications to the operating system

Attached EBS Status Checks

monitor whether the EBS volumes attached to an instance are reachable and able to complete I/O operations.
available for Nitro-based instances only.
helps detect issues where the instance cannot communicate with one or more attached EBS volumes.
Attached EBS status check failure might be due to
- Hardware or software issues on the storage subsystem underlying the EBS volume
- Hardware issues on the physical host impacting reachability to EBS
The metric StatusCheckFailed_AttachedEBS is available at a 1-minute frequency at no additional charge.

Can be used with CloudWatch alarms and Auto Scaling health checks to replace instances with impaired EBS volumes.

EC2 Instance Recovery

Simplified Automatic Recovery
- enabled by default during instance launch on supported instances.
- automatically moves the instance from the impaired host to a different host when a system status check failure is detected.
- recovered instance is identical to the original (instance ID, private IP, Elastic IP, metadata, placement group).
- does not require a CloudWatch alarm to be configured.
- works only for system status check failures, not for instance status check failures.
- available for over 90% of deployed EC2 instances.
CloudWatch Action Based Recovery
- can be configured optionally after instance launch using CloudWatch alarms.
- provides the ability to set a recovery action on a CloudWatch alarm monitoring the StatusCheckFailed_System metric.
- provides more granular control over recovery conditions and notification.

CloudWatch Monitoring

CloudWatch helps monitor EC2 instances, which collects and processes
raw data from EC2 into readable, near real-time metrics.

Statistics are recorded for a period of two weeks so that historical information can be accessed and used to gain a better perspective on how
the application or service is performing.
By default, Basic monitoring is enabled and EC2 metric data is sent to CloudWatch in 5-minute periods automatically
Detailed monitoring can be enabled on the EC2 instance, which sends data to CloudWatch in 1-minute periods.

Organization-wide Detailed Monitoring Enablement (2026)
- CloudWatch Ingestion enablement rules can automatically enable detailed monitoring for both existing and newly launched EC2 instances matching the rule scope.
- Ensures consistent 1-minute metrics collection across EC2 instances at the organization or account level.

Aggregating Statistics Across Instances/ASG/AMI ID
- Aggregate statistics are available for the instances that have detailed monitoring (at an additional charge) enabled, which provides data in 1-minute periods
- Instances that use basic monitoring are not included in the aggregates.
- CloudWatch does not aggregate data across Regions. Therefore, metrics are completely separate between regions.
- CloudWatch returns statistics for all dimensions in the AWS/EC2 namespace if no dimension is specified
- The technique for retrieving all dimensions across an AWS namespace does not work for custom namespaces published to CloudWatch.
- Statistics include Sum, Average, Minimum, Maximum, Data Samples
- With custom namespaces, the complete set of dimensions that are associated with any given data point to retrieve statistics that include the data point must be specified
CloudWatch alarms
- can be created to monitor any one of the EC2 instance’s metrics.
- can be configured to automatically send you a notification when the metric reaches a specified threshold.
- can automatically stop, terminate, reboot, or recover EC2 instances
- can automatically recover an EC2 instance when the instance becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair
- can automatically stop or terminate the instances to save costs (EC2 instances that use an EBS volume as the root device can be stopped
  or terminated, whereas instances that use the instance store as the root device can only be terminated)
- can use EC2ActionsAccess IAM role, which enables AWS to perform stop, terminate, or reboot actions on EC2 instances
- If you have read/write permissions for CloudWatch but not for EC2, alarms can still be created but the stop or terminate actions won’t be performed on the EC2 instance
- Composite Alarms can combine multiple metric alarms into a single alarm for aggregated health, but cannot perform EC2 actions directly.

CloudWatch Agent

The unified CloudWatch agent collects system-level metrics and logs from EC2 instances that are not available through the default hypervisor-level metrics.
Key OS-level metrics collected by the agent include:
- Memory utilization (mem_used_percent)
- Disk usage (disk_used_percent)
- Swap usage
- Process-level metrics (procstat)
EC2 does NOT provide memory or disk usage metrics by default — these require the CloudWatch agent.

Can be installed and managed via AWS Systems Manager (SSM).
Configuration is stored in a JSON file or as an SSM Parameter Store parameter.
Metrics collected by the CloudWatch agent are billed as custom metrics.

In-Console Agent Management (2025/2026)
- CloudWatch provides visibility into agent status across the EC2 fleet directly in the console.
- Automatic detection of supported workloads and recommended monitoring configurations.
- Visual configuration editor for the agent eliminates the need to hand-edit JSON (April 2026).

EC2 Monitoring Metrics

Instance Metrics

CPUUtilization
- % of physical CPU time that EC2 uses to run the instance, including time spent running both user code and EC2 code.
- At a very high level, CPUUtilization is the sum of guest CPUUtilization and hypervisor CPUUtilization.

DiskReadOps
- Completed read operations from all instance store volumes available to the instance in a specified period of time.
- If there are no instance store volumes, the value is 0 or the metric is not reported.

DiskWriteOps
- Completed write operations to all instance store volumes available to the instance in a specified period of time.
- If there are no instance store volumes, the value is 0 or the metric is not reported.

DiskReadBytes
- Bytes read from all instance store volumes available to the instance.
- This metric is used to determine the volume of the data the application reads from the hard disk of the instance.

DiskWriteBytes
- Bytes written to all instance store volumes available to the instance.
- This metric is used to determine the volume of the data the application writes onto the hard disk of the instance.

MetadataNoToken
- The number of times the Instance Metadata Service (IMDS) was successfully accessed using a method that does not use a token (IMDSv1).
- Used to determine if there are any processes accessing instance metadata using IMDSv1, which is less secure than IMDSv2.
- If all requests use token-backed sessions (IMDSv2), the value is 0.
MetadataNoTokenRejected
- The number of times an IMDSv1 call was attempted after IMDSv1 was disabled on the instance.
- Indicates that software on the instance still attempts IMDSv1 calls and needs updating.
NetworkIn
- The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to an application on a single instance.

NetworkOut
- The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance.
NetworkPacketsIn
- The number of packets received on all network interfaces by the instance.
- This metric is available for basic monitoring only (5-minute periods).
NetworkPacketsOut
- The number of packets sent out on all network interfaces by the instance.
- This metric is available for basic monitoring only (5-minute periods).

CPU Credit Metrics (Burstable Performance Instances)

Applicable to all burstable performance instances (T2, T3, T3a, T4g) — not just T2.

CPU Credit metrics are available at a 5-minute frequency only.
CPUCreditUsage
- The number of CPU credits spent by the instance for CPU utilization.
- One CPU credit equals one vCPU running at 100% utilization for one minute.
CPUCreditBalance
- The number of earned CPU credits that an instance has accrued since it was launched or started.
- For T2 Standard, also includes the number of launch credits accrued.
- When a T3/T3a instance stops, the CPUCreditBalance persists for seven days. When a T2 instance stops, credits are lost.
- Used to determine how long an instance can burst beyond its baseline performance level.

CPUSurplusCreditBalance (Unlimited mode only)
- The number of surplus credits spent when the CPUCreditBalance is zero.
- Surplus credits are paid down by earned CPU credits.
- If surplus credits exceed the maximum earnable in a 24-hour period, additional charges apply.

CPUSurplusCreditsCharged (Unlimited mode only)
- The number of surplus credits that are not paid down and incur an additional charge.
- Charged when surplus credits exceed 24-hour maximum, instance is stopped/terminated, or switched from unlimited to standard mode.

Amazon EBS Metrics for Nitro-based Instances

Available for EBS volumes attached to Nitro-based instances (non-bare-metal).
EBSReadOps / EBSWriteOps – Completed read/write operations from all attached EBS volumes.
EBSReadBytes / EBSWriteBytes – Bytes read from/written to all attached EBS volumes.
EBSIOBalance%
- Percentage of I/O credits remaining in the burst bucket.
- Available for basic monitoring only.
- Available for some *.4xlarge and smaller instance sizes that burst to maximum performance for 30 minutes every 24 hours.
EBSByteBalance%
- Percentage of throughput credits remaining in the burst bucket.
- Available for basic monitoring only.
- Available for some *.4xlarge and smaller instance sizes that burst to maximum performance for 30 minutes every 24 hours.

InstanceEBSIOPSExceededCheck
- Reports whether the application attempted to drive IOPS exceeding the maximum EBS IOPS limits for the instance.
- Values: 0 (not exceeded) or 1 (exceeded).
InstanceEBSThroughputExceededCheck
- Reports whether the application attempted to drive throughput exceeding the maximum EBS throughput limits for the instance.
- Values: 0 (not exceeded) or 1 (exceeded).

Status Check Metrics

Available at a 1-minute frequency at no charge by default.
StatusCheckFailed
- Reports if either of the status checks has failed.
- Values: 0 (passed) or 1 (failed).

StatusCheckFailed_Instance
- Reports whether the instance has passed the EC2 instance status check in the last minute.
- Values: 0 (passed) or 1 (failed).
StatusCheckFailed_System
- Reports whether the instance has passed the EC2 system status check in the last minute.
- Values: 0 (passed) or 1 (failed).
StatusCheckFailed_AttachedEBS
- Reports whether the instance has passed the attached EBS status check in the last minute.
- Values: 0 (passed) or 1 (failed).
- Available for Nitro-based instances only.

Accelerator Metrics

GPUPowerUtilization
- Active power usage as a percentage of maximum active power.
- Available for supported accelerated computing instances only.

CloudWatch Network Flow Monitor

Launched at re:Invent 2024 as part of CloudWatch Network Monitoring.

Provides near real-time visibility into network performance (packet loss and latency) for traffic between EC2 instances, EKS workloads, and AWS services (S3, DynamoDB).
Uses fully-managed agents installed on EC2 instances to collect TCP-based performance metrics.
Agents send aggregated metrics to the backend approximately every 30 seconds.

Top contributors feature identifies network flows with the highest retransmissions or latency to help pinpoint impairments.
Supports multi-account monitoring via AWS Organizations integration.

EC2 Metric Dimensions

InstanceId – Filters data for a specific instance.
InstanceType – Filters data for all instances of a specific type (requires Detailed Monitoring).
ImageId (AMI ID) – Filters data for all instances running a specific AMI (requires Detailed Monitoring).

AutoScalingGroupName – Filters data for all instances in a specified Auto Scaling group.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

In the basic monitoring package for EC2, Amazon CloudWatch provides the following metrics:
1. Web server visible metrics such as number failed transaction requests
2. Operating system visible metrics such as memory utilization
3. Database visible metrics such as number of connections
4. Hypervisor visible metrics such as CPU utilization

Which of the following requires a custom CloudWatch metric to monitor?
1. Memory Utilization of an EC2 instance
2. CPU Utilization of an EC2 instance
3. Disk usage activity of an EC2 instance
4. Data transfer of an EC2 instance
A user has configured CloudWatch monitoring on an EBS backed EC2 instance. If the user has not attached any additional device, which of the below mentioned metrics will always show a 0 value?
1. DiskReadBytes
2. NetworkIn
3. NetworkOut
4. CPUUtilization
A user is running a batch process on EBS backed EC2 instances. The batch process starts a few instances to process Hadoop Map reduce jobs, which can run between 50 – 600 minutes or sometimes for more time. The user wants to configure that the instance gets terminated only when the process is completed. How can the user configure this with CloudWatch?
1. Setup the CloudWatch action to terminate the instance when the CPU utilization is less than 5%
2. Setup the CloudWatch with Auto Scaling to terminate all the instances
3. Setup a job which terminates all instances after 600 minutes
4. It is not possible to terminate instances automatically

An AWS account owner has setup multiple IAM users. One IAM user only has CloudWatch access. He has setup the alarm action, which stops the EC2 instances when the CPU utilization is below the threshold limit. What will happen in this case?
1. It is not possible to stop the instance using the CloudWatch alarm
2. CloudWatch will stop the instance when the action is executed
3. The user cannot set an alarm on EC2 since he does not have the permission
4. The user can setup the action but it will not be executed if the user does not have EC2 rights
A user has launched 10 instances from the same AMI ID using Auto Scaling. The user is trying to see the average CPU utilization across all instances of the last 2 weeks under the CloudWatch console. How can the user achieve this?
1. View the Auto Scaling CPU metrics (Refer AS Instance Monitoring)
2. Aggregate the data over the instance AMI ID (Works but needs detailed monitoring enabled)
3. The user has to use the CloudWatch analyser to find the average data across instances
4. It is not possible to see the average CPU utilization of the same AMI ID since the instance ID is different
Which EC2 status check type monitors whether the EBS volumes attached to a Nitro-based instance are reachable?
1. System status check
2. Instance status check
3. Attached EBS status check
4. Volume status check
An organization wants to monitor memory utilization of their EC2 instances. Which approach should they use?
1. Enable detailed monitoring on the instances
2. Install the unified CloudWatch agent and configure memory metrics
3. Use the default CloudWatch EC2 metrics
4. Enable enhanced monitoring on the instances
Which CloudWatch metric can help identify if an EC2 instance is still using the less secure IMDSv1 to access instance metadata?
1. StatusCheckFailed_Instance
2. MetadataNoToken
3. CPUCreditBalance
4. NetworkPacketsIn
A company wants to ensure all EC2 instances across their AWS Organization have detailed monitoring enabled. What is the most efficient approach? [Select 2]
1. Manually enable detailed monitoring on each instance
2. Create CloudWatch Ingestion enablement rules scoped to the organization
3. Use enablement rules to automatically enable detailed monitoring for existing and new instances
4. Use AWS Config rules to detect and auto-remediate