AWS EC2 Auto Scaling Troubleshooting
⚠️ Important: Launch Configurations Deprecated
AWS has deprecated Launch Configurations. As of October 1, 2024, new AWS accounts cannot create launch configurations using any method (Console, API, CLI, or CloudFormation). Existing accounts can still use them but should migrate to Launch Templates, which support all new EC2 features.
This post has been updated to reflect Launch Templates as the current standard. References to launch configurations are maintained for existing deployments.
Exam Question Scenario
EC2 instances fail to launch with Auto Scaling configuration
Auto Scaling Configuration Overview
- Auto Scaling configuration requires the following:
- Launch Template (recommended) or Launch Configuration (deprecated) which allows you to specify:
- AMI
- Instance type (or multiple instance types with mixed instances policy)
- IAM role (optional)
- Security group(s)
- Key pair (optional)
- Network interfaces and subnet settings
- EBS volume configuration
- User data scripts
- Auto Scaling Group (ASG) configuration specifies:
- VPC and Subnets (Availability Zones) for instance placement
- Desired, minimum, and maximum capacity
- Health check type and grace period
- Scaling policies (target tracking, step, simple, predictive)
- Instance maintenance policy
- Load balancer / target group attachments
- Launch Template (recommended) or Launch Configuration (deprecated) which allows you to specify:
EC2 Instance Launch Failure Troubleshooting
- AMI Issues
- AMI ID does not exist or has been deregistered
- AMI is still in a pending state and cannot be used to launch instances
- AMI is in a different region than the Auto Scaling group
- AMI permissions do not allow the account to launch from it (private/shared AMI)
- Security Group Issues
- Security group specified in the launch template does not exist or has been deleted
- Security group belongs to a different VPC than the one specified in the ASG subnets
- Key Pair Issues
- Key pair associated with the launch template does not exist or has been deleted
- Auto Scaling Group Configuration Issues
- Auto Scaling group not found or is incorrectly configured
- Subnet specified in the ASG does not exist or is invalid
- AZ configured with the Auto Scaling group is no longer supported or unavailable
- EBS Volume Issues
- Invalid EBS block device mappings
- EBS snapshot specified does not exist
- EBS volume type not supported in the AZ
- Encrypted EBS volumes require proper KMS key permissions for the service-linked role
- Instance Type & Capacity Issues
- Instance type is not supported in the specified AZ
- InsufficientInstanceCapacity – AWS does not have enough capacity for the requested instance type in the AZ
- Account-level service limits (vCPU limits) for instance types reached in the region
- Spot Instance capacity unavailable or Spot price exceeds the maximum price specified
- Launch Template Issues
- Launch template version specified does not exist
- Launch template contains invalid parameters (e.g., unsupported instance type for a region)
- IAM instance profile specified does not exist or the ASG service role lacks iam:PassRole permission
- Networking Issues
- VPC/Subnet has no available IP addresses
- Placement group constraints cannot be satisfied
- Network interface configuration conflicts with subnet settings
- Permission Issues
- Auto Scaling service-linked role (AWSServiceRoleForAutoScaling) lacks required EC2 permissions
- Custom service-linked role does not have permissions for encrypted volumes, specific VPCs, or EC2 actions
- SCP (Service Control Policy) blocking required ec2:RunInstances action
Health Check Failure Troubleshooting
- Auto Scaling supports multiple health check sources:
- EC2 Status Checks – Default; checks instance system and instance status
- ELB Health Checks – Target group health checks when integrated with ALB/NLB
- VPC Lattice Health Checks – Health checks from VPC Lattice target groups
- Amazon EBS Health Checks – Monitors attached EBS volume status
- Custom Health Checks – User-defined via set-instance-health API
- Common Health Check Issues:
- Instances marked unhealthy immediately after launch – health check grace period may be too short
- Application not ready before grace period expires – increase the grace period or fix slow startup
- ELB health check failing – verify security group allows health check traffic from LB
- Instances stuck in a launch/terminate loop – check application health and startup scripts
- Instances failing system status checks – may indicate underlying hardware issues
Scaling Policy Troubleshooting
- Scaling Not Triggering:
- CloudWatch alarm not in ALARM state – verify metric and threshold
- Scaling processes suspended (AlarmNotification, Launch, or Terminate processes)
- Cooldown period active – scaling actions wait until cooldown expires
- ASG already at maximum (scale-out) or minimum (scale-in) capacity
- Predictive Scaling Issues:
- Insufficient historical data (requires at least 24 hours of load data)
- Metrics not available in CloudWatch for the prediction
- Policy in forecast-only mode will not actually scale
- Target Tracking Issues:
- Metric math expression errors in custom metrics
- Scale-in disabled unintentionally
- Conflicting scaling policies (scale-out and scale-in triggering simultaneously – scale-out takes precedence)
Instance Refresh Troubleshooting
- Instance refresh allows rolling updates to replace instances with new launch template versions
- Common Issues:
- Refresh stuck or slow – minimum healthy percentage too high, leaving no room for replacement
- New instances failing health checks – launch template changes may have introduced issues
- Checkpoint failure – application not ready within checkpoint delay period
- Rollback triggered – instances launched with new configuration failing health checks
- Cannot start refresh – another refresh or operation already in progress
Warm Pool Troubleshooting
- Warm pools maintain pre-initialized instances for faster scale-out
- Common Issues:
- Instances not entering warm pool – lifecycle hook actions completing with ABANDON result
- Warm pool instances not transitioning to InService – health check failures during transition
- Warm pool size not maintained – check MaxGroupPreparedCapacity and min pool size settings
- Hibernated instances failing to resume – instance type or AMI may not support hibernation
Capacity Rebalancing (Spot) Troubleshooting
- Capacity Rebalancing proactively replaces Spot Instances at risk of interruption
- Common Issues:
- Frequent instance replacements – diversify instance types (recommend 10+ types) across multiple AZs
- Replacement instances also getting rebalance recommendations – enable attribute-based instance type selection (ABS) for broader diversification
- Capacity not maintained during rebalancing – ASG temporarily exceeds desired capacity (by design) to launch replacements before terminating at-risk instances
Zonal Shift Troubleshooting
- Zonal shift (Amazon ARC) allows shifting traffic away from an impaired AZ
- Common Issues:
- Zonal shift not available – ensure ASG has zonal shift enabled (AvailabilityZoneImpairmentPolicy)
- Instances still terminating in shifted zone – check health check behavior setting (IgnoreUnhealthy vs ReplaceUnhealthy)
- Insufficient capacity after shift – remaining AZs must have capacity to handle full load; consider cross-AZ pre-provisioning
Troubleshooting Commands
- Use
describe-scaling-activitiesto retrieve error messages:
1aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg - Scaling activities log is retained for 6 weeks
- Check the StatusCode (Successful, Failed, Cancelled) and StatusMessage fields for details
- Use
--include-deleted-groupsto view activities for deleted ASGs
Best Practices to Avoid Launch Failures
- Use Launch Templates (not launch configurations) for access to all new features
- Specify multiple instance types (10+) using mixed instances policy or attribute-based instance type selection
- Configure instances across multiple Availability Zones
- Use attribute-based instance type selection (ABS) to automatically select instance types matching compute requirements
- Enable Capacity Rebalancing for Spot Instance workloads
- Set appropriate health check grace period based on application startup time
- Monitor ASG activities via CloudWatch metrics and scaling activity history
- Use instance maintenance policy to control replacement behavior (launch-before-terminate vs terminate-before-launch)
Certification Exam Tips
- Know the difference between launch templates and launch configurations and that launch configurations are deprecated
- Understand InsufficientInstanceCapacity errors and the recommendation to use multiple instance types
- Know health check types (EC2, ELB, VPC Lattice, EBS, Custom) and grace period behavior
- Understand that scaling activities can be viewed with describe-scaling-activities API
- Know that conflicting scaling policies resolve in favor of scale-out for availability
- Understand instance refresh rollback behavior and checkpoint mechanisms