AWS Autoscaling Troubleshooting

AWS EC2 Auto Scaling Troubleshooting

⚠️ Important: Launch Configurations Deprecated

AWS has deprecated Launch Configurations. As of October 1, 2024, new AWS accounts cannot create launch configurations using any method (Console, API, CLI, or CloudFormation). Existing accounts can still use them but should migrate to Launch Templates, which support all new EC2 features.

This post has been updated to reflect Launch Templates as the current standard. References to launch configurations are maintained for existing deployments.

Exam Question Scenario

EC2 instances fail to launch with Auto Scaling configuration

Auto Scaling Configuration Overview

  • Auto Scaling configuration requires the following:
    • Launch Template (recommended) or Launch Configuration (deprecated) which allows you to specify:
      • AMI
      • Instance type (or multiple instance types with mixed instances policy)
      • IAM role (optional)
      • Security group(s)
      • Key pair (optional)
      • Network interfaces and subnet settings
      • EBS volume configuration
      • User data scripts
    • Auto Scaling Group (ASG) configuration specifies:
      • VPC and Subnets (Availability Zones) for instance placement
      • Desired, minimum, and maximum capacity
      • Health check type and grace period
      • Scaling policies (target tracking, step, simple, predictive)
      • Instance maintenance policy
      • Load balancer / target group attachments

EC2 Instance Launch Failure Troubleshooting

  • AMI Issues
    • AMI ID does not exist or has been deregistered
    • AMI is still in a pending state and cannot be used to launch instances
    • AMI is in a different region than the Auto Scaling group
    • AMI permissions do not allow the account to launch from it (private/shared AMI)
  • Security Group Issues
    • Security group specified in the launch template does not exist or has been deleted
    • Security group belongs to a different VPC than the one specified in the ASG subnets
  • Key Pair Issues
    • Key pair associated with the launch template does not exist or has been deleted
  • Auto Scaling Group Configuration Issues
    • Auto Scaling group not found or is incorrectly configured
    • Subnet specified in the ASG does not exist or is invalid
    • AZ configured with the Auto Scaling group is no longer supported or unavailable
  • EBS Volume Issues
    • Invalid EBS block device mappings
    • EBS snapshot specified does not exist
    • EBS volume type not supported in the AZ
    • Encrypted EBS volumes require proper KMS key permissions for the service-linked role
  • Instance Type & Capacity Issues
    • Instance type is not supported in the specified AZ
    • InsufficientInstanceCapacity – AWS does not have enough capacity for the requested instance type in the AZ
    • Account-level service limits (vCPU limits) for instance types reached in the region
    • Spot Instance capacity unavailable or Spot price exceeds the maximum price specified
  • Launch Template Issues
    • Launch template version specified does not exist
    • Launch template contains invalid parameters (e.g., unsupported instance type for a region)
    • IAM instance profile specified does not exist or the ASG service role lacks iam:PassRole permission
  • Networking Issues
    • VPC/Subnet has no available IP addresses
    • Placement group constraints cannot be satisfied
    • Network interface configuration conflicts with subnet settings
  • Permission Issues
    • Auto Scaling service-linked role (AWSServiceRoleForAutoScaling) lacks required EC2 permissions
    • Custom service-linked role does not have permissions for encrypted volumes, specific VPCs, or EC2 actions
    • SCP (Service Control Policy) blocking required ec2:RunInstances action

Health Check Failure Troubleshooting

  • Auto Scaling supports multiple health check sources:
    • EC2 Status Checks – Default; checks instance system and instance status
    • ELB Health Checks – Target group health checks when integrated with ALB/NLB
    • VPC Lattice Health Checks – Health checks from VPC Lattice target groups
    • Amazon EBS Health Checks – Monitors attached EBS volume status
    • Custom Health Checks – User-defined via set-instance-health API
  • Common Health Check Issues:
    • Instances marked unhealthy immediately after launch – health check grace period may be too short
    • Application not ready before grace period expires – increase the grace period or fix slow startup
    • ELB health check failing – verify security group allows health check traffic from LB
    • Instances stuck in a launch/terminate loop – check application health and startup scripts
    • Instances failing system status checks – may indicate underlying hardware issues

Scaling Policy Troubleshooting

  • Scaling Not Triggering:
    • CloudWatch alarm not in ALARM state – verify metric and threshold
    • Scaling processes suspended (AlarmNotification, Launch, or Terminate processes)
    • Cooldown period active – scaling actions wait until cooldown expires
    • ASG already at maximum (scale-out) or minimum (scale-in) capacity
  • Predictive Scaling Issues:
    • Insufficient historical data (requires at least 24 hours of load data)
    • Metrics not available in CloudWatch for the prediction
    • Policy in forecast-only mode will not actually scale
  • Target Tracking Issues:
    • Metric math expression errors in custom metrics
    • Scale-in disabled unintentionally
    • Conflicting scaling policies (scale-out and scale-in triggering simultaneously – scale-out takes precedence)

Instance Refresh Troubleshooting

  • Instance refresh allows rolling updates to replace instances with new launch template versions
  • Common Issues:
    • Refresh stuck or slow – minimum healthy percentage too high, leaving no room for replacement
    • New instances failing health checks – launch template changes may have introduced issues
    • Checkpoint failure – application not ready within checkpoint delay period
    • Rollback triggered – instances launched with new configuration failing health checks
    • Cannot start refresh – another refresh or operation already in progress

Warm Pool Troubleshooting

  • Warm pools maintain pre-initialized instances for faster scale-out
  • Common Issues:
    • Instances not entering warm pool – lifecycle hook actions completing with ABANDON result
    • Warm pool instances not transitioning to InService – health check failures during transition
    • Warm pool size not maintained – check MaxGroupPreparedCapacity and min pool size settings
    • Hibernated instances failing to resume – instance type or AMI may not support hibernation

Capacity Rebalancing (Spot) Troubleshooting

  • Capacity Rebalancing proactively replaces Spot Instances at risk of interruption
  • Common Issues:
    • Frequent instance replacements – diversify instance types (recommend 10+ types) across multiple AZs
    • Replacement instances also getting rebalance recommendations – enable attribute-based instance type selection (ABS) for broader diversification
    • Capacity not maintained during rebalancing – ASG temporarily exceeds desired capacity (by design) to launch replacements before terminating at-risk instances

Zonal Shift Troubleshooting

  • Zonal shift (Amazon ARC) allows shifting traffic away from an impaired AZ
  • Common Issues:
    • Zonal shift not available – ensure ASG has zonal shift enabled (AvailabilityZoneImpairmentPolicy)
    • Instances still terminating in shifted zone – check health check behavior setting (IgnoreUnhealthy vs ReplaceUnhealthy)
    • Insufficient capacity after shift – remaining AZs must have capacity to handle full load; consider cross-AZ pre-provisioning

Troubleshooting Commands

  • Use describe-scaling-activities to retrieve error messages:
  • Scaling activities log is retained for 6 weeks
  • Check the StatusCode (Successful, Failed, Cancelled) and StatusMessage fields for details
  • Use --include-deleted-groups to view activities for deleted ASGs

Best Practices to Avoid Launch Failures

  • Use Launch Templates (not launch configurations) for access to all new features
  • Specify multiple instance types (10+) using mixed instances policy or attribute-based instance type selection
  • Configure instances across multiple Availability Zones
  • Use attribute-based instance type selection (ABS) to automatically select instance types matching compute requirements
  • Enable Capacity Rebalancing for Spot Instance workloads
  • Set appropriate health check grace period based on application startup time
  • Monitor ASG activities via CloudWatch metrics and scaling activity history
  • Use instance maintenance policy to control replacement behavior (launch-before-terminate vs terminate-before-launch)

Certification Exam Tips

  • Know the difference between launch templates and launch configurations and that launch configurations are deprecated
  • Understand InsufficientInstanceCapacity errors and the recommendation to use multiple instance types
  • Know health check types (EC2, ELB, VPC Lattice, EBS, Custom) and grace period behavior
  • Understand that scaling activities can be viewed with describe-scaling-activities API
  • Know that conflicting scaling policies resolve in favor of scale-out for availability
  • Understand instance refresh rollback behavior and checkpoint mechanisms

References

AWS – EC2 Troubleshooting Connecting to an Instance

AWS – EC2 Troubleshooting Connecting to an Instance

EC2 Connection Methods

AWS provides multiple methods to connect to EC2 instances. Understanding these helps choose the right approach and troubleshoot connection issues effectively.

  • SSH/RDP (Traditional) – Requires open inbound ports (22/3389), key pairs, and a public IP or VPN connectivity.
  • EC2 Instance Connect – Browser-based SSH using temporary keys pushed via IAM. Requires the EC2 Instance Connect agent installed and port 22 open from EC2 Instance Connect service IP ranges.
  • EC2 Instance Connect Endpoint (EIC Endpoint) – Launched June 2023, allows SSH/RDP to instances in private subnets without a public IP, bastion host, or internet gateway. Creates a private tunnel through an endpoint in the VPC. No additional cost.
  • AWS Systems Manager Session Manager – Provides shell access without opening inbound ports, managing SSH keys, or requiring a public IP. Uses the SSM Agent and IAM for authentication. All sessions are logged and auditable. AWS recommends this as the preferred method for EC2 access.
  • EC2 Serial Console – Provides low-level serial port access for troubleshooting boot, network, and OS configuration issues even when SSH/RDP is unavailable. Does not require network connectivity to the instance.

Common Causes for Connection Issues

  1. Security Group misconfiguration – Inbound rules must allow SSH (port 22) or RDP (port 3389) traffic from your IP address. The default VPC security group does not allow inbound SSH by default.
  2. Network ACL (NACL) misconfiguration – NACLs are stateless. Both inbound rules (allow traffic on port 22 from source IP) and outbound rules (allow response traffic on ephemeral ports 1024-65535) must be configured.
  3. Missing or incorrect key pair – Verify the private key (.pem) file corresponds to the key pair selected when the instance was launched.
  4. Incorrect username – The default username varies by AMI/OS:
    AMI Default Username
    Amazon Linux ec2-user
    Ubuntu ubuntu
    Debian admin
    CentOS centos or ec2-user
    RHEL ec2-user or root
    SUSE ec2-user or root
    Fedora fedora or ec2-user
    FreeBSD ec2-user
    Oracle ec2-user
    Bitnami bitnami
    Rocky Linux rocky
  5. No public IP address – Instance must have a public IPv4 address (or Elastic IP) to connect via SSH over the internet. Alternatively, use Session Manager or EC2 Instance Connect Endpoint for private instances.
  6. Missing route to Internet Gateway – The subnet’s route table must have a route for 0.0.0.0/0 pointing to an Internet Gateway for internet-facing instances.
  7. Instance not in running state or failed status checks – Verify the instance is running and passes both system and instance status checks.
  8. Key file permissions too open – Private key file must have restrictive permissions (chmod 400 on Linux/macOS). SSH ignores keys with permissions broader than 0400.
  9. Corporate firewall blocking port 22 – Internal firewalls may block outbound SSH. Use Session Manager (HTTPS-based) as an alternative.
  10. CPU overload on instance – High CPU utilization can make the instance unresponsive. Check CloudWatch metrics and consider resizing or using Auto Scaling.

Connection Error Messages and Solutions

“Connection timed out” Error

Indicates network-level connectivity issues. Troubleshoot:

  1. Verify security group allows inbound SSH from your current public IP (IP may change with dynamic addressing)
  2. Verify route table has a route to an Internet Gateway (0.0.0.0/0 → igw-xxx)
  3. Verify Network ACL allows inbound on port 22 AND outbound on ephemeral ports (1024-65535)
  4. Verify instance has a public IPv4 address or Elastic IP
  5. Check for corporate firewall blocking outbound port 22
  6. Use VPC Reachability Analyzer to diagnose the network path

“Permission denied (publickey)” Error

Indicates authentication failure. Troubleshoot:

  1. Verify you are using the correct private key for the instance’s key pair
  2. Verify you are connecting with the correct username for the AMI
  3. Verify key file permissions are 0400 (Linux/macOS) or properly restricted (Windows)
  4. Check if permissions on ~/.ssh/authorized_keys or home directory were changed on the instance

“Unprotected Private Key File” Warning

SSH ignores keys with overly permissive file permissions.

  • Linux/macOS: chmod 0400 my_private_key.pem
  • Windows: Remove inherited permissions and grant Read access only to your user account via file Properties → Security → Advanced

“Host key verification failed” Error

Occurs when the host key stored in ~/.ssh/known_hosts doesn’t match the instance. Common after stopping/starting instances (which may change the public IP) or associating/removing an Elastic IP. Remove the old host key entry and reconnect.

“Server refused our key” Error (PuTTY)

  • Verify the .pem file was converted to .ppk format using PuTTYgen
  • Verify correct username is entered in the PuTTY configuration
  • Verify the latest version of PuTTY is installed

Troubleshooting Tools

VPC Reachability Analyzer

A configuration analysis tool that checks network reachability between a source and destination resource in your VPC. For EC2 connectivity troubleshooting:

  • Set Source type to Internet Gateway and Destination to your EC2 instance
  • Analyzes security groups, NACLs, route tables, and other network components
  • Provides hop-by-hop path details when reachable or identifies the blocking component when not reachable
  • Amazon Q network troubleshooting (2024) integrates with Reachability Analyzer to help diagnose connectivity issues using natural language

AWSSupport-TroubleshootSSH Automation Runbook

An AWS Systems Manager Automation document that automatically diagnoses and repairs common SSH connection issues:

  • Installs EC2Rescue for Linux on the instance
  • Checks and attempts to fix SSH daemon configuration, file permissions, and firewall rules
  • Can be run with Action: FixAll to automatically repair identified issues
  • Creates a temporary VPC and uses Lambda functions to perform the analysis

EC2 Serial Console

Provides serial port access for troubleshooting when SSH/RDP is unavailable:

  • Does not require network connectivity to the instance
  • Useful for troubleshooting boot failures, network misconfigurations, and OS-level issues
  • Must be enabled at the account level; requires IAM permissions and a password-based OS user
  • Supported on Nitro-based instance types

SSH Verbose Mode

Use ssh -vvv for detailed debugging output to identify where the connection fails in the SSH handshake process.

Modern Alternatives to Traditional SSH

AWS Systems Manager Session Manager

AWS recommends Session Manager as the preferred access method because it:

  • Eliminates the need to open inbound port 22
  • Removes the need to manage SSH keys
  • Does not require bastion hosts or public IP addresses
  • Provides centralized access control through IAM policies
  • Logs all sessions to CloudWatch Logs and/or S3 for audit
  • Supports port forwarding for accessing remote services
  • Encrypts all traffic using TLS 1.2
  • Available at no additional charge for EC2 instances

Requirements: SSM Agent installed (pre-installed on Amazon Linux 2/2023, Ubuntu 16.04+), instance profile with AmazonSSMManagedInstanceCore policy, and outbound HTTPS connectivity (or VPC endpoints for private subnets).

EC2 Instance Connect Endpoint

For instances in private subnets without Session Manager configured:

  • Create an EIC Endpoint in your VPC (no additional cost)
  • Connect via AWS CLI: aws ec2-instance-connect ssh --instance-id i-xxx
  • No need for public IP, IGW, or bastion hosts
  • Uses IAM for authorization
  • Security group on the endpoint controls which instances can be accessed

Lost Private Key Recovery

If the private key for an EBS-backed instance is lost:

  1. Create a new key pair
  2. Stop the instance (not terminate)
  3. Detach the root EBS volume
  4. Attach the volume to a temporary instance
  5. Mount the volume and update ~/.ssh/authorized_keys with the new public key
  6. Detach the volume and reattach to the original instance as the root volume
  7. Start the instance and connect with the new key pair

Note: This procedure only works for EBS-backed instances. Instance store-backed instances cannot be recovered without the original key. Alternatively, use Session Manager if SSM Agent is running, or use EC2 Serial Console if a password-based user is configured.

AWS Certification Exam Tips

  • “Connection timed out” typically indicates network-level issues (security groups, NACLs, route tables, no public IP)
  • “Permission denied” typically indicates authentication issues (wrong key, wrong username, key file permissions)
  • Session Manager is the recommended approach for secure, auditable access without open ports
  • EC2 Instance Connect Endpoint enables access to private instances without bastion hosts
  • EC2 Serial Console is the last-resort tool when all network-based access fails
  • VPC Reachability Analyzer is used to diagnose network path issues

Exam Scenario Questions

  1. You try to connect via SSH to a newly created Amazon EC2 instance and get one of the following error messages: “Network error: Connection timed out” or “Error connecting to [instance], reason: → Connection timed out: connect.” You have confirmed that the network and security group rules are configured correctly and the instance is passing status checks. What steps should you take to identify the source of the behavior? Choose 2 answers
    • Verify that the private key file corresponds to the Amazon EC2 key pair assigned at launch.
    • Verify that your IAM user policy has permission to launch Amazon EC2 instances.
    • Verify that you are connecting with the appropriate user name for your AMI.
    • Verify that the Amazon EC2 Instance was launched with the proper IAM role.
    • Verify that your federation trust to AWS has been established.
  2. A developer is unable to SSH into an EC2 instance in a private subnet. The instance has no public IP address and no internet gateway is attached to the VPC. The instance has the SSM Agent installed with an appropriate instance profile. What is the MOST operationally efficient way to connect?
    • Attach an Elastic IP address to the instance and connect via SSH.
    • Deploy a bastion host in a public subnet and use it to SSH into the private instance.
    • Use AWS Systems Manager Session Manager to establish a session to the instance.
    • Create a VPN connection to the VPC and connect via the private IP.
  3. An administrator receives “Permission denied (publickey)” when connecting via SSH to an EC2 instance running Amazon Linux. The administrator confirmed the correct key pair was used. What should be checked NEXT?
    • Verify the security group allows inbound traffic on port 22.
    • Verify the username is ec2-user (not root) and the key file permissions are chmod 400.
    • Verify the instance has an IAM role attached.
    • Verify the instance is in a public subnet.
  4. A security team wants to provide developers access to EC2 instances without opening any inbound ports and with full session logging. Which AWS service should they implement?
    • EC2 Instance Connect
    • AWS Systems Manager Session Manager
    • AWS Direct Connect
    • Amazon WorkSpaces
  5. An EC2 instance has become unresponsive and all network-based connection methods (SSH, Session Manager) are failing. The instance is running on a Nitro-based instance type. Which AWS feature can provide access for troubleshooting?
    • VPC Flow Logs
    • AWS CloudTrail
    • EC2 Serial Console
    • AWS X-Ray
  6. A solutions architect needs to diagnose why SSH connections to an EC2 instance are timing out. Which AWS tool can analyze the network path between an internet gateway and the instance to identify the blocking component?
    • AWS CloudTrail
    • VPC Reachability Analyzer
    • Amazon Inspector
    • AWS Trusted Advisor

References