Table of Contents hide

Architecting for the Cloud – AWS Best Practices

📋 Important Note: Whitepaper Superseded

The original “Architecting for the Cloud: AWS Best Practices” whitepaper (last updated October 2018) has been superseded by the AWS Well-Architected Framework.

The Well-Architected Framework is now organized into six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability (added in 2021). It receives continuous updates — most recently in November 2024 and April 2025.

This post is maintained for certification study purposes as the core architectural principles remain relevant.

Architecting for the Cloud – AWS Best Practices whitepaper provides architectural patterns and advice on how to design systems that are secure, reliable, high performing, and cost efficient

AWS Design Principles

Scalability

While AWS provides virtually unlimited on-demand capacity, the architecture should be designed to take advantage of those resources
There are two ways to scale an IT architecture
- Vertical Scaling
  - takes place through increasing specifications of an individual resource for e.g. updating EC2 instance type with increasing RAM, CPU, IOPS, or networking capabilities
  - will eventually hit a limit, and is not always a cost effective or highly available approach
  - AWS Graviton-based instances (Graviton4 as of 2024) offer up to 40% better price-performance, making vertical scaling more cost-effective
- Horizontal Scaling
  - takes place through increasing number of resources for e.g. adding more EC2 instances or EBS volumes
  - can help leverage the elasticity of cloud computing
  - not all the architectures can be designed to distribute their workload to multiple resources
  - applications designed should be stateless,
    - that needs no knowledge of previous interactions and stores no session information
    - capacity can be increased and decreased, after running tasks have been drained
  - State, if needed, can be implemented using
    - Low latency external store, for e.g. DynamoDB, ElastiCache (Redis or Memcached), to maintain state information
    - Session affinity, for e.g. ELB sticky sessions, to bind all the transactions of a session to a specific compute resource. However, it cannot be guaranteed or take advantage of newly added resources for existing sessions
  - Load can be distributed across multiple resources using
    - Push model, for e.g. through ELB where it distributes the load across multiple EC2 instances
    - Pull model, for e.g. through SQS or Kinesis where multiple consumers subscribe and consume
  - Distributed processing, for e.g. using EMR or Kinesis, helps process large amounts of data by dividing task and its data into many small fragments of works

Disposable Resources Instead of Fixed Servers

Resources need to be treated as temporary disposable resources rather than fixed permanent on-premises resources before

AWS focuses on the concept of Immutable infrastructure
- servers once launched, is never updated throughout its lifetime.
- updates can be performed on a new server with latest configurations,
- this ensures resources are always in a consistent (and tested) state and easier rollbacks

AWS provides multiple ways to instantiate compute resources in an automated and repeatable way
- Bootstrapping
  - scripts to configure and setup for e.g. using EC2 user data scripts and cloud-init to install software or copy resources and code
- Golden Images
  - a snapshot of a particular state of that resource,
  - faster start times and removes dependencies to configuration services or third-party repositories
  - EC2 Image Builder can automate creation, testing, and distribution of golden AMIs
- Containers
  - AWS supports container workloads through Amazon ECS, Amazon EKS, and AWS Fargate (serverless containers)
  - Docker allows packaging a piece of software in a Docker Image, which is a standardized unit for software development, containing everything the software needs to run: code, runtime, system tools, system libraries, etc
  - AWS Fargate provides serverless compute for containers, eliminating the need to manage underlying EC2 instances
Infrastructure as Code
- AWS assets are programmable, techniques, practices, and tools from software development can be applied to make the whole infrastructure reusable, maintainable, extensible, and testable.
- AWS provides services for IaC deployment:
  - AWS CloudFormation – declarative JSON/YAML templates for provisioning AWS resources
  - AWS CDK (Cloud Development Kit) – define infrastructure using familiar programming languages (TypeScript, Python, Java, Go, C#) that synthesize to CloudFormation
  - AWS SAM (Serverless Application Model) – simplified CloudFormation for serverless applications
- Note: AWS OpsWorks reached End of Life on May 26, 2024 and is no longer available. Use AWS Systems Manager, CloudFormation, or CDK as alternatives.

Automation

AWS provides various automation tools and services which help improve system’s stability, efficiency and time to market.
- Elastic Beanstalk
  - a PaaS that allows quick application deployment while handling resource provisioning, load balancing, auto scaling, monitoring etc
- EC2 Auto Recovery
  - creates CloudWatch alarm that monitors an EC2 instance and automatically recovers it if it becomes impaired.
  - A recovered instance is identical to the original instance, including the instance ID, private & Elastic IP addresses, and all instance metadata.
  - Instance is migrated through reboot, in memory contents are lost.
- Auto Scaling
  - allows maintain application availability and scale the capacity up or down automatically as per defined conditions
  - supports predictive scaling that uses machine learning to forecast traffic and proactively scale capacity
- CloudWatch Alarms
  - allows SNS triggers to be configured when a particular metric goes beyond a specified threshold for a specified number of periods
- Amazon EventBridge (formerly CloudWatch Events)
  - allows real-time stream of system events that describe changes in AWS resources
  - extends capabilities with partner event sources, Schema Registry, and EventBridge Pipes for point-to-point integrations
  - EventBridge Scheduler supports one-time and recurring schedules with built-in retry policies
- AWS Systems Manager
  - provides operational management for AWS resources including patch management, configuration compliance, and automated runbooks
  - replaces the need for OpsWorks with features like State Manager, Automation, and Run Command
- Lambda Scheduled Events
  - allows Lambda function creation and direct AWS Lambda to execute it on a regular schedule via EventBridge Scheduler.

Loose Coupling

AWS helps loose coupled architecture that reduces interdependencies, a change or failure in a component does not cascade to other components
- Asynchronous Integration
  - does not involve direct point-to-point interaction but usually through an intermediate durable storage layer for e.g. SQS, Kinesis, EventBridge
  - decouples the components and introduces additional resiliency
  - suitable for any interaction that doesn’t need an immediate response and an ack that a request has been registered will suffice
- Service Discovery
  - allows new resources to be launched or terminated at any point in time and discovered as well for e.g. using ELB as a single point of contact with hiding the underlying instance details or Route 53 zones to abstract load balancer’s endpoint
  - AWS Cloud Map provides service discovery for cloud resources, allowing applications to discover services via API calls, DNS queries, or directly through the SDK
- Well-Defined Interfaces
  - allows various components to interact with each other through specific, technology agnostic interfaces for e.g. RESTful APIs with API Gateway
  - Amazon API Gateway supports REST APIs, HTTP APIs, and WebSocket APIs for real-time communication

Services, Not Servers

AWS encourages leveraging managed services and serverless architectures to reduce operational overhead
- Serverless compute – AWS Lambda for event-driven functions, AWS Fargate for serverless containers
- Managed databases – Amazon RDS, DynamoDB, Aurora Serverless for auto-scaling relational databases
- Application integration – SQS, SNS, EventBridge, Step Functions for workflow orchestration
- API management – API Gateway for creating, publishing, and managing APIs at scale

Databases

AWS provides different categories of database technologies
- Relational Databases (RDS)
  - normalizes data into well-defined tabular structures known as tables, which consist of rows and columns
  - provide a powerful query language, flexible indexing capabilities, strong integrity controls, and the ability to combine data from multiple tables in a fast and efficient manner
  - allows vertical scalability by increasing resources and horizontal scalability using Read Replicas for read capacity and sharding or data partitioning for write capacity
  - provides High Availability using Multi-AZ deployment, where data is synchronously replicated
  - Amazon Aurora provides MySQL and PostgreSQL-compatible databases with up to 5x and 3x better throughput respectively, with automatic storage scaling up to 128 TB
  - Aurora Serverless v2 scales capacity automatically based on application demand, ideal for variable workloads
- NoSQL Databases (DynamoDB)
  - provides databases that trade some of the query and transaction capabilities of relational databases for a more flexible data model that seamlessly scales horizontally
  - perform data partitioning and replication to scale both the reads and writes in a horizontal fashion
  - DynamoDB service synchronously replicates data across three facilities in an AWS region to provide fault tolerance in the event of a server failure or Availability Zone disruption
  - DynamoDB Global Tables provide multi-region, active-active replication for globally distributed applications
  - DynamoDB On-Demand mode eliminates capacity planning by automatically scaling to accommodate workloads
- Data Warehouse (Redshift)
  - Specialized type of relational database, optimized for analysis and reporting of large amounts of data
  - Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing (MPP), columnar data storage, and targeted data compression encoding schemes
  - Redshift MPP architecture enables increasing performance by increasing the number of nodes in the data warehouse cluster
  - Redshift Serverless automatically provisions and scales capacity, allowing analytics without cluster management
- Purpose-Built Databases
  - Amazon ElastiCache – in-memory caching (Redis, Memcached) for sub-millisecond latency
  - Amazon Neptune – graph database for highly connected datasets
  - Amazon Timestream – time series database for IoT and operational applications
  - Amazon MemoryDB for Redis – Redis-compatible, durable, in-memory database

For more details refer to AWS Storage Options Whitepaper

Removing Single Points of Failure

AWS provides ways to implement redundancy, automate recovery and reduce disruption at every layer of the architecture
AWS supports redundancy in the following ways
- Standby Redundancy
  - When a resource fails, functionality is recovered on a secondary resource using a process called failover.
  - Failover will typically require some time before it completes, and during that period the resource remains unavailable.
  - Secondary resource can either be launched automatically only when needed (to reduce cost), or it can be already running idle (to accelerate failover and minimize disruption).
  - Standby redundancy is often used for stateful components such as relational databases.
- Active Redundancy
  - requests are distributed to multiple redundant compute resources, if one fails, the rest can simply absorb a larger share of the workload.
  - Compared to standby redundancy, it can achieve better utilization and affect a smaller population when there is a failure.
AWS supports replication
- Synchronous replication
  - acknowledges a transaction after it has been durably stored in both the primary location and its replicas.
  - protects data integrity from the event of a primary node failure
  - used to scale read capacity for queries that require the most up-to-date data (strong consistency).
  - compromises performance and availability
- Asynchronous replication
  - decouples the primary node from its replicas at the expense of introducing replication lag
  - used to horizontally scale the system’s read capacity for queries that can tolerate that replication lag.
- Quorum-based replication
  - combines synchronous and asynchronous replication to overcome the challenges of large-scale distributed database systems
  - Replication to multiple nodes can be managed by defining a minimum number of nodes that must participate in a successful write operation

AWS provide services to reduce or remove single point of failure
- Regions, Availability Zones with multiple data centers
- ELB or Route 53 to configure health checks and mask failure by routing traffic to healthy endpoints
- Auto Scaling to automatically replace unhealthy nodes
- EC2 auto-recovery to recover unhealthy impaired nodes
- S3, DynamoDB with data redundantly stored across multiple facilities
- Multi-AZ RDS, Aurora (6 copies across 3 AZs), and Read Replicas
- ElastiCache Redis engine supports replication with automatic failover
- AWS Elastic Disaster Recovery (DRS) for continuous replication and automated recovery of on-premises and cloud-based applications
For more details refer to AWS Disaster Recovery Whitepaper

Optimize for Cost

AWS can help organizations reduce capital expenses and drive savings as a result of the AWS economies of scale
AWS provides different options which should be utilized as per use case –
- EC2 pricing models:
  - On-Demand – pay per second/hour with no commitment
  - Savings Plans – commit to a consistent amount of usage (measured in $/hr) for 1 or 3 years; Compute Savings Plans (up to 66% savings) and EC2 Instance Savings Plans (up to 72% savings)
  - Reserved Instances – capacity reservation with up to 72% discount for 1 or 3 year terms
  - Spot Instances – up to 90% discount for fault-tolerant, flexible workloads using spare capacity
  - Dedicated Hosts – single-tenant hardware for compliance and BYOL licensing
- AWS Graviton instances for up to 40% better price-performance over comparable x86 instances
- AWS Cost Optimization Hub, Trusted Advisor, and AWS Compute Optimizer to identify cost savings opportunities
- S3 storage classes:
  - S3 Standard – frequently accessed data
  - S3 Intelligent-Tiering – automatic cost optimization for data with unknown or changing access patterns
  - S3 Standard-Infrequent Access (S3 Standard-IA) – infrequently accessed data
  - S3 One Zone-IA – infrequently accessed data not requiring multi-AZ resilience
  - S3 Glacier Instant Retrieval, Flexible Retrieval, and Deep Archive – long-term archive storage
  - S3 Express One Zone – single-digit millisecond latency for most frequently accessed data (up to 10x faster than S3 Standard)
- EBS volumes – General Purpose SSD (gp3), Provisioned IOPS SSD (io2 Block Express), Throughput Optimized HDD (st1), Cold HDD (sc1). Note: Magnetic (standard) is a previous-generation volume type; gp3 is recommended as default.
- Cost Allocation tags to identify costs based on tags
- Auto Scaling to horizontally scale the capacity up or down based on demand
- Lambda and Fargate based serverless architectures to never pay for idle or redundant resources
- Utilize managed services where scaling is handled by AWS for e.g. ELB, CloudFront, Kinesis, SQS, Amazon OpenSearch Service etc.

Caching

Caching improves application performance and increases the cost efficiency of an implementation
- Application Data Caching
  - provides services that help store and retrieve information from fast, managed, in-memory caches
  - Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud and supports two open-source in-memory caching engines: Memcached and Redis
  - Amazon DynamoDB Accelerator (DAX) provides a fully managed, in-memory cache for DynamoDB with microsecond response times
- Edge Caching
  - allows content to be served by infrastructure that is closer to viewers, lowering latency and giving high, sustained data transfer rates needed to deliver large popular objects to end users at scale.
  - Amazon CloudFront is Content Delivery Network (CDN) consisting of 600+ Points of Presence (edge locations and regional caches), that allows copies of static and dynamic content to be cached
  - CloudFront Functions and Lambda@Edge enable running code at edge locations for request/response manipulation

Security

AWS works on shared security responsibility model
- AWS is responsible for the security of the underlying cloud infrastructure
- you are responsible for securing the workloads you deploy in AWS
AWS also provides ample security features
- IAM to define a granular set of policies and assign them to users, groups, and AWS resources
- IAM roles to assign short term credentials to resources, which are automatically distributed and rotated
- AWS IAM Identity Center (formerly AWS SSO) for centralized workforce identity management and single sign-on across AWS accounts and applications
- Amazon Cognito, for mobile and web applications, which allows client devices to get controlled access to AWS resources via temporary tokens
- VPC to isolate parts of infrastructure through the use of subnets, security groups, and routing controls
- AWS WAF to help protect web applications from SQL injection, cross-site scripting, and other common exploits with managed rule groups
- CloudWatch logs to collect logs centrally as the servers are temporary
- CloudTrail for auditing AWS API calls, which delivers a log file to S3 bucket. Logs can then be stored in an immutable manner and automatically processed to either notify or even take action on your behalf, protecting your organization from non-compliance
- AWS Security Hub – unified security posture management that aggregates findings from GuardDuty, Inspector, Macie, and partner tools with automated compliance checks
- Amazon GuardDuty – intelligent threat detection using machine learning, anomaly detection, and integrated threat intelligence to identify malicious activity
- Amazon Inspector – automated vulnerability management that continuously scans EC2 instances, container images (ECR), Lambda functions, and code repositories for software vulnerabilities
- AWS Config for continuous compliance monitoring, and AWS Trusted Advisor for best practice recommendations across cost, performance, security, fault tolerance, and service limits
For more details refer to AWS Security Whitepaper

AWS Well-Architected Framework

The AWS Well-Architected Framework is the successor to this whitepaper and provides comprehensive guidance for building secure, high-performing, resilient, and efficient infrastructure
The Framework is built on six pillars:
- Operational Excellence – running and monitoring systems to deliver business value and continually improve processes and procedures
- Security – protecting information and systems through risk assessments, mitigation strategies, and security best practices
- Reliability – ensuring workloads perform correctly and consistently, with ability to recover from failures and meet demand
- Performance Efficiency – using computing resources efficiently to meet requirements and maintain efficiency as demand changes
- Cost Optimization – avoiding unnecessary costs through understanding spending, selecting the right resources, and scaling to meet needs without overspending
- Sustainability (added 2021) – minimizing environmental impacts by reducing energy consumption and increasing efficiency of cloud workloads
The AWS Well-Architected Tool in the AWS Management Console allows workload reviews against framework best practices
AWS also provides Well-Architected Lenses for specific workload types (Serverless, SaaS, Machine Learning, Data Analytics, IoT, etc.)

References

AWS Well-Architected Framework (current guidance, replaces the original whitepaper)
Architecting for the Cloud: AWS Best Practices – Whitepaper (archived, October 2018)
AWS Well-Architected Homepage

5 thoughts on “Architecting for the Cloud – AWS Best Practices – Whitepaper – Certification”

Kayeee Su says:

July 31, 2019 at 1:28 pm

I really appreciate your note. It is clear and complete. It helps me a lot for preparing certification test!!:))) Thank you sooo much!
I noticed that you don’t have note about cognito.
1. jayendrapatil says:
  
  August 1, 2019 at 9:43 pm
  
  Glad its helping Kayeee …
Pradeep KR says:

September 2, 2019 at 2:15 pm

Great Articulation with lucid explanation of the Cloud eco system and relevant components. Not seen such detailed information available in a single place holder (web site). It helped me understand the cloud related nuances every bit.
Shantanu Deshpande says:

June 23, 2020 at 11:05 pm

Your notes are really helpful. Information is dispensed in a crisp, short and clear way! Useful for certification preparation. Thanks !
1. jayendrapatil says:
  
  June 24, 2020 at 10:41 am
  
  glad it helped Shantanu ..

Comments are closed.