AWS DDoS Resiliency – Best Practices – Whitepaper

AWS DDoS Resiliency Whitepaper

  • Denial of Service (DoS) is an attack, carried out by a single attacker, which attempts to make a website or application unavailable to the end users.
  • Distributed Denial of Service (DDoS) is an attack, carried out by multiple attackers either controlled or compromised by a group of collaborators, which generates a flood of requests to the application making in unavailable to the legitimate end users

Mitigation techniques

Minimize the Attack Surface Area

  • This is all all about reducing the attack surface, the different Internet entry points, that allows access to your application
  • Strategy to minimize the Attack surface area
    • reduce the number of necessary Internet entry points,
    • don’t expose back end servers,
    • eliminate non-critical Internet entry points,
    • separate end user traffic from management traffic,
    • obfuscate necessary Internet entry points to the level that untrusted end users cannot access them, and
    • decouple Internet entry points to minimize the effects of attacks.
  • Benefits
    • Minimizes the effective attack vectors and targets
    • Less to monitor and protect
  • Strategy can be achieved using AWS Virtual Private Cloud (VPC)
    • helps define a logically isolated virtual network within the AWS
    • provides ability to create Public & Private Subnets to launch the internet facing and non-public facing instances accordingly
    • provides NAT gateway which allows instances in the private subnet to have internet access without the need to launch them in public subnets with Public IPs
    • allows creation of Bastion host which can be used to connect to instances in the private subnets
    • provides the ability to configure security groups for instances and NACLs for subnets, which act as a firewall, to control and limit outbound and inbound traffic

VPC Architecture

Be Ready to Scale to Absorb the Attack

  • DDOS mainly targets to load the systems till the point they cannot handle the load and are rendered unusable.
  • Scaling out Benefits
    • help build a resilient architecture
    • makes the attacker work harder
    • gives you time to think, analyze and adapt
  • AWS provided services :-
    • Auto Scaling & ELB
      • Horizontal scaling using Auto Scaling with ELB
      • Auto Scaling allows instances to be added and removed as the demand changes
      • ELB helps distribute the traffic across multiple EC2 instances while acting as a Single point of contact.
      • Auto Scaling automatically registers and deregisters EC2 instances with the ELB during scale out and scale in events
    • EC2 Instance
      • Vertical scaling can be achieved by using appropriate EC2 instance types for e.g. EBS optimized or ones with 10 gigabyte network connectivity to handle the load
    • Enhanced Networking
      • Use Instances with Enhanced Networking capabilities which can provide high packet-per-second performance, low latency networking, and improved scalability
    • Amazon CloudFront
      • CloudFront is a CDN, acts as a proxy between end users and the Origin servers, and helps distribute content to the end users without sending traffic to the Origin servers.
      • CloudFront has the inherent ability to help mitigate against both infrastructure and some application layer DDoS attacks by dispersing the traffic across multiple locations.
      • AWS has multiple Internet connections for capacity and redundancy at each location, which allows it to isolate attack traffic while serving content to legitimate end users
      • CloudFront also has filtering capabilities to ensure that only valid TCP connections and HTTP requests are made while dropping invalid requests. This takes the burden of handling invalid traffic (commonly used in UDP & SYN floods, and slow reads) off the origin.
    • Route 53
      • DDOS attacks are also targeted towards DNS, cause if the DNS is unavailable your application is effectively unavailable.
      • AWS Route 53 is highly available and scalable DNS service and have capabilities to ensure access to the application even when under DDOS attack
        • Shuffle Sharding – Shuffle sharding is similar to the concept of database sharding, where horizontal partitions of data are spread across separate database servers to spread load and provide redundancy. Similarly, Amazon Route 53 uses shuffle sharding to spread DNS requests over numerous PoPs, thus providing multiple paths and routes for your application.
        • Anycast Routing – Anycast routing increases redundancy by advertising the same IP address from multiple PoPs. In the event that a DDoS attack overwhelms one endpoint, shuffle sharding isolate failures while providing additional routes to your infrastructure.

Safeguard Exposed & Hard to Scale Expensive Resources

  • If entry points cannot be limited, additional measures to restrict access and protect those entry points without interrupting legitimate end user traffic
  • AWS provided services :-
    • CloudFront
      • CloudFront can restrict access to content using Geo Restriction and Origin Access Identity
      • With Geo Restriction, access can be restricted to a set of whitelisted countries or prevent access from a set of black listed countries
      • Origin Access Identity is the CloudFront special user which allows access to the resources only through CloudFront while denying direct access to the origin content for e.g. if S3 is the Origin for CloudFront, S3 can be configured to allow access only from OAI and hence deny direct access
    • Route 53
      • Route 53 provides two features Alias Record sets & Private DNS to make it easier to scale infrastructure and respond to DDoS attacks
    • WAF
      • WAFs act as filters that apply a set of rules to web traffic. Generally, these rules cover exploits like cross-site scripting (XSS) and SQL injection (SQLi) but can also help build resiliency against DDoS by mitigating HTTP GET or POST floods
      • WAF provides a lot of features like
        • OWASP Top 10
        • HTTP rate limiting (where only a certain number of requests are allowed per user in a timeframe),
        • Whitelist or blacklist (customizable rules)
        • inspect and identify requests with abnormal patterns,
        • CAPTCHA etc
      • To prevent WAF from being a Single point of failure, a WAF sandwich pattern can be implemented where an autoscaled WAF sits between the Internet and Internal Load Balancer

DDOS Resiliency - WAF Sandwich Architecture

Learn Normal Behavior

  • Understand the normal usual levels and Patterns of traffic for your application and use that as a benchmark for identifying abnormal level of traffic or resource spikes patterns
  • Benefits
    • allows one to spot abnormalities
    • configure Alarms with accurate thresholds
    • assists with generating forensic data
  • AWS provided services for tracking
    • AWS CloudWatch monitoring
      • CloudWatch can be used to monitor your infrastructure and applications running in AWS. Amazon CloudWatch can collect metrics, log files, and set alarms for when these metrics have passed predetermined thresholds
    • VPC Flow Logs
      • Flow logs helps capture traffic to the Instances in an VPC and can be used to understand the pattern

Create a Plan for Attacks

  • Have a plan in place before an attack, which ensures that:
    • Architecture has been validated and techniques selected work for the infrastructure
    • Costs for increased resiliency have been evaluated and the goals of your defense are understood
    • Contact points have been identified

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are designing a social media site and are considering how to mitigate distributed denial-of-service (DDoS) attacks. Which of the below are viable mitigation techniques? (Choose 3 answers)
    1. Add multiple elastic network interfaces (ENIs) to each EC2 instance to increase the network bandwidth.
    2. Use dedicated instances to ensure that each instance has the maximum performance possible.
    3. Use an Amazon CloudFront distribution for both static and dynamic content.
    4. Use an Elastic Load Balancer with auto scaling groups at the web app and Amazon Relational Database Service (RDS) tiers
    5. Add alert Amazon CloudWatch to look for high Network in and CPU utilization.
    6. Create processes and capabilities to quickly add and remove rules to the instance OS firewall.
  2. You’ve been hired to enhance the overall security posture for a very large e-commerce site. They have a well architected multi-tier application running in a VPC that uses ELBs in front of both the web and the app tier with static assets served directly from S3. They are using a combination of RDS and DynamoDB for their dynamic data and then archiving nightly into S3 for further processing with EMR. They are concerned because they found questionable log entries and suspect someone is attempting to gain unauthorized access. Which approach provides a cost effective scalable mitigation to this kind of attack?
    1. Recommend that they lease space at a DirectConnect partner location and establish a 1G DirectConnect connection to their VPC they would then establish Internet connectivity into their space, filter the traffic in hardware Web Application Firewall (WAF). And then pass the traffic through the DirectConnect connection into their application running in their VPC. (Not cost effective)
    2. Add previously identified hostile source IPs as an explicit INBOUND DENY NACL to the web tier subnet. (does not protect against new source)
    3. Add a WAF tier by creating a new ELB and an AutoScaling group of EC2 Instances running a host-based WAF. They would redirect Route 53 to resolve to the new WAF tier ELB. The WAF tier would their pass the traffic to the current web tier The web tier Security Groups would be updated to only allow traffic from the WAF tier Security Group
    4. Remove all but TLS 1.2 from the web tier ELB and enable Advanced Protocol Filtering This will enable the ELB itself to perform WAF functionality. (No advanced protocol filtering in ELB)

References

DDOS Whitepaper

 

 

AWS Security – Whitepaper – Certification

AWS Security Whitepaper

AWS Security whitepaper is one of the most important whitepaper for the Certification perspective

Shared Security Responsibility Model

In the Shared Security Responsibility Model, AWS is responsible for securing the underlying infrastructure that supports the cloud, and you’re responsible for anything you put on the cloud or connect to the cloud.
AWS Security Shared Responsibility Model

AWS Security Responsibilities

  • AWS is responsible for protecting the global infrastructure that runs all of the services offered in the AWS cloud. This infrastructure is comprised of the hardware, software, networking, and facilities that run AWS services.
  • AWS provide several reports from third-party auditors who have verified their compliance with a variety of computer security standards and regulations
  • AWS is responsible for the security configuration of its products that are considered managed services for e.g. RDS, DynamoDB
  • For Managed Services, AWS will handle basic security tasks like guest operating system (OS) and database patching, firewall configuration, and disaster recovery.

Customer Security Responsibilities

  • AWS Infrastructure as a Service (IaaS) products for e.g. EC2, VPC, S3 are completely under your control and require you to perform all of the necessary security configuration and management tasks.
  • Management of the guest OS (including updates and security patches), any application software or utilities installed on the instances, and the configuration of the AWS-provided firewall (called a security group) on each instance
  • For most of these managed services, all you have to do is configure logical access controls for the resources and protect the account credentials

AWS Global Infrastructure Security  

AWS Compliance Program

IT infrastructure that AWS provides to its customers is designed and managed in alignment with security best practices and a variety of IT security standards, including:
  • SOC 1/SSAE 16/ISAE 3402 (formerly SAS 70)
  • SOC 2
  • SOC 3
  • FISMA, DIACAP, and FedRAMP
  • DOD CSM Levels 1-5
  • PCI DSS Level 1
  • ISO 9001 / ISO 27001
  • ITAR
  • FIPS 140-2
  • MTCS Level 3
And meet several industry-specific standards, including:
  • Criminal Justice Information Services (CJIS)
  • Cloud Security Alliance (CSA)
  • Family Educational Rights and Privacy Act (FERPA)
  • Health Insurance Portability and Accountability Act (HIPAA)
  • Motion Picture Association of America (MPAA)

Physical and Environmental Security 

Storage Decommissioning

  • When a storage device has reached the end of its useful life, AWS procedures include a decommissioning process that is designed to prevent customer data from being exposed to unauthorized individuals.
  • AWS uses the techniques detailed in DoD 5220.22-M (National Industrial Security Program Operating Manual) or NIST 800-88 (Guidelines for Media Sanitization) to destroy data as part of the decommissioning process.
  • All decommissioned magnetic storage devices are degaussed and physically destroyed in accordance with industry-standard practices.

Network Security 

Amazon Corporate Segregation

  • AWS Production network is segregated from the Amazon Corporate network and requires a separate set of credentials for logical access.
  • Amazon Corporate network relies on user IDs, passwords, and Kerberos, while the AWS Production network require SSH public-key authentication through a bastion host.

Networking Monitoring & Protection

AWS utilizes a wide variety of automated monitoring systems to provide a high level of service performance and availability. These tools monitor server and network usage, port scanning activities, application usage, and unauthorized intrusion attempts. The tools have the ability to set custom performance metrics thresholds for unusual activity.
AWS network provides protection against traditional network security issues
  1. DDOS – AWS uses proprietary DDoS mitigation techniques. Additionally, AWS’s networks are multi-homed across a number of providers to achieve Internet access diversity.
  2. Man in the Middle attacks – AWS APIs are available via SSL-protected endpoints which provide server authentication
  3. IP spoofing – AWS-controlled, host-based firewall infrastructure will not permit an instance to send traffic with a source IP or MAC address other than its own.
  4. Port Scanning – Unauthorized port scans by Amazon EC2 customers are a violation of the AWS Acceptable Use Policy. When unauthorized port scanning is detected by AWS, it is stopped and blocked. Penetration/Vulnerability testing can be performed only on your own instances, with mandatory prior approval, and must not violate the AWS Acceptable Use Policy.
  5. Packet Sniffing by other tenants – It is not possible for a virtual instance running in promiscuous mode to receive or “sniff” traffic that is intended for a different virtual instance. While you can place your interfaces into promiscuous mode, the hypervisor will not deliver any traffic to them that is not addressed to them. Even two virtual instances that are owned by the same customer located on the same physical host cannot listen to each other’s traffic.

Secure Design Principles

AWS’s development process follows :-
  • Secure software development best practices, which include formal design reviews by the AWS Security Team, threat modeling, and completion of a risk assessment
  • Static code analysis tools are run as a part of the standard build process
  • Recurring penetration testing performed by carefully selected industry experts

AWS Account Security Features

AWS account security features includes credentials for access control, HTTPS endpoints for encrypted data transmission, the creation of separate IAM user accounts, user activity logging for security monitoring, and Trusted Advisor security checks

AWS Credentials

AWS IAM Credentials

Individual User Accounts

Do not use the Root account, instead create an IAM User for each user and provide them with a unique set of Credentials and grant least privilege as required to perform their job function

Secure HTTPS Access Points

Use HTTPS, provided by all AWS services, for data transmissions, which uses public-key cryptography to prevent eavesdropping, tampering, and forgery

Security Logs

Use Amazon CloudTrail which provides logs of all requests for AWS resources within the account and captures information about every API call to every AWS resource you use, including sign-in events

Trusted Advisor Security Checks

Use Trusted Advisor service which helps inspect AWS environment and provide recommendations when opportunities may exist to optimize cost, improve system performance, or close security gaps

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. In the shared security model, AWS is responsible for which of the following security best practices (check all that apply) :
    1. Penetration testing
    2. Operating system account security management (User responsibility)
    3. Threat modeling
    4. User group access management (User responsibility)
    5. Static code analysis (AWS development cycle responsibility)
  2. You are running a web-application on AWS consisting of the following components an Elastic Load Balancer (ELB) an Auto-Scaling Group of EC2 instances running Linux/PHP/Apache, and Relational DataBase Service (RDS) MySQL. Which security measures fall into AWS’s responsibility?
    1. Protect the EC2 instances against unsolicited access by enforcing the principle of least-privilege access (User responsibility)
    2. Protect against IP spoofing or packet sniffing
    3. Assure all communication between EC2 instances and ELB is encrypted (User responsibility)
    4. Install latest security patches on ELB. RDS and EC2 instances (User responsibility)
  3. In AWS, which security aspects are the customer’s responsibility? Choose 4 answers
    1. Controlling physical access to compute resources (AWS responsibility)
    2. Patch management on the EC2 instances operating system
    3. Encryption of EBS (Elastic Block Storage) volumes
    4. Life-cycle management of IAM credentials
    5. Decommissioning storage devices (AWS responsibility)
    6. Security Group and ACL (Access Control List) settings
  4. Per the AWS Acceptable Use Policy, penetration testing of EC2 instances:
    1. May be performed by AWS, and will be performed by AWS upon customer request.
    2. May be performed by AWS, and is periodically performed by AWS.
    3. Are expressly prohibited under all circumstances.
    4. May be performed by the customer on their own instances with prior authorization from AWS.
    5. May be performed by the customer on their own instances, only if performed from EC2 instances
  5. Which is an operational process performed by AWS for data security?
    1. AES-256 encryption of data stored on any shared storage device (User responsibility)
    2. Decommissioning of storage devices using industry-standard practices
    3. Background virus scans of EBS volumes and EBS snapshots (No virus scan is performed by AWS on User instances)
    4. Replication of data across multiple AWS Regions (AWS does not replicate data across regions unless done by User)
    5. Secure wiping of EBS data when an EBS volume is unmounted (data is not wiped off on EBS volume when unmounted and it can be remounted on other EC2 instance)

References

AWS Disaster Recovery – Whitepaper – Certification

AWS Disaster Recovery Whitepaper

AWS Disaster Recovery Whitepaper is one of the very important Whitepaper for both the Associate & Professional AWS Certification exam

Disaster Recovery Overview

  • AWS Disaster Recovery whitepaper highlights AWS services and features that can be leveraged for disaster recovery (DR) processes to significantly minimize the impact on data, system, and overall business operations.
  • It outlines best practices to improve your DR processes, from minimal investments to full-scale availability and fault tolerance, and describes how AWS services can be used to reduce cost and ensure business continuity during a DR event
  • Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster. One of the AWS best practice is to always design your systems for failures

Disaster Recovery Key AWS services

  1. Region
    • AWS services are available in multiple regions around the globe, and the DR site location can be selected as appropriate, in addition to the primary site location
  2. Storage
    • Amazon S3
      • provides a highly durable (99.999999999%) storage infrastructure designed for mission-critical and primary data storage.
      • stores Objects redundantly on multiple devices across multiple facilities within a region
    • Amazon Glacier
      • provides extremely low-cost storage for data archiving and backup.
      • Objects are optimized for infrequent access, for which retrieval times of several (3-5) hours are adequate.
    • Amazon EBS
      • provides the ability to create point-in-time snapshots of data volumes.
      • Snapshots can then be used to create volumes and attached to running instances
    • Amazon Storage Gateway
      • a service that provides seamless and highly secure integration between on-premises IT environment and the storage infrastructure of AWS.
    • AWS Import/Export
      • accelerates moving large amounts of data into and out of AWS by using portable storage devices for transport bypassing the Internet
      • transfers data directly onto and off of storage devices by means of the high-speed internal network of Amazon
  3. Compute
    • Amazon EC2
      • provides resizable compute capacity in the cloud which can be easily created and scaled.
      • EC2 instance creation using Preconfigured AMIs
      • EC2 instances can be launched in multiple AZs, which are engineered to be insulated from failures in other AZs
  4. Networking
    • Amazon Route 53
      • is a highly available and scalable DNS web service
      • includes a number of global load-balancing capabilities that can be effective when dealing with DR scenarios for e.g. DNS endpoint health checks and the ability to failover between multiple endpoints 
    • Elastic IP
      • addresses enables masking of instance or Availability Zone failures by programmatically remapping
      • addresses are static IP addresses designed for dynamic cloud computing.
    • Elastic Load Balancing (ELB)
      • performs health checks and automatically distributes incoming application traffic across multiple EC2 instances
    • Amazon Virtual Private Cloud (Amazon VPC)
      • allows provisioning of a private, isolated section of the AWS cloud where resources can be launched in a defined virtual network
    • Amazon Direct Connect
      • makes it easy to set up a dedicated network connection from on-premises environment to AWS
  5. Databases
    • RDS, DynamoDb, Redshift provided as a fully managed RDBMS, NoSQL and data warehouse solutions which can scale up easily
    • DynamoDB offers cross region replication
    • RDS provides Multi-AZ and Read Replicas and also ability to snapshot data from one region to other
  6. Deployment Orchestration
    • CloudFormation
      • gives developers and systems administrators an easy way to create a collection of related AWS resources and provision them in an orderly and predictable fashion
    • Elastic Beanstalk
      • is an easy-to-use service for deploying and scaling web applications and services
    • OpsWorks
      • is an application management service that makes it easy to deploy and operate applications of all types and sizes.
      • Environment can be defined as a series of layers, and each layer can be configured as a tier of the application.
      • has automatic host replacement, so in the event of an instance failure it will be automatically replaced.
      • can be used in the preparation phase to template the environment, and combined with AWS CloudFormation in the recovery phase.
      • Stacks can be quickly provisioned from the stored configuration to support the defined RTO.

Key factors for Disaster Planning

Disaster Recovery RTO RPO Defination

Recovery Time Objective (RTO) – The time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA) for e.g. if the RTO is 1 hour and disaster occurs @ 12:00 p.m (noon), then the DR process should restore the systems to an acceptable service level within an hour i.e. by 1:00 p.m

Recovery Point Objective (RPO) – The acceptable amount of data loss measured in time before the disaster occurs. for e.g., if a disaster occurs at 12:00 p.m (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 a.m.

Disaster Recovery Scenarios

  • Disaster Recovery scenarios can be implemented with the Primary infrastructure running in your data center in conjunction with the AWS
  • Disaster Recovery Scenarios still apply if Primary site is running in AWS using AWS multi region feature.
  • Combination and variation of the below is always possible.

Disaster Recovery Scenarios Options

  1. Backup & Restore (Data backed up and restored)
  2. Pilot Light (Only Minimal critical functionalities)
  3. Warm Standby (Fully Functional Scaled down version)
  4. Multi-Site (Active-Active)

For the DR scenarios options, RTO and RPO reduces with an increase in Cost as you move from Backup & Restore option (left) to Multi-Site option (right)

Backup & Restore

AWS can be used to backup the data in a cost effective, durable and secure manner as well as recover the data quickly and reliably.

Backup phase

In most traditional environments, data is backed up to tape and sent off-site regularly taking longer time to restore the system in the event of a disruption or disasterBackup Restore - Backup Phase

  1. Amazon S3 can be used to backup the data and perform a quick restore and is also available from any location
  2. AWS Import/Export can be used to transfer large data sets by shipping storage devices directly to AWS bypassing the Internet
  3. Amazon Glacier can be used for archiving data, where retrieval time of several hours are adequate and acceptable
  4. AWS Storage Gateway enables snapshots (used to created EBS volumes) of the on-premises data volumes to be transparently copied into S3 for backup. It can be used either as a backup solution (Gateway-stored volumes) or as a primary data store (Gateway-cached volumes)
  5. AWS Direct connect can be used to transfer data directly from On-Premise to Amazon consistently and at high speed
  6. Snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored in Amazon S3

Restore phase

Data backed up then can be used to quickly restore and create Compute and Database instances

Backup Restore - Recovery PhaseKey steps for Backup and Restore:
1. Select an appropriate tool or method to back up the data into AWS.
2. Ensure an appropriate retention policy for this data.
3. Ensure appropriate security measures are in place for this data, including encryption and access policies.
4. Regularly test the recovery of this data and the restoration of the system.

Pilot Light

In a Pilot Light Disaster Recovery scenario option a minimal version of an environment is always running in the cloud, which basically host the critical functionalities of the application for e.g. databases

In this approach :-

  1. Maintain a pilot light by configuring and running the most critical core elements of your system in AWS for e.g. Databases where the data needs to be replicated and kept updated.
  2. During recovery, a full-scale production environment, for e.g. application and web servers, can be rapidly provisioned (using preconfigured AMIs and EBS volume snapshots) around the critical core
  3. For Networking, either a ELB to distribute traffic to multiple instances and have DNS point to the load balancer or preallocated Elastic IP address with instances associated can be used
Preparation phase steps :
  1. Set up Amazon EC2 instances or RDS instances to replicate or mirror data critical data
  2. Ensure that all supporting custom software packages available in AWS.
  3. Create and maintain AMIs of key servers where fast recovery is required.
  4. Regularly run these servers, test them, and apply any software updates and configuration changes.
  5. Consider automating the provisioning of AWS resources.
Pilot Light Scenario - Preparation Phase

Recovery Phase steps :

  1. Start the application EC2 instances from your custom AMIs.
  2. Resize existing database/data store instances to process the increased traffic for e.g. If using RDS, it can be easily scaled vertically while EC2 instances can be easily scaled horizontally
  3. Add additional database/data store instances to give the DR site resilience in the data tier for e.g. turn on Multi-AZ for RDS to improve resilience.
  4. Change DNS to point at the Amazon EC2 servers.
  5. Install and configure any non-AMI based systems, ideally in an automated way.

Pilot Light Scenario - Recovery Phase

Warm Standby

  • In a Warm standby DR scenario a scaled-down version of a fully functional environment identical to the business critical systems is always running in the cloud
  • This setup can be used for testing, quality assurances or for internal use.
  • In case of an disaster, the system can be easily scaled up or out to handle production load.

Preparation phase steps :

  1. Set up Amazon EC2 instances to replicate or mirror data.
  2. Create and maintain AMIs for faster provisioning
  3. Run the application using a minimal footprint of EC2 instances or AWS infrastructure.
  4. Patch and update software and configuration files in line with your live environment.

Warm Standby - Preparation PhaseRecovery phase Steps:

  1. Increase the size of the Amazon EC2 fleets in service with the load balancer (horizontal scaling).
  2. Start applications on larger Amazon EC2 instance types as needed (vertical scaling).
  3. Either manually change the DNS records, or use Route 53 automated health checks to route all the traffic to the AWS environment.
  4. Consider using Auto Scaling to right-size the fleet or accommodate the increased load.
  5. Add resilience or scale up your database to guard against DR going down

Warm Standby - Recovery Phase

Multi-Site

  • Multi-Site is an active-active configuration DR approach, where in an identical solution runs on AWS as your on-site infrastructure.
  • Traffic can be equally distributed to both the infrastructure as needed by using DNS service weighted routing approach.
  • In case of a disaster the DNS can be tuned to send all the traffic to the AWS environment and the AWS infrastructure scaled accordingly.

Preparation phase steps :

  1. Set up your AWS environment to duplicate the production environment.
  2. Set up DNS weighting, or similar traffic routing technology, to distribute incoming requests to both sites.
  3. Configure automated failover to re-route traffic away from the affected site. for e.g. application to check if primary DB is available if not then redirect to the AWS DB
Multi-Site - Preparation Phase

Recovery phase steps :

  1. Either manually or by using DNS failover, change the DNS weighting so that all requests are sent to the AWS site.
  2. Have application logic for failover to use the local AWS database servers for all queries.
  3. Consider using Auto Scaling to automatically right-size the AWS fleet.Multi-Site - Recovery Phase

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which of these Disaster Recovery options costs the least?
    1. Pilot Light (most systems are down and brought up only after disaster)
    2. Fully Working Low capacity Warm standby
    3. Multi site Active-Active
  2. Your company currently has a 2-tier web application running in an on-premises data center. You have experienced several infrastructure failures in the past two months resulting in significant financial losses. Your CIO is strongly agreeing to move the application to AWS. While working on achieving buy-in from the other company executives, he asks you to develop a disaster recovery plan to help improve Business continuity in the short term. He specifies a target Recovery Time Objective (RTO) of 4 hours and a Recovery Point Objective (RPO) of 1 hour or less. He also asks you to implement the solution within 2 weeks. Your database is 200GB in size and you have a 20Mbps Internet connection. How would you do this while minimizing costs?
    1. Create an EBS backed private AMI which includes a fresh install or your application. Setup a script in your data center to backup the local database every 1 hour and to encrypt and copy the resulting file to an S3 bucket using multi-part upload (while AMI is a right approach to keep cost down, Upload to S3 very Slow)
    2. Install your application on a compute-optimized EC2 instance capable of supporting the application’s average load synchronously replicate transactions from your on-premises database to a database instance in AWS across a secure Direct Connect connection. (EC2 running in Compute Optimized as well as Direct Connect is expensive to start with also Direct Connect cannot be implemented in 2 weeks)
    3. Deploy your application on EC2 instances within an Auto Scaling group across multiple availability zones asynchronously replicate transactions from your on-premises database to a database instance in AWS across a secure VPN connection. (While VPN can be setup quickly asynchronous replication using VPN would work, running instances in DR is expensive)
    4. Create an EBS backed private AMI that includes a fresh install of your application. Develop a Cloud Formation template which includes your AMI and the required EC2. Auto-Scaling and ELB resources to support deploying the application across Multiple Availability Zones. Asynchronously replicate transactions from your on-premises database to a database instance in AWS across a secure VPN connection. (Pilot Light approach with only DB running and replicate while you have preconfigured AMI and autoscaling config)
  3. You are designing an architecture that can recover from a disaster very quickly with minimum down time to the end users. Which of the following approaches is best?
    1. Leverage Route 53 health checks to automatically fail over to backup site when the primary site becomes unreachable
    2. Implement the Pilot Light DR architecture so that traffic can be processed seamlessly in case the primary site becomes unreachable
    3. Implement either Fully Working Low Capacity Standby or Multi-site Active-Active architecture so that the end users will not experience any delay even if the primary site becomes unreachable
    4. Implement multi-region architecture to ensure high availability
  4. Your customer wishes to deploy an enterprise application to AWS that will consist of several web servers, several application servers and a small (50GB) Oracle database. Information is stored, both in the database and the file systems of the various servers. The backup system must support database recovery, whole server and whole disk restores, and individual file restores with a recovery time of no more than two hours. They have chosen to use RDS Oracle as the database. Which backup architecture will meet these requirements?
    1. Backup RDS using automated daily DB backups. Backup the EC2 instances using AMIs and supplement with file-level backup to S3 using traditional enterprise backup software to provide file level restore (RDS automated backups with file-level backups can be used)
    2. Backup RDS using a Multi-AZ Deployment Backup the EC2 instances using AMIs, and supplement by copying file system data to S3 to provide file level restore (Multi-AZ is more of an Disaster recovery solution)
    3. Backup RDS using automated daily DB backups. Backup the EC2 instances using EBS snapshots and supplement with file-level backups to Amazon Glacier using traditional enterprise backup software to provide file level restore (Glacier not an option with the 2 hours RTO)
    4. Backup RDS database to S3 using Oracle RMAN. Backup the EC2 instances using AMIs, and supplement with EBS snapshots for individual volume restore. (Will use RMAN only if Database hosted on EC2 and not when using RDS)
  5. Which statements are true about the Pilot Light Disaster recovery architecture pattern?
    1. Pilot Light is a hot standby (Cold Standby)
    2. Enables replication of all critical data to AWS
    3. Very cost-effective DR pattern
    4. Can scale the system as needed to handle current production load
  6. An ERP application is deployed across multiple AZs in a single region. In the event of failure, the Recovery Time Objective (RTO) must be less than 3 hours, and the Recovery Point Objective (RPO) must be 15 minutes. The customer realizes that data corruption occurred roughly 1.5 hours ago. What DR strategy could be used to achieve this RTO and RPO in the event of this kind of failure?
    1. Take hourly DB backups to S3, with transaction logs stored in S3 every 5 minutes
    2. Use synchronous database master-slave replication between two availability zones. (Replication won’t help to backtrack and would be sync always)
    3. Take hourly DB backups to EC2 Instance store volumes with transaction logs stored In S3 every 5 minutes. (Instance store not a preferred storage)
    4. Take 15 minute DB backups stored in Glacier with transaction logs stored in S3 every 5 minutes. (Glacier does not meet the RTO)
  7. Your company’s on-premises content management system has the following architecture:
    – Application Tier – Java code on a JBoss application server
    – Database Tier – Oracle database regularly backed up to Amazon Simple Storage Service (S3) using the Oracle RMAN backup utility
    – Static Content – stored on a 512GB gateway stored Storage Gateway volume attached to the application server via the iSCSI interfaceWhich AWS based disaster recovery strategy will give you the best RTO?

    1. Deploy the Oracle database and the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Generate an EBS volume of static content from the Storage Gateway and attach it to the JBoss EC2 server.
    2. Deploy the Oracle database on RDS. Deploy the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon Glacier. Generate an EBS volume of static content from the Storage Gateway and attach it to the JBoss EC2 server. (Glacier does help to give best RTO)
    3. Deploy the Oracle database and the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Restore the static content by attaching an AWS Storage Gateway running on Amazon EC2 as an iSCSI volume to the JBoss EC2 server. (No need to attach the Storage Gateway as an iSCSI volume can just create a EBS volume)
    4. Deploy the Oracle database and the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Restore the static content from an AWS Storage Gateway-VTL running on Amazon EC2 (VTL is Virtual Tape library and doesn’t fit the RTO)

References