AWS Certification – Security & Identity Services – Cheat Sheet

IAM

  • securely control access to AWS services and resources
  • helps create and manage user identities and grant permissions for those users to access AWS resources
  • helps create groups for multiple users with similar permissions
  • not appropriate for application authentication
  • is Global and does not need to be migrated to a different region
  • helps define Policies,
    • in JSON format
    • all permissions are implicitly denied by default
    • most restrictive policy wins
  • IAM Role
    • helps grants and delegate access to users and services without the need of creating permanent credentials
    • IAM users or AWS services can assume a role to obtain temporary security credentials that can be used to make AWS API calls
    • needs Trust policy to define who and Permission policy to define what the user or service can access
    • used with Security Token Service (STS), a lightweight web service that provides temporary, limited privilege credentials for IAM users or for authenticated federated users
    • IAM role scenarios
      • Service access for e.g. EC2 to access S3 or DynamoDB
      • Cross Account access for users
        • with user within the same account
        • with user within an AWS account owned the same owner
        • with user from a Third Party AWS account with External ID for enhanced security
      • Identity Providers & Federation
        • Web Identity Federation, where the user can be authenticated using external authentication Identity providers like Amazon, Google or any OpenId IdP using AssumeRoleWithWebIdentity
        • Identity Provider using SAML 2.0, where the user can be authenticated using on premises Active Directory, Open Ldap or any SAML 2.0 compliant IdP using AssumeRoleWithSAML
        • For other Identity Providers, use Identity Broker to authenticate and provide temporary Credentials using AssumeRole (recommended) or GetFederationToken
  • IAM Best Practices
    • Do not use Root account for anything other than billing
    • Create Individual IAM users
    • Use groups to assign permissions to IAM users
    • Grant least privilege
    • Use IAM roles for applications on EC2
    • Delegate using roles instead of sharing credentials
    • Rotate credentials regularly
    • Use Policy conditions for increased granularity
    • Use CloudTrail to keep a history of activity
    • Enforce a strong IAM password policy for IAM users
    • Remove all unused users and credentials

CloudHSM

  • provides secure cryptographic key storage to customers by making hardware security modules (HSMs) available in the AWS cloud
  • single tenant, dedicated physical device to securely generate, store, and manage cryptographic keys used for data encryption
  • are inside the VPC (not EC2-classic) & isolated from the rest of the network
  • can use VPC peering to connect to CloudHSM from multiple VPCs
  • integrated with Amazon Redshift and Amazon RDS for Oracle
  • EBS volume encryption, S3 object encryption and key management can be done with CloudHSM but requires custom application scripting
  • is NOT fault tolerant and would need to build a cluster as if one fails all the keys are lost
  • expensive, prefer AWS Key Management Service (KMS) if cost is a criteria

AWS Directory Services

  • gives applications in AWS access to Active Directory services
  • different from SAML + AD, where the access is granted to AWS services through Temporary Credentials
  • Simple AD
    • least expensive but does not support Microsoft AD advance features
    • provides a Samba 4 Microsoft Active Directory compatible standalone directory service on AWS
    • No single point of Authentication or Authorization, as a separate copy is maintained
    • trust relationships cannot be setup between Simple AD and other Active Directory domains
    • Don’t use it, if the requirement is to leverage access and control through centralized authentication service
  • AD Connector
    • acts just as an hosted proxy service for instances in AWS to connect to on-premises Active Directory
    • enables consistent enforcement of existing security policies, such as password expiration, password history, and account lockouts, whether users are accessing resources on-premises or in the AWS cloud
    • needs VPN connectivity (or Direct Connect)
    • integrates with existing RADIUS-based MFA solutions to enabled multi-factor authentication
    • does not cache data which might lead to latency
  • Read-only Domain Controllers (RODCs)
    • works out as a Read-only Active Directory
    • holds a copy of the Active Directory Domain Service (AD DS) database and respond to authentication requests
    • they cannot be written to and are typically deployed in locations where physical security cannot be guaranteed
    • helps maintain a single point to authentication & authorization controls, however needs to be synced
  • Writable Domain Controllers
    • are expensive to setup
    • operate in a multi-master model; changes can be made on any writable server in the forest, and those changes are replicated to servers throughout the entire forest

AWS WAF

  • is a web application firewall that helps monitor the HTTP/HTTPS requests forwarded to CloudFront and allows controlling access to the content.
  • helps define Web ACLs, which is a combination of Rules, which is a combinations of Conditions and Action to block or allow
  • Third Party WAF
    • act as filters that apply a set of rules to web traffic to cover exploits like XSS and SQL injection and also help build resiliency against DDoS by mitigating HTTP GET or POST floods
    • WAF provides a lot of features like OWASP Top 10, HTTP rate limiting, Whitelist or blacklist, inspect and identify requests with abnormal patterns, CAPTCHA etc
    • a WAF sandwich pattern can be implemented where an autoscaled WAF sits between the Internet and Internal Load Balancer

 

AWS Certification – Compute Services – Cheat Sheet

EC2

  • provides scalable computing capacity
  • Features
    • Virtual computing environments, known as EC2 instances
    • Preconfigured templates for EC2 instances, known as Amazon Machine Images (AMIs), that package the bits needed for the server (including the operating system and additional software)
    • Various configurations of CPU, memory, storage, and networking capacity for your instances, known as Instance types
    • Secure login information for your instances using key pairs (public-private keys where private is kept by user)
    • Storage volumes for temporary data that’s deleted when you stop or terminate your instance, known as Instance store volumes
    • Persistent storage volumes for data using Elastic Block Store (EBS)
    • Multiple physical locations for your resources, such as instances and EBS volumes, known as Regions and Availability Zones
    • A firewall to specify the protocols, ports, and source IP ranges that can reach your instances using Security Groups
    • Static IP addresses, known as Elastic IP addresses
    • Metadata, known as tags, can be created and assigned to EC2 resources
    • Virtual networks that are logically isolated from the rest of the AWS cloud, and can optionally connect to on premises network, known as Virtual private clouds (VPCs)
  • Amazon Machine Image
    • template from which EC2 instances can be launched quickly
    • does NOT span across across regions, and needs to be copied
    • can be shared with other specific AWS accounts or made public
  • Purchasing Option
    • On-Demand Instances
      • pay for instances and compute capacity that you use by the hour
      • with no long-term commitments or up-front payments
    • Reserved Instances
      • provides lower hourly running costs by providing a billing discount
      • capacity reservation that is applied to instances
      • suited if consistent, heavy, predictable usage
      • provides benefits with Consolidate Billing
      • can be modified to switch Availability Zones or the instance size within the same instance type, given the instance size footprint (Normalization factor) remains the same
      • pay for the entire term regardless of the usage, so if the question targets cost effective solution and answer mentions reserved instances are purchased & unused, it can be ignored
    • Spot Instances
      • cost-effective choice but does NOT guarantee availability
      • applications flexible in the timing when they can run and also able to handle interruption by storing the state externally
      • AWS will give a two minute warning if the instance is to be terminated to save any unsaved work
    • Dedicated Instances, is a tenancy option which enables instances to run in VPC on hardware that’s isolated, dedicated to a single customer
    • Light, Medium, and Heavy Utilization Reserved Instances are no longer available for purchase and were part of the Previous Generation AWS EC2 purchasing model
  • Enhanced Networking
    • results in higher bandwidth, higher packet per second (PPS) performance, lower latency, consistency, scalability and lower jitter
    • supported using Single Root I/O Virtualization (SR-IOV) only on supported instance types
    • is supported only with an VPC (not EC2 Classic), HVM virtualization type and available by default on Amazon AMI but can be installed on other AMIs as well
  • Placement Group
    • provide low latency, High Performance Computing via 10Gbps network
    • is a logical grouping on instances within a Single AZ
    • don’t span availability zones, can span multiple subnets but subnets must be in the same AZ
    • can span across peered VPCs for the same Availability Zones
    • existing instances cannot be moved into an existing placement group
    • for capacity errors, stop and start the instances in the placement group
    • use homogenous instance types which support enhanced networking and launch all the instances at once

EBS

Elastic Load Balancer & Auto Scaling

  • Elastic Load Balancer
    • Managed load balancing service and scales automatically
    • distributes incoming application traffic across multiple EC2 instances
    • is distributed system that is fault tolerant and actively monitored by AWS scales it as per the demand
    • are engineered to not be a single point of failure
    • need to Pre Warm ELB if the demand is expected to shoot especially during load testing
    • supports routing traffic to instances in multiple AZs in the same region
    • performs Health Checks to route traffic only to the healthy instances
    • support Listeners with HTTP, HTTPS, SSL, TCP protocols
    • has an associated IPv4 and dual stack DNS name
    • can offload the work of encryption and decryption (SSL termination) so that the EC2 instances can focus on their main work
    • supports Cross Zone load balancing to help route traffic evenly across all EC2 instances regardless of the AZs they reside in
    • to help identify the IP address of a client
      • supports Proxy Protocol header for TCP/SSL connections
      • supports X-Forward headers for HTTP/HTTPS connections
    • supports Stick Sessions (session affinity) to bind a user’s session to a specific application instance,
      • it is not fault tolerant, if an instance is lost the information is lost
      • requires HTTP/HTTPS listener and does not work with TCP
      • requires SSL termination on ELB as it users the headers
    • supports Connection draining to help complete the in-flight requests in case an instance is deregistered
    • For High Availability, it is recommended to attach one subnet per AZ for at least two AZs, even if the instances are in a single subnet.
    • cannot assign an Elastic IP address to an ELB
    • IPv4 & IPv6 support however VPC does not support IPv6
    • HTTPS listener does not support Client Side Certificate
    • for SSL termination at backend instances or support for Client Side Certificate use TCP for connections from the client to the ELB, use the SSL protocol for connections from the ELB to the back-end application, and deploy certificates on the back-end instances handling requests
    • supports a single SSL certificate, so for multiple SSL certificate multiple ELBs need to be created
  • Auto Scaling
    • ensures correct number of EC2 instances are always running to handle the load by scaling up or down automatically as demand changes
    • cannot span multiple regions.
    • attempts to distribute instances evenly between the AZs that are enabled for the Auto Scaling group
    • performs checks either using EC2 status checks or can use ELB health checks to determine the health of an instance and terminates the instance if unhealthy, to launch a new instance
    • can be scaled using manual scaling, scheduled scaling or demand based scaling
    • cooldown period helps ensure instances are not launched or terminated before the previous scaling activity takes effect to allow the newly launched instances to start handling traffic and reduce load
  • Auto Scaling & ELB can be used for High Availability and Redundancy by spanning Auto Scaling groups across multiple AZs within a region and then setting up ELB to distribute incoming traffic across those AZs
  • With Auto Scaling use ELB health check with the instances to ensure that traffic is routed only to the healthy instances

AWS Certification – Storage & Content Delivery – Cheat Sheet

Elastic Block Store – EBS

  • is virtual network attached block storage
  • volumes CANNOT be shared with multiple EC2 instances, use EFS instead
  • persists and is independent of EC2 lifecycle
  • multiple volumes can be attached to a single EC2 instance
  • can be detached & attached to another EC2 instance in that same AZ only
  • volumes are created in an specific AZ and CANNOT span across AZs
  • snapshots CANNOT span across regions
  • for making volume available to different AZ, create a snapshot of the volume and restore it to a new volume in any AZ within the region
  • for making the volume available to different Region, the snapshot of the volume can be copied to a different region and restored as a volume
  • provides high durability and are redundant in an AZ, as the data is automatically replicated within that AZ to prevent data loss due to any single hardware component failure
  • PIOPS is designed to run transactions applications that require high and consistent IO for e.g. Relation database, NoSQL etc

S3

  • Key-value based object storage with unlimited storage, unlimited objects up to 5 TB for the internet
  • is an Object level storage (not a Block level storage) and cannot be used to host OS or dynamic websites (but can work with Javascript SDK)
  • provides durability by redundantly storing objects on multiple facilities within a region
  • support SSL encryption of data in transit and data encryption at rest
  • regularly verifies the integrity of data using checksums and provides auto healing capability
  • integrates with CloudTrail, CloudWatch and SNS for event notifications
  • S3 resources
    • consists of bucket and objects stored in the bucket which can be retrieved via a unique, developer-assigned key
    • bucket names are globally unique
    • data model is a flat structure with no hierarchies or folders
    • Logical hierarchy can be inferred using the keyname prefix e.g. Folder1/Object1
  • Bucket & Object Operations
    • allows retrieval of 1000 objects and provides pagination support and is NOT suited for list or prefix queries with large number of objects
    • with a single put operations, 5GB size object can be uploaded
    • use Multipart upload to upload large objects up to 5 TB and is recommended for object size of over 100MB for fault tolerant uploads
    • support Range HTTP Header to retrieve partial objects for fault tolerant downloads where the network connectivity is poor
    • Pre-Signed URLs can also be used shared for uploading/downloading objects for limited time without requiring AWS security credentials
    • allows deletion of a single object or multiple objects (max 1000) in a single call
  • Multipart Uploads allows
    • parallel uploads with improved throughput and bandwidth utilization
    • fault tolerance and quick recovery from network issues
    • ability to pause and resume uploads
    • begin an upload before the final object size is known
  • Versioning
    • allows preserve, retrieve, and restore every version of every object
    • protects individual files but does NOT protect from Bucket deletion
  • Storage tiers
    • Standard
      • default storage class
      • 99.999999999% durability & 99.99% availability
      • Low latency and high throughput performance
      • designed to sustain the loss of data in a two facilities
    • Standard IA
      • optimized for long-lived and less frequently accessed data
      • designed to sustain the loss of data in a two facilities
      • 99.999999999% durability & 99.9% availability
      • suitable for objects greater than 128 KB kept for at least 30 days
    • Reduced Redundancy Storage
      • designed for noncritical, reproducible data stored at lower levels of redundancy than the STANDARD storage class
      • reduces storage costs
      • 99.99% durability & 99.99% availability
      • designed to sustain the loss of data in a single facility
    • Glacier
      • suitable for archiving data where data access is infrequent and retrieval time of several (3-5) hours  is acceptable
      • 99.999999999% durability
  • allows Lifecycle Management policies
    • transition to move objects to different storage classes and Glacier
    • expiration to remove objects
  • Data Consistency Model
    • provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES
    • for new objects, synchronously stores data across multiple facilities before returning success
    • updates to a single key are atomic
  • Security
    • IAM policies – grant users within your own AWS account permission to access S3 resources
    • Bucket and Object ACL – grant other AWS accounts (not specific users) access to  S3 resources
    • Bucket policies – allows to add or deny permissions across some or all of the objects within a single bucket
  • Data Protection – Pending
  • Best Practices
    • use random hash prefix for keys and ensure a random access pattern, as S3 stores object lexicographically randomness helps distribute the contents across multiple partitions for better performance
    • use parallel threads and Multipart upload for faster writes
    • use parallel threads and Range Header GET for faster reads
    • for list operations with large number of objects, its better to build a secondary index in DynamoDB
    • use Versioning to protect from unintended overwrites and deletions, but this does not protect against bucket deletion
    • use VPC S3 Endpoints with VPC to transfer data using Amazon internet network

Glacier

  • suitable for archiving data, where data access is infrequent and a retrieval time of several hours (3 to 5 hours) is acceptable (Not true anymore with enhancements from AWS)
  • provides a high durability by storing archive in multiple facilities and multiple devices at a very low cost storage
  • performs regular, systematic data integrity checks and is built to be automatically self healing
  • aggregate files into bigger files before sending them to Glacier and use range retrievals to retrieve partial file and reduce costs
  • improve speed and reliability with multipart upload
  • automatically encrypts the data using AES-256
  • upload or download data to Glacier via SSL encrypted endpoints

CloudFront

  • provides low latency and high data transfer speeds for distribution of static, dynamic web or streaming content to web users
  • delivers the content through a worldwide network of data centers called Edge Locations
  • keeps persistent connections with the origin servers so that the files can be fetched from the origin servers as quickly as possible.
  • dramatically reduces the number of network hops that users’ requests must pass through
  • supports multiple origin server options, like AWS hosted service for e.g. S3, EC2, ELB or an on premise server, which stores the original, definitive version of the objects
  • single distribution can have multiple origins and Path pattern in a cache behavior determines which requests are routed to the origin
  • supports Web Download distribution and RTMP Streaming distribution
    • Web distribution supports static, dynamic web content, on demand using progressive download & HLS and live streaming video content
    • RTMP supports streaming of media files using Adobe Media Server and the Adobe Real-Time Messaging Protocol (RTMP) ONLY
  • supports HTTPS using either
    • dedicated IP address, which is expensive as dedicated IP address is assigned to each CloudFront edge location
    • Server Name Indication (SNI), which is free but supported by modern browsers only with the domain name available in the request header
  • For E2E HTTPS connection,
    • Viewers -> CloudFront needs either self signed certificate, or certificate issued by CA or ACM
    • CloudFront -> Origin needs certificate issued by ACM for ELB and by CA for other origins
  •  Security
    • Origin Access Identity (OAI) can be used to restrict the content from S3 origin to be accessible from CloudFront only
    • supports Geo restriction (Geo-Blocking) to whitelist or blacklist countries that can access the content
    • Signed URLs 
      • for RTMP distribution as signed cookies aren’t supported
      • to restrict access to individual files, for e.g., an installation download for your application.
      • users using a client, for e.g. a custom HTTP client, that doesn’t support cookies
    • Signed Cookies
      • provide access to multiple restricted files, for e.g., video part files in HLS format or all of the files in the subscribers’ area of a website.
      • don’t want to change the current URLs
    • integrates with AWS WAF, a web application firewall that helps protect web applications from attacks by allowing rules configured based on IP addresses, HTTP headers, and custom URI strings
  • supports GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE to get object & object headers, add, update, and delete objects
    • only caches responses to GET and HEAD requests and, optionally, OPTIONS requests
    • does not cache responses to PUT, POST, PATCH, DELETE request methods and these requests are proxied back to the origin
  • object removal from cache
    • would be removed upon expiry (TTL) from the cache, by default 24 hrs
    • can be invalidated explicitly, but has a cost associated, however might continue to see the old version until it expires from those caches
    • objects can be invalidated only for Web distribution
    • change object name, versioning, to serve different version
  • supports adding or modifying custom headers before the request is sent to origin which can be used to
    • validate if user is accessing the content from CDN
    • identifying CDN from which the request was forwarded from, in case of multiple CloudFront distribution
    • for viewers not supporting CORS to return the Access-Control-Allow-Origin header for every request
  • supports Partial GET requests using range header to download object in smaller units improving the efficiency of partial downloads and recovery from partially failed transfers
  • supports compression to compress and serve compressed files when viewer requests include Accept-Encoding: gzip in the request header
  • supports different price class to include all regions, to include only least expensive regions and other regions to exclude most expensive regions
  • supports access logs which contain detailed information about every user request for both web and RTMP distribution

AWS Import/Export

  • accelerates moving large amounts of data into and out of AWS using portable storage devices for transport and transfers data directly using Amazon’s high speed internal network, bypassing the internet.
  • suitable for use cases with
    • large datasets
    • low bandwidth connections
    • first time migration of data
  • Importing data to several types of AWS storage, including EBS snapshots, S3 buckets, and Glacier vaults.
  • Exporting data out from S3 only, with versioning enabled only the latest version is exported
  • Import data can be encrypted (optional but recommended) while export is always encrypted using Truecrypt
  • Amazon will wipe the device if specified, however it will not destroy the device

 

AWS Key Management Service – KMS – Certification

AWS Key Management Service – KMS

  • AWS KMS is a managed encryption service that enables encryption of data easily
  • KMS provides a highly available key storage, management, and auditing solution to encrypt the data across AWS services & within applications
  • KMS is seamlessly integrated with several other AWS services to make encrypting data in those service easy
  • KMS Keys are only stored and used in the region in which they are created. They cannot be transferred to another region
  • KMS enforces usage and management policies, to control which IAM user, role from your account or other accounts who can manage and use keys
  • KMS is integrated with CloudTrail, so all requests to use the keys are logged to understand who used which key when
  • KMS allows rotation of the keys,
    • if keys generated by KMS rotated automatically by KMS, data does not need to be re-encrypted. KMS keeps previous versions of keys to use for decryption of data encrypted under an old version of a key. All new encryption requests against a key in AWS KMS are encrypted under the newest version of the key.
    • if manually rotated, data has to be re-encrypted depending on the application’s configuration
    • Automatic key rotation is not supported for imported keys

KMS Working

  • KMS centrally manages and securely stores the keys
  • Keys can be generated or imported from your key management infrastructure
  • Keys can be used from within the applications and supported AWS services to protect the data, but the key never leaves KMS AWS.
  • Data is submitted to AWS KMS to be encrypted, or decrypted, under keys that you control.
  • Usage policies on these keys can be set that determine which users can use them to encrypt and decrypt data.

Envelope encryption

  • AWS cloud services integrated with AWS KMS use a method called envelope encryption to protect the data.
  • Envelope encryption is an optimized method for encrypting data that uses two different keys
  • With Envelop encryption
    • A data key is generated and used by the AWS service to encrypt each piece of data or resource.
    • Data key is encrypted under a master key that you define in AWS KMS.
    • Encrypted data key is then stored by the AWS service.
    • For data decryption by the AWS service, the encrypted data key is passed to AWS KMS and decrypted under the master key that was originally encrypted under so the service can then decrypt your data.
  • KMS does support sending data less than 4 KB to be encrypted, envelope encryption can offer significant performance benefits
  • When the data is encrypted directly with KMS it must be transferred over the network.
  • Envelope encryption reduces the network load for the application or AWS cloud service as Only the request and fulfillment of the data key through KMS must go over the network

KMS Features

  • Create keys with a unique alias and description
  • Import your own keys
  • Control which IAM users and roles can manage keys
  • Control which IAM users and roles can use keys to encrypt & decrypt data
  • Choose to have AWS KMS automatically rotate keys on an annual basis
  • Temporarily disable keys so they cannot be used by anyone
  • Re-enable disabled keys
  • Delete keys that you no longer use
  • Audit use of keys by inspecting logs in AWS CloudTrail

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are designing a personal document-archiving solution for your global enterprise with thousands of employee. Each employee has potentially gigabytes of data to be backed up in this archiving solution. The solution will be exposed to he employees as an application, where they can just drag and drop their files to the archiving system. Employees can retrieve their archives through a web interface. The corporate network has high bandwidth AWS DirectConnect connectivity to AWS. You have regulatory requirements that all data needs to be encrypted before being uploaded to the cloud. How do you implement this in a highly available and cost efficient way?
    1. Manage encryption keys on-premise in an encrypted relational database. Set up an on-premises server with sufficient storage to temporarily store files and then upload them to Amazon S3, providing a client-side master key. (Storing temporary increases cost and not a high availability option)
    2. Manage encryption keys in a Hardware Security Module (HSM) appliance on-premise server with sufficient storage to temporarily store, encrypt, and upload files directly into amazon Glacier. (Not cost effective)
    3. Manage encryption keys in amazon Key Management Service (KMS), upload to amazon simple storage service (s3) with client-side encryption using a KMS customer master key ID and configure Amazon S3 lifecycle policies to store each object using the amazon glacier storage tier. (With CSE-KMS the encryption happens at client side before the object is upload to S3 and KMS is cost effective as well)
    4. Manage encryption keys in an AWS CloudHSM appliance. Encrypt files prior to uploading on the employee desktop and then upload directly into amazon glacier (Not cost effective)
  2. An AWS customer is deploying an application that is composed of an Auto Scaling group of EC2 Instances. The customers security policy requires that every outbound connection from these instances to any other service within the customers Virtual Private Cloud must be authenticated using a unique x 509 certificate that contains the specific instance-id. In addition an x 509 certificates must be designed by the customer’s Key management service in order to be trusted for authentication.
    Which of the following configurations will support these requirements?

    1. Configure an IAM Role that grants access to an Amazon S3 object containing a signed certificate and configure the Auto Scaling group to launch instances with this role. Have the instances bootstrap get the certificate from Amazon S3 upon first boot.
    2. Embed a certificate into the Amazon Machine Image that is used by the Auto Scaling group Have the launched instances generate a certificate signature request with the instance’s assigned instance-id to the Key management service for signature.
    3. Configure the Auto Scaling group to send an SNS notification of the launch of a new instance to the trusted key management service. Have the Key management service generate a signed certificate and send it directly to the newly launched instance.
    4. Configure the launched instances to generate a new certificate upon first boot. Have the Key management service poll the AutoScaling group for associated instances and send new instances a certificate signature that contains the specific instance-id.

IAM Role – Identity Providers and Federation – Certification

IAM Role – Identity Providers and Federation

  • Identify Provider can be used to grant external user identities permissions to AWS resources without having to be created within your AWS account.
  • External user identities can be authenticated either through the organization’s authentication system or through a well-know identity provider such as login with Amazon, Google etc.
  • Identity providers help keep the AWS account secure without having the need to distribute or embed long term in the application
  • To use an IdP, you create an IAM identity provider entity to establish a trust relationship between your AWS account and the IdP.
  • IAM supports IdPs that are compatible with OpenID Connect (OIDC) or SAML 2.0 (Security Assertion Markup Language 2.0)

Web Identity Federation

Complete Process Flow

IAM Web Identity Federation

  1. Mobile or Web Application needs to be configured with the IdP which gives each application a unique ID or client ID (also called audience)
  2. Create an Identity Provider entity for OIDC compatible IdP in IAM.
  3. Create IAM role and define the
    1. Trust policy –  specify the IdP (like Amazon) as the Principal (the trusted entity), and include a Condition that matches the IdP assigned app ID
    2. Permission policy – specify the permissions the application can assume
  4. Application calls the sign-in interface for the IdP to login
  5. IdP authenticates the user and returns an authentication token (OAuth access token or OIDC ID token) with information about the user to the application
  6. Application then makes an unsigned call to the STS service with the AssumeRoleWithWebIdentity action to request temporary security credentials.
  7. Application passes the IdP’s authentication token along with the Amazon Resource Name (ARN) for the IAM role created for that IdP.
  8. AWS verifies that the token is trusted and valid and if so, returns temporary security credentials (access key, secret access key, session token, expiry time) to the application that have the permissions for the role that you name in the request.
  9. STS response also includes metadata about the user from the IdP, such as the unique user ID that the IdP associates with the user.
  10. Using the Temporary credentials, the application makes signed requests to AWS
  11. User ID information from the identity provider can distinguish users in the app for e.g., objects can be put into S3 folders that include the user ID as prefixes or suffixes. This lets you create access control policies that lock the folder so only the user with that ID can access it.
  12. Application can cache the temporary security credentials and refresh them before their expiry accordingly. Temporary credentials, by default, are good for an hour.

Interactive Website provides a very good way to understand the flow

Mobile or Web Identity Federation with Cognito

Amazon Cognito

  • Use Amazon Cognito as the identity broker for almost all web identity federation scenarios
  • Amazon Cognito is easy to use and provides additional capabilities like anonymous (unauthenticated) access
  • Amazon Cognito also helps synchronizing user data across devices and providers

Web Identify Federation using Cognito

SAML 2.0-based Federation

  • AWS supports identity federation with SAML 2.0 (Security Assertion Markup Language 2.0), an open standard that many identity providers (IdPs) use.
  • SAML 2.0 based federation feature enables federated single sign-on (SSO), so users can log into the AWS Management Console or call the AWS APIs without having to create an IAM user for everyone in your organization.
  • By using SAML, the process of configuring federation with AWS can be simplified by using the IdP’s service instead of writing custom identity proxy code.
  • This is useful in organizations that have integrated their identity systems (such as Windows Active Directory or OpenLDAP) with software that can produce SAML assertions to provide information about user identity and permissions (such as Active Directory Federation Services or Shibboleth)

Complete Process Flow

SAML based Federation

  1. Create a SAML provider entity in AWS using the SAML metadata document provided by the Organizations IdP to establish a “trust” between your AWS account and the IdP
  2. SAML metadata document includes the issuer name, a creation date, an expiration date, and keys that AWS can use to validate authentication responses (assertions) from your organization.
  3. Create IAM roles which defines
    1. Trust policy with the SAML provider as the principal, which establishes a trust relationship between the organization and AWS
    2. Permission policy establishes what users from the organization are allowed to do in AWS
  4. SAML trust is completed by configuring the Organization’s IdP with information about AWS and the role(s) that you want the federated users to use. This is referred to as configuring relying party trust between your IdP and AWS
  5. Application calls the sign-in interface for the Organization IdP to login
  6. IdP authenticates the user and generates a SAML authentication response which includes assertions that identify the user and include attributes about the user
  7. Application then makes an unsigned call to the STS service with the AssumeRoleWithSAML action to request temporary security credentials.
  8. Application passes the ARN of the SAML provider, the ARN of the role to assume, the SAML assertion about the current user returned by IdP and the time for which the credentials should be valid. An optional IAM Policy parameter can be provided to further restrict the permissions to the user
  9. AWS verifies that the SAML assertion is trusted and valid and if so, returns temporary security credentials (access key, secret access key, session token, expiry time) to the application that have the permissions for the role named in the request.
  10. STS response also includes metadata about the user from the IdP, such as the unique user ID that the IdP associates with the user.
  11. Using the Temporary credentials, the application makes signed requests to AWS to access the services
  12. Application can cache the temporary security credentials and refresh them before their expiry accordingly. Temporary credentials, by default, are good for an hour.

SAML 2.0 based federation can also be used to grant access to the federated users to the AWS Management console. This requires the use of the AWS SSO endpoint instead of directly calling the AssumeRoleWithSAML API. The endpoint calls the API for the user and returns a URL that automatically redirects the user’s browser to the AWS Management Console.

Complete Process Flow

SAML based SSO to AWS Console

  1. User browses to the organization’s portal and selects the option to go to the AWS Management Console.
  2. Portal performs the function of the identity provider (IdP) that handles the exchange of trust between the organization and AWS.
  3. Portal verifies the user’s identity in the organization.
  4. Portal generates a SAML authentication response that includes assertions that identify the user and include attributes about the user.
  5. Portal sends this response to the client browser.
  6. Client browser is redirected to the AWS SSO endpoint and posts the SAML assertion.
  7. AWS SSO endpoint handles the call for the AssumeRoleWithSAML API action on the user’s behalf and requests temporary security credentials from STS and creates a console sign-in URL that uses those credentials.
  8. AWS sends the sign-in URL back to the client as a redirect.
  9. Client browser is redirected to the AWS Management Console. If the SAML authentication response includes attributes that map to multiple IAM roles, the user is first prompted to select the role to use for access to the console.

Custom Identity broker Federation

Custom Identity broker Federation

  • If the Organization doesn’t support SAML compatible IdP, a Custom Identity Broker can be used to provide the access
  • Custom Identity Broker should perform the following steps
    • Verify that the user is authenticated by the local identity system.
    • Call the AWS Security Token Service (AWS STS) AssumeRole (recommended) or GetFederationToken (by default, has a expiration period of 36 hours) APIs to obtain temporary security credentials for the user.
    • Temporary credentials limit the permissions a user has to the AWS resource
    • Call an AWS federation endpoint and supply the temporary security credentials to get a sign-in token.
    • Construct a URL for the console that includes the token.
    • URL that the federation endpoint provides is valid for 15 minutes after it is created.
    • Give the URL to the user or invoke the URL on the user’s behalf.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A photo-sharing service stores pictures in Amazon Simple Storage Service (S3) and allows application sign-in using an OpenID Connect-compatible identity provider. Which AWS Security Token Service approach to temporary access should you use for the Amazon S3 operations?
    1. SAML-based Identity Federation
    2. Cross-Account Access
    3. AWS IAM users
    4. Web Identity Federation
  2. Which technique can be used to integrate AWS IAM (Identity and Access Management) with an on-premise LDAP (Lightweight Directory Access Protocol) directory service?
    1. Use an IAM policy that references the LDAP account identifiers and the AWS credentials.
    2. Use SAML (Security Assertion Markup Language) to enable single sign-on between AWS and LDAP
    3. Use AWS Security Token Service from an identity broker to issue short-lived AWS credentials. (Refer Link)
    4. Use IAM roles to automatically rotate the IAM credentials when LDAP credentials are updated.
    5. Use the LDAP credentials to restrict a group of users from launching specific EC2 instance types.
  3. You are designing a photo sharing mobile app the application will store all pictures in a single Amazon S3 bucket. Users will upload pictures from their mobile device directly to Amazon S3 and will be able to view and download their own pictures directly from Amazon S3. You want to configure security to handle potentially millions of users in the most secure manner possible. What should your server-side application do when a new user registers on the photo-sharing mobile application? [PROFESSIONAL]
    1. Create a set of long-term credentials using AWS Security Token Service with appropriate permissions Store these credentials in the mobile app and use them to access Amazon S3.
    2. Record the user’s Information in Amazon RDS and create a role in IAM with appropriate permissions. When the user uses their mobile app create temporary credentials using the AWS Security Token Service ‘AssumeRole’ function. Store these credentials in the mobile app’s memory and use them to access Amazon S3. Generate new credentials the next time the user runs the mobile app.
    3. Record the user’s Information in Amazon DynamoDB. When the user uses their mobile app create temporary credentials using AWS Security Token Service with appropriate permissions. Store these credentials in the mobile app’s memory and use them to access Amazon S3 Generate new credentials the next time the user runs the mobile app.
    4. Create IAM user. Assign appropriate permissions to the IAM user Generate an access key and secret key for the IAM user, store them in the mobile app and use these credentials to access Amazon S3.
    5. Create an IAM user. Update the bucket policy with appropriate permissions for the IAM user Generate an access Key and secret Key for the IAM user, store them In the mobile app and use these credentials to access Amazon S3.
  4. Your company has recently extended its datacenter into a VPC on AWS to add burst computing capacity as needed Members of your Network Operations Center need to be able to go to the AWS Management Console and administer Amazon EC2 instances as necessary. You don’t want to create new IAM users for each NOC member and make those users sign in again to the AWS Management Console. Which option below will meet the needs for your NOC members? [PROFESSIONAL]
    1. Use OAuth 2.0 to retrieve temporary AWS security credentials to enable your NOC members to sign in to the AWS Management Console.
    2. Use Web Identity Federation to retrieve AWS temporary security credentials to enable your NOC members to sign in to the AWS Management Console.
    3. Use your on-premises SAML 2.O-compliant identity provider (IDP) to grant the NOC members federated access to the AWS Management Console via the AWS single sign-on (SSO) endpoint.
    4. Use your on-premises SAML 2.0-compliant identity provider (IDP) to retrieve temporary security credentials to enable NOC members to sign in to the AWS Management Console
  5. A corporate web application is deployed within an Amazon Virtual Private Cloud (VPC) and is connected to the corporate data center via an iPsec VPN. The application must authenticate against the on-premises LDAP server. After authentication, each logged-in user can only access an Amazon Simple Storage Space (S3) keyspace specific to that user. Which two approaches can satisfy these objectives? (Choose 2 answers) [PROFESSIONAL]
    1. Develop an identity broker that authenticates against IAM security Token service to assume a IAM role in order to get temporary AWS security credentials. The application calls the identity broker to get AWS temporary security credentials with access to the appropriate S3 bucket. (Needs to authenticate against LDAP and not IAM)
    2. The application authenticates against LDAP and retrieves the name of an IAM role associated with the user. The application then calls the IAM Security Token Service to assume that IAM role. The application can use the temporary credentials to access the appropriate S3 bucket. (Authenticates with LDAP and calls the AssumeRole)
    3. Develop an identity broker that authenticates against LDAP and then calls IAM Security Token Service to get IAM federated user credentials The application calls the identity broker to get IAM federated user credentials with access to the appropriate S3 bucket. (Custom Identity broker implementation, with authentication with LDAP and using federated token)
    4. The application authenticates against LDAP the application then calls the AWS identity and Access Management (IAM) Security Token service to log in to IAM using the LDAP credentials the application can use the IAM temporary credentials to access the appropriate S3 bucket. (Can’t login to IAM using LDAP credentials)
    5. The application authenticates against IAM Security Token Service using the LDAP credentials the application uses those temporary AWS security credentials to access the appropriate S3 bucket. (Need to authenticate with LDAP)
  6. Company B is launching a new game app for mobile devices. Users will log into the game using their existing social media account to streamline data capture. Company B would like to directly save player data and scoring information from the mobile app to a DynamoDB table named Score Data When a user saves their game the progress data will be stored to the Game state S3 bucket. what is the best approach for storing data to DynamoDB and S3? [PROFESSIONAL]
    1. Use an EC2 Instance that is launched with an EC2 role providing access to the Score Data DynamoDB table and the GameState S3 bucket that communicates with the mobile app via web services.
    2. Use temporary security credentials that assume a role providing access to the Score Data DynamoDB table and the Game State S3 bucket using web identity federation
    3. Use Login with Amazon allowing users to sign in with an Amazon account providing the mobile app with access to the Score Data DynamoDB table and the Game State S3 bucket.
    4. Use an IAM user with access credentials assigned a role providing access to the Score Data DynamoDB table and the Game State S3 bucket for distribution with the mobile app.
  7. A user has created a mobile application which makes calls to DynamoDB to fetch certain data. The application is using the DynamoDB SDK and root account access/secret access key to connect to DynamoDB from mobile. Which of the below mentioned statements is true with respect to the best practice for security in this scenario?
    1. User should create a separate IAM user for each mobile application and provide DynamoDB access with it
    2. User should create an IAM role with DynamoDB and EC2 access. Attach the role with EC2 and route all calls from the mobile through EC2
    3. The application should use an IAM role with web identity federation which validates calls to DynamoDB with identity providers, such as Google, Amazon, and Facebook
    4. Create an IAM Role with DynamoDB access and attach it with the mobile application
  8. You are managing the AWS account of a big organization. The organization has more than 1000+ employees and they want to provide access to the various services to most of the employees. Which of the below mentioned options is the best possible solution in this case?
    1. The user should create a separate IAM user for each employee and provide access to them as per the policy
    2. The user should create an IAM role and attach STS with the role. The user should attach that role to the EC2 instance and setup AWS authentication on that server
    3. The user should create IAM groups as per the organization’s departments and add each user to the group for better access control
    4. Attach an IAM role with the organization’s authentication service to authorize each user for various AWS services
  9. Your fortune 500 company has under taken a TCO analysis evaluating the use of Amazon S3 versus acquiring more hardware The outcome was that all employees would be granted access to use Amazon S3 for storage of their personal documents. Which of the following will you need to consider so you can set up a solution that incorporates single sign-on from your corporate AD or LDAP directory and restricts access for each user to a designated user folder in a bucket? (Choose 3 Answers) [PROFESSIONAL]
    1. Setting up a federation proxy or identity provider
    2. Using AWS Security Token Service to generate temporary tokens
    3. Tagging each folder in the bucket
    4. Configuring IAM role
    5. Setting up a matching IAM user for every user in your corporate directory that needs access to a folder in the bucket
  10. An AWS customer is deploying a web application that is composed of a front-end running on Amazon EC2 and of confidential data that is stored on Amazon S3. The customer security policy that all access operations to this sensitive data must be authenticated and authorized by a centralized access management system that is operated by a separate security team. In addition, the web application team that owns and administers the EC2 web front-end instances is prohibited from having any ability to access the data that circumvents this centralized access management system. Which of the following configurations will support these requirements? [PROFESSIONAL]
    1. Encrypt the data on Amazon S3 using a CloudHSM that is operated by the separate security team. Configure the web application to integrate with the CloudHSM for decrypting approved data access operations for trusted end-users. (S3 doesn’t integrate directly with CloudHSM, also there is no centralized access management system control)
    2. Configure the web application to authenticate end-users against the centralized access management system. Have the web application provision trusted users STS tokens entitling the download of approved data directly from Amazon S3 (Controlled access and admins cannot access the data as it needs authentication)
    3. Have the separate security team create and IAM role that is entitled to access the data on Amazon S3. Have the web application team provision their instances with this role while denying their IAM users access to the data on Amazon S3 (Web team would have access to the data)
    4. Configure the web application to authenticate end-users against the centralized access management system using SAML. Have the end-users authenticate to IAM using their SAML token and download the approved data directly from S3. (not the way SAML auth works and not sure if the centralized access management system is SAML complaint)
  11. What is web identity federation?
    1. Use of an identity provider like Google or Facebook to become an AWS IAM User.
    2. Use of an identity provider like Google or Facebook to exchange for temporary AWS security credentials.
    3. Use of AWS IAM User tokens to log in as a Google or Facebook user.
    4. Use of AWS STS Tokens to log in as a Google or Facebook user.
  12. Games-R-Us is launching a new game app for mobile devices. Users will log into the game using their existing Facebook account and the game will record player data and scoring information directly to a DynamoDB table. What is the most secure approach for signing requests to the DynamoDB API?
    1. Create an IAM user with access credentials that are distributed with the mobile app to sign the requests
    2. Distribute the AWS root account access credentials with the mobile app to sign the requests
    3. Request temporary security credentials using web identity federation to sign the requests
    4. Establish cross account access between the mobile app and the DynamoDB table to sign the requests
  13. You are building a mobile app for consumers to post cat pictures online. You will be storing the images in AWS S3. You want to run the system very cheaply and simply. Which one of these options allows you to build a photo sharing application without needing to worry about scaling expensive uploads processes, authentication/authorization and so forth?
    1. Build the application out using AWS Cognito and web identity federation to allow users to log in using Facebook or Google Accounts. Once they are logged in, the secret token passed to that user is used to directly access resources on AWS, like AWS S3. (Amazon Cognito is a superset of the functionality provided by web identity federation. Refer link)
    2. Use JWT or SAML compliant systems to build authorization policies. Users log in with a username and password, and are given a token they can use indefinitely to make calls against the photo infrastructure.
    3. Use AWS API Gateway with a constantly rotating API Key to allow access from the client-side. Construct a custom build of the SDK and include S3 access in it.
    4. Create an AWS oAuth Service Domain ad grant public signup and access to the domain. During setup, add at least one major social media site as a trusted Identity Provider for users.
  14. The Marketing Director in your company asked you to create a mobile app that lets users post sightings of good deeds known as random acts of kindness in 80-character summaries. You decided to write the application in JavaScript so that it would run on the broadest range of phones, browsers, and tablets. Your application should provide access to Amazon DynamoDB to store the good deed summaries. Initial testing of a prototype shows that there aren’t large spikes in usage. Which option provides the most cost-effective and scalable architecture for this application? [PROFESSIONAL]
    1. Provide the JavaScript client with temporary credentials from the Security Token Service using a Token Vending Machine (TVM) on an EC2 instance to provide signed credentials mapped to an Amazon Identity and Access Management (IAM) user allowing DynamoDB puts and S3 gets. You serve your mobile application out of an S3 bucket enabled as a web site. Your client updates DynamoDB. (Single EC2 instance not a scalable architecture)
    2. Register the application with a Web Identity Provider like Amazon, Google, or Facebook, create an IAM role for that provider, and set up permissions for the IAM role to allow S3 gets and DynamoDB puts. You serve your mobile application out of an S3 bucket enabled as a web site. Your client updates DynamoDB. (Can work with JavaScript SDK, is scalable and cost effective)
    3. Provide the JavaScript client with temporary credentials from the Security Token Service using a Token Vending Machine (TVM) to provide signed credentials mapped to an IAM user allowing DynamoDB puts. You serve your mobile application out of Apache EC2 instances that are load-balanced and autoscaled. Your EC2 instances are configured with an IAM role that allows DynamoDB puts. Your server updates DynamoDB. (Is Scalable but Not cost effective)
    4. Register the JavaScript application with a Web Identity Provider like Amazon, Google, or Facebook, create an IAM role for that provider, and set up permissions for the IAM role to allow DynamoDB puts. You serve your mobile application out of Apache EC2 instances that are load-balanced and autoscaled. Your EC2 instances are configured with an IAM role that allows DynamoDB puts. Your server updates DynamoDB. (Is Scalable but Not cost effective)

References

AWS IAM User Guide – Id Role Providers

AWS Elastic Map Reduce – EMR – Certification

AWS EMR

  • Amazon EMR is a web service that utilizes a hosted Hadoop framework running on the web-scale infrastructure of EC2 and S3
  • EMR enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data
  • EMR
    • uses Apache Hadoop as its distributed data processing engine, which is an open source, Java software that supports data-intensive distributed applications running on large clusters of commodity hardware
    • is ideal for problems that necessitate fast and efficient processing of large amounts of data
    • lets the focus be on crunching or analyzing big data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity
    • can help perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research etc
    • provides web service interface to launch the clusters and monitor processing-intensive computation on clusters
    • is a batch-processing framework that measures the common processing time duration in hours to days, if the use case is to have processing at real time or within minutes Apache Spark or Storm would be a better option
  • EMR seamlessly supports On-Demand, Spot, and Reserved Instances
  • EMR launches all nodes for a given cluster in the same EC2 Availability Zone, which improves performance as it provides higher data access rate
  • EMR supports different EC2 instance types including Standard, High CPU, High Memory, Cluster Compute, High I/O, and High Storage
    • Standard Instances have memory to CPU ratios suitable for most general-purpose applications.
    • High CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications
    • High Memory instances offer large memory sizes for high throughput applications
    • Cluster Compute instances have proportionally high CPU with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications
    • High Storage instances offer 48 TB of storage across 24 disks and are ideal for applications that require sequential access to very large data sets such as data warehousing and log processing
  • EMR charges on hourly increments i.e. once the cluster is running,  charges apply entire hour
  • EMR integrates with CloudTrail to record AWS API calls

NOTE: Topic mainly for Solution Architect Professional Exam Only

EMR Architecture

  • Amazon EMR uses industry proven, fault-tolerant Hadoop software as its data processing engine
  • Hadoop is an open source, Java software that supports data-intensive distributed applications running on large clusters of commodity hardware
  • Hadoop splits the data into multiple subsets and assigns each subset to more than one EC2 instance. So, if an EC2 instance fails to process one subset of data, the results of another Amazon EC2 instance can be used
  • EMR consists of Master node, one or more Slave nodes
    • Master Node
      • EMR currently does not support automatic failover of the master nodes or master node state recovery
      • If master node goes down, the EMR cluster will be terminated and the job needs to be re-executed
    • Slave Nodes – Core nodes and Task nodes
      • Core nodes
        • host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks
        • can be increased in an existing cluster
      • Task nodes
        • only run Hadoop tasks
        • can be increased or decreased in an existing cluster
      • EMR is fault tolerant for slave failures and continues job execution if a slave node goes down.
      • Currently, EMR does not automatically provision another node to take over failed slaves
  • EMR supports Bootstrap actions which allow
    • users a way to run custom set-up prior to the execution of the cluster.
    • can be used to install software or configure instances before running the cluster

EMR Security

  • EMR cluster starts with different security groups for Master and Slaves
    • Master security group
      • has a port open for communication with the service.
      • has a SSH port open to allow direct SSH into the instances, using the key specified at startup
    • Slave security group
      • only allows interaction with the master instance
      • SSH to the slave nodes can be done by doing SSH to the master node and then to the slave node
    • Security groups can be configured with different access rules

EMR Security Encryption

  • EMR enables use of security configuration
    • which helps to encrypt data at-rest, data in-transit, or both
    • can be used to specify settings for S3 encryption with EMR file system (EMRFS), local disk encryption, and in-transit encryption
    • is stored in EMR rather than the cluster configuration making it reusable
    • gives flexibility to choose from several options, including keys managed by AWS KMS, keys managed by S3, and keys and certificates from custom providers that you supply
  • At-rest Encryption for S3 with EMRFS
    • EMRFS supports Server-side (SSE-S3, SSE-KMS) and Client-side encryption (CSE-KMS or CSE-Custom)
    • S3 SSE and CSE encryption with EMRFS are mutually exclusive; either one can be selected but not both
    • Transport layer security (TLS) encrypts EMRFS objects in-transit between EMR cluster nodes & S3
  • At-rest Encryption for Local Disks
    • Open-source HDFS Encryption
      • HDFS exchanges data between cluster instances during distributed processing, and also reads from and writes data to instance store volumes and the EBS volumes attached to instances
      • Open-source Hadoop encryption options are activated
        • Secure Hadoop RPC is set to “Privacy”, which uses Simple Authentication Security Layer (SASL).
        • Data encryption on HDFS block data transfer is set to true and is configured to use AES 256 encryption.
    • LUKS. In addition to HDFS encryption, the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using LUKS
  • In-Transit Data Encryption
    • Encryption artifacts used for in-transit encryption in one of two ways:
      • either by providing a zipped file of certificates that you upload to S3,
      • or by referencing a custom Java class that provides encryption artifacts

EMR Cluster Types

  • EMR has two cluster types, transient and persistent
  • Transient EMR Clusters
    • Transient EMR clusters are clusters that shut down when the job or the steps (series of jobs) are complete
    • Transient EMT clusters can be used in situations
      • where total number of EMR processing hours per day < 24 hours and its beneficial to shut down the cluster when it’s not being used.
      • using HDFS as your primary data storage.
      • job processing is intensive, iterative data processing.
  • Persistent EMR Clusters
    • Persistent EMR clusters continue to run after the data processing job is complete
    • Persistent EMR clusters can be used in situations
      • frequently run processing jobs where it’s beneficial to keep the cluster running after the previous job.
      • processing jobs have an input-output dependency on one another.
      • In rare cases when it is more cost effective to store the data on HDFS instead of S3

EMR Best Practices

  • Data Migration
    • Two tools – S3DistCp and DistCp – can be used to move data stored on the local (data center) HDFS storage to S3, from S3 to HDFS and between S3 and local disk (non HDFS) to S3
    • AWS Import/Export and Direct Connect can also be considered for moving data
  • Data Collection
    • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, & moving large amounts of log data
    • Flume agents can be installed on the data sources (web-servers, app servers etc) and data shipped to the collectors which can then be stored in persistent storage like S3 or HDFS
  • Data Aggregation
    • Data aggregation refers to techniques for gathering individual data records (for e.g. log records) and combining them into a large bundle of data files i.e. creating a large file from small files
    • Hadoop, on which EMR runs, generally performs better with fewer large files compared to many small files
    • Hadoop splits the file on HDFS on multiple nodes, while for the data in S3 it uses the HTTP Range header query to split the files which helps improve performance by supporting parallelization
    • Log collectors like Flume and Fluentd can be used to aggregate data before copying it to the final destination (S3 or HDFS)
    • Data aggregation has following benefits
      • Improves data ingest scalability by reducing the number of times needed to upload data to AWS
      • Reduces the number of files stored on S3 (or HDFS), which inherently helps provide better performance when processing data
      • Provides a better compression ratio as compressing large, highly compressible files is often more effective than compressing a large number of smaller files.
  • Data compression
    • Data compression can be used at the input as well as intermediate outputs from the mappers
    • Data compression helps
      • Lower storage costs
      • Lower bandwidth cost for data transfer
      • Better data processing performance by moving less data between data storage location, mappers, and reducers
      • Better data processing performance by compressing the data that EMR writes to disk, i.e. achieving better performance by writing to disk less frequently
    • Data Compression can have an impact on Hadoop data splitting logic as some of the compression techniques like gzip do not support it
    • Data Compression Techniques
  • Data Partitioning
    • Data partitioning helps in data optimizations and lets you create unique buckets of data and eliminate the need for a data processing job to read the entire data set
    • Data can be partitioned by
      • Data type (time series)
      • Data processing frequency (per hour, per day, etc.)
      • Data access and query pattern (query on time vs. query on geo location)
  • Cost Optimization
    • AWS offers different pricing models for EC2 instances
      • On-Demand instances
        • are a good option if using transient EMR jobs or if the EMR hourly usage is less than 17% of the time
      • Reserved instances
        • are a good option for persistent EMR cluster or if the  EMR hourly usage is more than 17% of the time as is more cost effective
      • Spot instances
        • can be a cost effective mechanism to add compute capacity
        • can be used where the data is persists on S3
        • can be used to add extra task capacity with Task nodes, and
        • is not suited for Master node, as if it is lost the cluster is lost and Core nodes (data nodes) as they host data and if lost needs to be recovered to rebalance the HDFS cluster
    • Architecture pattern can be used,
      • Run master node on On-Demand or Reserved Instances (if running persistent EMR clusters).
      • Run a portion of the EMR cluster on core nodes using On-Demand or Reserved Instances and
      • the rest of the cluster on task nodes using Spot Instances.

EMR – S3 vs HDFS

  • Storing data on S3 provides several benefits
    • inherent features high availability, durability, lifecycle management, data encryption and archival of data to Glacier
    • cost effective as storing data in S3 is cheaper as compared to HDFS with the replication factor
    • ability to use Transient EMR cluster and shutdown the clusters after the job is completed, with data being maintained in S3
    • ability to use Spot instances and not having to worry about losing the spot instances any time
    • provides data durability from any HDFS node failures, where node failures exceed the HDFS replication factor
    • data ingestion with high throughput data stream to S3 is much easier than ingesting to HDFS

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You require the ability to analyze a large amount of data, which is stored on Amazon S3 using Amazon Elastic Map Reduce. You are using the cc2.8xlarge instance type, who’s CPUs are mostly idle during processing. Which of the below would be the most cost efficient way to reduce the runtime of the job? [PROFESSIONAL]
    1. Create smaller files on Amazon S3.
    2. Add additional cc2.8xlarge instances by introducing a task group.
    3. Use smaller instances that have higher aggregate I/O performance.
    4. Create fewer, larger files on Amazon S3.
  2. A customer’s nightly EMR job processes a single 2-TB data file stored on Amazon Simple Storage Service (S3). The Amazon Elastic Map Reduce (EMR) job runs on two On-Demand core nodes and three On-Demand task nodes. Which of the following may help reduce the EMR job completion time? Choose 2 answers [PROFESSIONAL]
    1. Use three Spot Instances rather than three On-Demand instances for the task nodes.
    2. Change the input split size in the MapReduce job configuration.
    3. Use a bootstrap action to present the S3 bucket as a local filesystem.
    4. Launch the core nodes and task nodes within an Amazon Virtual Cloud.
    5. Adjust the number of simultaneous mapper tasks.
    6. Enable termination protection for the job flow.
  3. Your department creates regular analytics reports from your company’s log files. All log data is collected in Amazon S3 and processed by daily Amazon Elastic Map Reduce (EMR) jobs that generate daily PDF reports and aggregated tables in CSV format for an Amazon Redshift data warehouse. Your CFO requests that you optimize the cost structure for this system. Which of the following alternatives will lower costs without compromising average performance of the system or data integrity for the raw data? [PROFESSIONAL]
    1. Use reduced redundancy storage (RRS) for PDF and CSV data in Amazon S3. Add Spot instances to Amazon EMR jobs. Use Reserved Instances for Amazon Redshift. (Only Spot instances impacts performance)
    2. Use reduced redundancy storage (RRS) for all data in S3. Use a combination of Spot instances and Reserved Instances for Amazon EMR jobs. Use Reserved instances for Amazon Redshift (Combination of the Spot and reserved with guarantee performance and help reduce cost. Also, RRS would reduce cost and guarantee data integrity, which is different from data durability)
    3. Use reduced redundancy storage (RRS) for all data in Amazon S3. Add Spot Instances to Amazon EMR jobs. Use Reserved Instances for Amazon Redshift (Only Spot instances impacts performance)
    4. Use reduced redundancy storage (RRS) for PDF and CSV data in S3. Add Spot Instances to EMR jobs. Use Spot Instances for Amazon Redshift. (Spot instances impacts performance and Spot instance not available for Redshift)
  4. A research scientist is planning for the one-time launch of an Elastic MapReduce cluster and is encouraged by her manager to minimize the costs. The cluster is designed to ingest 200TB of genomics data with a total of 100 Amazon EC2 instances and is expected to run for around four hours. The resulting data set must be stored temporarily until archived into an Amazon RDS Oracle instance. Which option will help save the most money while meeting requirements? [PROFESSIONAL]
    1. Store ingest and output files in Amazon S3. Deploy on-demand for the master and core nodes and spot for the task nodes.
    2. Optimize by deploying a combination of on-demand, RI and spot-pricing models for the master, core and task nodes. Store ingest and output files in Amazon S3 with a lifecycle policy that archives them to Amazon Glacier. (Master and Core must be RI or On Demand. Cannot be Spot)
    3. Store the ingest files in Amazon S3 RRS and store the output files in S3. Deploy Reserved Instances for the master and core nodes and on-demand for the task nodes. (Need better durability for ingest file. Spot instances can be used for task nodes for cost saving. RI will not provide cost saving in this case)
    4. Deploy on-demand master, core and task nodes and store ingest and output files in Amazon S3 RRS (Input should be in S3 standard, as re-ingesting the input data might end up being more costly then holding the data for limited time in standard S3)
  5. Your company sells consumer devices and needs to record the first activation of all sold devices. Devices are not activated until the information is written on a persistent database. Activation data is very important for your company and must be analyzed daily with a MapReduce job. The execution time of the data analysis process must be less than three hours per day. Devices are usually sold evenly during the year, but when a new device model is out, there is a predictable peak in activation’s, that is, for a few days there are 10 times or even 100 times more activation’s than in average day. Which of the following databases and analysis framework would you implement to better optimize costs and performance for this workload? [PROFESSIONAL]
    1. Amazon RDS and Amazon Elastic MapReduce with Spot instances.
    2. Amazon DynamoDB and Amazon Elastic MapReduce with Spot instances.
    3. Amazon RDS and Amazon Elastic MapReduce with Reserved instances.
    4. Amazon DynamoDB and Amazon Elastic MapReduce with Reserved instances

References

AWS ElastiCache – Certification

AWS ElastiCache

  • AWS ElastiCache is a managed web service that lets you easily deploy and run Memcached or Redis protocol-compliant cache clusters in the cloud
  • ElastiCache is available in two flavours: Memcached and Redis
  • ElastiCache helps
    • simplify and offload the management, monitoring, and operation of in-memory cache environments, enabling the engineering resources to focus on developing applications
    • automate common administrative tasks required to operate a distributed cache environment.
    • improves the performance of web applications by allowing retrieval of information from a fast, managed, in-memory caching system, instead of relying entirely on slower disk-based databases.
    • not only to improve load & response times to user actions and queries, but also reduce the cost associated with scaling web applications
    • helps automatically detect and replace failed cache nodes, providing a resilient system that mitigates the risk of overloaded databases, which can slow website and application load times
    • provides enhanced visibility into key performance metrics associated with the cache nodes through integration with CloudWatch
    • code, applications, and popular tools already using Memcached or Redis environments work seamlessly, with being protocol- compliant with Memcached and Redis environments
  • ElastiCache provides in-memory caching which can
    • significantly improve latency and throughput for many
      • read-heavy application workloads for e.g. social networking, gaming, media sharing and Q&A portals or
      • compute-intensive workloads such as a recommendation engine
    • improve application performance by storing critical pieces of data in memory for low-latency access.
    • be used to cache results of I/O-intensive database queries or the results of computationally-intensive calculations.
  • ElastiCache currently allows access only from the EC2 network and cannot be accessed from outside networks like on-premises servers

AWS ElastiCache Redis vs Memcached

Redis

  • Redis is an open source, BSD licensed, advanced key-value cache & store
  • ElastiCache enables the management, monitoring and operation of a Redis node; creation, deletion and modification of the node
  • ElastiCache for Redis can be used as a primary in-memory key-value data store, providing fast, sub millisecond data performance, high availability and scalability up to 16 nodes plus up to 5 read replicas, each of up to 3.55 TiB of in-memory data
  • ElastiCache for Redis supports
    • Redis Master/Slave replication.
    • Multi-AZ operation by creating read replicas in another AZ
    • Backup and Restore feature for persistence by snapshotting
  • ElastiCache for Redis can be vertically scaled upwards by selecting a larger node type, however it cannot be scaled down
  • Parameter group can be specified for Redis during installation, which acts as a “container” for Redis configuration values that can be applied to one or more Redis primary clusters
  • Append Only File (AOF)
    • provides persistence and can be enabled for recovery scenarios
    • if a node restarts or service crash, Redis will replay the updates from an AOF file, thereby recovering the data lost due to the restart or crash
    • cannot protect against all failure scenarios, cause if the underlying hardware fails, a new server would be provisioned and the AOF file will no longer be available to recover the data
    • Enabling Redis Multi-AZ as a Better Approach to Fault Tolerance, as failing over to a read replica is much faster than rebuilding the primary from an AOF file

Redis Read Replica

  • Read Replicas help provide Read scaling and handling failures
  • Read Replicas are kept in sync with the Primary node using Redis’s asynchronous replication technology
  • Redis Read Replicas can help
    • Horizontal scaling beyond the compute or I/O capacity of a single primary node for read-heavy workloads.
    • Serving read traffic while the primary is unavailable either being down due to failure or maintenance
    • Data protection scenarios to promote a Read Replica as primary node, in case the primary node or the AZ of the primary node fails
  • ElastiCache supports initiated or forced failover where it flips the DNS record for the primary node to point at the read replica, which is in turn promoted to become the new primary
  • Read replica cannot span across regions and may only be provisioned in the same or different AZ of the same Region as the cache node primary

Redis Multi-AZ

  • ElastiCache for Redis shard consists of a primary and up to 5 read replicas
  • Redis asynchronously replicates the data from the primary node to the read replicas
  • ElastiCache for Redis Multi-AZ mode
    • provides enhanced availability and smaller need for administration as the node failover is automatic
    • impact on the ability to read/write to the primary is limited to the time it takes for automatic failover to complete
    • no longer needs monitoring of Redis nodes and manually initiating a recovery in the event of a primary node disruption
  • During certain types of planned maintenance, or in the unlikely event of ElastiCache node failure or AZ failure,
    • it automatically detects the failure,
    • selects a replica, depending upon the read replica with the smallest asynchronous replication lag to the primary, and promotes it to become the new primary node
    • it will also propagate the DNS changes so that the the primary endpoint remains the same
  • If Multi-AZ is not enabled,
    • ElastiCache monitors the primary node
    • in case the node becomes unavailable or unresponsive, it will repair the node by acquiring new service resources
    • it propagates the DNS endpoint changes to redirect the node’s existing DNS name to point to the new service resources.
    • If the primary node cannot be healed and you will have the choice to promote one of the read replicas to be the new primary

Redis Backup & Restore

  • Backup and Restore allows users to create snapshots of the Redis clusters
  • Snapshots can be used for recovery, restoration, archiving purpose or warm start an ElastiCache for Redis cluster with preloaded data
  • Snapshots can created on a cluster basis and uses Redis’ native mechanism to create and store an RDB file as the snapshot
  • Increased latencies for a brief period at the node might be encountered while taking a snapshot, and is recommended to be taken from a Read Replica minimizing performance impact
  • Snapshots can be created either automatically (if configured) or manually
  • ElastiCache for Redis cluster when deleted removes the automatic snapshots. However, manual snapshots are retained

Memcached

  • Memcached is an in-memory key-value store for small chunks or arbitrary data
  • ElastiCache for Memcached can be used to cache a variety of objects
    • from the content in persistent data stores such as RDS, DynamoDB, or self-managed databases hosted on EC2) to
    • dynamically generated web pages (with Nginx for example), or
    • transient session data that may not require a persistent backing store
  • ElastiCache for Memcached
    • can be scaled Vertically by increasing the node type size
    • can be scaled Horizontally by adding and removing nodes
    • does not support persistence of data
  • ElastiCache for Memcached cluster can have
    • nodes can span across multiple AZs within the same region
    • maximum of 20 nodes per cluster with a maximum of 100 nodes per region (soft limit and can be extended)
  • ElasticCache for Memcached supports auto discovery, which enables automatic discovery of cache nodes by clients when they are added to or removed from an ElastiCache cluster

ElastiCache Mitigating Failures

  • ElastiCache should be designed to plan so that failures have a minimal impact upon your application and data
  • Mitigating Failures when Running Memcached
    • Mitigating Node Failures
      • spread the cached data over more nodes
      • as Memcached does not support replication, a node failure will always result in some data loss from the cluster
      • having more nodes will reduce the proportion of cache data lost
    • Mitigating Availability Zone Failures
      • locate the nodes in as many availability zones as possible, only the data cached in that AZ is lost, not the data cached in the other AZs
  • Mitigating Failures when Running Redis
    • Mitigating Cluster Failures
      • Redis Append Only Files (AOF)
        • enable AOF so whenever data is written to the Redis cluster, a corresponding transaction record is written to a Redis AOF
        • when Redis process restarts, ElastiCache creates a replacement cluster and provisions it and repopulating it with data from AOF
        • It is time consuming
        • AOF can get big
        • Using AOF cannot protect you from all failure scenarios
      • Redis Replication Groups
        • A Redis replication group is comprised of a single primary cluster which your application can both read from and write to, and from 1 to 5 read-only replica clusters.
        • Data written to the primary cluster is also asynchronously updated on the read replica clusters
        • When a Read Replica fails, ElastiCache detects the failure, replaces the instance in the same AZ and synchronizes with the Primary Cluster
        • Redis Multi-AZ with Automatic Failover, ElastiCache detects Primary cluster failure, promotes a read replica with least replication lag to primary
        • Multi-AZ with Auto Failover is disabled, ElastiCache detects Primary cluster failure, creates a new one and syncs the new Primary with one of the existing replicas
    • Mitigating Availability Zone Failures
      • locate the clusters in as many availability zones as possible

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. What does Amazon ElastiCache provide?
    1. A service by this name doesn’t exist. Perhaps you mean Amazon CloudCache.
    2. A virtual server with a huge amount of memory.
    3. A managed In-memory cache service
    4. An Amazon EC2 instance with the Memcached software already pre-installed.
  2. You are developing a highly available web application using stateless web servers. Which services are suitable for storing session state data? Choose 3 answers.
    1. Elastic Load Balancing
    2. Amazon Relational Database Service (RDS)
    3. Amazon CloudWatch
    4. Amazon ElastiCache
    5. Amazon DynamoDB
    6. AWS Storage Gateway
  3. Which statement best describes ElastiCache?
    1. Reduces the latency by splitting the workload across multiple AZs
    2. A simple web services interface to create and store multiple data sets, query your data easily, and return the results
    3. Offload the read traffic from your database in order to reduce latency caused by read-heavy workload
    4. Managed service that makes it easy to set up, operate and scale a relational database in the cloud
  4. Our company is getting ready to do a major public announcement of a social media site on AWS. The website is running on EC2 instances deployed across multiple Availability Zones with a Multi-AZ RDS MySQL Extra Large DB Instance. The site performs a high number of small reads and writes per second and relies on an eventual consistency model. After comprehensive tests you discover that there is read contention on RDS MySQL. Which are the best approaches to meet these requirements? (Choose 2 answers)
    1. Deploy ElastiCache in-memory cache running in each availability zone
    2. Implement sharding to distribute load to multiple RDS MySQL instances
    3. Increase the RDS MySQL Instance size and Implement provisioned IOPS
    4. Add an RDS MySQL read replica in each availability zone
  5. You are using ElastiCache Memcached to store session state and cache database queries in your infrastructure. You notice in CloudWatch that Evictions and Get Misses are both very high. What two actions could you take to rectify this? Choose 2 answers
    1. Increase the number of nodes in your cluster
    2. Tweak the max_item_size parameter
    3. Shrink the number of nodes in your cluster
    4. Increase the size of the nodes in the cluster
  6. You have been tasked with moving an ecommerce web application from a customer’s datacenter into a VPC. The application must be fault tolerant and well as highly scalable. Moreover, the customer is adamant that service interruptions not affect the user experience. As you near launch, you discover that the application currently uses multicast to share session state between web servers, In order to handle session state within the VPC, you choose to:
    1. Store session state in Amazon ElastiCache for Redis (scalable and makes the web applications stateless)
    2. Create a mesh VPN between instances and allow multicast on it
    3. Store session state in Amazon Relational Database Service (RDS solution not highly scalable)
    4. Enable session stickiness via Elastic Load Balancing (affects user experience if the instance goes down)
  7. When you are designing to support a 24-hour flash sale, which one of the following methods best describes a strategy to lower the latency while keeping up with unusually heavy traffic?
    1. Launch enhanced networking instances in a placement group to support the heavy traffic (only improves internal communication)
    2. Apply Service Oriented Architecture (SOA) principles instead of a 3-tier architecture (just simplifies architecture)
    3. Use Elastic Beanstalk to enable blue-green deployment (only minimizes download for applications and ease of rollback)
    4. Use ElastiCache as in-memory storage on top of DynamoDB to store user sessions (scalable, faster read/writes and in memory storage)
  8. You are configuring your company’s application to use Auto Scaling and need to move user state information. Which of the following AWS services provides a shared data store with durability and low latency?
    1. AWS ElastiCache Memcached (does not provide durability as if the node is gone the data is gone)
    2. Amazon Simple Storage Service
    3. Amazon EC2 instance storage
    4. Amazon DynamoDB
  9. Your application is using an ELB in front of an Auto Scaling group of web/application servers deployed across two AZs and a Multi-AZ RDS Instance for data persistence. The database CPU is often above 80% usage and 90% of I/O operations on the database are reads. To improve performance you recently added a single-node Memcached ElastiCache Cluster to cache frequent DB query results. In the next weeks the overall workload is expected to grow by 30%. Do you need to change anything in the architecture to maintain the high availability for the application with the anticipated additional load and Why?
    1. You should deploy two Memcached ElastiCache Clusters in different AZs because the RDS Instance will not be able to handle the load if the cache node fails.
    2. If the cache node fails the automated ElastiCache node recovery feature will prevent any availability impact. (does not provide high availability, as data is lost if the node is lost)
    3. Yes you should deploy the Memcached ElastiCache Cluster with two nodes in the same AZ as the RDS DB master instance to handle the load if one cache node fails. (Single AZ affects availability as DB is Multi AZ and would be overloaded is the AZ goes down)
    4. No if the cache node fails you can always get the same data from the DB without having any availability impact. (Will overload the database affecting availability)
  10. A read only news reporting site with a combined web and application tier and a database tier that receives large and unpredictable traffic demands must be able to respond to these traffic fluctuations automatically. What AWS services should be used meet these requirements?
    1. Stateless instances for the web and application tier synchronized using ElastiCache Memcached in an autoscaling group monitored with CloudWatch and RDS with read replicas.
    2. Stateful instances for the web and application tier in an autoscaling group monitored with CloudWatch and RDS with read replicas (Stateful instances will allow for scaling)
    3. Stateful instances for the web and application tier in an autoscaling group monitored with CloudWatch and multi-AZ RDS (Stateful instances will allow for scaling & multi-AZ is for high availability and not scaling)
    4. Stateless instances for the web and application tier synchronized using ElastiCache Memcached in an autoscaling group monitored with CloudWatch and multi-AZ RDS (multi-AZ is for high availability and not scaling)
  11. You have written an application that uses the Elastic Load Balancing service to spread traffic to several web servers. Your users complain that they are sometimes forced to login again in the middle of using your application, after they have already logged in. This is not behavior you have designed. What is a possible solution to prevent this happening?
    1. Use instance memory to save session state.
    2. Use instance storage to save session state.
    3. Use EBS to save session state.
    4. Use ElastiCache to save session state.
    5. Use Glacier to save session slate.

AWS Redshift – Certification

AWS Redshift

  • Amazon Redshift is a fully managed, fast and powerful, petabyte scale data warehouse service
  • Redshift automatically helps
    • set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity
    • patches and backs up the data warehouse, storing the backups for a user-defined retention period
    • monitors the nodes and drives to help recovery from failures
    • significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly
    • provide fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections.
    • uses replication and continuous backups to enhance availability and improve data durability and can automatically recover from node and component failures.
    • scale up or down with a few clicks in the AWS Management Console or with a single API call
    • distribute & parallelize queries across multiple physical resources
    • supports VPC, SSL, AES-256 encryption and Hardware Security Modules (HSMs) to protect the data in transit and at rest.
  • Redshift only supports Single-AZ deployments and the nodes are available within the same AZ, if the AZ supports Redshift clusters
  • Redshift provides monitoring using CloudWatch and metrics for compute utilization, storage utilization, and read/write traffic to the cluster are available with the ability to add user-defined custom metrics
  • Redshift provides Audit logging and AWS CloudTrail integration
  • Redshift can be easily enabled to a second region for disaster recovery.

Redshift Architecture

Redshift Performance

  • Massively Parallel Processing (MPP)
    • automatically distributes data and query load across all nodes.
    • makes it easy to add nodes to the data warehouse and enables fast query performance as the data warehouse grows.
  • Columnar Data Storage
    • organizes the data by column, as column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets
    • columnar data is stored sequentially on the storage media, and require far fewer I/Os, greatly improving query performance
  • Advance Compression
    • Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk.
    • employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores.
    • doesn’t require indexes or materialized views and so uses less space than traditional relational database systems.
    • automatically samples the data and selects the most appropriate compression scheme, when the data is loaded into an empty table

Redshift Single vs Multi-Node Cluster

  • Single Node
    • single node configuration enables getting started quickly and cost-effectively & scale up to a multi-node configuration as the needs grow
  • Multi-Node
    • Multi-node configuration requires a leader node that manages client connections and receives queries, and two or more compute nodes that store data and perform queries and computations.
    • Leader node
      • provisioned automatically and not charged for
      • receives queries from client applications, parses the queries and develops execution plans, which are an ordered set of steps to process these queries.
      • coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes and finally returns the results back to the client applications.
    • Compute node
      • can contain from 1-128 compute nodes, depending on the node type
      • executes the steps specified in the execution plans and transmit data among themselves to serve these queries.
      • intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.
      • supports Dense Storage or Dense Compute nodes (DC) instance type
        • Dense Storage (DS) allow creation of very large data warehouses using hard disk drives (HDDs) for a very low price point
        • Dense Compute (DC) allow creation of very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs)
      • direct access to compute nodes is not allowed

Redshift Availability & Durability

  • Redshift replicates the data within the data warehouse cluster and continuously backs up the data to S3 (11 9’s durability)
  • Redshift mirrors each drive’s data to other nodes within the cluster.
  • Redshift will automatically detect and replace a failed drive or node
  • If a drive fails, Redshift
    • cluster will remain available in the event of a drive failure
    • the queries will continue with a slight latency increase while Redshift rebuilds the drive from replica of the data on that drive which is stored on other drives within that node
    • single node clusters do not support data replication and the cluster needs to be restored from snapshot on S3
  • In case of node failure(s), Redshift
    • automatically provisions new node(s) and begins restoring data from other drives within the cluster or from S3
    • prioritizes restoring the most frequently queried data so the most frequently executed queries will become performant quickly
    • cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the cluster
  • In case of Redshift cluster AZ goes down, Redshift
    • cluster is unavailable until power and network access to the AZ are restored
    • cluster’s data is preserved and can be used once AZ becomes available
    • cluster can be restored from any existing snapshots to a new AZ within the same region

Redshift Backup & Restore

  • Redshift replicates all the data within the data warehouse cluster when it is loaded and also continuously backs up the data to S3
  • Redshift always attempts to maintain at least three copies of the data
  • Redshift enables automated backups of the data warehouse cluster with a 1-day retention period, by default, which can be extended to max 35 days
  • Automated backups can be turned off by setting the retention period as 0
  • Redshift can also asynchronously replicate the snapshots to S3 in another region for disaster recovery

Redshift Scalability

  • Redshift allows scaling of the cluster either by
    • increasing the node instance type (Vertical scaling)
    • increasing the number of nodes (Horizontal scaling)
  • Redshift scaling changes are usually applied during the maintenance window or can be applied immediately
  • Redshift scaling process
    • existing cluster remains available for read operations only while a new data warehouse cluster gets created during scaling operations
    • data from the compute nodes in the existing data warehouse cluster is moved in parallel to the compute nodes in the new cluster
    • when the new data warehouse cluster is ready, the existing cluster will be temporarily unavailable while the canonical name record of the existing cluster is flipped to point to the new data warehouse cluster

Redshift vs EMR vs RDS

  • RDS is ideal for
    • structured data and running traditional relational databases while offloading database administration
    • for online-transaction processing (OLTP) and for reporting and analysis
  • Redshift is ideal for
    • large volumes of structured data that needs to be persisted and queried using standard SQL and existing BI tools
    • analytic and reporting workloads against very large data sets by harnessing the scale and resources of multiple nodes and using a variety of optimizations to provide improvements over RDS
    • preventing reporting and analytic processing from interfering with the performance of the OLTP workload
  • EMR is ideal for
    • processing and transforming unstructured or semi-structured data to bring in to Amazon Redshift and
    • for data sets that are relatively transitory, not stored for long-term use.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. With which AWS services CloudHSM can be used (select 2)
    1. S3
    2. DynamoDB
    3. RDS
    4. ElastiCache
    5. Amazon Redshift
  2. You have recently joined a startup company building sensors to measure street noise and air quality in urban areas. The company has been running a pilot deployment of around 100 sensors for 3 months. Each sensor uploads 1KB of sensor data every minute to a backend hosted on AWS. During the pilot, you measured a peak of 10 IOPS on the database, and you stored an average of 3GB of sensor data per month in the database. The current deployment consists of a load-balanced auto scaled Ingestion layer using EC2 instances and a PostgreSQL RDS database with 500GB standard storage. The pilot is considered a success and your CEO has managed to get the attention or some potential investors. The business plan requires a deployment of at least 100K sensors, which needs to be supported by the backend. You also need to store sensor data for at least two years to be able to compare year over year Improvements. To secure funding, you have to make sure that the platform meets these requirements and leaves room for further scaling. Which setup will meet the requirements?
    1. Add an SQS queue to the ingestion layer to buffer writes to the RDS instance (RDS instance will not support data for 2 years)
    2. Ingest data into a DynamoDB table and move old data to a Redshift cluster (Handle 10K IOPS ingestion and store data into Redshift for analysis)
    3. Replace the RDS instance with a 6 node Redshift cluster with 96TB of storage (Does not handle the ingestion issue)
    4. Keep the current architecture but upgrade RDS storage to 3TB and 10K provisioned IOPS (RDS instance will not support data for 2 years)
  3. Which two AWS services provide out-of-the-box user configurable automatic backup-as-a-service and backup rotation options? Choose 2 answers
    1. Amazon S3
    2. Amazon RDS
    3. Amazon EBS
    4. Amazon Redshift
  4. Your department creates regular analytics reports from your company’s log files. All log data is collected in Amazon S3 and processed by daily Amazon Elastic Map Reduce (EMR) jobs that generate daily PDF reports and aggregated tables in CSV format for an Amazon Redshift data warehouse. Your CFO requests that you optimize the cost structure for this system. Which of the following alternatives will lower costs without compromising average performance of the system or data integrity for the raw data?
    1. Use reduced redundancy storage (RRS) for PDF and CSV data in Amazon S3. Add Spot instances to Amazon EMR jobs. Use Reserved Instances for Amazon Redshift. (Spot instances impacts performance)
    2. Use reduced redundancy storage (RRS) for all data in S3. Use a combination of Spot instances and Reserved Instances for Amazon EMR jobs. Use Reserved instances for Amazon Redshift (Combination of the Spot and reserved with guarantee performance and help reduce cost. Also, RRS would reduce cost and guarantee data integrity, which is different from data durability)
    3. Use reduced redundancy storage (RRS) for all data in Amazon S3. Add Spot Instances to Amazon EMR jobs. Use Reserved Instances for Amazon Redshift (Spot instances impacts performance)
    4. Use reduced redundancy storage (RRS) for PDF and CSV data in S3. Add Spot Instances to EMR jobs. Use Spot Instances for Amazon Redshift. (Spot instances impacts performance and Spot instance not available for Redshift)

References

AWS Kinesis – Certification

AWS Kinesis

  • Amazon Kinesis enables real-time processing of streaming data at massive scale
  • Kinesis Streams enables building of custom applications that process or analyze streaming data for specialized needs
  • Kinesis Streams features
    • handles provisioning, deployment, ongoing-maintenance of hardware, software, or other services for the data streams
    • manages the infrastructure, storage, networking, and configuration needed to stream the data at the level of required data throughput
    • synchronously replicates data across three facilities in an AWS Region, providing high availability and data durability
    • stores records of a stream for up to 24 hours, by default, from the time they are added to the stream. The limit can be raised to up to 7 days by enabling extended data retention
  • Data such as clickstreams, application logs, social media etc can be added from multiple sources and within seconds is available for processing to the Amazon Kinesis Applications
  • Kinesis provides ordering of records, as well as the ability to read and/or replay records in the same order to multiple Kinesis applications.
  • Kinesis Streams is useful for rapidly moving data off data producers and then continuously processing the data, be it to transform the data before emitting to a data store, run real-time metrics and analytics, or derive more complex data streams for further processing
    • Accelerated log and data feed intake: Data producers can push data to Kinesis stream as soon as it is produced, preventing any data loss and making it available for processing within seconds.
    • Real-time metrics and reporting: Metrics can be extracted and used to generate reports from data in real-time.
    • Real-time data analytics: Run real-time streaming data analytics.
    • Complex stream processing: Create Directed Acyclic Graphs (DAGs) of Kinesis Applications and data streams, with Kinesis applications adding to another Amazon Kinesis stream for further processing, enabling successive stages of stream processing.
  • Kinesis limits
    • stores records of a stream for up to 24 hours, by default, which can be extended to max 7 days
    • maximum size of a data blob (the data payload before Base64-encoding) within one record is 1 megabyte (MB)
    • Each shard can support up to 1000 PUT records per second
    • Each account can provision 10 shards per region, which can be increased further through request
  • Amazon Kinesis is designed to process streaming big data and the pricing model allows heavy PUTs rate.
  • Amazon S3 is a cost-effective way to store your data, but not designed to handle a stream of data in real-time

Kinesis Architecture

Kinesis Streams

  • Shard
    • Streams are made of shards and is the base throughput unit of an Kinesis stream.
    • Each shard provides a capacity of 1MB/sec data input and 2MB/sec data output
    • Each shard can support up to 1000 PUT records per second
    • All data is stored for 24 hours.
    • Replay data inside a 24-hour window
    • Shards define the capacity limits. If the limits are exceeded, either by data throughput or the number of PUT records, the put data call will be rejected with a ProvisionedThroughputExceeded exception.
    • This can be handled by
      • Implementing a retry on the data producer side, if this is due to a temporary rise of the stream’s input data rate
      • Dynamically scaling the number of shared (resharding) to provide enough capacity for the put data calls to consistently succeed
  • Record
    • A record is the unit of data stored in an Amazon Kinesis stream.
    • A record is composed of a sequence number, partition key, and data blob.
    • Data blob is the data of interest your data producer adds to a stream.
    • Maximum size of a data blob (the data payload before Base64-encoding) is 1 MB
  • Partition key
    • Partition key is used to segregate and route records to different shards of a stream.
    • A partition key is specified by your data producer while adding data to an Amazon Kinesis stream
  • Sequence number
    • A sequence number is a unique identifier for each record.
    • Sequence number is assigned by Amazon Kinesis when a data producer calls PutRecord or PutRecords operation to add data to an Amazon Kinesis stream.
    • Sequence numbers for the same partition key generally increase over time; the longer the time period between PutRecord or PutRecords requests, the larger the sequence numbers become.

Kinesis Streams Components

  • Data to an Amazon Kinesis stream can be added via PutRecord and PutRecords operations, Kinesis Producer Library (KPL), or Kinesis Agent.
    • Amazon Kinesis Agent
      • is a pre-built Java application that offers an easy way to collect and send data to Amazon Kinesis stream.
      • can be installed on a Linux-based server environments such as web servers, log servers, and database servers
      • can be configured to monitor certain files on the disk and then continuously send new data to the Amazon Kinesis stream
    • Amazon Kinesis Producer Library (KPL)
      • is an easy to use and highly configurable library that helps you put data into an Amazon Kinesis stream.
      • presents a simple, asynchronous, and reliable interface that enables you to quickly achieve high producer throughput with minimal client resources.
  • Amazon Kinesis Application is a data consumer that reads and processes data from an Amazon Kinesis stream and can be build using either Amazon Kinesis API or Amazon Kinesis Client Library (KCL)
    • Amazon Kinesis Client Library (KCL)
      • is a pre-built library with multiple language support
      • delivers all records for a given partition key to same record processor
      • makes it easier to build multiple applications reading from the same Kinesis stream for e.g. to perform counting, aggregation, and filtering
      • handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance
    • Amazon Kinesis Connector Library
      • is a pre-built library that helps you easily integrate Amazon Kinesis Streams with other AWS services and third-party tools
      • Kinesis Client Library is required for Kinesis Connector Library
    • Amazon Kinesis Storm Spout is a pre-built library that helps you easily integrate Amazon Kinesis Streams with Apache Storm

Kinesis vs SQS

  • Kinesis Streams enables real-time processing of streaming big data while SQS offers a reliable, highly scalable hosted queue for storing messages and move data between distributed application components
  • Kinesis provides ordering of records, as well as the ability to read and/or replay records in the same order to multiple Amazon Kinesis Applications while SQS does not guarantee data ordering and provides at least once delivery of messages
  • Kinesis stores the data up to 24 hours, by default, and can be extended to 7 days while SQS stores the message up to 4 days, by default, and can be configured from 1 minute to 14 days but clears the message once deleted by the consumer
  • Kineses and SQS both guarantee at-least once delivery of message
  • Kinesis supports multiple consumers while SQS allows the messages to be delivered to only one consumer at a time and requires multiple queues to deliver message to multiple consumers
  • Kinesis use cases requirements
    • Ordering of records.
    • Ability to consume records in the same order a few hours later
    • Ability for multiple applications to consume the same stream concurrently
    • Routing related records to the same record processor (as in streaming MapReduce)
  • SQS uses cases requirements
    • Messaging semantics like message-level ack/fail and visibility timeout
    • Leveraging SQS’s ability to scale transparently
    • Dynamically increasing concurrency/throughput at read time
    • Individual message delay, which can be delayed

Kinesis vs S3

Amazon Kinesis vs S3

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are deploying an application to track GPS coordinates of delivery trucks in the United States. Coordinates are transmitted from each delivery truck once every three seconds. You need to design an architecture that will enable real-time processing of these coordinates from multiple consumers. Which service should you use to implement data ingestion?
    1. Amazon Kinesis
    2. AWS Data Pipeline
    3. Amazon AppStream
    4. Amazon Simple Queue Service
  2. You are deploying an application to collect votes for a very popular television show. Millions of users will submit votes using mobile devices. The votes must be collected into a durable, scalable, and highly available data store for real-time public tabulation. Which service should you use?
    1. Amazon DynamoDB
    2. Amazon Redshift
    3. Amazon Kinesis
    4. Amazon Simple Queue Service
  3. Your company is in the process of developing a next generation pet collar that collects biometric information to assist families with promoting healthy lifestyles for their pets. Each collar will push 30kb of biometric data In JSON format every 2 seconds to a collection platform that will process and analyze the data providing health trending information back to the pet owners and veterinarians via a web portal Management has tasked you to architect the collection platform ensuring the following requirements are met. Provide the ability for real-time analytics of the inbound biometric data Ensure processing of the biometric data is highly durable, elastic and parallel. The results of the analytic processing should be persisted for data mining. Which architecture outlined below will meet the initial requirements for the collection platform?
    1. Utilize S3 to collect the inbound sensor data analyze the data from S3 with a daily scheduled Data Pipeline and save the results to a Redshift Cluster.
    2. Utilize Amazon Kinesis to collect the inbound sensor data, analyze the data with Kinesis clients and save the results to a Redshift cluster using EMR. (refer link)
    3. Utilize SQS to collect the inbound sensor data analyze the data from SQS with Amazon Kinesis and save the results to a Microsoft SQL Server RDS instance.
    4. Utilize EMR to collect the inbound sensor data, analyze the data from EUR with Amazon Kinesis and save me results to DynamoDB.
  4. Your customer is willing to consolidate their log streams (access logs, application logs, security logs etc.) in one single system. Once consolidated, the customer wants to analyze these logs in real time based on heuristics. From time to time, the customer needs to validate heuristics, which requires going back to data samples extracted from the last 12 hours? What is the best approach to meet your customer’s requirements?
    1. Send all the log events to Amazon SQS. Setup an Auto Scaling group of EC2 servers to consume the logs and apply the heuristics.
    2. Send all the log events to Amazon Kinesis develop a client process to apply heuristics on the logs (Can perform real time analysis and stores data for 24 hours which can be extended to 7 days)
    3. Configure Amazon CloudTrail to receive custom logs, use EMR to apply heuristics the logs (CloudTrail is only for auditing)
    4. Setup an Auto Scaling group of EC2 syslogd servers, store the logs on S3 use EMR to apply heuristics on the logs (EMR is for batch analysis)
  5. You require the ability to analyze a customer’s clickstream data on a website so they can do behavioral analysis. Your customer needs to know what sequence of pages and ads their customer clicked on. This data will be used in real time to modify the page layouts as customers click through the site to increase stickiness and advertising click-through. Which option meets the requirements for captioning and analyzing this data?
    1. Log clicks in weblogs by URL store to Amazon S3, and then analyze with Elastic MapReduce
    2. Push web clicks by session to Amazon Kinesis and analyze behavior using Kinesis workers
    3. Write click events directly to Amazon Redshift and then analyze with SQL
    4. Publish web clicks by session to an Amazon SQS queue men periodically drain these events to Amazon RDS and analyze with SQL
  6. Your social media monitoring application uses a Python app running on AWS Elastic Beanstalk to inject tweets, Facebook updates and RSS feeds into an Amazon Kinesis stream. A second AWS Elastic Beanstalk app generates key performance indicators into an Amazon DynamoDB table and powers a dashboard application. What is the most efficient option to prevent any data loss for this application?
    1. Use AWS Data Pipeline to replicate your DynamoDB tables into another region.
    2. Use the second AWS Elastic Beanstalk app to store a backup of Kinesis data onto Amazon Elastic Block Store (EBS), and then create snapshots from your Amazon EBS volumes.
    3. Add a second Amazon Kinesis stream in another Availability Zone and use AWS data pipeline to replicate data across Kinesis streams.
    4. Add a third AWS Elastic Beanstalk app that uses the Amazon Kinesis S3 connector to archive data from Amazon Kinesis into Amazon S3.
  7. You need to replicate API calls across two systems in real time. What tool should you use as a buffer and transport mechanism for API call events?
    1. AWS SQS
    2. AWS Lambda
    3. AWS Kinesis (AWS Kinesis is an event stream service. Streams can act as buffers and transport across systems for in-order programmatic events, making it ideal for replicating API calls across systems)
    4. AWS SNS
  8. You need to perform ad-hoc business analytics queries on well-structured data. Data comes in constantly at a high velocity. Your business intelligence team can understand SQL. What AWS service(s) should you look to first?
    1. Kinesis Firehose + RDS
    2. Kinesis Firehose + RedShift (Kinesis Firehose provides a managed service for aggregating streaming data and inserting it into RedShift. RedShift also supports ad-hoc queries over well-structured data using a SQL-compliant wire protocol, so the business team should be able to adopt this system easily. Refer link)
    3. EMR using Hive
    4. EMR running Apache Spark

References

AWS Elastic Transcoder – Certification

AWS Elastic Transcoder

  • Amazon Elastic Transcoder is a highly scalable, easy-to-use and cost-effective way for developers and businesses to convert (or “transcode”) video files from their source format into versions that will play back on multiple devices like smartphones, tablets and PCs.
  • Elastic Transcoder is for any customer with media assets stored in S3 for e.g. developers creating apps or websites that publish user-generated content, enterprises and educational establishments converting training and communication videos, and content owners and broadcasters needing to convert media assets into web-friendly formats.
  • Elastic Transcoder features
    • can be used to convert files from different media formats into H.264/AAC/MP4 files at different resolutions, bitrates, and frame rates, and set up transcoding pipelines to transcode files in parallel.
    • can be configured to overlay up to four graphics, known as watermarks, over a video during transcoding
    • can be configured to transcode captions, or subtitles, from one format to another and supports embedded and sidebar caption types
    • provides clip stitching ability to stitch together parts, or clips, from multiple input files to create a single output
    • can be configured to create Thumbnails
  • Elastic Transcoder is integrated with CloudTrail, an AWS service that captures information about every request that is sent to the Elastic Transcoder API by your AWS account, including your IAM users

Elastic Transcoder Components

  • Presets
    • are templates that contain most of the settings for transcoding media files from one format to another.
    • Elastic Transcoder includes some default presets for common formats and ability to create customized presets
  • Jobs
    • do the work of transcoding and converts a file into up to 30 formats.
    • takes the input file to be transcoded, names of the transcoded files and several other settings as input
    • For each transcoded format a preset needs to be specified
  • Pipelines
    • are queues that manage the transcoding jobs.
    • Elastic Transcoder starts processing the jobs and transcoding into format (for multiple formats) in the order they are added.
    • can be paused to temporarily stop processing jobs
  • Notifications
    • help keep you apprised of the status of a job, i.e. started, completed, encounters warning or error
    • eliminate the need for polling to determine when a job has finished and can be configured during pipeline creation

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your website is serving on-demand training videos to your workforce. Videos are uploaded monthly in high resolution MP4 format. Your workforce is distributed globally often on the move and using company-provided tablets that require the HTTP Live Streaming (HLS) protocol to watch a video. Your company has no video transcoding expertise and it required you might need to pay for a consultant. How do you implement the most cost-efficient architecture without compromising high availability and quality of video delivery?
    1. Elastic Transcoder to transcode original high-resolution MP4 videos to HLS. S3 to host videos with lifecycle Management to archive original flies to Glacier after a few days. CloudFront to serve HLS transcoded videos from S3
    2. A video transcoding pipeline running on EC2 using SQS to distribute tasks and Auto Scaling to adjust the number or nodes depending on the length of the queue S3 to host videos with Lifecycle Management to archive all files to Glacier after a few days CloudFront to serve HLS transcoding videos from Glacier
    3. Elastic Transcoder to transcode original high-resolution MP4 videos to HLS EBS volumes to host videos and EBS snapshots to incrementally backup original rues after a few days. CloudFront to serve HLS transcoded videos from EC2.
    4. A video transcoding pipeline running on EC2 using SQS to distribute tasks and Auto Scaling to adjust the number of nodes depending on the length of the queue. EBS volumes to host videos and EBS snapshots to incrementally backup original files after a few days. CloudFront to serve HLS transcoded videos from EC2

References