AWS S3 Security

AWS S3 Security

  • AWS S3 Security is a shared responsibility between AWS and the Customer
  • As a managed service, S3 is protected by the AWS global network security procedures
  • AWS handles basic security tasks like guest operating system (OS) and database patching, firewall configuration, and disaster recovery.
  • Security and compliance of S3 is assessed by third-party auditors as part of multiple AWS compliance programs including SOC, PCI DSS, HIPAA, etc.
  • AWS S3 provides several other features to handle security, which are customers’ responsibility.

S3 Data Protection

Refer blog post @ S3 Data Protection

S3 Encryption

Refer blog post @ S3 Encryption

S3 Permissions

Refer blog post @ S3 Permissions

S3 Object Lock

  • S3 Object Lock helps to store objects using a write-once-read-many (WORM) model.
  • S3 Object Lock can help prevent objects from being deleted or overwritten for a fixed amount of time or indefinitely.
  • S3 Object Lock can help meet regulatory requirements that require WORM storage or add an extra layer of protection against object changes and deletion.
  • Object Lock for new buckets can be enabled only for new buckets. For an existing bucket, contact AWS Support.
  • Enabling Object Lock automatically enables versioning for the bucket.
  • Once Object Lock is enabled, you can’t disable Object Lock or suspend versioning for the bucket.
  • S3 Object Lock provides two retention modes that apply different levels of protection to the objects
    • Governance mode
      • Users can’t overwrite or delete an object version or alter its lock settings unless they have special permissions.
      • Objects against can be protected from being deleted by most users, but you can still grant some users permission to alter the retention settings or delete the object if necessary.
      • Can be used to test retention-period settings before creating a compliance-mode retention period.
    • Compliance mode
      • A protected object version can’t be overwritten or deleted by any user, including the root user in your AWS account.
      • When an object is locked in compliance mode, its retention mode can’t be changed, and its retention period can’t be shortened.
      • Compliance mode helps ensure that an object version can’t be overwritten or deleted for the duration of the retention period.

S3 VPC Gateway Endpoint

  • A VPC endpoint enables connections between a VPC and supported services, without requiring that you use an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
  • VPC is not exposed to the public internet.
  • Gateway Endpoint is a gateway that is a target for a route in your route table used for traffic destined to either S3.

S3 Security Best Practices

S3 Preventative Security Best Practices

  • Ensure S3 buckets use the correct policies and are not publicly accessible
    • Use S3 block public access
    • Identify Bucket policies and ACLs that allow public access
    • Use AWS Trusted Advisor to inspect the S3 implementation.
  • Implement least privilege access
  • Use IAM roles for applications and AWS services that require S3 access
  • Enable Multi-factor authentication (MFA) Delete to help prevent accidental bucket deletions
  • Consider Data at Rest Encryption
  • Enforce Data in Transit Encryption
  • Consider S3 Object Lock to store objects using a “Write Once Read Many” (WORM) model.
  • Enable versioning to easily recover from both unintended user actions and application failures.
  • Consider S3 Cross-Region replication
  • Consider VPC endpoints for S3 access to provide private S3 connectivity and help prevent traffic from potentially traversing the open internet.

S3 Monitoring and Auditing Best Practices

  • Identify and Audit all S3 buckets to have visibility of all the S3 resources to assess their security posture and take action on potential areas of weakness.
  • Implement monitoring using AWS monitoring tools
  • Enable S3 server access logging, which provides detailed records of the requests that are made to a bucket useful for security and access audits
  • Use AWS CloudTrail, which provides a record of actions taken by a user, a role, or an AWS service in S3.
  • Enable AWS Config, which enables you to assess, audit, and evaluate the configurations of the AWS resources
  • Consider using Amazon Macie with S3 to automatically discover, classify, and protect sensitive data in AWS.
  • Monitor AWS security advisories to regularly check security advisories posted in Trusted Advisor for the AWS account.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

References

AWS_S3_Security

AWS S3 Encryption

AWS S3 Encryption

  • AWS S3 Encryption supports both data at rest and data in transit encryption.
  • Data in-transit
    • S3 allows protection of data in transit by enabling communication via SSL or using client-side encryption
  • Data at Rest
    • Server-Side Encryption
      • S3 encrypts the object before saving it on disks in its data centers and decrypt it when the objects are downloaded
    • Client-Side Encryption
      • data is encrypted at the client-side and uploaded to S3.
      • the encryption process, the encryption keys, and related tools are managed by the user.

S3 Server-Side Encryption

  • Server-side encryption is about data encryption at rest
  • Server-side encryption encrypts only the object data.
  • Any object metadata is not encrypted.
  • S3 handles the encryption (as it writes to disks) and decryption (when objects are accessed) of the data objects
  • There is no difference in the access mechanism for both encrypted or unencrypted objects and is handled transparently by S3

Server-Side Encryption with S3-Managed Keys – SSE-S3

  • Each object is encrypted with a unique data key employing strong multi-factor encryption.
  • SSE-S3 encrypts the data key with a master key that is regularly rotated.
  • S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt the data.
  • Whether or not objects are encrypted with SSE-S3 can’t be enforced when they are uploaded using pre-signed URLs, because the only way server-side encryption can be specified is through the AWS Management Console or through an HTTP request header
  • Must set header x-amz-server-side-encryption to AES-256
  • For enforcing server-side encryption for all of the objects that are stored in a bucket, use a bucket policy that denies permissions to upload an object unless the request includes x-amz-server-side-encryption header to request server-side encryption.

Server-Side Encryption with AWS KMS-Managed Keys – SSE-KMS

Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS)

  • SSE-KMS is similar to SSE-S3, but it uses AWS Key Management Services (KMS) which provides additional benefits along with additional charges
    • KMS is a service that combines secure, highly available hardware and software to provide a key management system scaled for the cloud.
    • KMS uses customer master keys (CMKs) to encrypt the S3 objects.
    • The master key is never made available
    • KMS enables you to centrally create encryption keys, define the policies that control how keys can be used
    • Allows audit of keys used to prove they are being used correctly, by inspecting logs in AWS CloudTrail
    • Allows keys to temporarily disabled and re-enabled
    • Allows keys to be rotated regularly
    • Security controls in AWS KMS can help meet encryption-related compliance requirements.
  • SSE-KMS enables separate permissions for the use of an envelope key (that is, a key that protects the data’s encryption key) that provides added protection against unauthorized access of the objects in S3.
  • SSE-KMS provides the option to create and manage encryption keys yourself, or use a default customer master key (CMK) that is unique to you, the service you’re using, and the region you’re working in.
  • Creating and Managing CMK gives more flexibility, including the ability to create, rotate, disable, and define access controls, and to audit the encryption keys used to protect the data.
  • Data keys used to encrypt the data are also encrypted and stored alongside the data they protect and are unique to each object.
  • Process flow
    • An application or AWS service client requests an encryption key to encrypt data and passes a reference to a master key under the account.
    • Client requests are authenticated based on whether they have access to use the master key.
    • A new data encryption key is created, and a copy of it is encrypted under the master key.
    • Both the data key and encrypted data key are returned to the client.
    • Data key is used to encrypt customer data and then deleted as soon as is practical.
    • Encrypted data key is stored for later use and sent back to AWS KMS when the source data needs to be decrypted.
  • S3 only supports symmetric keys and not asymmetric keys.
  • Must set header x-amz-server-side-encryption to aws:kms

Server-Side Encryption with Customer-Provided Keys – SSE-C

AWS S3 Server Side Encryption using Customer Provided Keys SSE-C

  • Encryption keys can be managed and provided by the Customer and S3 manages the encryption, as it writes to disks, and decryption, when you access the objects
  • When you upload an object, the encryption key is provided as a part of the request and S3 uses that encryption key to apply AES-256 encryption to the data and removes the encryption key from memory.
  • When you download an object, the same encryption key should be provided as a part of the request. S3 first verifies the encryption key and if it matches the object is decrypted before returning back to you.
  • As each object and each object’s version can be encrypted with a different key, you are responsible for maintaining the mapping between the object and the encryption key used.
  • SSE-C requests must be done through HTTPS and S3 will reject any requests made over HTTP when using SSE-C.
  • For security considerations, AWS recommends considering any key sent erroneously using HTTP to be compromised and discarded or rotated
  • S3 does not store the encryption key provided. Instead, a randomly salted HMAC value of the encryption key is stored which can be used to validate future requests. The salted HMAC value cannot be used to decrypt the contents of the encrypted object or to derive the value of the encryption key. That means, if you lose the encryption key, you lose the object.

Client-Side Encryption

Client-side encryption refers to encrypting data before sending it to S3 and decrypting the data after downloading it

AWS KMS-managed Customer Master Key – CMK

  • Customer can maintain the encryption CMK with AWS KMS and can provide the CMK id to the client to encrypt the data
  • Uploading Object
    • AWS S3 encryption client first sends a request to AWS KMS for the key to encrypt the object data
    • AWS KMS returns a randomly generated data encryption key with 2 versions a plain text version for encrypting the data and cipher blob to be uploaded with the object as object metadata
    • Client obtains a unique data encryption key for each object it uploads.
    • AWS S3 encryption client uploads the encrypted data and the cipher blob with object metadata
  • Download Object
    • AWS Client first downloads the encrypted object along with the cipher blob version of the data encryption key stored as object metadata
    • AWS Client then sends the cipher blob to AWS KMS to get the plain text version of the same, so that it can decrypt the object data.

Client-Side master key

  • Encryption master keys are completely maintained at the Client-side
  • Uploading Object
    • S3 encryption client ( for e.g. AmazonS3EncryptionClient in the AWS SDK for Java) locally generates randomly a one-time-use symmetric key (also known as a data encryption key or data key).
    • Client encrypts the data encryption key using the customer provided master key
    • Client uses this data encryption key to encrypt the data of a single S3 object (for each object, the client generates a separate data key).
    • Client then uploads the encrypted data to S3 and also saves the encrypted data key and its material description as object metadata ( x-amz-meta-x-amz-key)  in S3 by default
  • Downloading Object
    • Client first downloads the encrypted object from S3 along with the object metadata.
    • Using the material description in the metadata, the client first determines which master key to use to decrypt the encrypted data key.
    • Using that master key, the client decrypts the data key and uses it to decrypt the object
  • Client-side master keys and your unencrypted data are never sent to AWS
  • If the master key is lost the data cannot be decrypted

Enforcing S3 Encryption

  • S3 Default Encryption
    • helps set the default encryption behavior for an S3 bucket so that all new objects are encrypted when they are stored in the bucket.
    • Objects are encrypted using SSE with either S3-managed keys (SSE-S3) or AWS KMS keys stored in AWS KMS (SSE-KMS).
  • S3 Bucket Policy
    • can be applied that denies permissions to upload an object unless the request includes x-amz-server-side-encryption header to request server-side encryption.
    • is not required, if S3 default encryption is enabled
    • are evaluated before the default encryption.

S3 Bucket Policy Enforce Encryption

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company is storing data on Amazon Simple Storage Service (S3). The company’s security policy mandates that data is encrypted at rest. Which of the following methods can achieve this? Choose 3 answers
    1. Use Amazon S3 server-side encryption with AWS Key Management Service managed keys
    2. Use Amazon S3 server-side encryption with customer-provided keys
    3. Use Amazon S3 server-side encryption with EC2 key pair.
    4. Use Amazon S3 bucket policies to restrict access to the data at rest.
    5. Encrypt the data on the client-side before ingesting to Amazon S3 using their own master key
    6. Use SSL to encrypt the data while in transit to Amazon S3.
  2. A user has enabled versioning on an S3 bucket. The user is using server side encryption for data at Rest. If the user is supplying his own keys for encryption (SSE-C) which of the below mentioned statements is true?
    1. The user should use the same encryption key for all versions of the same object
    2. It is possible to have different encryption keys for different versions of the same object
    3. AWS S3 does not allow the user to upload his own keys for server side encryption
    4. The SSE-C does not work when versioning is enabled
  3. A storage admin wants to encrypt all the objects stored in S3 using server side encryption. The user does not want to use the AES 256 encryption key provided by S3. How can the user achieve this?
    1. The admin should upload his secret key to the AWS console and let S3 decrypt the objects
    2. The admin should use CLI or API to upload the encryption key to the S3 bucket. When making a call to the S3 API mention the encryption key URL in each request
    3. S3 does not support client supplied encryption keys for server side encryption
    4. The admin should send the keys and encryption algorithm with each API call
  4. A user has enabled versioning on an S3 bucket. The user is using server side encryption for data at rest. If the user is supplying his own keys for encryption (SSE-C), what is recommended to the user for the purpose of security?
    1. User should not use his own security key as it is not secure
    2. Configure S3 to rotate the user’s encryption key at regular intervals
    3. Configure S3 to store the user’s keys securely with SSL
    4. Keep rotating the encryption key manually at the client side
  5. A system admin is planning to encrypt all objects being uploaded to S3 from an application. The system admin does not want to implement his own encryption algorithm; instead he is planning to use server side encryption by supplying his own key (SSE-C.. Which parameter is not required while making a call for SSE-C?
    1. x-amz-server-side-encryption-customer-key-AES-256
    2. x-amz-server-side-encryption-customer-key
    3. x-amz-server-side-encryption-customer-algorithm
    4. x-amz-server-side-encryption-customer-key-MD5
  6. You are designing a personal document-archiving solution for your global enterprise with thousands of employee. Each employee has potentially gigabytes of data to be backed up in this archiving solution. The solution will be exposed to he employees as an application, where they can just drag and drop their files to the archiving system. Employees can retrieve their archives through a web interface. The corporate network has high bandwidth AWS DirectConnect connectivity to AWS. You have regulatory requirements that all data needs to be encrypted before being uploaded to the cloud. How do you implement this in a highly available and cost efficient way?
    1. Manage encryption keys on-premise in an encrypted relational database. Set up an on-premises server with sufficient storage to temporarily store files and then upload them to Amazon S3, providing a client-side master key. (Storing temporary increases cost and not a high availability option)
    2. Manage encryption keys in a Hardware Security Module(HSM) appliance on-premise server with sufficient storage to temporarily store, encrypt, and upload files directly into amazon Glacier. (Not cost effective)
    3. Manage encryption keys in amazon Key Management Service (KMS), upload to amazon simple storage service (s3) with client-side encryption using a KMS customer master key ID and configure Amazon S3 lifecycle policies to store each object using the amazon glacier storage tier. (with CSE-KMS the encryption happens at client side before the object is upload to S3 and KMS is cost effective as well)
    4. Manage encryption keys in an AWS CloudHSM appliance. Encrypt files prior to uploading on the employee desktop and then upload directly into amazon glacier (Not cost effective)
  7. A user has enabled server side encryption with S3. The user downloads the encrypted object from S3. How can the user decrypt it?
    1. S3 does not support server side encryption
    2. S3 provides a server side key to decrypt the object
    3. The user needs to decrypt the object using their own private key
    4. S3 manages encryption and decryption automatically
  8. When uploading an object, what request header can be explicitly specified in a request to Amazon S3 to encrypt object data when saved on the server side?
    1. x-amz-storage-class
    2. Content-MD5
    3. x-amz-security-token
    4. x-amz-server-side-encryption
  9. A company must ensure that any objects uploaded to an S3 bucket are encrypted. Which of the following actions should the SysOps Administrator take to meet this requirement? (Select TWO.)
    1. Implement AWS Shield to protect against unencrypted objects stored in S3 buckets.
    2. Implement Object access control list (ACL) to deny unencrypted objects from being uploaded to the S3 bucket.
    3. Implement Amazon S3 default encryption to make sure that any object being uploaded is encrypted before it is stored.
    4. Implement Amazon Inspector to inspect objects uploaded to the S3 bucket to make sure that they are encrypted.
    5. Implement S3 bucket policies to deny unencrypted objects from being uploaded to the buckets.

References

AWS_S3_Encryption

AWS S3 vs EBS vs EFS

S3 vs EBS vs EFS

EFS, EBS, and S3 are AWS’ three different storage types that are applicable for different types of workload needs

S3 vs EBS vs EFS Comparision

S3 vs EBS vs EFS

Simple Storage Service – S3

  • is an object store with a simple key, value store design, and good at storing vast numbers of backups or user files.
  • offers pay for the storage you actually use. Offers cost-saving storage classes ideal for infrequently access data or for data archival
  • provides unlimited storage
  • provides durability as the data is replicated and stored across at least three geographically dispersed AZs with a maximum of 99.999999999% (11! 9’s)
  • provide high availability with a maximum of 99.99%
  • provides security with a range of access control mechanisms and abilities to encrypt data at rest and in transit
  • data can be accessed programmatically or directly from services such as AWS CloudFront.
  • provides backup capability using versioning and cross-region replication

Elastic Block Storage – EBS

  • delivers high-availability block-level storage volumes for EC2 instances.
  • offers pay for the provisioned storage, even if you do not use it
  • provides limited storage capability and cannot scale infinitely
  • stores data on a file system which can be retained after the EC2 instance is shut down.
  • provides durability by replicating data across multiple servers in an AZ to prevent the loss of data from the failure of any single component
  • designed for 99.999% availability
  • provides low-latency performance – using SSD EBS volumes, it offers reliable I/O performance scaled to meet your workload needs.
  • provides secure storage with access control and providing data at rest and in transit encryption
  • is only accessible from a single EC2 instance in the particular AWS region and AZ
  • provides Multi-Attach option to share storage across multiple EC2 instances, but within a particular AWS region and AZ
  • provides backup capability using backups and snapshots

Elastic File Storage – EFS

  • scalable file storage, also optimized for EC2.
  • offers pay for the storage you actually use. There’s no advance provisioning, up-front fees, or commitments
  • multiple instances can be configured to mount the file system.
  • allows mounting the file system across multiple regions and instances.
  • is designed to be highly durable and highly available. Data is redundantly stored across multiple AZs.
  • provides elasticity – scales up and down automatically, even to meet the most abrupt workload spikes.
  • provides performance that scales to support any workload: EFS offers the throughput changing workloads need. It can provide higher throughput in spurts that match sudden file system growth, even for workloads up to 500,000 IOPS or 10 GB per second.
  • provides accessible file storage, which can be accessed by On-premises servers and EC2 instances concurrently.
  • provides security and compliance – access to the file system can be secured with the current security solution, or control access to EFS file systems using IAM, VPC, or POSIX permissions.
  • provides data encryption in transit or at rest.
  • allows EC2 instances to access EFS file systems located in other AWS regions through VPC peering.
  • a file system can be accessed concurrently from all AZs in the region where it is located, which means the application can be architected to failover from one AZ to other AZs in the region in order to ensure the highest level of application availability. Mount targets themselves are designed to be highly available.
  • used as a common data source for any application or workload that runs on numerous instances.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company runs an application on a group of Amazon Linux EC2 instances. The application writes log files using standard API calls. For compliance reasons, all log files must be retained indefinitely and will be analyzed by a reporting tool that must access all files concurrently. Which storage service should a solutions architect use to provide the MOST cost-effective solution?
    1. Amazon EBS
    2. Amazon EFS
    3. Amazon EC2 instance store
    4. Amazon S3
  2. A new application is being deployed on Amazon EC2. The Application needs to read write upto 3 TB of data to an external data store and requires read-after-write consistency across all AWS regions for writing new objects into this data store.
    1. Amazon EBS
    2. Amazon Glacier
    3. Amazon EFS
    4. Amazon S3
  3. To meet the requirements of an application, an organization needs to save a constantly increasing volume of files on a cloud storage system with the following features and abilities. What below AWS service will meet these requirements?
      1. Pay only for the storage used
      2. Create different security policies for different groups of files
      3. Allow access to the public
      4. Retrieve the files at any time
      5. Store an unlimited number of files
    1. Amazon EBS
    2. Amazon S3
    3. Amazon Glacier
    4. Amazon EFS
  4. An administrator runs a highly available application in AWS. A file storage layer is needed that can share between instances and scale the platform more easily. The storage should also be POSIX compliant. Which AWS service can perform this action?
    1. Amazon EBS
    2. Amazon S3
    3. Amazon EFS
    4. Amazon EC2 Instance store

Reference

AWS_When_to_choose_EFS

AWS Certification – Storage Services – Cheat Sheet

Simple Storage Service – S3

  • provides key-value based object storage with unlimited storage, unlimited objects up to 5 TB for the internet
  • offers an extremely durable, highly available, and infinitely scalable data storage infrastructure at very low costs.
  • is Object-level storage (not a Block level storage) and cannot be used to host OS or dynamic websites (but can work with Javascript SDK)
  • provides durability by redundantly storing objects on multiple facilities within a region
  • regularly verifies the integrity of data using checksums and provides the auto-healing capability
  • S3 resources consist of globally unique buckets with objects and related metadata. The data model is a flat structure with no hierarchies or folders.
  • S3 Replication enables automatic, asynchronous copying of objects across S3 buckets in the same or different AWS regions using SRR or CRR. Replication needs versioning enabled on either side.
  • S3 Transfer Acceleration helps speed data transport over long distances between a client and an S3 bucket using CloudFront edge locations.
  • S3 supports cost-effective Static Website hosting with Client-side scripts.
  • S3 CORS – Cross-Origin Resource Sharing allows cross-origin access to S3 resources.
  • S3 Access Logs enables tracking access requests to an S3 bucket.
  • S3 notification feature enables notifications to be triggered when certain events happen in the bucket.
  • S3 Inventory helps manage the storage and can be used to audit and report on the replication and encryption status of the objects for business, compliance, and regulatory needs.
  • Requestor Pays help bucket owner to specify that the requester requesting the download will be charged for the download.
  • S3 Batch Operations help perform large-scale batch operations on S3 objects and can perform a single operation on lists of specified S3 objects.
  • Pre-Signed URLs can be used shared for uploading/downloading objects for a limited time without requiring AWS security credentials.
  • Multipart Uploads allows
    • parallel uploads with improved throughput and bandwidth utilization
    • fault tolerance and quick recovery from network issues
    • ability to pause and resume uploads
    • begin an upload before the final object size is known
  • Versioning
    • helps preserve, retrieve, and restore every version of every object
    • protect from unintended overwrites and accidental deletions
    • protects individual files but does NOT protect from Bucket deletion
  • MFA (Multi-Factor Authentication) can be enabled for additional security for the deletion of objects.
  • Integrates with CloudTrail, CloudWatch, and SNS for event notifications
  • S3 Storage Classes
    • S3 Standard
      • default storage class, ideal for frequently accessed data
      • 99.999999999% durability & 99.99% availability
      • Low latency and high throughput performance
      • designed to sustain the loss of data in a two facilities
    • S3 Standard-Infrequent Access (S3 Standard-IA)
      • optimized for long-lived and less frequently accessed data
      • designed to sustain the loss of data in a two facilities
      • 99.999999999% durability & 99.9% availability
      • suitable for objects greater than 128 KB kept for at least 30 days
    • S3 One Zone-Infrequent Access (S3 One Zone-IA)
      • optimized for rapid access, less frequently access data
      • ideal for secondary backups and reproducible data
      • stores data in a single AZ, data stored in this storage class will be lost in the event of AZ destruction.
      • 99.999999999% durability & 99.5% availability
    • S3 Reduced Redundancy Storage (Not Recommended)
      • designed for noncritical, reproducible data stored at lower levels of redundancy than the STANDARD storage class
      • reduces storage costs
      • 99.99% durability & 99.99% availability
      • designed to sustain the loss of data in a single facility
    • S3 Glacier
      • suitable for low cost data archiving, where data access is infrequent
      • provides retrieval time of minutes to several hours
        • Expedited – 1 to 5 minutes
        • Standard – 3 to 5 hours
        • Bulk – 5 to 12 hours
      • 99.999999999% durability & 99.9% availability
      • Minimum storage duration of 90 days
    • S3 Glacier Deep Archive (S3 Glacier Deep Archive)
      • provides lowest cost data archiving, where data access is infrequent
      • 99.999999999% durability & 99.9% availability
      • provides retrieval time of several (12-48) hours
        • Standard – 12 hours
        • Bulk – 48 hours
      • Minimum storage duration of 180 days
      • supports long-term retention and digital preservation for data that may be accessed once or twice a year
  • Lifecycle Management policies
    • transition to move objects to different storage classes and Glacier
    • expiration to remove objects and object versions
    • can be applied to both current and non-current objects, in case, versioning is enabled.
  • Data Consistency Model
    • provides strong read-after-write consistency for PUT and DELETE requests of objects in the S3 bucket in all AWS Regions
    • updates to a single key are atomic
    • does not currently support object locking for concurrent writes
  • S3 Security
    • IAM policies – grant users within your own AWS account permission to access S3 resources
    • Bucket and Object ACL – grant other AWS accounts (not specific users) access to  S3 resources
    • Bucket policies – allows to add or deny permissions across some or all of the objects within a single bucket
    • S3 Access Points simplify data access for any AWS service or customer application that stores data in S3.
    • S3 Glacier Vault Lock helps deploy and enforce compliance controls for individual S3 Glacier vaults with a vault lock policy.
    • S3 VPC Gateway Endpoint enables private connections between a VPC and S3, without requiring that you use an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
    • Support SSL encryption of data in transit and data encryption at rest
  • S3 Data Encryption
    • supports data at rest and data in transit encryption
    • Server-Side Encryption
      • SSE-S3 – encrypts S3 objects using keys handled & managed by AWS
      • SSE-KMS – leverage AWS Key Management Service to manage encryption keys. KMS provides control and audit trail over the keys.
      • SSE-C – when you want to manage your own encryption keys. AWS does not store the encryption key. Requires HTTPS.
    • Client-Side Encryption
      • Client library such as the S3 Encryption Client
      • Clients must encrypt data themselves before sending it to S3
      • Clients must decrypt data themselves when retrieving from S3
      • Customer fully manages the keys and encryption cycle
  • S3 Best Practices
    • use random hash prefix for keys and ensure a random access pattern, as S3 stores object lexicographically randomness helps distribute the contents across multiple partitions for better performance
    • use parallel threads and Multipart upload for faster writes
    • use parallel threads and Range Header GET for faster reads
    • for list operations with a large number of objects, it’s better to build a secondary index in DynamoDB
    • use Versioning to protect from unintended overwrites and deletions, but this does not protect against bucket deletion
    • use VPC S3 Endpoints with VPC to transfer data using Amazon internal network

Instance Store

  • provides temporary or ephemeral block-level storage for an EC2 instance
  • is physically attached to the Instance
  • deliver very high random I/O performance, which is a good option when storage with very low latency is needed
  • cannot be dynamically resized
  • data persists when an instance is rebooted
  • data does not persists if the
    • underlying disk drive fails
    • instance stops i.e. if the EBS backed instance with instance store volumes attached is stopped
    • instance terminates
  • can be attached to an EC2 instance only when the instance is launched
  • is ideal for the temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers.

Elastic Block Store – EBS

  • is virtual network-attached block storage
  • provides highly available, reliable, durable, block-level storage volumes that can be attached to a running instance
  • provides high durability and are redundant in an AZ, as the data is automatically replicated within that AZ to prevent data loss due to any single hardware component failure
  • persists and is independent of EC2 lifecycle
  • multiple volumes can be attached to a single EC2 instance
  • can be detached & attached to another EC2 instance in that same AZ only
  • volumes are Zonal i.e. created in a specific AZ and CAN’T span across AZs
  • snapshots
  • for making volume available to different AZ, create a snapshot of the volume and restore it to a new volume in any AZ within the region
  • for making the volume available to different Region, the snapshot of the volume can be copied to a different region and restored as a volume
  • PIOPS is designed to run transactions applications that require high and consistent IO for e.g. Relation database, NoSQL, etc
  • volumes CANNOT be shared with multiple EC2 instances, use EFS instead
  • Multi-Attach enables attaching a single Provisioned IOPS SSD (io1 or io2) volume to multiple instances that are in the same AZ.

EBS Encryption

  • allow encryption using the EBS encryption feature.
  • All data stored at rest, disk I/O, and snapshots created from the volume are encrypted.
  • uses 256-bit AES algorithms (AES-256) and an Amazon-managed KMS
  • Snapshots of encrypted EBS volumes are automatically encrypted.

EBS Snapshots

  • helps create backups of EBS volumes
  • are incremental
  • occur asynchronously, consume the instance IOPS
  • are regional and CANNOT span across regions
  • can be copied across regions to make it easier to leverage multiple regions for geographical expansion, data center migration, and disaster recovery
  • can be shared by making them public or with specific AWS accounts by modifying the access permissions of the snapshots
  • support EBS encryption
    • Snapshots of encrypted volumes are automatically encrypted
    • Volumes created from encrypted snapshots are automatically encrypted
    • All data in flight between the instance and the volume is encrypted
    • Volumes created from an unencrypted snapshot owned or have access to can be encrypted on the fly.
    • Encrypted snapshot owned or having access to, can be encrypted with a different key during the copy process.
  • can be automated using AWS Data Lifecycle Manager

EBS vs Instance Store

Refer blog post @ EBS vs Instance Store

Glacier

  • suitable for archiving data, where data access is infrequent and a retrieval time of several hours (3 to 5 hours) is acceptable (Not true anymore with enhancements from AWS)
  • provides a high durability by storing archive in multiple facilities and multiple devices at a very low cost storage
  • performs regular, systematic data integrity checks and is built to be automatically self healing
  • aggregate files into bigger files before sending them to Glacier and use range retrievals to retrieve partial file and reduce costs
  • improve speed and reliability with multipart upload
  • automatically encrypts the data using AES-256
  • upload or download data to Glacier via SSL encrypted endpoints

EFS

  • fully-managed, easy to set up, scale, and cost-optimize file storage
  • can automatically scale from gigabytes to petabytes of data without needing to provision storage
  • provides managed NFS (network file system) that can be mounted on and accessed by multiple EC2 in multiple AZs simultaneously
  • highly durable, highly scalable and highly available.
    • stores data redundantly across multiple Availability Zones
    • grows and shrinks automatically as files are added and removed, so you there is no need to manage storage procurement or provisioning.
  • expensive (3x gp2), but you pay per use
  • uses the Network File System version 4 (NFS v4) protocol
  • is compatible with all Linux-based AMIs for EC2,  POSIX file system (~Linux) that has a standard file API
  • does not support Windows AMI
  • offers the ability to encrypt data at rest using KMS and in transit.
  • can be accessed from on-premises using an AWS Direct Connect or AWS VPN connection between the on-premises datacenter and VPC.
  • can be accessed concurrently from servers in the on-premises datacenter as well as EC2 instances in the Amazon VPC
  • Performance mode
    • General purpose (default)
      • latency-sensitive use cases (web server, CMS, etc…)
    • Max I/O
      • higher latency, throughput, highly parallel (big data, media processing)
  • Storage Tiers
    • Standard
      • for frequently accessed files
      • ideal for active file system workloads and you pay only for the file system storage you use per month
    • Infrequent access (EFS-IA)
      • a lower cost storage class that’s cost-optimized for files infrequently accessed i.e. not accessed every day
      • cost to retrieve files, lower price to store
    • EFS Lifecycle Management with choosing an age-off policy allows moving files to EFS IA
    • Lifecycle Management automatically moves the data to the EFS IA storage class according to the lifecycle policy. for e.g., you can move files automatically into EFS IA fourteen days of not being accessed.
    • EFS is a shared POSIX system for Linux systems and does not work for Windows

Amazon FSx for Windows

  • is a fully managed,  highly reliable, and scalable Windows file system share drive
  • supports SMB protocol & Windows NTFS
  • supports Microsoft Active Directory integration, ACLs, user quotas
  • built on SSD, scale up to 10s of GB/s, millions of IOPS, 100s PB of data
  • is accessible from Windows, Linux, and MacOS compute instances
  • can be accessed from the on-premise infrastructure
  • can be configured to be Multi-AZ (high availability)
  • supports encryption of data at rest and in transit
  • provides data deduplication, which enables further cost optimization by removing redundant data.
  • data is backed-up daily to S3

Amazon FSx for Lustre

  • provides easy and cost effective way to launch and run the world’s most popular high-performance file system.
  • is a type of parallel distributed file system, for large-scale computing
  • Lustre is derived from “Linux” and “cluster”
  • Machine Learning, High Performance Computing (HPC) esp. Video Processing, Financial Modeling, Electronic Design Automation
  • scales up to 100s GB/s, millions of IOPS, sub-ms latencies
  • seamless integration with S3, it transparently presents S3 objects as files and allows you to write changed data back to S3.
  • can “read S3” as a file system (through FSx)
  • can write the output of the computations back to S3 (through FSx)
  • supports encryption of data at rest and in transit
  • can be used from on-premise servers

CloudFront

  • provides low latency and high data transfer speeds for distribution of static, dynamic web or streaming content to web users
  • delivers the content through a worldwide network of data centers called Edge Locations
  • keeps persistent connections with the origin servers so that the files can be fetched from the origin servers as quickly as possible.
  • dramatically reduces the number of network hops that users’ requests must pass through
  • supports multiple origin server options, like AWS hosted service for e.g. S3, EC2, ELB or an on premise server, which stores the original, definitive version of the objects
  • single distribution can have multiple origins and Path pattern in a cache behavior determines which requests are routed to the origin
  • supports Web Download distribution and RTMP Streaming distribution
    • Web distribution supports static, dynamic web content, on demand using progressive download & HLS and live streaming video content
    • RTMP supports streaming of media files using Adobe Media Server and the Adobe Real-Time Messaging Protocol (RTMP) ONLY
  • supports HTTPS using either
    • dedicated IP address, which is expensive as dedicated IP address is assigned to each CloudFront edge location
    • Server Name Indication (SNI), which is free but supported by modern browsers only with the domain name available in the request header
  • For E2E HTTPS connection,
    • Viewers -> CloudFront needs either self signed certificate, or certificate issued by CA or ACM
    • CloudFront -> Origin needs certificate issued by ACM for ELB and by CA for other origins
  •  Security
    • Origin Access Identity (OAI) can be used to restrict the content from S3 origin to be accessible from CloudFront only
    • supports Geo restriction (Geo-Blocking) to whitelist or blacklist countries that can access the content
    • Signed URLs 
      • for RTMP distribution as signed cookies aren’t supported
      • to restrict access to individual files, for e.g., an installation download for your application.
      • users using a client, for e.g. a custom HTTP client, that doesn’t support cookies
    • Signed Cookies
      • provide access to multiple restricted files, for e.g., video part files in HLS format or all of the files in the subscribers’ area of a website.
      • don’t want to change the current URLs
    • integrates with AWS WAF, a web application firewall that helps protect web applications from attacks by allowing rules configured based on IP addresses, HTTP headers, and custom URI strings
  • supports GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE to get object & object headers, add, update, and delete objects
    • only caches responses to GET and HEAD requests and, optionally, OPTIONS requests
    • does not cache responses to PUT, POST, PATCH, DELETE request methods and these requests are proxied back to the origin
  • object removal from cache
    • would be removed upon expiry (TTL) from the cache, by default 24 hrs
    • can be invalidated explicitly, but has a cost associated, however might continue to see the old version until it expires from those caches
    • objects can be invalidated only for Web distribution
    • change object name, versioning, to serve different version
  • supports adding or modifying custom headers before the request is sent to origin which can be used to
    • validate if user is accessing the content from CDN
    • identifying CDN from which the request was forwarded from, in case of multiple CloudFront distribution
    • for viewers not supporting CORS to return the Access-Control-Allow-Origin header for every request
  • supports Partial GET requests using range header to download object in smaller units improving the efficiency of partial downloads and recovery from partially failed transfers
  • supports compression to compress and serve compressed files when viewer requests include Accept-Encoding: gzip in the request header
  • supports different price class to include all regions, to include only least expensive regions and other regions to exclude most expensive regions
  • supports access logs which contain detailed information about every user request for both web and RTMP distribution

AWS Import/Export

  • accelerates moving large amounts of data into and out of AWS using portable storage devices for transport and transfers data directly using Amazon’s high speed internal network, bypassing the internet.
  • suitable for use cases with
    • large datasets
    • low bandwidth connections
    • first time migration of data
  • Importing data to several types of AWS storage, including EBS snapshots, S3 buckets, and Glacier vaults.
  • Exporting data out from S3 only, with versioning enabled only the latest version is exported
  • Import data can be encrypted (optional but recommended) while export is always encrypted using Truecrypt
  • Amazon will wipe the device if specified, however it will not destroy the device

AWS Storage Options – S3 & Glacier

Amazon S3

  • highly-scalable, reliable, and low-latency data storage infrastructure at very low costs.
  • provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from within Amazon EC2 or from anywhere on the web.
  • allows you to write, read, and delete objects containing from 1 byte to 5 terabytes of data each.
  • number of objects you can store in an Amazon S3 bucket is virtually unlimited.
  • highly secure, supporting encryption at rest, and providing multiple mechanisms to provide fine-grained control of access to Amazon S3 resources.
  • highly scalable, allowing concurrent read or write access to Amazon S3 data by many separate clients or application threads.
  • provides data lifecycle management capabilities, allowing users to define rules to automatically archive Amazon S3 data to Amazon Glacier, or to delete data at end of life.

Ideal Use Cases

  • Storage & Distribution of static web content and media
    • frequently used to host static websites and provides a highly-available and highly-scalable solution for websites with only static content, including HTML files, images, videos, and client-side scripts such as JavaScript
    • works well for fast growing websites hosting data intensive, user-generated content, such as video and photo sharing sites as no storage provisioning is required
    • content can either be directly served from Amazon S3 since each object in Amazon S3 has a unique HTTP URL address
    • can also act as an Origin store for the Content Delivery Network (CDN) such as Amazon CloudFront
    • it works particularly well for hosting web content with extremely spiky bandwidth demands because of S3’s elasticity
  • Data Store for Large Objects
    • can be paired with RDS or NoSQL database and used to store large objects for e.g. file or objects, while the associated metadata for e.g. name, tags, comments etc. can be stored in RDS or NoSQL database where it can be indexed and queried providing faster access to relevant data
  • Data store for computation and large-scale analytics
    • commonly used as a data store for computation and large-scale analytics, such as analyzing financial transactions, clickstream analytics, and media transcoding.
    • data can be accessed from multiple computing nodes concurrently without being constrained by a single connection because of its horizontal scalability
  • Backup and Archival of critical data
    • used as a highly durable, scalable, and secure solution for backup and archival of critical data, and to provide disaster recovery solutions for business continuity.
    • stores objects redundantly on multiple devices across multiple facilities, it provides the highly-durable storage infrastructure needed for these scenarios.
    • it’s versioning capability is available to protect critical data from inadvertent deletion

Anti-Patterns

Amazon S3 has following Anti-Patterns where it is not an optimal solution

  • Dynamic website hosting
    • While Amazon S3 is ideal for hosting static websites, dynamic websites requiring server side interaction, scripting or database interaction cannot be hosted and should rather be hosted on Amazon EC2
  • Backup and archival storage
    • Data requiring long term archival storage with infrequent read access can be stored more cost effectively in Amazon Glacier
  • Structured Data Query
    • Amazon S3 doesn’t offer query capabilities, so to read an object the object name and key must be known. Instead pair up S3 with RDS or Dynamo DB to store, index and query metadata about Amazon S3 objects
    • NOTE – S3 now provides query capabilities and also Athena can be used
  • Rapidly Changing Data
    • Data that needs to updated frequently might be better served by a storage solution with lower read/write latencies, such as Amazon EBS volumes, RDS or Dynamo DB.
  • File System
    • Amazon S3 uses a flat namespace and isn’t meant to serve as a standalone, POSIX-compliant file system. However, by using delimiters (commonly either the ‘/’ or ‘’ character) you are able construct your keys to emulate the hierarchical folder structure of file system within a given bucket.

Performance

  • Access to Amazon S3 from within Amazon EC2 in the same region is fast.
  • Amazon S3 is designed so that server-side latencies are insignificant relative to Internet latencies.
  • Amazon S3 is also built to scale storage, requests, and users to support a virtually unlimited number of web-scale applications.
  • If Amazon S3 is accessed using multiple threads, multiple applications, or multiple clients concurrently, total Amazon S3 aggregate throughput will typically scale to rates that far exceed what any single server can generate or consume.

Durability & Availability

  • Amazon S3 storage provides provides the highest level of data durability and availability, by automatically and synchronously storing your data across both multiple devices and multiple facilities within the selected geographical region
  • Error correction is built-in, and there are no single points of failure. Amazon S3 is designed to sustain the concurrent loss of data in two facilities, making it very well-suited to serve as the primary data storage for mission-critical data.
  • Amazon S3 is designed for 99.999999999% (11 nines) durability per object and 99.99% availability over a one-year period.
  • Amazon S3 data can be protected from unintended deletions or overwrites using Versioning.
  • Versioning can be enabled with MFA (Multi Factor Authentication) Delete on the bucket, which would require two forms of authentication to delete an object
  • For Non Critical and Reproducible data for e.g. thumbnails, transcoded media etc., S3 Reduced Redundancy Storage (RRS) can be used, which provides a lower level of durability at a lower storage cost
  • RRS is designed to provide 99.99% durability per object over a given year. While RRS is less durable than standard Amazon S3, it is still designed to provide 400 times more durability than a typical disk drive

Cost Model

  • With Amazon S3, you pay only for what you use and there is no minimum fee.
  • Amazon S3 has three pricing components: storage (per GB per month), data transfer in or out (per GB per month), and requests (per n thousand requests per month).

Scalability & Elasticity

  • Amazon S3 has been designed to offer a very high level of scalability and elasticity automatically
  • Amazon S3 supports a virtually unlimited number of files in any bucket
  • Amazon S3 bucket can store a virtually unlimited number of bytes
  • Amazon S3 allows you to store any number of objects (files) in a single bucket, and Amazon S3 will automatically manage scaling and distributing redundant copies of your information to other servers in other locations in the same region, all using Amazon’s high-performance infrastructure.

Interfaces

  • Amazon S3 provides standards-based REST and SOAP web services APIs for both management and data operations.
  • NOTE – SOAP support over HTTP is deprecated, but it is still available over HTTPS. New Amazon S3 features will not be supported for SOAP. We recommend that you use either the REST API or the AWS SDKs.
  • Amazon S3 provides easier to use higher level toolkit or SDK in different languages (Java, .NET, PHP, and Ruby) that wraps the underlying APIs
  • Amazon S3 Command Line Interface (CLI) provides a set of high-level, Linux-like Amazon S3 file commands for common operations, such as ls, cp, mv, sync, etc. They also provide the ability to perform recursive uploads and downloads using a single folder-level Amazon S3 command, and supports parallel transfers.
  • AWS Management Console provides the ability to easily create and manage Amazon S3 buckets, upload and download objects, and browse the contents of your Amazon S3 buckets using a simple web-based user interface
  • All interfaces provide the ability to store Amazon S3 objects (files) in uniquely-named buckets (top-level folders), with each object identified by an unique Object key within that bucket.

Glacier

  • extremely low-cost storage service that provides highly secure, durable, and flexible storage for data backup and archival
  • can reliably store their data for as little as $0.01 per gigabyte per month.
  • to offload the administrative burdens of operating and scaling storage to AWS such as capacity planning, hardware provisioning, data replication, hardware failure detection and repair, or time consuming hardware migrations
  • Data is stored in Amazon Glacier as Archives where an archive can represent a single file or multiple files combined into a single archive
  • Archives are stored in Vaults for which the access can be controlled through IAM
  • Retrieving archives from Vaults require initiation of a job and can take anywhere around 3-5 hours
  • Amazon Glacier integrates seamlessly with Amazon S3 by using S3 data lifecycle management policies to move data from S3 to Glacier
  • AWS Import/Export can also be used to accelerate moving large amounts of data into Amazon Glacier using portable storage devices for transport

Ideal Usage Patterns

  • Amazon Glacier is ideally suited for long term archival solution for infrequently accessed data with archiving offsite enterprise information, media assets, research and scientific data, digital preservation and magnetic tape replacement

Anti-Patterns

Amazon Glacier has following Anti-Patterns where it is not an optimal solution

  • Rapidly changing data
    • Data that must be updated very frequently might be better served by a storage solution with lower read/write latencies such as Amazon EBS or a Database
  • Real time access
    • Data stored in Glacier can not be accessed at real time and requires an initiation of a job for object retrieval with retrieval times ranging from 3-5 hours. If immediate access is needed, Amazon S3 is a better choice.

Performance

  • Amazon Glacier is a low-cost storage service designed to store data that is infrequently accessed and long lived.
  • Amazon Glacier jobs typically complete in 3 to 5 hours

Durability and Availability

  • Amazon Glacier redundantly stores data in multiple facilities and on multiple devices within each facility
  • Amazon Glacier is designed to provide average annual durability of 99.999999999% (11 nines) for an archive
  • Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives.
  • Amazon Glacier also performs regular, systematic data integrity checks and is built to be automatically self-healing.

Cost Model

  • Amazon Glacier has three pricing components: storage (per GB per month), data transfer out (per GB per month), and requests (per thousand UPLOAD and RETRIEVAL requests per month).
  • Amazon Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time and allows you to retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. Any additional amount of data retrieved is charged per GB
  • Amazon Glacier also charges a pro-rated charge (per GB) for items deleted prior to 90 days

Scalability & Elasticity

  • A single archive is limited to 40 TBs, but there is no limit to the total amount of data you can store in the service.
  • Amazon Glacier scales to meet your growing and often unpredictable storage requirements whether you’re storing petabytes or gigabytes, Amazon Glacier automatically scales your storage up or down as needed.

Interfaces

  • Amazon Glacier provides a native, standards-based REST web services interface, as well as Java and .NET SDKs.
  • AWS Management Console or the Amazon Glacier APIs can be used to create vaults to organize the archives in Amazon Glacier.
  • Amazon Glacier APIs can be used to upload and retrieve archives, monitor the status of your jobs and also configure your vault to send you a notification via Amazon Simple Notification Service (Amazon SNS) when your jobs complete.
  • Amazon Glacier can be used as a storage class in Amazon S3 by using object lifecycle management to provide automatic, policy-driven archiving from Amazon S3 to Amazon Glacier.
  • Amazon S3 api provides a RESTORE operation and the retrieval process takes the same 3-5 hours
  • On retrieval, a copy of the retrieved object is placed in Amazon S3 RRS storage for a specified retention period; the original archived object remains stored in Amazon Glacier and you are charged for both the storage.
  • When using Amazon Glacier as a storage class in Amazon S3, use the Amazon S3 APIs, and when using “native” Amazon Glacier, you use the Amazon Glacier APIs
  • Objects archived to Amazon Glacier via Amazon S3 can only be listed and retrieved via the Amazon S3 APIs or the AWS Management Console—they are not visible as archives in an Amazon Glacier vault.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You want to pass queue messages that are 1GB each. How should you achieve this?
    1. Use Kinesis as a buffer stream for message bodies. Store the checkpoint id for the placement in the Kinesis Stream in SQS.
    2. Use the Amazon SQS Extended Client Library for Java and Amazon S3 as a storage mechanism for message bodies. (Amazon SQS messages with Amazon S3 can be useful for storing and retrieving messages with a message size of up to 2 GB. To manage Amazon SQS messages with Amazon S3, use the Amazon SQS Extended Client Library for Java. Refer link)
    3. Use SQS’s support for message partitioning and multi-part uploads on Amazon S3.
    4. Use AWS EFS as a shared pool storage medium. Store filesystem pointers to the files on disk in the SQS message bodies.
  2. Company ABCD has recently launched an online commerce site for bicycles on AWS. They have a “Product” DynamoDB table that stores details for each bicycle, such as, manufacturer, color, price, quantity and size to display in the online store. Due to customer demand, they want to include an image for each bicycle along with the existing details. Which approach below provides the least impact to provisioned throughput on the “Product” table?
    1. Serialize the image and store it in multiple DynamoDB tables
    2. Create an “Images” DynamoDB table to store the Image with a foreign key constraint to the “Product” table
    3. Add an image data type to the “Product” table to store the images in binary format
    4. Store the images in Amazon S3 and add an S3 URL pointer to the “Product” table item for each image

References

AWS S3 Best Practices

S3 Best Practices

Performance

Multiple Concurrent PUTs/GETs

  • S3 scales to support very high request rates. If the request rate grows steadily, S3 automatically partitions the buckets as needed to support higher request rates.
  • S3 can achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket.
  • If the typical workload involves only occasional bursts of 100 requests per second and less than 800 requests per second, AWS scales and handle it.
  • If the typical workload involves a request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, it’s recommended to open a support case to prepare for the workload and avoid any temporary limits on your request rate.
  • S3 best practice guidelines can be applied only if you are routinely processing 100 or more requests per second
  • Workloads that include a mix of request types
    • If the request workload is typically a mix of GET, PUT, DELETE, or GET Bucket (list objects), choosing appropriate key names for the objects ensures better performance by providing low-latency access to the S3 index
    • This behavior is driven by how S3 stores key names.
      • S3 maintains an index of object key names in each AWS region.
      • Object keys are stored lexicographically (UTF-8 binary ordering) across multiple partitions in the index i.e. S3 stores key names in alphabetical order.
      • Object keys are stored in across multiple partitions in the index and the key name dictates which partition the key is stored in
      • Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that S3 will target a specific partition for a large number of keys, overwhelming the I/O capacity of the partition.
    • Introduce some randomness in the key name prefixes, the key names, and the I/O load, will be distributed across multiple index partitions.
    • It also ensures scalability regardless of the number of requests sent per second.

Transfer Acceleration

  • S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between the client and an S3 bucket.
  • Transfer Acceleration takes advantage of CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to S3 over an optimized network path.

GET-intensive Workloads

  • CloudFront can be used for performance optimization and can help by
    • distributing content with low latency and high data transfer rate.
    • caching the content and thereby reducing the number of direct requests to S3
    • providing multiple endpoints (Edge locations) for data availability
    • available in two flavors as Web distribution or RTMP distribution
  • To fast data transport over long distances between a client and an S3 bucket, use S3 Transfer Acceleration. Transfer Acceleration uses the globally distributed edge locations in CloudFront to accelerate data transport over geographical distances

PUTs/GETs for Large Objects

  • AWS allows Parallelizing the PUTs/GETs request to improve the upload and download performance as well as the ability to recover in case it fails
  • For PUTs, Multipart upload can help improve the uploads by
    • performing multiple uploads at the same time and maximizing network bandwidth utilization
    • quick recovery from failures, as only the part that failed to upload, needs to be re-uploaded
    • ability to pause and resume uploads
    • begin an upload before the Object size is known
  • For GETs, the range HTTP header can help to improve the downloads by
    • allowing the object to be retrieved in parts instead of the whole object
    • quick recovery from failures, as only the part that failed to download needs to be retried.

List Operations

  • Object key names are stored lexicographically in S3 indexes, making it hard to sort and manipulate the contents of LIST
  • S3 maintains a single lexicographically sorted list of indexes
  • Build and maintain Secondary Index outside of S3 for e.g. DynamoDB or RDS to store, index and query objects metadata rather than performing operations on S3

Security

  • Use Versioning
    • can be used to protect from unintended overwrites and deletions
    • allows the ability to retrieve and restore deleted objects or rollback to previous versions
  • Enable additional security by configuring a bucket to enable MFA (Multi-Factor Authentication) to delete
  • Versioning does not prevent Bucket deletion and must be backed up as if accidentally or maliciously deleted the data is lost
  • Use Same Region Replication or Cross Region replication feature to backup data to a different region
  • When using VPC with S3, use VPC S3 endpoints as
    • are horizontally scaled, redundant, and highly available VPC components
    • help establish a private connection between VPC and S3 and the traffic never leaves the Amazon network

Refer blog post @ S3 Security Best Practices

Cost

  • Optimize S3 storage cost by selecting an appropriate storage class for objects
  • Configure appropriate lifecycle management rules to move objects to different storage classes and expire them

Tracking

  • Use Event Notifications to be notified for any put or delete request on the S3 objects
  • Use CloudTrail, which helps capture specific API calls made to S3 from the AWS account and delivers the log files to an S3 bucket
  • Use CloudWatch to monitor the Amazon S3 buckets, tracking metrics such as object counts and bytes stored, and configure appropriate actions

S3 Monitoring and Auditing Best Practices

Refer blog post @ S3 Monitoring and Auditing Best Practices

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A media company produces new video files on-premises every day with a total size of around 100GB after compression. All files have a size of 1-2 GB and need to be uploaded to Amazon S3 every night in a fixed time window between 3am and 5am. Current upload takes almost 3 hours, although less than half of the available bandwidth is used. What step(s) would ensure that the file uploads are able to complete in the allotted time window?
    1. Increase your network bandwidth to provide faster throughput to S3
    2. Upload the files in parallel to S3 using multipart upload
    3. Pack all files into a single archive, upload it to S3, then extract the files in AWS
    4. Use AWS Import/Export to transfer the video files
  2. You are designing a web application that stores static assets in an Amazon Simple Storage Service (S3) bucket. You expect this bucket to immediately receive over 150 PUT requests per second. What should you do to ensure optimal performance?
    1. Use multi-part upload.
    2. Add a random prefix to the key names.
    3. Amazon S3 will automatically manage performance at this scale.
    4. Use a predictable naming scheme, such as sequential numbers or date time sequences, in the key names
  3. You have an application running on an Amazon Elastic Compute Cloud instance, that uploads 5 GB video objects to Amazon Simple Storage Service (S3). Video uploads are taking longer than expected, resulting in poor application performance. Which method will help improve performance of your application?
    1. Enable enhanced networking
    2. Use Amazon S3 multipart upload
    3. Leveraging Amazon CloudFront, use the HTTP POST method to reduce latency.
    4. Use Amazon Elastic Block Store Provisioned IOPs and use an Amazon EBS-optimized instance
  4. Which of the following methods gives you protection against accidental loss of data stored in Amazon S3? (Choose 2)
    1. Set bucket policies to restrict deletes, and also enable versioning
    2. By default, versioning is enabled on a new bucket so you don’t have to worry about it (Not enabled by default)
    3. Build a secondary index of your keys to protect the data (improves performance only)
    4. Back up your bucket to a bucket owned by another AWS account for redundancy
  5. A startup company hired you to help them build a mobile application that will ultimately store billions of image and videos in Amazon S3. The company is lean on funding, and wants to minimize operational costs, however, they have an aggressive marketing plan, and expect to double their current installation base every six months. Due to the nature of their business, they are expecting sudden and large increases to traffic to and from S3, and need to ensure that it can handle the performance needs of their application. What other information must you gather from this customer in order to determine whether S3 is the right option?
    1. You must know how many customers that company has today, because this is critical in understanding what their customer base will be in two years. (No. of customers do not matter)
    2. You must find out total number of requests per second at peak usage.
    3. You must know the size of the individual objects being written to S3 in order to properly design the key namespace. (Size does not relate to the key namespace design but the count does)
    4. In order to build the key namespace correctly, you must understand the total amount of storage needs for each S3 bucket. (S3 provided unlimited storage the key namespace design would depend on the number)
  6. A document storage company is deploying their application to AWS and changing their business model to support both free tier and premium tier users. The premium tier users will be allowed to store up to 200GB of data and free tier customers will be allowed to store only 5GB. The customer expects that billions of files will be stored. All users need to be alerted when approaching 75 percent quota utilization and again at 90 percent quota use. To support the free tier and premium tier users, how should they architect their application?
    1. The company should utilize an amazon simple workflow service activity worker that updates the users data counter in amazon dynamo DB. The activity worker will use simple email service to send an email if the counter increases above the appropriate thresholds.
    2. The company should deploy an amazon relational data base service relational database with a store objects table that has a row for each stored object along with size of each object. The upload server will query the aggregate consumption of the user in questions (by first determining the files store by the user, and then querying the stored objects table for respective file sizes) and send an email via Amazon Simple Email Service if the thresholds are breached. (Good Approach to use RDS but with so many objects might not be a good option)
    3. The company should write both the content length and the username of the files owner as S3 metadata for the object. They should then create a file watcher to iterate over each object and aggregate the size for each user and send a notification via Amazon Simple Queue Service to an emailing service if the storage threshold is exceeded. (List operations on S3 not feasible)
    4. The company should create two separated amazon simple storage service buckets one for data storage for free tier users and another for data storage for premium tier users. An amazon simple workflow service activity worker will query all objects for a given user based on the bucket the data is stored in and aggregate storage. The activity worker will notify the user via Amazon Simple Notification Service when necessary (List operations on S3 not feasible as well as SNS does not address email requirement)
  7. Your company host a social media website for storing and sharing documents. the web application allow users to upload large files while resuming and pausing the upload as needed. Currently, files are uploaded to your php front end backed by Elastic Load Balancing and an autoscaling fleet of amazon elastic compute cloud (EC2) instances that scale upon average of bytes received (NetworkIn) After a file has been uploaded. it is copied to amazon simple storage service(S3). Amazon Ec2 instances use an AWS Identity and Access Management (AMI) role that allows Amazon s3 uploads. Over the last six months, your user base and scale have increased significantly, forcing you to increase the auto scaling groups Max parameter a few times. Your CFO is concerned about the rising costs and has asked you to adjust the architecture where needed to better optimize costs. Which architecture change could you introduce to reduce cost and still keep your web application secure and scalable?
    1. Replace the Autoscaling launch Configuration to include c3.8xlarge instances; those instances can potentially yield a network throughput of 10gbps. (no info of current size and might increase cost)
    2. Re-architect your ingest pattern, have the app authenticate against your identity provider as a broker fetching temporary AWS credentials from AWS Secure token service (GetFederation Token). Securely pass the credentials and s3 endpoint/prefix to your app. Implement client-side logic to directly upload the file to amazon s3 using the given credentials and S3 Prefix. (will not provide the ability to handle pause and restarts)
    3. Re-architect your ingest pattern, and move your web application instances into a VPC public subnet. Attach a public IP address for each EC2 instance (using the auto scaling launch configuration settings). Use Amazon Route 53 round robin records set and http health check to DNS load balance the app request this approach will significantly reduce the cost by bypassing elastic load balancing. (ELB is not the bottleneck)
    4. Re-architect your ingest pattern, have the app authenticate against your identity provider as a broker fetching temporary AWS credentials from AWS Secure token service (GetFederation Token). Securely pass the credentials and s3 endpoint/prefix to your app. Implement client-side logic that used the S3 multipart upload API to directly upload the file to Amazon s3 using the given credentials and s3 Prefix. (multipart allows one to start uploading directly to S3 before the actual size is known or complete data is downloaded)
  8. If an application is storing hourly log files from thousands of instances from a high traffic web site, which naming scheme would give optimal performance on S3?
    1. Sequential
    2. instanceID_log-HH-DD-MM-YYYY
    3. instanceID_log-YYYY-MM-DD-HH
    4. HH-DD-MM-YYYY-log_instanceID (HH will give some randomness to start with instead of instaneId where the first characters would be i-)
    5. YYYY-MM-DD-HH-log_instanceID

Reference

S3_Optimizing_Performance

AWS S3 Data Consistency Model

AWS S3 Data Consistency Model

  • S3 Data Consistency provides strong read-after-write consistency for PUT and DELETE requests of objects in the S3 bucket in all AWS Regions
  • This behavior applies to both writes to new objects as well as PUT requests that overwrite existing objects and DELETE requests.
  • Read operations on S3 Select, S3 ACLs, S3 Object Tags, and object metadata (for example, the HEAD object) are strongly consistent.
  • Updates to a single key are atomic. for e.g., if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it will never write corrupted or partial data.
  • S3 achieves high availability by replicating data across multiple servers within Amazon’s data centers. If a PUT request is successful, the data is safely stored. Any read (GET or LIST request) that is initiated following the receipt of a successful PUT response will return the data written by the PUT request.
  • S3 Data Consistency behavior examples
    • A process writes a new object to S3 and immediately lists keys within its bucket. The new object appears in the list.
    • A process replaces an existing object and immediately tries to read it. S3 returns the new data.
    • A process deletes an existing object and immediately tries to read it. S3 does not return any data because the object has been deleted.
    • A process deletes an existing object and immediately lists keys within its bucket. The object does not appear in the listing.
  • S3 does not currently support object locking for concurrent writes. for e.g. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
  • Updates are key-based; there is no way to make atomic updates across keys. for e.g, an update of one key cannot be dependent on the update of another key unless you design this functionality into the application.
  • S3 Object Lock is different as it allows to store objects using a write-once-read-many (WORM) model, which prevents an object from being deleted or overwritten for a fixed amount of time or indefinitely.
  • S3 provides strong Read-after-Write consistency for PUTS of new objects
    • For a PUT request, S3 synchronously stores data across multiple facilities before returning SUCCESS
    • A process writes a new object to S3 and will be immediately able to read the Object i.e. PUT 200 -> GET 200
    • A process writes a new object to S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
    • However, if a HEAD or GET request to a key name is made before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency. i.e. GET 404 -> PUT 200 -> GET 404
  • S3 provides Eventual Consistency for overwrite PUTS and DELETES in all regions.
    • For updates and deletes to Objects, the changes are eventually reflected and not available immediately i.e. PUT 200 -> PUT 200 -> GET 200 (might be older version) OR DELETE 200 -> GET 200
    • if a process replaces an existing object and immediately attempts to read it, S3 might return the prior data till the change is fully propagated
    • if a process deletes an existing object and immediately attempts to read it, S3 might return the deleted data until the deletion is fully propagated
    • if a process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, S3 might list the deleted object.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which of the following are valid statements about Amazon S3? Choose 2 answers
    1. S3 provides read-after-write consistency for any type of PUT or DELETE. (S3 now provides strong read-after-write consistency)
    2. Consistency is not guaranteed for any type of PUT or DELETE.
    3. A successful response to a PUT request only occurs when a complete object is saved
    4. Partially saved objects are immediately readable with a GET after an overwrite PUT.
    5. S3 provides eventual consistency for overwrite PUTS and DELETES
  2. A customer is leveraging Amazon Simple Storage Service in eu-west-1 to store static content for web-based property. The customer is storing objects using the Standard Storage class. Where are the customers’ objects replicated?
    1. Single facility in eu-west-1 and a single facility in eu-central-1
    2. Single facility in eu-west-1 and a single facility in us-east-1
    3. Multiple facilities in eu-west-1
    4. A single facility in eu-west-1
  3. A user has an S3 object in the US Standard region with the content “color=red”. The user updates the object with the content as “color=”white”. If the user tries to read the value 1 minute after it was uploaded, what will S3 return?
    1. It will return “color=white” (strong read-after-write consistency)
    2. It will return “color=red”
    3. It will return an error saying that the object was not found
    4. It may return either “color=red” or “color=white” i.e. any of the value (Eventual Consistency)

References

AWS_S3_Data_Consistency

AWS Simple Storage Service – S3

AWS Simple Storage Service – S3

  • Amazon S3 is a simple key, value object store designed for the Internet
  • S3 provides unlimited storage space and works on the pay-as-you-use model. Service rates get cheaper as the usage volume increases
  • S3 offers an extremely durable, highly available, and infinitely scalable data storage infrastructure at very low costs.
  • S3 is Object-level storage (not Block level storage) and cannot be used to host OS or dynamic websites
  • S3 resources for e.g. buckets and objects are private by default

S3 Buckets & Objects

S3 Buckets

  • A bucket is a container for objects stored in S3
  • Buckets help organize the S3 namespace.
  • A bucket is owned by the AWS account that creates it and helps identify the account responsible for storage and data transfer charges.
  • S3 bucket names are globally unique, regardless of the AWS region in which it was created and the namespace is shared by all AWS accounts
  • Even though S3 is a global service, buckets are created within a region specified during the creation of the bucket.
  • Every object is contained in a bucket
  • There is no limit to the number of objects that can be stored in a bucket and no difference in performance whether a single bucket or multiple buckets are used to store all the objects
  • The S3 data model is a flat structure i.e. there are no hierarchies or folders within the buckets. However, logical hierarchy can be inferred using the key name prefix e.g. Folder1/Object1
  • Restrictions
    • 100 buckets (soft limit) and a maximum of 1000 buckets can be created in each AWS account
    • Bucket names should be globally unique and DNS compliant
    • Bucket ownership is not transferable
    • Buckets cannot be nested and cannot have bucket within another bucket
    • Bucket name and region cannot be changed, once created
  • Empty or a non-empty bucket can be deleted
  • S3 allows retrieval of 1000 objects and provides pagination support

Objects

  • Objects are the fundamental entities stored in an S3 bucket
  • An object is uniquely identified within a bucket by a key name and version ID (if S3 versioning is enabled on the bucket)
  • Objects consist of object data, metadata, and others
    • Key is the object name and a unique identifier for an object
    • Value is actual content stored
    • Metadata is the data about the data and is a set of name-value pairs that describe the object for e.g. content-type, size, last modified. Custom metadata can also be specified at the time the object is stored.
    • Version ID is the version id for the object and in combination with the key helps to uniquely identify an object within a bucket
    • Subresources helps provide additional information for an object
    • Access Control Information helps control access to the objects
  • S3 objects allow two kinds of metadata
    • System metadata
      • Metadata such as the Last-Modified date is controlled by the system. Only S3 can modify the value.
      • System metadata that user can control, for e.g., the storage class, encryption configured for the object.
    • User-defined metadata
      • User-defined metadata can be assigned during uploading the object or after the object has been uploaded.
      • User-defined metadata is stored with the object and is returned when an object is downloaded
      • S3 does not process user-defined metadata.
      • User-defined metadata must begin with the prefix “x-amz-meta“, otherwise S3 will not set the key-value pair as you define it
  • Object metadata cannot be modified after the object is uploaded and it can be only modified by performing copy operation and setting the metadata
  • Objects belonging to a bucket that reside in a specific AWS region never leave that region, unless explicitly copied using Cross Region replication
  • Each object can be up to 5 TB in size
  • An object can be retrieved as a whole or a partially
  • With Versioning enabled, current as well as previous versions of an object can be retrieved

S3 Bucket & Object Operations

  • Listing
    • S3 allows listing of all the keys within a bucket
    • A single listing request would return a max of 1000 object keys with pagination support using an indicator in the response to indicate if the response was truncated
    • Keys within a bucket can be listed using Prefix and Delimiter.
    • Prefix limits result to only those keys (kind of filtering) that begin with the specified prefix, and delimiter causes the list to roll up all keys that share a common prefix into a single summary list result.
  • Retrieval
    • An object can be retrieved as a whole
    • An object can be retrieved in parts or partially (specific range of bytes) by using the Range HTTP header.
    • Range HTTP header is helpful
      • if only a partial object is needed for e.g. multiple files were uploaded as a single archive
      • for fault-tolerant downloads where the network connectivity is poor
    • Objects can also be downloaded by sharing Pre-Signed URLs
    • Metadata of the object is returned in the response headers
  • Object Uploads
    • Single Operation – Objects of size 5GB can be uploaded in a single PUT operation
    • Multipart upload – can be used for objects of size > 5GB and supports the max size of 5TB. It is recommended for objects above size 100MB.
    • Pre-Signed URLs can also be used shared for uploading objects
    • Objects if uploaded successfully can be verified if the request received a successful response. Additionally, returned ETag can be compared to the calculated MD5 value of the upload object
  • Copying Objects
    • Copying of objects up to 5GB can be performed using a single operation and multipart upload can be used for uploads up to 5TB
    • When an object is copied
      • user-controlled system metadata e.g. storage class and user-defined metadata are also copied.
      • system controlled metadata e.g. the creation date etc is reset
    • Copying Objects can be needed to
      • Create multiple object copies
      • Copy object across locations or regions
      • Renaming of the objects
      • Change object metadata for e.g. storage class, encryption, etc
      • Updating any metadata for an object requires all the metadata fields to be specified again
  • Deleting Objects
    • S3 allows deletion of a single object or multiple objects (max 1000) in a single call
    • For Non Versioned buckets,
      • the object key needs to be provided and the object is permanently deleted
    • For Versioned buckets,
      • if an object key is provided, S3 inserts a delete marker and the previous current object becomes the non-current object
      • if an object key with a version ID is provided, the object is permanently deleted
      • if the version ID is of the delete marker, the delete marker is removed and the previous non-current version becomes the current version object
    • Deletion can be MFA enabled for adding extra security
  • Restoring Objects from Glacier
    • Objects must be restored before accessing an archived object
    • Restoration of an Object can take about 3 to 5 hours for standard retrievals. Glacier now offers expedited retrievals within minutes
    • Restoration request also needs to specify the number of days for which the object copy needs to be maintained.
    • During this period, storage cost applies for both the archive and the copy.

Pre-Signed URLs

  • All buckets and objects are by default private
  • Pre-signed URLs allows user to be able to download or upload a specific object without requiring AWS security credentials or permissions
  • Pre-signed URL allows anyone access to the object identified in the URL, provided the creator of the URL has permissions to access that object
  • Creation of the pre-signed urls requires the creator to provide his security credentials, specify a bucket name, an object key, an HTTP method (GET for download object & PUT of uploading objects), and expiration date and time
  • Pre-signed urls are valid only till the expiration date & time

Multipart Upload

  • Multipart upload allows the user to upload a single large object as a set of parts. Each part is a contiguous portion of the object’s data.
  • Multipart uploads supports 1 to 10000 parts and each part can be from 5MB to 5GB with last part size allowed to be less than 5MB
  • Multipart uploads allow max upload size of 5TB
  • Object parts can be uploaded independently and in any order. If transmission of any part fails, it can be retransmitted without affecting other parts.
  • After all parts of the object are uploaded and complete initiated, S3 assembles these parts and creates the object.
  • Using multipart upload provides the following advantages:
    • Improved throughput – parallel upload of parts to improve throughput
    • Quick recovery from any network issues – Smaller part size minimizes the impact of restarting a failed upload due to a network error.
    • Pause and resume object uploads – Object parts can be uploaded over time. Once a multipart upload is initiated there is no expiry; you must explicitly complete or abort the multipart upload.
    • Begin an upload before the final object size is known – an object can be uploaded as-is it being created
  • Three Step process
    • Multipart Upload Initiation
      • Initiation of a Multipart upload request to S3 returns a unique ID for each multipart upload.
      • This ID needs to be provided for each part uploads, completion or abort request and listing of parts call.
      • All the Object metadata required needs to be provided during the Initiation call
    • Parts Upload
      • Parts upload of objects can be performed using the unique upload ID
      • A part number (between 1 – 10000) needs to be specified with each request which identifies each part and its position in the object
      • If a part with the same part number is uploaded, the previous part would be overwritten
      • After the part upload is successful, S3 returns an ETag header in the response which must be recorded along with the part number to be provided during the multipart completion request
    • Multipart Upload Completion or Abort
      • On Multipart Upload Completion request, S3 creates an object by concatenating the parts in ascending order based on the part number and associates the metadata with the object
      • Multipart Upload Completion request should include the unique upload ID with all the parts and the ETag information
      • S3 response includes an ETag that uniquely identifies the combined object data
      • On Multipart upload Abort request, the upload is aborted and all parts are removed. Any new part upload would fail. However, any in-progress part upload is completed, and hence an abort request must be sent after all the parts uploads have been completed.
      • S3 should receive a multipart upload completion or abort request else it will not delete the parts and storage would be charged

S3 Access Points

  • S3 access points simplify data access for any AWS service or customer application that stores data in S3.
  • Access points are named network endpoints that are attached to buckets and can be used to perform S3 object operations, such as GetObject and PutObject.
  • Each access point has distinct permissions and network controls that S3 applies for any request that is made through that access point.
  • Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket.
  • You can configure any access point to accept requests only from a VPC to restrict S3 data access to a private network.
  • You can also configure custom block public access settings for each access point.

S3 Transfer Acceleration

  • S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between the client and an S3 bucket.
  • Transfer Acceleration takes advantage of CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to S3 over an optimized network path.

S3 Batch Operations

  • S3 Batch Operations help perform large-scale batch operations on S3 objects and can perform a single operation on lists of specified S3 objects.
  • A single job can perform a specified operation on billions of objects containing exabytes of data.
  • S3 tracks progress, sends notifications, and stores a detailed completion report of all actions, providing a fully managed, auditable, and serverless experience.
  • S3 Batch Operations can be used with S3 Inventory to get the object list and use S3 Select to filter the objects.
  • S3 Batch Operations can be used for copying objects, modify object metadata, applying ACLs, encrypting objects, transforming objects, invoke a custom lambda function, etc.

Virtual Hosted Style vs Path-Style Request

S3 allows the buckets and objects to be referred to in Path-style or Virtual hosted-style URLs

Path-style

  • Bucket name is not part of the domain (unless region specific endpoint used)
  • Endpoint used must match the region in which the bucket resides for e.g, if you have a bucket called mybucket that resides in the EU (Ireland) region with object named puppy.jpg, the correct path-style syntax URI is http://s3-eu-west-1.amazonaws.com/mybucket/puppy.jpg.
  • A “PermanentRedirect” error is received with an HTTP response code 301, and a message indicating what the correct URI is for the resource if a bucket is accessed outside the US East (N. Virginia) region with path-style syntax that uses either of the following:
    • http://s3.amazonaws.com
    • An endpoint for a region different from the one where the bucket resides for e.g., if you use http://s3-eu-west-1.amazonaws.com for a bucket that was created in the US West (N. California) region
  • Path-style requests would not be supported after after September 30, 2020

Virtual hosted-style

  • S3 supports virtual hosted-style and path-style access in all regions.
  • In a virtual-hosted-style URL, the bucket name is part of the domain name in the URL for e.g. http://bucketname.s3.amazonaws.com/objectname
  • S3 virtual hosting can be used to address a bucket in a REST API call by using the HTTP Host header
  • Benefits
    • attractiveness of customized URLs,
    • provides an ability to publish to the “root directory” of the bucket’s virtual server. This ability can be important because many existing applications search for files in this standard location.
  • S3 updates DNS to reroute the request to the correct location when a bucket is created in any region, which might take time.
  • S3 routes any virtual hosted-style requests to the US East (N.Virginia) region, by default, if the US East (N. Virginia) endpoint s3.amazonaws.com is used, instead of the region-specific endpoint (for e.g., s3-eu-west-1.amazonaws.com) and S3 redirects it with HTTP 307 redirect to the correct region.
  • When using virtual hosted-style buckets with SSL, the SSL wild card certificate only matches buckets that do not contain periods.To work around this, use HTTP or write your own certificate verification logic.
  • If you make a request to the http://bucket.s3.amazonaws.com endpoint, the DNS has sufficient information to route the request directly to the region where your bucket resides.

S3 Pricing

  • S3 costs vary by Region
  • Charges in S3 are incurred for
    • Storage – cost is per GB/month
    • Requests – per request cost varies depending on the request type GET, PUT
    • Data Transfer
      • data transfer in is free
      • data transfer out is charged per GB/month (except in the same region or to Amazon CloudFront)

Additional Topics

Labs

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. What does Amazon S3 stand for?
    1. Simple Storage Solution.
    2. Storage Storage Storage (triple redundancy Storage).
    3. Storage Server Solution.
    4. Simple Storage Service
  2. What are characteristics of Amazon S3? Choose 2 answers
    1. Objects are directly accessible via a URL
    2. S3 should be used to host a relational database
    3. S3 allows you to store objects or virtually unlimited size
    4. S3 allows you to store virtually unlimited amounts of data
    5. S3 offers Provisioned IOPS
  3. You are building an automated transcription service in which Amazon EC2 worker instances process an uploaded audio file and generate a text file. You must store both of these files in the same durable storage until the text file is retrieved. You do not know what the storage capacity requirements are. Which storage option is both cost-efficient and scalable?
    1. Multiple Amazon EBS volume with snapshots
    2. A single Amazon Glacier vault
    3. A single Amazon S3 bucket
    4. Multiple instance stores
  4. A user wants to upload a complete folder to AWS S3 using the S3 Management console. How can the user perform this activity?
    1. Just drag and drop the folder using the flash tool provided by S3
    2. Use the Enable Enhanced Folder option from the S3 console while uploading objects
    3. The user cannot upload the whole folder in one go with the S3 management console
    4. Use the Enable Enhanced Uploader option from the S3 console while uploading objects (NOTE – Its no longer supported by AWS)
  5. A media company produces new video files on-premises every day with a total size of around 100GB after compression. All files have a size of 1-2 GB and need to be uploaded to Amazon S3 every night in a fixed time window between 3am and 5am. Current upload takes almost 3 hours, although less than half of the available bandwidth is used. What step(s) would ensure that the file uploads are able to complete in the allotted time window?
    1. Increase your network bandwidth to provide faster throughput to S3
    2. Upload the files in parallel to S3 using mulipart upload
    3. Pack all files into a single archive, upload it to S3, then extract the files in AWS
    4. Use AWS Import/Export to transfer the video files
  6. A company is deploying a two-tier, highly available web application to AWS. Which service provides durable storage for static content while utilizing lower Overall CPU resources for the web tier?
    1. Amazon EBS volume
    2. Amazon S3
    3. Amazon EC2 instance store
    4. Amazon RDS instance
  7. You have an application running on an Amazon Elastic Compute Cloud instance, that uploads 5 GB video objects to Amazon Simple Storage Service (S3). Video uploads are taking longer than expected, resulting in poor application performance. Which method will help improve performance of your application?
    1. Enable enhanced networking
    2. Use Amazon S3 multipart upload
    3. Leveraging Amazon CloudFront, use the HTTP POST method to reduce latency.
    4. Use Amazon Elastic Block Store Provisioned IOPs and use an Amazon EBS-optimized instance
  8. When you put objects in Amazon S3, what is the indication that an object was successfully stored?
    1. Each S3 account has a special bucket named_s3_logs. Success codes are written to this bucket with a timestamp and checksum.
    2. A success code is inserted into the S3 object metadata.
    3. A HTTP 200 result code and MD5 checksum, taken together, indicate that the operation was successful.
    4. Amazon S3 is engineered for 99.999999999% durability. Therefore there is no need to confirm that data was inserted.
  9. You have private video content in S3 that you want to serve to subscribed users on the Internet. User IDs, credentials, and subscriptions are stored in an Amazon RDS database. Which configuration will allow you to securely serve private content to your users?
    1. Generate pre-signed URLs for each user as they request access to protected S3 content
    2. Create an IAM user for each subscribed user and assign the GetObject permission to each IAM user
    3. Create an S3 bucket policy that limits access to your private content to only your subscribed users’ credentials
    4. Create a CloudFront Origin Identity user for your subscribed users and assign the GetObject permission to this user
  10. You run an ad-supported photo sharing website using S3 to serve photos to visitors of your site. At some point you find out that other sites have been linking to the photos on your site, causing loss to your business. What is an effective method to mitigate this?
    1. Remove public read access and use signed URLs with expiry dates.
    2. Use CloudFront distributions for static content.
    3. Block the IPs of the offending websites in Security Groups.
    4. Store photos on an EBS volume of the web server.
  11. You are designing a web application that stores static assets in an Amazon Simple Storage Service (S3) bucket. You expect this bucket to immediately receive over 150 PUT requests per second. What should you do to ensure optimal performance?
    1. Use multi-part upload.
    2. Add a random prefix to the key names.
    3. Amazon S3 will automatically manage performance at this scale. (With latest S3 performance improvements, S3 scaled automatically)
    4. Use a predictable naming scheme, such as sequential numbers or date time sequences, in the key names
  12. What is the maximum number of S3 buckets available per AWS Account?
    1. 100 Per region
    2. There is no Limit
    3. 100 Per Account (Refer documentation)
    4. 500 Per Account
    5. 100 Per IAM User
  13. Your customer needs to create an application to allow contractors to upload videos to Amazon Simple Storage Service (S3) so they can be transcoded into a different format. She creates AWS Identity and Access Management (IAM) users for her application developers, and in just one week, they have the application hosted on a fleet of Amazon Elastic Compute Cloud (EC2) instances. The attached IAM role is assigned to the instances. As expected, a contractor who authenticates to the application is given a pre-signed URL that points to the location for video upload. However, contractors are reporting that they cannot upload their videos. Which of the following are valid reasons for this behavior? Choose 2 answers { “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: “s3:*”, “Resource”: “*” } ] }
    1. The IAM role does not explicitly grant permission to upload the object. (The role has all permissions for all activities on S3)
    2. The contractorsˈ accounts have not been granted “write” access to the S3 bucket. (using pre-signed urls the contractors account don’t need to have access but only the creator of the pre-signed urls)
    3. The application is not using valid security credentials to generate the pre-signed URL.
    4. The developers do not have access to upload objects to the S3 bucket. (developers are not uploading the objects but its using pre-signed urls)
    5. The S3 bucket still has the associated default permissions. (does not matter as long as the user has permission to upload)
    6. The pre-signed URL has expired.

AWS S3 Subresources

AWS S3 Subresources

  • S3 Subresources provides support to store, and manage the bucket configuration information
  • S3 subresources only exist in the context of a specific bucket or object
  • S3 defines a set of subresources associated with buckets and objects.
  • S3 Subresources are subordinates to objects; i.e. they do not exist on their own, they are always associated with some other entity, such as an object or a bucket
  • S3 supports various options to configure a bucket for e.g., the bucket can be configured for website hosting, configuration added to manage the lifecycle of objects in the bucket, and to log all access to the bucket.

Object Lifecycle

Refer blog post @ S3 Object Lifecycle Management

Static Website Hosting

  • S3 can be used for Static Website hosting with Client-side scripts.
  • S3 does not support server-side scripting
  • S3, in conjunction with Route 53, supports hosting a website at the root domain which can point to the S3 website endpoint
  • S3 website endpoints do not support HTTPS or access points
  • For S3 website hosting the content should be made publicly readable which can be provided using a bucket policy or an ACL on an object
  • Users can configure the index, error document as well as configure the conditional routing of an object name
  • Bucket policy applies only to objects owned by the bucket owner. If the bucket contains objects not owned by the bucket owner, then public READ permission on those objects should be granted using the object ACL.
  • Requester Pays buckets or DevPay buckets do not allow access through the website endpoint. Any request to such a bucket will receive a 403 -Access Denied response

S3 Versioning

Refer blog post @ S3 Object Versioning

Policy & Access Control List (ACL)

Refer blog post @ S3 Permissions

CORS (Cross Origin Resource Sharing)

  • All browsers implement the Same-Origin policy, for security reasons, where the web page from a domain can only request resources from the same domain.
  • CORS allows client web applications loaded in one domain access to the restricted resources to be requested from another domain
  • With CORS support, S3 allows cross-origin access to S3 resources
  • CORS configuration rules identify the origins allowed to access the bucket, the operations (HTTP methods) that would be supported for each origin, and other operation-specific information

S3 Access Logs

  • S3 Access Logs enables tracking access requests to an S3 bucket.
  • S3 Access logs are disabled by default.
  • Each access log record provides details about a single access request, such as the requester, bucket name, request time, request action, response status, and error code, if any.
  • Access log information can be useful in security and access audits and also help learn about the customer base and understand the S3 bill
  • S3 periodically collects access log records, consolidates the records in log files, and then uploads log files to a target bucket as log objects.
  • Logging can be enabled on multiple source buckets with the same target bucket which will have access logs for all those source buckets, but each log object will report access log records for a specific source bucket.
  • S3 Access Logs can be analyzed using data analysis tools or Athena

Tagging

  • S3 provides the tagging subresource to store and manage tags on a bucket
  • Cost allocation tags can be added to the bucket to categorize and track AWS costs
  • AWS can generate a cost allocation report with usage and costs aggregated by the tags applied to the buckets

Location

  • AWS region needs to be specified during bucket creation and it cannot be changed.
  • S3 stores this information in the location subresource and provides an API for retrieving this information

Event Notifications

  • S3 notification feature enables notifications to be triggered when certain events happen in the bucket.
  • Notifications are enabled at the Bucket level
  • Notifications can be configured to be filtered by the prefix and suffix of the key name of objects. However, filtering rules cannot be defined with overlapping prefixes, overlapping suffixes, or prefix and suffix overlapping
  • S3 can publish the following events
    • New Object created events
      • Can be enabled for PUT, POST, or COPY operations
      • You will not receive event notifications from failed operations
    • Object Removal events
      • Can public delete events for object deletion, version object deletion or insertion of delete marker
      • You will not receive event notifications from automatic deletes from lifecycle policies or from failed operations.
    • Restore object events
      • restoration of objects archived to the S3 Glacier storage classes
    • Reduced Redundancy Storage (RRS) object lost events
      • Can be used to reproduce/recreate the Object
    • Replication events
      • for replication configurations that have S3 replication metrics or S3 Replication Time Control (S3 RTC) enabled
  • S3 can publish events to the following destination
  • For S3 to be able to publish events to the destination, the S3 principal should be granted the necessary permissions
  • S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.

Cross-Region Replication & Same-Region Replication

  • S3 Replication enables automatic, asynchronous copying of objects across S3 buckets in the same or different AWS regions.
  • S3 Cross-Region Replication – CRR is used to copy objects across S3 buckets in different AWS Regions.
  • S3 Same-Region Replication – SRR is used to copy objects across S3 buckets in the same AWS Regions.
  • S3 Replication helps to
    • Replicate objects while retaining metadata
    • Replicate objects into different storage classes
    • Maintain object copies under different ownership
    • Keep objects stored over multiple AWS Regions
    • Replicate objects within 15 minutes
  • S3 can replicate all or a subset of objects with specific key name prefixes
  • S3 encrypts all data in transit across AWS regions using SSL
  • Object replicas in the destination bucket are exact replicas of the objects in the source bucket with the same key names and the same metadata.
  • Objects may be replicated to a single destination bucket or multiple destination buckets.
  • Cross-Region Replication can be useful for the following scenarios:-
    • Compliance requirement to have data backed up across regions
    • Minimize latency to allow users across geography to access objects
    • Operational reasons compute clusters in two different regions that analyze the same set of objects
  • Same-Region Replication can be useful for the following scenarios:-
    • Aggregate logs into a single bucket
    • Configure live replication between production and test accounts
    • Abide by data sovereignty laws to store multiple copies
  • Replication Requirements
    • source and destination buckets must be versioning-enabled
    • for CRR, the source and destination buckets must be in different AWS regions
    • S3 must have permission to replicate objects from that source bucket to the destination bucket on your behalf.
    • If the source bucket owner also owns the object, the bucket owner has full permissions to replicate the object. If not, the source bucket owner must have permission for the S3 actions s3:GetObjectVersion and s3:GetObjectVersionACL to read the object and object ACL
    • Setting up cross-region replication in a cross-account scenario (where the source and destination buckets are owned by different AWS accounts), the source bucket owner must have permission to replicate objects in the destination bucket.
    • if the source bucket has S3 Object Lock enabled, the destination buckets must also have S3 Object Lock enabled.
    • destination buckets cannot be configured as Requester Pays buckets
  • Replicated & Not Replicated
    • Only new objects created after you add a replication configuration are replicated. S3 does NOT retroactively replicate objects that existed before you added replication configuration.
    • Only Objects created with SSE-S3 are replicated using server-side encryption using the S3-managed encryption key.
    • Objects created with server-side encryption using AWS KMS–managed encryption (SSE-KMS) keys are NOT replicated, by default. It requires additional handling.
    • S3 replicates only objects in the source bucket for which the bucket owner has permission to read objects and read ACLs
    • Any object ACL updates are replicated, although there can be some delay before Amazon S3 can bring the two in sync. This applies only to objects created after you add a replication configuration to the bucket.
    • S3 does NOT replicate objects in the source bucket for which the bucket owner does not have permissions.
    • Updates to bucket-level S3 subresources are NOT replicated, allowing different bucket configurations on the source and destination buckets
    • Only customer actions are replicated & actions performed by lifecycle configuration are NOT replicated
    • Replication chaining is NOT allowed, Objects in the source bucket that are replicas, created by another replication, are NOT replicated.
    • Objects created with server-side encryption using either customer-provided (SSE-C) are NOT replicated.
    • S3 does NOT replicate the delete marker by default. However, you can add delete marker replication to non-tag-based rules to override it.
    • S3 does NOT replicate deletion by object version ID. This protects data from malicious deletions.

S3 Inventory

  • S3 Inventory helps manage the storage and can be used to audit and report on the replication and encryption status of the objects for business, compliance, and regulatory needs.
  • S3 inventory provides a scheduled alternative to the S3 synchronous List API operation.
  • S3 inventory provides CSV, ORC, or Apache Parquet output files that list the objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix.

Requester Pays

  • By default, buckets are owned by the AWS account that created it (the bucket owner) and the AWS account pays for storage costs, downloads, and data transfer charges associated with the bucket.
  • Using Requester Pays subresource:-
    • Bucket owner specifies that the requester requesting the download will be charged for the download
    • However, the bucket owner still pays the storage costs
  • Enabling Requester Pays on a bucket
    • disables anonymous access to that bucket
    • does not support BitTorrent
    • does not support SOAP requests
    • cannot be enabled for end-user logging bucket

Torrent

  • Default distribution mechanism for S3 data is via client/server download
  • Bucket owner bears the cost of Storage as well as the request and transfer charges which can increase linearly for a popular object
  • S3 also supports the BitTorrent protocol
    • BitTorrent is an open-source Internet distribution protocol
    • BitTorrent addresses this problem by recruiting the very clients that are downloading the object as distributors themselves
    • S3 bandwidth rates are inexpensive, but BitTorrent allows developers to further save on bandwidth costs for a popular piece of data by letting users download from Amazon and other users simultaneously
  • Benefit for a publisher is that for large, popular files the amount of data actually supplied by S3 can be substantially lower than what it would have been serving the same clients via client/server download
  • Any object in S3 that is publicly available and can be read anonymously can be downloaded via BitTorrent
  • Torrent file can be retrieved for any publicly available object by simply adding a “?torrent” query string parameter at the end of the REST GET request for the object
  • Generating the .torrent for an object takes time proportional to the size of that object, so its recommended to make a first torrent request yourself to generate the file so that subsequent requests are faster
  • Torrent is enabled only for objects that are less than 5 GB in size.
  • Torrent subresource can only be retrieved, and cannot be created, updated, or deleted

Object ACL

Refer blog post @ S3 Permissions

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An organization’s security policy requires multiple copies of all critical data to be replicated across at least a primary and backup data center. The organization has decided to store some critical data on Amazon S3. Which option should you implement to ensure this requirement is met?
    1. Use the S3 copy API to replicate data between two S3 buckets in different regions
    2. You do not need to implement anything since S3 data is automatically replicated between regions
    3. Use the S3 copy API to replicate data between two S3 buckets in different facilities within an AWS Region
    4. You do not need to implement anything since S3 data is automatically replicated between multiple facilities within an AWS Region
  2. A customer wants to track access to their Amazon Simple Storage Service (S3) buckets and also use this information for their internal security and access audits. Which of the following will meet the Customer requirement?
    1. Enable AWS CloudTrail to audit all Amazon S3 bucket access.
    2. Enable server access logging for all required Amazon S3 buckets
    3. Enable the Requester Pays option to track access via AWS Billing
    4. Enable Amazon S3 event notifications for Put and Post.
  3. A user is enabling a static website hosting on an S3 bucket. Which of the below mentioned parameters cannot be configured by the user?
    1. Error document
    2. Conditional error on object name
    3. Index document
    4. Conditional redirection on object name
  4. Company ABCD is running their corporate website on Amazon S3 accessed from http//www.companyabcd.com. Their marketing team has published new web fonts to a separate S3 bucket accessed by the S3 endpoint: https://s3-us-west1.amazonaws.com/abcdfonts. While testing the new web fonts, Company ABCD recognized the web fonts are being blocked by the browser. What should Company ABCD do to prevent the web fonts from being blocked by the browser?
    1. Enable versioning on the abcdfonts bucket for each web font
    2. Create a policy on the abcdfonts bucket to enable access to everyone
    3. Add the Content-MD5 header to the request for webfonts in the abcdfonts bucket from the website
    4. Configure the abcdfonts bucket to allow cross-origin requests by creating a CORS configuration
  5. Company ABCD is currently hosting their corporate site in an Amazon S3 bucket with Static Website Hosting enabled. Currently, when visitors go to http://www.companyabcd.com the index.html page is returned. Company C now would like a new page welcome.html to be returned when a visitor enters http://www.companyabcd.com in the browser. Which of the following steps will allow Company ABCD to meet this requirement? Choose 2 answers.
    1. Upload an html page named welcome.html to their S3 bucket
    2. Create a welcome subfolder in their S3 bucket
    3. Set the Index Document property to welcome.html
    4. Move the index.html page to a welcome subfolder
    5. Set the Error Document property to welcome.html

AWS S3 Storage Classes

AWS S3 Storage Classes

  • AWS S3 offers a range of S3 Storage Classes to match the use case scenario and performance access requirements.
  • S3 storage classes are designed to sustain the concurrent loss of data in one or two facilities
  • S3 storage classes allow lifecycle management for automatic migration of objects for cost savings
  • All S3 storage classes provide the same durability, first-byte latency, and support SSL encryption of data in transit, and data encryption at rest
  • S3 also regularly verifies the integrity of the data using checksums and provides the auto-healing capability

S3 Storage Classes Comparision

S3 Storage Classes Performance

S3 Standard

  • STANDARD is the default storage class, if none specified during upload
  • Low latency and high throughput performance
  • Designed for 99.999999999% i.e. 11 9’s Durability of objects across AZs
  • Designed for 99.99% availability over a given year
  • Resilient against events that impact an entire Availability Zone and is designed to sustain the loss of data in a two facilities
  • Ideal for performance-sensitive use cases and frequently accessed data
  • S3 Standard is appropriate for a wide variety of use cases, including cloud applications, dynamic websites, content distribution, mobile and gaming applications, and big data analytics.

S3 Intelligent Tiering (S3 Intelligent-Tiering)

  • S3 Intelligent Tiering storage class is designed to optimize storage costs by automatically moving data to the most cost-effective storage access tier, without performance impact or operational overhead.
  • Delivers automatic cost savings by moving data on a granular object-level between two access tiers
    • one tier that is optimized for frequent access and
    • another lower-cost tier that is optimized for infrequently accessed data.
  • a frequent access tier and a lower-cost infrequent access tier, when access patterns change.
  • Ideal to optimize storage costs automatically for long-lived data when access patterns are unknown or unpredictable.
  • For a small monthly monitoring and automation fee per object, S3 monitors access patterns of the objects and moves objects that have not been accessed for 30 consecutive days to the infrequent access tier.
  • There are no separate retrieval fees when using the Intelligent Tiering storage class. If an object in the infrequent access tier is accessed, it is automatically moved back to the frequent access tier.
  • No additional fees apply when objects are moved between access tiers
  • Suitable for objects greater than 128 KB (smaller objects are charged for 128 KB only) kept for at least 30 days (charged for a minimum of 30 days)
  • Same low latency and high throughput performance of S3 Standard
  • Designed for 99.999999999% i.e. 11 9’s Durability of objects across AZs
  • Designed for 99.9% availability over a given year

S3 Standard-Infrequent Access (S3 Standard-IA)

  • S3 Standard-Infrequent Access storage class is optimized for long-lived and less frequently accessed data. for e.g. for  backups and older data where access is limited, but the use case still demands high performance
  • Ideal for use for the primary or only copy of data that can’t be recreated.
  • Data stored redundantly across multiple geographically separated AZs and are resilient to the loss of an Availability Zone.
  • offers greater availability and resiliency than the ONEZONE_IA class.
  • Objects are available for real-time access.
  • Suitable for larger objects greater than 128 KB (smaller objects are charged for 128 KB only) kept for at least 30 days (charged for minimum 30 days)
  • Same low latency and high throughput performance of Standard
  • Designed for 99.999999999% i.e. 11 9’s Durability of objects across AZs
  • Designed for 99.9% availability over a given year
  • S3 charges a retrieval fee for these objects, so they are most suitable for infrequently accessed data.

S3 One Zone-Infrequent Access (S3 One Zone-IA)

  • S3 One Zone-Infrequent Access storage classes are designed for long-lived and infrequently accessed data, but available for millisecond access (similar to the STANDARD and STANDARD_IA storage class).
  • Ideal when the data can be recreated if the AZ fails, and for object replicas when setting cross-region replication (CRR).
  • Objects are available for real-time access.
  • Suitable for objects greater than 128 KB (smaller objects are charged for 128 KB only) kept for at least 30 days (charged for a minimum of 30 days)
  • Stores the object data in only one AZ, which makes it less expensive than Standard-Infrequent Access
  • Data is not resilient to the physical loss of the AZ resulting from disasters, such as earthquakes and floods.
  • One Zone-Infrequent Access storage class is as durable as Standard-Infrequent Access, but it is less available and less resilient.
  • Designed for 99.999999999% i.e. 11 9’s Durability of objects in a single AZ
  • Designed for 99.5% availability over a given year
  • S3 charges a retrieval fee for these objects, so they are most suitable for infrequently accessed data.

Reduced Redundancy Storage – RRS

  • NOTE – AWS recommends not to use this storage class. The STANDARD storage class is more cost-effective.
  • Reduced Redundancy Storage (RRS) storage class is designed for non-critical, reproducible data stored at lower levels of redundancy than the STANDARD storage class, which reduces storage costs
  • Designed for durability of 99.99% of objects
  • Designed for 99.99% availability over a given year
  • Lower level of redundancy results in less durability and availability
  • RRS stores object on multiple devices across multiple facilities, providing 400 times the durability of a typical disk drive,
  • RRS does not replicate objects as many times as S3 standard storage and is designed to sustain the loss of data in a single facility.
  • If an RRS object is lost, S3 returns a 405 error on requests made to that object
  • S3 can send an event notification, configured on the bucket, to alert a user or start a workflow when it detects that an RRS object is lost which can be used to replace the lost object

S3 Glacier

  • GLACIER storage class is suitable for low-cost data archiving where data access is infrequent and retrieval time of minutes to hours is acceptable.
  • Storage class has a minimum storage duration period of 90 days
  • Provides configurable retrieval times, from minutes to hours
  • GLACIER storage class uses the very low-cost Glacier storage service, but the objects in this storage class are still managed through S3
  • GLACIER cannot be specified as the storage class at the object creation time but has to be transitioned from STANDARD, RRS, or STANDARD_IA to GLACIER storage class using lifecycle management.
  • For accessing GLACIER objects,
    • the object must be restored which can take anywhere between minutes to hours
    • objects are only available for the time period (the number of days) specified during the restoration request
    • object’s storage class remains GLACIER
    • charges are levied for both the archive (GLACIER rate) and the copy restored temporarily
  • Vault Lock feature enforces compliance via a lockable policy
  • Offers the same durability and resiliency as the STANDARD storage class
  • Designed for 99.999999999% i.e. 11 9’s Durability of objects across AZs
  • Designed for 99.9% availability over a given year

S3 Glacier Deep Archive

  • Glacier Deep Archive storage class provides the lowest-cost data archiving where data access is infrequent and retrieval time of hours is acceptable.
  • Has a minimum storage duration period of 180 days and can be accessed at a default retrieval time of 12 hours.
  • Supports long-term retention and digital preservation for data that may be accessed once or twice a year
  • Designed for 99.999999999% i.e. 11 9’s Durability of objects across AZs
  • Designed for 99.9% availability over a given year
  • DEEP_ARCHIVE retrieval costs can be reduced by using bulk retrieval, which returns data within 48 hours.
  • Ideal alternative to magnetic tape libraries

S3 Analytics – Storage Class Analysis

  • S3 Analytics – Storage Class Analysis helps analyze storage access patterns to decide when to transition the right data to the right storage class.
  • S3 Analytics feature observes data access patterns to help determine when to transition less frequently accessed STANDARD storage to the STANDARD_IA (IA, for infrequent access) storage class.
  • Storage Class Analysis can be configured to analyze all the objects in a bucket or filters to group objects.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. What does RRS stand for when talking about S3?
    1. Redundancy Removal System
    2. Relational Rights Storage
    3. Regional Rights Standard
    4. Reduced Redundancy Storage
  2. What is the durability of S3 RRS?
    1. 99.99%
    2. 99.95%
    3. 99.995%
    4. 99.999999999%
  3. What is the Reduced Redundancy option in Amazon S3?
    1. Less redundancy for a lower cost
    2. It doesn’t exist in Amazon S3, but in Amazon EBS.
    3. It allows you to destroy any copy of your files outside a specific jurisdiction.
    4. It doesn’t exist at all
  4. An application is generating a log file every 5 minutes. The log file is not critical but may be required only for verification in case of some major issue. The file should be accessible over the internet whenever required. Which of the below mentioned options is a best possible storage solution for it?
    1. AWS S3
    2. AWS Glacier
    3. AWS RDS
    4. AWS S3 RRS (Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to store noncritical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. RRS is designed to sustain the loss of data in a single facility.)
  5. A user has moved an object to Glacier using the life cycle rules. The user requests to restore the archive after 6 months. When the restore request is completed the user accesses that archive. Which of the below mentioned statements is not true in this condition?
    1. The archive will be available as an object for the duration specified by the user during the restoration request
    2. The restored object’s storage class will be RRS (After the object is restored the storage class still remains GLACIER. Read more)
    3. The user can modify the restoration period only by issuing a new restore request with the updated period
    4. The user needs to pay storage for both RRS (restored) and Glacier (Archive) Rates
  6. Your department creates regular analytics reports from your company’s log files. All log data is collected in Amazon S3 and processed by daily Amazon Elastic Map Reduce (EMR) jobs that generate daily PDF reports and aggregated tables in CSV format for an Amazon Redshift data warehouse. Your CFO requests that you optimize the cost structure for this system. Which of the following alternatives will lower costs without compromising average performance of the system or data integrity for the raw data? [PROFESSIONAL]
    1. Use reduced redundancy storage (RRS) for PDF and CSV data in Amazon S3. Add Spot instances to Amazon EMR jobs. Use Reserved Instances for Amazon Redshift. (Spot instances impacts performance)
    2. Use reduced redundancy storage (RRS) for all data in S3. Use a combination of Spot instances and Reserved Instances for Amazon EMR jobs. Use Reserved instances for Amazon Redshift (Combination of the Spot and reserved with guarantee performance and help reduce cost. Also, RRS would reduce cost and guarantee data integrity, which is different from data durability )
    3. Use reduced redundancy storage (RRS) for all data in Amazon S3. Add Spot Instances to Amazon EMR jobs. Use Reserved Instances for Amazon Redshift (Spot instances impacts performance)
    4. Use reduced redundancy storage (RRS) for PDF and CSV data in S3. Add Spot Instances to EMR jobs. Use Spot Instances for Amazon Redshift. (Spot instances impacts performance)
  7. Which of the below mentioned options can be a good use case for storing content in AWS RRS?
    1. Storing mission critical data Files
    2. Storing infrequently used log files
    3. Storing a video file which is not reproducible
    4. Storing image thumbnails
  8. A newspaper organization has an on-premises application which allows the public to search its back catalogue and retrieve individual newspaper pages via a website written in Java. They have scanned the old newspapers into JPEGs (approx. 17TB) and used Optical Character Recognition (OCR) to populate a commercial search product. The hosting platform and software is now end of life and the organization wants to migrate its archive to AWS and produce a cost efficient architecture and still be designed for availability and durability. Which is the most appropriate? [PROFESSIONAL]
    1. Use S3 with reduced redundancy to store and serve the scanned files, install the commercial search application on EC2 Instances and configure with auto-scaling and an Elastic Load Balancer. (RRS impacts durability and commercial search would add to cost)
    2. Model the environment using CloudFormation. Use an EC2 instance running Apache webserver and an open source search application, stripe multiple standard EBS volumes together to store the JPEGs and search index. (Using EBS is not cost effective for storing files)
    3. Use S3 with standard redundancy to store and serve the scanned files, use CloudSearch for query processing, and use Elastic Beanstalk to host the website across multiple availability zones. (Standard S3 and Elastic Beanstalk provides availability and durability, Standard S3 and CloudSearch provides cost effective storage and search)
    4. Use a single-AZ RDS MySQL instance to store the search index and the JPEG images use an EC2 instance to serve the website and translate user queries into SQL. (RDS is not ideal and cost effective to store files, Single AZ impacts availability)
    5. Use a CloudFront download distribution to serve the JPEGs to the end users and Install the current commercial search product, along with a Java Container for the website on EC2 instances and use Route53 with DNS round-robin. (CloudFront needs a source and using commercial search product is not cost effective)
  9. A research scientist is planning for the one-time launch of an Elastic MapReduce cluster and is encouraged by her manager to minimize the costs. The cluster is designed to ingest 200TB of genomics data with a total of 100 Amazon EC2 instances and is expected to run for around four hours. The resulting data set must be stored temporarily until archived into an Amazon RDS Oracle instance. Which option will help save the most money while meeting requirements? [PROFESSIONAL]
    1. Store ingest and output files in Amazon S3. Deploy on-demand for the master and core nodes and spot for the task nodes.
    2. Optimize by deploying a combination of on-demand, RI and spot-pricing models for the master, core and task nodes. Store ingest and output files in Amazon S3 with a lifecycle policy that archives them to Amazon Glacier. (Master and Core must be RI or On Demand. Cannot be Spot)
    3. Store the ingest files in Amazon S3 RRS and store the output files in S3. Deploy Reserved Instances for the master and core nodes and on-demand for the task nodes. (Need better durability for ingest file. Spot instances can be used for task nodes for cost saving.)
    4. Deploy on-demand master, core and task nodes and store ingest and output files in Amazon S3 RRS (Input must be in S3 standard)