AWS S3 vs EBS vs EFS

S3 vs EBS vs EFS

EFS, EBS, and S3 are AWS’ three different storage types that are applicable for different types of workload needs

S3 vs EBS vs EFS Comparision

S3 vs EBS vs EFS

Simple Storage Service – S3

  • is an object store with a simple key, value store design, and good at storing vast numbers of backups or user files.
  • offers pay for the storage you actually use. Offers cost-saving storage classes ideal for infrequently access data or for data archival
  • provides unlimited storage
  • provides durability as the data is replicated and stored across at least three geographically dispersed AZs with a maximum of 99.999999999% (11! 9’s)
  • provide high availability with a maximum of 99.99%
  • provides security with a range of access control mechanisms and abilities to encrypt data at rest and in transit
  • data can be accessed programmatically or directly from services such as AWS CloudFront.
  • provides backup capability using versioning and cross-region replication

Elastic Block Storage – EBS

  • delivers high-availability block-level storage volumes for EC2 instances.
  • offers pay for the provisioned storage, even if you do not use it
  • provides limited storage capability and cannot scale infinitely
  • stores data on a file system which can be retained after the EC2 instance is shut down.
  • provides durability by replicating data across multiple servers in an AZ to prevent the loss of data from the failure of any single component
  • designed for 99.999% availability
  • provides low-latency performance – using SSD EBS volumes, it offers reliable I/O performance scaled to meet your workload needs.
  • provides secure storage with access control and providing data at rest and in transit encryption
  • is only accessible from a single EC2 instance in the particular AWS region and AZ
  • provides Multi-Attach option to share storage across multiple EC2 instances, but within a particular AWS region and AZ
  • provides backup capability using backups and snapshots

Elastic File Storage – EFS

  • scalable file storage, also optimized for EC2.
  • offers pay for the storage you actually use. There’s no advance provisioning, up-front fees, or commitments
  • multiple instances can be configured to mount the file system.
  • allows mounting the file system across multiple regions and instances.
  • is designed to be highly durable and highly available. Data is redundantly stored across multiple AZs.
  • provides elasticity – scales up and down automatically, even to meet the most abrupt workload spikes.
  • provides performance that scales to support any workload: EFS offers the throughput changing workloads need. It can provide higher throughput in spurts that match sudden file system growth, even for workloads up to 500,000 IOPS or 10 GB per second.
  • provides accessible file storage, which can be accessed by On-premises servers and EC2 instances concurrently.
  • provides security and compliance – access to the file system can be secured with the current security solution, or control access to EFS file systems using IAM, VPC, or POSIX permissions.
  • provides data encryption in transit or at rest.
  • allows EC2 instances to access EFS file systems located in other AWS regions through VPC peering.
  • a file system can be accessed concurrently from all AZs in the region where it is located, which means the application can be architected to failover from one AZ to other AZs in the region in order to ensure the highest level of application availability. Mount targets themselves are designed to be highly available.
  • used as a common data source for any application or workload that runs on numerous instances.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company runs an application on a group of Amazon Linux EC2 instances. The application writes log files using standard API calls. For compliance reasons, all log files must be retained indefinitely and will be analyzed by a reporting tool that must access all files concurrently. Which storage service should a solutions architect use to provide the MOST cost-effective solution?
    1. Amazon EBS
    2. Amazon EFS
    3. Amazon EC2 instance store
    4. Amazon S3
  2. A new application is being deployed on Amazon EC2. The Application needs to read write upto 3 TB of data to an external data store and requires read-after-write consistency across all AWS regions for writing new objects into this data store.
    1. Amazon EBS
    2. Amazon Glacier
    3. Amazon EFS
    4. Amazon S3
  3. To meet the requirements of an application, an organization needs to save a constantly increasing volume of files on a cloud storage system with the following features and abilities. What below AWS service will meet these requirements?
      1. Pay only for the storage used
      2. Create different security policies for different groups of files
      3. Allow access to the public
      4. Retrieve the files at any time
      5. Store an unlimited number of files
    1. Amazon EBS
    2. Amazon S3
    3. Amazon Glacier
    4. Amazon EFS
  4. An administrator runs a highly available application in AWS. A file storage layer is needed that can share between instances and scale the platform more easily. The storage should also be POSIX compliant. Which AWS service can perform this action?
    1. Amazon EBS
    2. Amazon S3
    3. Amazon EFS
    4. Amazon EC2 Instance store

Reference

AWS_When_to_choose_EFS

AWS Storage Services Cheat Sheet

AWS Storage Services Cheat Sheet

AWS Storage Services

Simple Storage Service – S3

  • provides key-value based object storage with unlimited storage, unlimited objects up to 5 TB for the internet
  • offers an extremely durable, highly available, and infinitely scalable data storage infrastructure at very low costs.
  • is Object-level storage (not a Block level storage) and cannot be used to host OS or dynamic websites (but can work with Javascript SDK)
  • provides durability by redundantly storing objects on multiple facilities within a region
  • regularly verifies the integrity of data using checksums and provides the auto-healing capability
  • S3 resources consist of globally unique buckets with objects and related metadata. The data model is a flat structure with no hierarchies or folders.
  • S3 Replication enables automatic, asynchronous copying of objects across S3 buckets in the same or different AWS regions using SRR or CRR. Replication needs versioning enabled on either side.
  • S3 Transfer Acceleration helps speed data transport over long distances between a client and an S3 bucket using CloudFront edge locations.
  • S3 supports cost-effective Static Website hosting with Client-side scripts.
  • S3 CORS – Cross-Origin Resource Sharing allows cross-origin access to S3 resources.
  • S3 Access Logs enables tracking access requests to an S3 bucket.
  • S3 notification feature enables notifications to be triggered when certain events happen in the bucket.
  • S3 Inventory helps manage the storage and can be used to audit and report on the replication and encryption status of the objects for business, compliance, and regulatory needs.
  • Requestor Pays help bucket owner to specify that the requester requesting the download will be charged for the download.
  • S3 Batch Operations help perform large-scale batch operations on S3 objects and can perform a single operation on lists of specified S3 objects.
  • Pre-Signed URLs can be used shared for uploading/downloading objects for a limited time without requiring AWS security credentials.
  • Multipart Uploads allows
    • parallel uploads with improved throughput and bandwidth utilization
    • fault tolerance and quick recovery from network issues
    • ability to pause and resume uploads
    • begin an upload before the final object size is known
  • Versioning
    • helps preserve, retrieve, and restore every version of every object
    • protect from unintended overwrites and accidental deletions
    • protects individual files but does NOT protect from Bucket deletion
  • MFA (Multi-Factor Authentication) can be enabled for additional security for the deletion of objects.
  • Integrates with CloudTrail, CloudWatch, and SNS for event notifications
  • S3 Storage Classes
    • S3 Standard
      • default storage class, ideal for frequently accessed data
      • 99.999999999% durability & 99.99% availability
      • Low latency and high throughput performance
      • designed to sustain the loss of data in a two facilities
    • S3 Standard-Infrequent Access (S3 Standard-IA)
      • optimized for long-lived and less frequently accessed data
      • designed to sustain the loss of data in a two facilities
      • 99.999999999% durability & 99.9% availability
      • suitable for objects greater than 128 KB kept for at least 30 days
    • S3 One Zone-Infrequent Access (S3 One Zone-IA)
      • optimized for rapid access, less frequently access data
      • ideal for secondary backups and reproducible data
      • stores data in a single AZ, data stored in this storage class will be lost in the event of AZ destruction.
      • 99.999999999% durability & 99.5% availability
    • S3 Reduced Redundancy Storage (Not Recommended)
      • designed for noncritical, reproducible data stored at lower levels of redundancy than the STANDARD storage class
      • reduces storage costs
      • 99.99% durability & 99.99% availability
      • designed to sustain the loss of data in a single facility
    • S3 Glacier
      • suitable for low cost data archiving, where data access is infrequent
      • provides retrieval time of minutes to several hours
        • Expedited – 1 to 5 minutes
        • Standard – 3 to 5 hours
        • Bulk – 5 to 12 hours
      • 99.999999999% durability & 99.9% availability
      • Minimum storage duration of 90 days
    • S3 Glacier Deep Archive (S3 Glacier Deep Archive)
      • provides lowest cost data archiving, where data access is infrequent
      • 99.999999999% durability & 99.9% availability
      • provides retrieval time of several (12-48) hours
        • Standard – 12 hours
        • Bulk – 48 hours
      • Minimum storage duration of 180 days
      • supports long-term retention and digital preservation for data that may be accessed once or twice a year
  • Lifecycle Management policies
    • transition to move objects to different storage classes and Glacier
    • expiration to remove objects and object versions
    • can be applied to both current and non-current objects, in case, versioning is enabled.
  • Data Consistency Model
    • provides strong read-after-write consistency for PUT and DELETE requests of objects in the S3 bucket in all AWS Regions
    • updates to a single key are atomic
    • does not currently support object locking for concurrent writes
  • S3 Security
    • IAM policies – grant users within your own AWS account permission to access S3 resources
    • Bucket and Object ACL – grant other AWS accounts (not specific users) access to  S3 resources
    • Bucket policies – allows to add or deny permissions across some or all of the objects within a single bucket
    • S3 Access Points simplify data access for any AWS service or customer application that stores data in S3.
    • S3 Glacier Vault Lock helps deploy and enforce compliance controls for individual S3 Glacier vaults with a vault lock policy.
    • S3 VPC Gateway Endpoint enables private connections between a VPC and S3, without requiring that you use an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
    • Support SSL encryption of data in transit and data encryption at rest
  • S3 Data Encryption
    • supports data at rest and data in transit encryption
    • Server-Side Encryption
      • SSE-S3 – encrypts S3 objects using keys handled & managed by AWS
      • SSE-KMS – leverage AWS Key Management Service to manage encryption keys. KMS provides control and audit trail over the keys.
      • SSE-C – when you want to manage your own encryption keys. AWS does not store the encryption key. Requires HTTPS.
    • Client-Side Encryption
      • Client library such as the S3 Encryption Client
      • Clients must encrypt data themselves before sending it to S3
      • Clients must decrypt data themselves when retrieving from S3
      • Customer fully manages the keys and encryption cycle
  • S3 Best Practices
    • use random hash prefix for keys and ensure a random access pattern, as S3 stores object lexicographically randomness helps distribute the contents across multiple partitions for better performance
    • use parallel threads and Multipart upload for faster writes
    • use parallel threads and Range Header GET for faster reads
    • for list operations with a large number of objects, it’s better to build a secondary index in DynamoDB
    • use Versioning to protect from unintended overwrites and deletions, but this does not protect against bucket deletion
    • use VPC S3 Endpoints with VPC to transfer data using Amazon internal network

Instance Store

  • provides temporary or ephemeral block-level storage for an EC2 instance
  • is physically attached to the Instance
  • deliver very high random I/O performance, which is a good option when storage with very low latency is needed
  • cannot be dynamically resized
  • data persists when an instance is rebooted
  • data does not persists if the
    • underlying disk drive fails
    • instance stops i.e. if the EBS backed instance with instance store volumes attached is stopped
    • instance terminates
  • can be attached to an EC2 instance only when the instance is launched
  • is ideal for the temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers.

Elastic Block Store – EBS

  • is virtual network-attached block storage
  • provides highly available, reliable, durable, block-level storage volumes that can be attached to a running instance
  • provides high durability and are redundant in an AZ, as the data is automatically replicated within that AZ to prevent data loss due to any single hardware component failure
  • persists and is independent of EC2 lifecycle
  • multiple volumes can be attached to a single EC2 instance
  • can be detached & attached to another EC2 instance in that same AZ only
  • volumes are Zonal i.e. created in a specific AZ and CAN’T span across AZs
  • snapshots
  • for making volume available to different AZ, create a snapshot of the volume and restore it to a new volume in any AZ within the region
  • for making the volume available to different Region, the snapshot of the volume can be copied to a different region and restored as a volume
  • PIOPS is designed to run transactions applications that require high and consistent IO for e.g. Relation database, NoSQL, etc
  • volumes CANNOT be shared with multiple EC2 instances, use EFS instead
  • Multi-Attach enables attaching a single Provisioned IOPS SSD (io1 or io2) volume to multiple instances that are in the same AZ.

EBS Encryption

  • allow encryption using the EBS encryption feature.
  • All data stored at rest, disk I/O, and snapshots created from the volume are encrypted.
  • uses 256-bit AES algorithms (AES-256) and an Amazon-managed KMS
  • Snapshots of encrypted EBS volumes are automatically encrypted.

EBS Snapshots

  • helps create backups of EBS volumes
  • are incremental
  • occur asynchronously, consume the instance IOPS
  • are regional and CANNOT span across regions
  • can be copied across regions to make it easier to leverage multiple regions for geographical expansion, data center migration, and disaster recovery
  • can be shared by making them public or with specific AWS accounts by modifying the access permissions of the snapshots
  • support EBS encryption
    • Snapshots of encrypted volumes are automatically encrypted
    • Volumes created from encrypted snapshots are automatically encrypted
    • All data in flight between the instance and the volume is encrypted
    • Volumes created from an unencrypted snapshot owned or have access to can be encrypted on the fly.
    • Encrypted snapshot owned or having access to, can be encrypted with a different key during the copy process.
  • can be automated using AWS Data Lifecycle Manager

EBS vs Instance Store

Refer blog post @ EBS vs Instance Store

Glacier

  • suitable for archiving data, where data access is infrequent and a retrieval time of several hours (3 to 5 hours) is acceptable (Not true anymore with enhancements from AWS)
  • provides a high durability by storing archive in multiple facilities and multiple devices at a very low cost storage
  • performs regular, systematic data integrity checks and is built to be automatically self healing
  • aggregate files into bigger files before sending them to Glacier and use range retrievals to retrieve partial file and reduce costs
  • improve speed and reliability with multipart upload
  • automatically encrypts the data using AES-256
  • upload or download data to Glacier via SSL encrypted endpoints

EFS

  • fully-managed, easy to set up, scale, and cost-optimize file storage
  • can automatically scale from gigabytes to petabytes of data without needing to provision storage
  • provides managed NFS (network file system) that can be mounted on and accessed by multiple EC2 in multiple AZs simultaneously
  • highly durable, highly scalable and highly available.
    • stores data redundantly across multiple Availability Zones
    • grows and shrinks automatically as files are added and removed, so you there is no need to manage storage procurement or provisioning.
  • expensive (3x gp2), but you pay per use
  • uses the Network File System version 4 (NFS v4) protocol
  • is compatible with all Linux-based AMIs for EC2,  POSIX file system (~Linux) that has a standard file API
  • does not support Windows AMI
  • offers the ability to encrypt data at rest using KMS and in transit.
  • can be accessed from on-premises using an AWS Direct Connect or AWS VPN connection between the on-premises datacenter and VPC.
  • can be accessed concurrently from servers in the on-premises datacenter as well as EC2 instances in the Amazon VPC
  • Performance mode
    • General purpose (default)
      • latency-sensitive use cases (web server, CMS, etc…)
    • Max I/O
      • higher latency, throughput, highly parallel (big data, media processing)
  • Storage Tiers
    • Standard
      • for frequently accessed files
      • ideal for active file system workloads and you pay only for the file system storage you use per month
    • Infrequent access (EFS-IA)
      • a lower cost storage class that’s cost-optimized for files infrequently accessed i.e. not accessed every day
      • cost to retrieve files, lower price to store
    • EFS Lifecycle Management with choosing an age-off policy allows moving files to EFS IA
    • Lifecycle Management automatically moves the data to the EFS IA storage class according to the lifecycle policy. for e.g., you can move files automatically into EFS IA fourteen days of not being accessed.
    • EFS is a shared POSIX system for Linux systems and does not work for Windows

Amazon FSx for Windows

  • is a fully managed,  highly reliable, and scalable Windows file system share drive
  • supports SMB protocol & Windows NTFS
  • supports Microsoft Active Directory integration, ACLs, user quotas
  • built on SSD, scale up to 10s of GB/s, millions of IOPS, 100s PB of data
  • is accessible from Windows, Linux, and MacOS compute instances
  • can be accessed from the on-premise infrastructure
  • can be configured to be Multi-AZ (high availability)
  • supports encryption of data at rest and in transit
  • provides data deduplication, which enables further cost optimization by removing redundant data.
  • data is backed-up daily to S3

Amazon FSx for Lustre

  • provides easy and cost effective way to launch and run the world’s most popular high-performance file system.
  • is a type of parallel distributed file system, for large-scale computing
  • Lustre is derived from “Linux” and “cluster”
  • Machine Learning, High Performance Computing (HPC) esp. Video Processing, Financial Modeling, Electronic Design Automation
  • scales up to 100s GB/s, millions of IOPS, sub-ms latencies
  • seamless integration with S3, it transparently presents S3 objects as files and allows you to write changed data back to S3.
  • can “read S3” as a file system (through FSx)
  • can write the output of the computations back to S3 (through FSx)
  • supports encryption of data at rest and in transit
  • can be used from on-premise servers

CloudFront

  • provides low latency and high data transfer speeds for distribution of static, dynamic web or streaming content to web users
  • delivers the content through a worldwide network of data centers called Edge Locations
  • keeps persistent connections with the origin servers so that the files can be fetched from the origin servers as quickly as possible.
  • dramatically reduces the number of network hops that users’ requests must pass through
  • supports multiple origin server options, like AWS hosted service for e.g. S3, EC2, ELB or an on premise server, which stores the original, definitive version of the objects
  • single distribution can have multiple origins and Path pattern in a cache behavior determines which requests are routed to the origin
  • supports Web Download distribution and RTMP Streaming distribution
    • Web distribution supports static, dynamic web content, on demand using progressive download & HLS and live streaming video content
    • RTMP supports streaming of media files using Adobe Media Server and the Adobe Real-Time Messaging Protocol (RTMP) ONLY
  • supports HTTPS using either
    • dedicated IP address, which is expensive as dedicated IP address is assigned to each CloudFront edge location
    • Server Name Indication (SNI), which is free but supported by modern browsers only with the domain name available in the request header
  • For E2E HTTPS connection,
    • Viewers -> CloudFront needs either self signed certificate, or certificate issued by CA or ACM
    • CloudFront -> Origin needs certificate issued by ACM for ELB and by CA for other origins
  •  Security
    • Origin Access Identity (OAI) can be used to restrict the content from S3 origin to be accessible from CloudFront only
    • supports Geo restriction (Geo-Blocking) to whitelist or blacklist countries that can access the content
    • Signed URLs 
      • for RTMP distribution as signed cookies aren’t supported
      • to restrict access to individual files, for e.g., an installation download for your application.
      • users using a client, for e.g. a custom HTTP client, that doesn’t support cookies
    • Signed Cookies
      • provide access to multiple restricted files, for e.g., video part files in HLS format or all of the files in the subscribers’ area of a website.
      • don’t want to change the current URLs
    • integrates with AWS WAF, a web application firewall that helps protect web applications from attacks by allowing rules configured based on IP addresses, HTTP headers, and custom URI strings
  • supports GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE to get object & object headers, add, update, and delete objects
    • only caches responses to GET and HEAD requests and, optionally, OPTIONS requests
    • does not cache responses to PUT, POST, PATCH, DELETE request methods and these requests are proxied back to the origin
  • object removal from cache
    • would be removed upon expiry (TTL) from the cache, by default 24 hrs
    • can be invalidated explicitly, but has a cost associated, however might continue to see the old version until it expires from those caches
    • objects can be invalidated only for Web distribution
    • change object name, versioning, to serve different version
  • supports adding or modifying custom headers before the request is sent to origin which can be used to
    • validate if user is accessing the content from CDN
    • identifying CDN from which the request was forwarded from, in case of multiple CloudFront distribution
    • for viewers not supporting CORS to return the Access-Control-Allow-Origin header for every request
  • supports Partial GET requests using range header to download object in smaller units improving the efficiency of partial downloads and recovery from partially failed transfers
  • supports compression to compress and serve compressed files when viewer requests include Accept-Encoding: gzip in the request header
  • supports different price class to include all regions, to include only least expensive regions and other regions to exclude most expensive regions
  • supports access logs which contain detailed information about every user request for both web and RTMP distribution

AWS Import/Export

  • accelerates moving large amounts of data into and out of AWS using portable storage devices for transport and transfers data directly using Amazon’s high speed internal network, bypassing the internet.
  • suitable for use cases with
    • large datasets
    • low bandwidth connections
    • first time migration of data
  • Importing data to several types of AWS storage, including EBS snapshots, S3 buckets, and Glacier vaults.
  • Exporting data out from S3 only, with versioning enabled only the latest version is exported
  • Import data can be encrypted (optional but recommended) while export is always encrypted using Truecrypt
  • Amazon will wipe the device if specified, however it will not destroy the device

AWS Storage Options – S3 & Glacier

Amazon S3

  • highly-scalable, reliable, and low-latency data storage infrastructure at very low costs.
  • provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from within Amazon EC2 or from anywhere on the web.
  • allows you to write, read, and delete objects containing from 1 byte to 5 terabytes of data each.
  • number of objects you can store in an Amazon S3 bucket is virtually unlimited.
  • highly secure, supporting encryption at rest, and providing multiple mechanisms to provide fine-grained control of access to Amazon S3 resources.
  • highly scalable, allowing concurrent read or write access to Amazon S3 data by many separate clients or application threads.
  • provides data lifecycle management capabilities, allowing users to define rules to automatically archive Amazon S3 data to Amazon Glacier, or to delete data at end of life.

Ideal Use Cases

  • Storage & Distribution of static web content and media
    • frequently used to host static websites and provides a highly-available and highly-scalable solution for websites with only static content, including HTML files, images, videos, and client-side scripts such as JavaScript
    • works well for fast growing websites hosting data intensive, user-generated content, such as video and photo sharing sites as no storage provisioning is required
    • content can either be directly served from Amazon S3 since each object in Amazon S3 has a unique HTTP URL address
    • can also act as an Origin store for the Content Delivery Network (CDN) such as Amazon CloudFront
    • it works particularly well for hosting web content with extremely spiky bandwidth demands because of S3’s elasticity
  • Data Store for Large Objects
    • can be paired with RDS or NoSQL database and used to store large objects for e.g. file or objects, while the associated metadata for e.g. name, tags, comments etc. can be stored in RDS or NoSQL database where it can be indexed and queried providing faster access to relevant data
  • Data store for computation and large-scale analytics
    • commonly used as a data store for computation and large-scale analytics, such as analyzing financial transactions, clickstream analytics, and media transcoding.
    • data can be accessed from multiple computing nodes concurrently without being constrained by a single connection because of its horizontal scalability
  • Backup and Archival of critical data
    • used as a highly durable, scalable, and secure solution for backup and archival of critical data, and to provide disaster recovery solutions for business continuity.
    • stores objects redundantly on multiple devices across multiple facilities, it provides the highly-durable storage infrastructure needed for these scenarios.
    • it’s versioning capability is available to protect critical data from inadvertent deletion

Anti-Patterns

Amazon S3 has following Anti-Patterns where it is not an optimal solution

  • Dynamic website hosting
    • While Amazon S3 is ideal for hosting static websites, dynamic websites requiring server side interaction, scripting or database interaction cannot be hosted and should rather be hosted on Amazon EC2
  • Backup and archival storage
    • Data requiring long term archival storage with infrequent read access can be stored more cost effectively in Amazon Glacier
  • Structured Data Query
    • Amazon S3 doesn’t offer query capabilities, so to read an object the object name and key must be known. Instead pair up S3 with RDS or Dynamo DB to store, index and query metadata about Amazon S3 objects
    • NOTE – S3 now provides query capabilities and also Athena can be used
  • Rapidly Changing Data
    • Data that needs to updated frequently might be better served by a storage solution with lower read/write latencies, such as Amazon EBS volumes, RDS or Dynamo DB.
  • File System
    • Amazon S3 uses a flat namespace and isn’t meant to serve as a standalone, POSIX-compliant file system. However, by using delimiters (commonly either the ‘/’ or ‘’ character) you are able construct your keys to emulate the hierarchical folder structure of file system within a given bucket.

Performance

  • Access to Amazon S3 from within Amazon EC2 in the same region is fast.
  • Amazon S3 is designed so that server-side latencies are insignificant relative to Internet latencies.
  • Amazon S3 is also built to scale storage, requests, and users to support a virtually unlimited number of web-scale applications.
  • If Amazon S3 is accessed using multiple threads, multiple applications, or multiple clients concurrently, total Amazon S3 aggregate throughput will typically scale to rates that far exceed what any single server can generate or consume.

Durability & Availability

  • Amazon S3 storage provides provides the highest level of data durability and availability, by automatically and synchronously storing your data across both multiple devices and multiple facilities within the selected geographical region
  • Error correction is built-in, and there are no single points of failure. Amazon S3 is designed to sustain the concurrent loss of data in two facilities, making it very well-suited to serve as the primary data storage for mission-critical data.
  • Amazon S3 is designed for 99.999999999% (11 nines) durability per object and 99.99% availability over a one-year period.
  • Amazon S3 data can be protected from unintended deletions or overwrites using Versioning.
  • Versioning can be enabled with MFA (Multi Factor Authentication) Delete on the bucket, which would require two forms of authentication to delete an object
  • For Non Critical and Reproducible data for e.g. thumbnails, transcoded media etc., S3 Reduced Redundancy Storage (RRS) can be used, which provides a lower level of durability at a lower storage cost
  • RRS is designed to provide 99.99% durability per object over a given year. While RRS is less durable than standard Amazon S3, it is still designed to provide 400 times more durability than a typical disk drive

Cost Model

  • With Amazon S3, you pay only for what you use and there is no minimum fee.
  • Amazon S3 has three pricing components: storage (per GB per month), data transfer in or out (per GB per month), and requests (per n thousand requests per month).

Scalability & Elasticity

  • Amazon S3 has been designed to offer a very high level of scalability and elasticity automatically
  • Amazon S3 supports a virtually unlimited number of files in any bucket
  • Amazon S3 bucket can store a virtually unlimited number of bytes
  • Amazon S3 allows you to store any number of objects (files) in a single bucket, and Amazon S3 will automatically manage scaling and distributing redundant copies of your information to other servers in other locations in the same region, all using Amazon’s high-performance infrastructure.

Interfaces

  • Amazon S3 provides standards-based REST and SOAP web services APIs for both management and data operations.
  • NOTE – SOAP support over HTTP is deprecated, but it is still available over HTTPS. New Amazon S3 features will not be supported for SOAP. We recommend that you use either the REST API or the AWS SDKs.
  • Amazon S3 provides easier to use higher level toolkit or SDK in different languages (Java, .NET, PHP, and Ruby) that wraps the underlying APIs
  • Amazon S3 Command Line Interface (CLI) provides a set of high-level, Linux-like Amazon S3 file commands for common operations, such as ls, cp, mv, sync, etc. They also provide the ability to perform recursive uploads and downloads using a single folder-level Amazon S3 command, and supports parallel transfers.
  • AWS Management Console provides the ability to easily create and manage Amazon S3 buckets, upload and download objects, and browse the contents of your Amazon S3 buckets using a simple web-based user interface
  • All interfaces provide the ability to store Amazon S3 objects (files) in uniquely-named buckets (top-level folders), with each object identified by an unique Object key within that bucket.

Glacier

  • extremely low-cost storage service that provides highly secure, durable, and flexible storage for data backup and archival
  • can reliably store their data for as little as $0.01 per gigabyte per month.
  • to offload the administrative burdens of operating and scaling storage to AWS such as capacity planning, hardware provisioning, data replication, hardware failure detection and repair, or time consuming hardware migrations
  • Data is stored in Amazon Glacier as Archives where an archive can represent a single file or multiple files combined into a single archive
  • Archives are stored in Vaults for which the access can be controlled through IAM
  • Retrieving archives from Vaults require initiation of a job and can take anywhere around 3-5 hours
  • Amazon Glacier integrates seamlessly with Amazon S3 by using S3 data lifecycle management policies to move data from S3 to Glacier
  • AWS Import/Export can also be used to accelerate moving large amounts of data into Amazon Glacier using portable storage devices for transport

Ideal Usage Patterns

  • Amazon Glacier is ideally suited for long term archival solution for infrequently accessed data with archiving offsite enterprise information, media assets, research and scientific data, digital preservation and magnetic tape replacement

Anti-Patterns

Amazon Glacier has following Anti-Patterns where it is not an optimal solution

  • Rapidly changing data
    • Data that must be updated very frequently might be better served by a storage solution with lower read/write latencies such as Amazon EBS or a Database
  • Real time access
    • Data stored in Glacier can not be accessed at real time and requires an initiation of a job for object retrieval with retrieval times ranging from 3-5 hours. If immediate access is needed, Amazon S3 is a better choice.

Performance

  • Amazon Glacier is a low-cost storage service designed to store data that is infrequently accessed and long lived.
  • Amazon Glacier jobs typically complete in 3 to 5 hours

Durability and Availability

  • Amazon Glacier redundantly stores data in multiple facilities and on multiple devices within each facility
  • Amazon Glacier is designed to provide average annual durability of 99.999999999% (11 nines) for an archive
  • Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives.
  • Amazon Glacier also performs regular, systematic data integrity checks and is built to be automatically self-healing.

Cost Model

  • Amazon Glacier has three pricing components: storage (per GB per month), data transfer out (per GB per month), and requests (per thousand UPLOAD and RETRIEVAL requests per month).
  • Amazon Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time and allows you to retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. Any additional amount of data retrieved is charged per GB
  • Amazon Glacier also charges a pro-rated charge (per GB) for items deleted prior to 90 days

Scalability & Elasticity

  • A single archive is limited to 40 TBs, but there is no limit to the total amount of data you can store in the service.
  • Amazon Glacier scales to meet your growing and often unpredictable storage requirements whether you’re storing petabytes or gigabytes, Amazon Glacier automatically scales your storage up or down as needed.

Interfaces

  • Amazon Glacier provides a native, standards-based REST web services interface, as well as Java and .NET SDKs.
  • AWS Management Console or the Amazon Glacier APIs can be used to create vaults to organize the archives in Amazon Glacier.
  • Amazon Glacier APIs can be used to upload and retrieve archives, monitor the status of your jobs and also configure your vault to send you a notification via Amazon Simple Notification Service (Amazon SNS) when your jobs complete.
  • Amazon Glacier can be used as a storage class in Amazon S3 by using object lifecycle management to provide automatic, policy-driven archiving from Amazon S3 to Amazon Glacier.
  • Amazon S3 api provides a RESTORE operation and the retrieval process takes the same 3-5 hours
  • On retrieval, a copy of the retrieved object is placed in Amazon S3 RRS storage for a specified retention period; the original archived object remains stored in Amazon Glacier and you are charged for both the storage.
  • When using Amazon Glacier as a storage class in Amazon S3, use the Amazon S3 APIs, and when using “native” Amazon Glacier, you use the Amazon Glacier APIs
  • Objects archived to Amazon Glacier via Amazon S3 can only be listed and retrieved via the Amazon S3 APIs or the AWS Management Console—they are not visible as archives in an Amazon Glacier vault.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You want to pass queue messages that are 1GB each. How should you achieve this?
    1. Use Kinesis as a buffer stream for message bodies. Store the checkpoint id for the placement in the Kinesis Stream in SQS.
    2. Use the Amazon SQS Extended Client Library for Java and Amazon S3 as a storage mechanism for message bodies. (Amazon SQS messages with Amazon S3 can be useful for storing and retrieving messages with a message size of up to 2 GB. To manage Amazon SQS messages with Amazon S3, use the Amazon SQS Extended Client Library for Java. Refer link)
    3. Use SQS’s support for message partitioning and multi-part uploads on Amazon S3.
    4. Use AWS EFS as a shared pool storage medium. Store filesystem pointers to the files on disk in the SQS message bodies.
  2. Company ABCD has recently launched an online commerce site for bicycles on AWS. They have a “Product” DynamoDB table that stores details for each bicycle, such as, manufacturer, color, price, quantity and size to display in the online store. Due to customer demand, they want to include an image for each bicycle along with the existing details. Which approach below provides the least impact to provisioned throughput on the “Product” table?
    1. Serialize the image and store it in multiple DynamoDB tables
    2. Create an “Images” DynamoDB table to store the Image with a foreign key constraint to the “Product” table
    3. Add an image data type to the “Product” table to store the images in binary format
    4. Store the images in Amazon S3 and add an S3 URL pointer to the “Product” table item for each image

References

AWS S3 Best Practices

S3 Best Practices

Performance

Multiple Concurrent PUTs/GETs

  • S3 scales to support very high request rates. If the request rate grows steadily, S3 automatically partitions the buckets as needed to support higher request rates.
  • S3 can achieve at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket.
  • If the typical workload involves only occasional bursts of 100 requests per second and less than 800 requests per second, AWS scales and handle it.
  • If the typical workload involves a request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, it’s recommended to open a support case to prepare for the workload and avoid any temporary limits on your request rate.
  • S3 best practice guidelines can be applied only if you are routinely processing 100 or more requests per second
  • Workloads that include a mix of request types
    • If the request workload is typically a mix of GET, PUT, DELETE, or GET Bucket (list objects), choosing appropriate key names for the objects ensures better performance by providing low-latency access to the S3 index
    • This behavior is driven by how S3 stores key names.
      • S3 maintains an index of object key names in each AWS region.
      • Object keys are stored lexicographically (UTF-8 binary ordering) across multiple partitions in the index i.e. S3 stores key names in alphabetical order.
      • Object keys are stored in across multiple partitions in the index and the key name dictates which partition the key is stored in
      • Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that S3 will target a specific partition for a large number of keys, overwhelming the I/O capacity of the partition.
    • Introduce some randomness in the key name prefixes, the key names, and the I/O load, will be distributed across multiple index partitions.
    • It also ensures scalability regardless of the number of requests sent per second.

Transfer Acceleration

  • S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between the client and an S3 bucket.
  • Transfer Acceleration takes advantage of CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to S3 over an optimized network path.

GET-intensive Workloads

  • CloudFront can be used for performance optimization and can help by
    • distributing content with low latency and high data transfer rate.
    • caching the content and thereby reducing the number of direct requests to S3
    • providing multiple endpoints (Edge locations) for data availability
    • available in two flavors as Web distribution or RTMP distribution
  • To fast data transport over long distances between a client and an S3 bucket, use S3 Transfer Acceleration. Transfer Acceleration uses the globally distributed edge locations in CloudFront to accelerate data transport over geographical distances

PUTs/GETs for Large Objects

  • AWS allows Parallelizing the PUTs/GETs request to improve the upload and download performance as well as the ability to recover in case it fails
  • For PUTs, Multipart upload can help improve the uploads by
    • performing multiple uploads at the same time and maximizing network bandwidth utilization
    • quick recovery from failures, as only the part that failed to upload, needs to be re-uploaded
    • ability to pause and resume uploads
    • begin an upload before the Object size is known
  • For GETs, the range HTTP header can help to improve the downloads by
    • allowing the object to be retrieved in parts instead of the whole object
    • quick recovery from failures, as only the part that failed to download needs to be retried.

List Operations

  • Object key names are stored lexicographically in S3 indexes, making it hard to sort and manipulate the contents of LIST
  • S3 maintains a single lexicographically sorted list of indexes
  • Build and maintain Secondary Index outside of S3 for e.g. DynamoDB or RDS to store, index and query objects metadata rather than performing operations on S3

Security

  • Use Versioning
    • can be used to protect from unintended overwrites and deletions
    • allows the ability to retrieve and restore deleted objects or rollback to previous versions
  • Enable additional security by configuring a bucket to enable MFA (Multi-Factor Authentication) to delete
  • Versioning does not prevent Bucket deletion and must be backed up as if accidentally or maliciously deleted the data is lost
  • Use Same Region Replication or Cross Region replication feature to backup data to a different region
  • When using VPC with S3, use VPC S3 endpoints as
    • are horizontally scaled, redundant, and highly available VPC components
    • help establish a private connection between VPC and S3 and the traffic never leaves the Amazon network

Refer blog post @ S3 Security Best Practices

Cost

  • Optimize S3 storage cost by selecting an appropriate storage class for objects
  • Configure appropriate lifecycle management rules to move objects to different storage classes and expire them

Tracking

  • Use Event Notifications to be notified for any put or delete request on the S3 objects
  • Use CloudTrail, which helps capture specific API calls made to S3 from the AWS account and delivers the log files to an S3 bucket
  • Use CloudWatch to monitor the Amazon S3 buckets, tracking metrics such as object counts and bytes stored, and configure appropriate actions

S3 Monitoring and Auditing Best Practices

Refer blog post @ S3 Monitoring and Auditing Best Practices

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A media company produces new video files on-premises every day with a total size of around 100GB after compression. All files have a size of 1-2 GB and need to be uploaded to Amazon S3 every night in a fixed time window between 3am and 5am. Current upload takes almost 3 hours, although less than half of the available bandwidth is used. What step(s) would ensure that the file uploads are able to complete in the allotted time window?
    1. Increase your network bandwidth to provide faster throughput to S3
    2. Upload the files in parallel to S3 using multipart upload
    3. Pack all files into a single archive, upload it to S3, then extract the files in AWS
    4. Use AWS Import/Export to transfer the video files
  2. You are designing a web application that stores static assets in an Amazon Simple Storage Service (S3) bucket. You expect this bucket to immediately receive over 150 PUT requests per second. What should you do to ensure optimal performance?
    1. Use multi-part upload.
    2. Add a random prefix to the key names.
    3. Amazon S3 will automatically manage performance at this scale.
    4. Use a predictable naming scheme, such as sequential numbers or date time sequences, in the key names
  3. You have an application running on an Amazon Elastic Compute Cloud instance, that uploads 5 GB video objects to Amazon Simple Storage Service (S3). Video uploads are taking longer than expected, resulting in poor application performance. Which method will help improve performance of your application?
    1. Enable enhanced networking
    2. Use Amazon S3 multipart upload
    3. Leveraging Amazon CloudFront, use the HTTP POST method to reduce latency.
    4. Use Amazon Elastic Block Store Provisioned IOPs and use an Amazon EBS-optimized instance
  4. Which of the following methods gives you protection against accidental loss of data stored in Amazon S3? (Choose 2)
    1. Set bucket policies to restrict deletes, and also enable versioning
    2. By default, versioning is enabled on a new bucket so you don’t have to worry about it (Not enabled by default)
    3. Build a secondary index of your keys to protect the data (improves performance only)
    4. Back up your bucket to a bucket owned by another AWS account for redundancy
  5. A startup company hired you to help them build a mobile application that will ultimately store billions of image and videos in Amazon S3. The company is lean on funding, and wants to minimize operational costs, however, they have an aggressive marketing plan, and expect to double their current installation base every six months. Due to the nature of their business, they are expecting sudden and large increases to traffic to and from S3, and need to ensure that it can handle the performance needs of their application. What other information must you gather from this customer in order to determine whether S3 is the right option?
    1. You must know how many customers that company has today, because this is critical in understanding what their customer base will be in two years. (No. of customers do not matter)
    2. You must find out total number of requests per second at peak usage.
    3. You must know the size of the individual objects being written to S3 in order to properly design the key namespace. (Size does not relate to the key namespace design but the count does)
    4. In order to build the key namespace correctly, you must understand the total amount of storage needs for each S3 bucket. (S3 provided unlimited storage the key namespace design would depend on the number)
  6. A document storage company is deploying their application to AWS and changing their business model to support both free tier and premium tier users. The premium tier users will be allowed to store up to 200GB of data and free tier customers will be allowed to store only 5GB. The customer expects that billions of files will be stored. All users need to be alerted when approaching 75 percent quota utilization and again at 90 percent quota use. To support the free tier and premium tier users, how should they architect their application?
    1. The company should utilize an amazon simple workflow service activity worker that updates the users data counter in amazon dynamo DB. The activity worker will use simple email service to send an email if the counter increases above the appropriate thresholds.
    2. The company should deploy an amazon relational data base service relational database with a store objects table that has a row for each stored object along with size of each object. The upload server will query the aggregate consumption of the user in questions (by first determining the files store by the user, and then querying the stored objects table for respective file sizes) and send an email via Amazon Simple Email Service if the thresholds are breached. (Good Approach to use RDS but with so many objects might not be a good option)
    3. The company should write both the content length and the username of the files owner as S3 metadata for the object. They should then create a file watcher to iterate over each object and aggregate the size for each user and send a notification via Amazon Simple Queue Service to an emailing service if the storage threshold is exceeded. (List operations on S3 not feasible)
    4. The company should create two separated amazon simple storage service buckets one for data storage for free tier users and another for data storage for premium tier users. An amazon simple workflow service activity worker will query all objects for a given user based on the bucket the data is stored in and aggregate storage. The activity worker will notify the user via Amazon Simple Notification Service when necessary (List operations on S3 not feasible as well as SNS does not address email requirement)
  7. Your company host a social media website for storing and sharing documents. the web application allow users to upload large files while resuming and pausing the upload as needed. Currently, files are uploaded to your php front end backed by Elastic Load Balancing and an autoscaling fleet of amazon elastic compute cloud (EC2) instances that scale upon average of bytes received (NetworkIn) After a file has been uploaded. it is copied to amazon simple storage service(S3). Amazon Ec2 instances use an AWS Identity and Access Management (AMI) role that allows Amazon s3 uploads. Over the last six months, your user base and scale have increased significantly, forcing you to increase the auto scaling groups Max parameter a few times. Your CFO is concerned about the rising costs and has asked you to adjust the architecture where needed to better optimize costs. Which architecture change could you introduce to reduce cost and still keep your web application secure and scalable?
    1. Replace the Autoscaling launch Configuration to include c3.8xlarge instances; those instances can potentially yield a network throughput of 10gbps. (no info of current size and might increase cost)
    2. Re-architect your ingest pattern, have the app authenticate against your identity provider as a broker fetching temporary AWS credentials from AWS Secure token service (GetFederation Token). Securely pass the credentials and s3 endpoint/prefix to your app. Implement client-side logic to directly upload the file to amazon s3 using the given credentials and S3 Prefix. (will not provide the ability to handle pause and restarts)
    3. Re-architect your ingest pattern, and move your web application instances into a VPC public subnet. Attach a public IP address for each EC2 instance (using the auto scaling launch configuration settings). Use Amazon Route 53 round robin records set and http health check to DNS load balance the app request this approach will significantly reduce the cost by bypassing elastic load balancing. (ELB is not the bottleneck)
    4. Re-architect your ingest pattern, have the app authenticate against your identity provider as a broker fetching temporary AWS credentials from AWS Secure token service (GetFederation Token). Securely pass the credentials and s3 endpoint/prefix to your app. Implement client-side logic that used the S3 multipart upload API to directly upload the file to Amazon s3 using the given credentials and s3 Prefix. (multipart allows one to start uploading directly to S3 before the actual size is known or complete data is downloaded)
  8. If an application is storing hourly log files from thousands of instances from a high traffic web site, which naming scheme would give optimal performance on S3?
    1. Sequential
    2. instanceID_log-HH-DD-MM-YYYY
    3. instanceID_log-YYYY-MM-DD-HH
    4. HH-DD-MM-YYYY-log_instanceID (HH will give some randomness to start with instead of instaneId where the first characters would be i-)
    5. YYYY-MM-DD-HH-log_instanceID

Reference

S3_Optimizing_Performance

AWS S3 Data Consistency Model

AWS S3 Data Consistency Model

  • S3 Data Consistency provides strong read-after-write consistency for PUT and DELETE requests of objects in the S3 bucket in all AWS Regions
  • This behavior applies to both writes to new objects as well as PUT requests that overwrite existing objects and DELETE requests.
  • Read operations on S3 Select, S3 ACLs, S3 Object Tags, and object metadata (for example, the HEAD object) are strongly consistent.
  • Updates to a single key are atomic. for e.g., if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it will never write corrupted or partial data.
  • S3 achieves high availability by replicating data across multiple servers within Amazon’s data centers. If a PUT request is successful, the data is safely stored. Any read (GET or LIST request) that is initiated following the receipt of a successful PUT response will return the data written by the PUT request.
  • S3 Data Consistency behavior examples
    • A process writes a new object to S3 and immediately lists keys within its bucket. The new object appears in the list.
    • A process replaces an existing object and immediately tries to read it. S3 returns the new data.
    • A process deletes an existing object and immediately tries to read it. S3 does not return any data because the object has been deleted.
    • A process deletes an existing object and immediately lists keys within its bucket. The object does not appear in the listing.
  • S3 does not currently support object locking for concurrent writes. for e.g. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
  • Updates are key-based; there is no way to make atomic updates across keys. for e.g, an update of one key cannot be dependent on the update of another key unless you design this functionality into the application.
  • S3 Object Lock is different as it allows to store objects using a write-once-read-many (WORM) model, which prevents an object from being deleted or overwritten for a fixed amount of time or indefinitely.
  • S3 provides strong Read-after-Write consistency for PUTS of new objects
    • For a PUT request, S3 synchronously stores data across multiple facilities before returning SUCCESS
    • A process writes a new object to S3 and will be immediately able to read the Object i.e. PUT 200 -> GET 200
    • A process writes a new object to S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
    • However, if a HEAD or GET request to a key name is made before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency. i.e. GET 404 -> PUT 200 -> GET 404
  • S3 provides Eventual Consistency for overwrite PUTS and DELETES in all regions.
    • For updates and deletes to Objects, the changes are eventually reflected and not available immediately i.e. PUT 200 -> PUT 200 -> GET 200 (might be older version) OR DELETE 200 -> GET 200
    • if a process replaces an existing object and immediately attempts to read it, S3 might return the prior data till the change is fully propagated
    • if a process deletes an existing object and immediately attempts to read it, S3 might return the deleted data until the deletion is fully propagated
    • if a process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, S3 might list the deleted object.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which of the following are valid statements about Amazon S3? Choose 2 answers
    1. S3 provides read-after-write consistency for any type of PUT or DELETE. (S3 now provides strong read-after-write consistency)
    2. Consistency is not guaranteed for any type of PUT or DELETE.
    3. A successful response to a PUT request only occurs when a complete object is saved
    4. Partially saved objects are immediately readable with a GET after an overwrite PUT.
    5. S3 provides eventual consistency for overwrite PUTS and DELETES
  2. A customer is leveraging Amazon Simple Storage Service in eu-west-1 to store static content for web-based property. The customer is storing objects using the Standard Storage class. Where are the customers’ objects replicated?
    1. Single facility in eu-west-1 and a single facility in eu-central-1
    2. Single facility in eu-west-1 and a single facility in us-east-1
    3. Multiple facilities in eu-west-1
    4. A single facility in eu-west-1
  3. A user has an S3 object in the US Standard region with the content “color=red”. The user updates the object with the content as “color=”white”. If the user tries to read the value 1 minute after it was uploaded, what will S3 return?
    1. It will return “color=white” (strong read-after-write consistency)
    2. It will return “color=red”
    3. It will return an error saying that the object was not found
    4. It may return either “color=red” or “color=white” i.e. any of the value (Eventual Consistency)

References

AWS_S3_Data_Consistency

AWS Simple Storage Service – S3

AWS Simple Storage Service – S3

  • Amazon Simple Storage Service – S3 is a simple key, value object store designed for the Internet
  • provides unlimited storage space and works on the pay-as-you-use model. Service rates get cheaper as the usage volume increases
  • offers an extremely durable, highly available, and infinitely scalable data storage infrastructure at very low costs.
  • is Object-level storage (not Block level storage like EBS volumes) and cannot be used to host OS or dynamic websites.
  • S3 resources e.g. buckets and objects are private by default.

S3 Buckets & Objects

S3 Buckets

  • A bucket is a container for objects stored in S3
  • Buckets help organize the S3 namespace.
  • A bucket is owned by the AWS account that creates it and helps identify the account responsible for storage and data transfer charges.
  • Bucket names are globally unique, regardless of the AWS region in which it was created and the namespace is shared by all AWS accounts
  • Even though S3 is a global service, buckets are created within a region specified during the creation of the bucket.
  • Every object is contained in a bucket
  • There is no limit to the number of objects that can be stored in a bucket and no difference in performance whether a single bucket or multiple buckets are used to store all the objects
  • The S3 data model is a flat structure i.e. there are no hierarchies or folders within the buckets. However, logical hierarchy can be inferred using the key name prefix e.g. Folder1/Object1
  • Restrictions
    • 100 buckets (soft limit) and a maximum of 1000 buckets can be created in each AWS account
    • Bucket names should be globally unique and DNS compliant
    • Bucket ownership is not transferable
    • Buckets cannot be nested and cannot have a bucket within another bucket
    • Bucket name and region cannot be changed, once created
  • Empty or a non-empty buckets can be deleted
  • S3 allows retrieval of 1000 objects and provides pagination support

Objects

  • Objects are the fundamental entities stored in a bucket
  • An object is uniquely identified within a bucket by a key name and version ID (if S3 versioning is enabled on the bucket)
  • Objects consist of object data, metadata, and others
    • Key is the object name and a unique identifier for an object
    • Value is actual content stored
    • Metadata is the data about the data and is a set of name-value pairs that describe the object e.g. content-type, size, last modified. Custom metadata can also be specified at the time the object is stored.
    • Version ID is the version id for the object and in combination with the key helps to uniquely identify an object within a bucket
    • Subresources help provide additional information for an object
    • Access Control Information helps control access to the objects
  • S3 objects allow two kinds of metadata
    • System metadata
      • Metadata such as the Last-Modified date is controlled by the system. Only S3 can modify the value.
      • System metadata that the user can control, e.g., the storage class, and encryption configured for the object.
    • User-defined metadata
      • User-defined metadata can be assigned during uploading the object or after the object has been uploaded.
      • User-defined metadata is stored with the object and is returned when an object is downloaded
      • S3 does not process user-defined metadata.
      • User-defined metadata must begin with the prefix “x-amz-meta“, otherwise S3 will not set the key-value pair as you define it
  • Object metadata cannot be modified after the object is uploaded and it can be only modified by performing copy operation and setting the metadata
  • Objects belonging to a bucket that reside in a specific AWS region never leave that region, unless explicitly copied using Cross Region Replication
  • Each object can be up to 5 TB in size
  • An object can be retrieved as a whole or a partially
  • With Versioning enabled, current as well as previous versions of an object can be retrieved

S3 Bucket & Object Operations

  • Listing
    • S3 allows the listing of all the keys within a bucket
    • A single listing request would return a max of 1000 object keys with pagination support using an indicator in the response to indicate if the response was truncated
    • Keys within a bucket can be listed using Prefix and Delimiter.
    • Prefix limits result in only those keys (kind of filtering) that begin with the specified prefix, and the delimiter causes the list to roll up all keys that share a common prefix into a single summary list result.
  • Retrieval
    • An object can be retrieved as a whole
    • An object can be retrieved in parts or partially (a specific range of bytes) by using the Range HTTP header.
    • Range HTTP header is helpful
      • if only a partial object is needed for e.g. multiple files were uploaded as a single archive
      • for fault-tolerant downloads where the network connectivity is poor
    • Objects can also be downloaded by sharing Pre-Signed URLs
    • Metadata of the object is returned in the response headers
  • Object Uploads
    • Single Operation – Objects of size 5GB can be uploaded in a single PUT operation
    • Multipart upload – can be used for objects of size > 5GB and supports the max size of 5TB. It is recommended for objects above size 100MB.
    • Pre-Signed URLs can also be used and shared for uploading objects
    • Objects if uploaded successfully can be verified if the request received a successful response. Additionally, returned ETag can be compared to the calculated MD5 value of the upload object
  • Copying Objects
    • Copying of objects up to 5GB can be performed using a single operation and multipart upload can be used for uploads up to 5TB
    • When an object is copied
      • user-controlled system metadata e.g. storage class and user-defined metadata are also copied.
      • system controlled metadata e.g. the creation date etc is reset
    • Copying Objects can be needed to
      • Create multiple object copies
      • Copy objects across locations or regions
      • Renaming of the objects
      • Change object metadata for e.g. storage class, encryption, etc
      • Updating any metadata for an object requires all the metadata fields to be specified again
  • Deleting Objects
    • S3 allows deletion of a single object or multiple objects (max 1000) in a single call
    • For Non Versioned buckets,
      • the object key needs to be provided and the object is permanently deleted
    • For Versioned buckets,
      • if an object key is provided, S3 inserts a delete marker and the previous current object becomes the non-current object
      • if an object key with a version ID is provided, the object is permanently deleted
      • if the version ID is of the delete marker, the delete marker is removed and the previous non-current version becomes the current version object
    • Deletion can be MFA enabled for adding extra security
  • Restoring Objects from Glacier
    • Objects must be restored before accessing an archived object
    • Restoration of an Object takes time and costs more. Glacier now offers expedited retrievals within minutes.
    • Restoration request also needs to specify the number of days for which the object copy needs to be maintained.
    • During this period, storage cost applies for both the archive and the copy.

Pre-Signed URLs

  • All buckets and objects are by default private.
  • Pre-signed URLs allows user to be able to download or upload a specific object without requiring AWS security credentials or permissions.
  • Pre-signed URL allows anyone to access the object identified in the URL, provided the creator of the URL has permission to access that object.
  • Pre-signed URLs creation requires the creator to provide security credentials, a bucket name, an object key, an HTTP method (GET for download object & PUT of uploading objects), and expiration date and time
  • Pre-signed URLs are valid only till the expiration date & time.

Multipart Upload

  • Multipart upload allows the user to upload a single large object as a set of parts. Each part is a contiguous portion of the object’s data.
  • Multipart uploads support 1 to 10000 parts and each part can be from 5MB to 5GB with the last part size allowed to be less than 5MB
  • Multipart uploads allow a max upload size of 5TB
  • Object parts can be uploaded independently and in any order. If transmission of any part fails, it can be retransmitted without affecting other parts.
  • After all parts of the object are uploaded and completed initiated, S3 assembles these parts and creates the object.
  • Using multipart upload provides the following advantages:
    • Improved throughput – parallel upload of parts to improve throughput
    • Quick recovery from any network issues – Smaller part size minimizes the impact of restarting a failed upload due to a network error.
    • Pause and resume object uploads – Object parts can be uploaded over time. Once a multipart upload is initiated there is no expiry; you must explicitly complete or abort the multipart upload.
    • Begin an upload before the final object size is known – an object can be uploaded as is it being created
  • Three Step process
    • Multipart Upload Initiation
      • Initiation of a Multipart upload request to S3 returns a unique ID for each multipart upload.
      • This ID needs to be provided for each part upload, completion or abort request and listing of parts call.
      • All the Object metadata required needs to be provided during the Initiation call
    • Parts Upload
      • Parts upload of objects can be performed using the unique upload ID
      • A part number (between 1 – 10000) needs to be specified with each request which identifies each part and its position in the object
      • If a part with the same part number is uploaded, the previous part would be overwritten
      • After the part upload is successful, S3 returns an ETag header in the response which must be recorded along with the part number to be provided during the multipart completion request
    • Multipart Upload Completion or Abort
      • On Multipart Upload Completion request, S3 creates an object by concatenating the parts in ascending order based on the part number and associates the metadata with the object
      • Multipart Upload Completion request should include the unique upload ID with all the parts and the ETag information
      • The response includes an ETag that uniquely identifies the combined object data
      • On Multipart upload Abort request, the upload is aborted and all parts are removed. Any new part upload would fail. However, any in-progress part upload is completed, and hence an abort request must be sent after all the parts uploads have been completed.
      • S3 should receive a multipart upload completion or abort request else it will not delete the parts and storage would be charged.

S3 Transfer Acceleration

  • S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between the client and a bucket.
  • Transfer Acceleration takes advantage of CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to S3 over an optimized network path.
  • Transfer Acceleration will have additional charges while uploading data to S3 is free through the public Internet.

S3 Batch Operations

  • S3 Batch Operations help perform large-scale batch operations on S3 objects and can perform a single operation on lists of specified S3 objects.
  • A single job can perform a specified operation on billions of objects containing exabytes of data.
  • S3 tracks progress, sends notifications, and stores a detailed completion report of all actions, providing a fully managed, auditable, and serverless experience.
  • Batch Operations can be used with S3 Inventory to get the object list and use S3 Select to filter the objects.
  • Batch Operations can be used for copying objects, modify object metadata, applying ACLs, encrypting objects, transforming objects, invoke a custom lambda function, etc.

Virtual Hosted Style vs Path-Style Request

S3 allows the buckets and objects to be referred to in Path-style or Virtual hosted-style URLs

Path-style

  • Bucket name is not part of the domain (unless region specific endpoint used)
  • Endpoint used must match the region in which the bucket resides for e.g, if you have a bucket called mybucket that resides in the EU (Ireland) region with object named puppy.jpg, the correct path-style syntax URI is http://s3-eu-west-1.amazonaws.com/mybucket/puppy.jpg.
  • A “PermanentRedirect” error is received with an HTTP response code 301, and a message indicating what the correct URI is for the resource if a bucket is accessed outside the US East (N. Virginia) region with path-style syntax that uses either of the following:
    • http://s3.amazonaws.com
    • An endpoint for a region different from the one where the bucket resides for e.g., if you use http://s3-eu-west-1.amazonaws.com for a bucket that was created in the US West (N. California) region
  • Path-style requests would not be supported after after September 30, 2020

Virtual hosted-style

  • S3 supports virtual hosted-style and path-style access in all regions.
  • In a virtual-hosted-style URL, the bucket name is part of the domain name in the URL for e.g. http://bucketname.s3.amazonaws.com/objectname
  • S3 virtual hosting can be used to address a bucket in a REST API call by using the HTTP Host header
  • Benefits
    • attractiveness of customized URLs,
    • provides an ability to publish to the “root directory” of the bucket’s virtual server. This ability can be important because many existing applications search for files in this standard location.
  • S3 updates DNS to reroute the request to the correct location when a bucket is created in any region, which might take time.
  • S3 routes any virtual hosted-style requests to the US East (N.Virginia) region, by default, if the US East (N. Virginia) endpoint s3.amazonaws.com is used, instead of the region-specific endpoint (for e.g., s3-eu-west-1.amazonaws.com) and S3 redirects it with HTTP 307 redirect to the correct region.
  • When using virtual hosted-style buckets with SSL, the SSL wild card certificate only matches buckets that do not contain periods.To work around this, use HTTP or write your own certificate verification logic.
  • If you make a request to the http://bucket.s3.amazonaws.com endpoint, the DNS has sufficient information to route the request directly to the region where your bucket resides.

S3 Pricing

  • S3 costs vary by Region
  • Charges are incurred for
    • Storage – cost is per GB/month
    • Requests – per request cost varies depending on the request type GET, PUT
    • Data Transfer
      • data transfer-in is free
      • data transfer out is charged per GB/month (except in the same region or to Amazon CloudFront)

Additional Topics

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. What does Amazon S3 stand for?
    1. Simple Storage Solution.
    2. Storage Storage Storage (triple redundancy Storage).
    3. Storage Server Solution.
    4. Simple Storage Service
  2. What are characteristics of Amazon S3? Choose 2 answers
    1. Objects are directly accessible via a URL
    2. S3 should be used to host a relational database
    3. S3 allows you to store objects or virtually unlimited size
    4. S3 allows you to store virtually unlimited amounts of data
    5. S3 offers Provisioned IOPS
  3. You are building an automated transcription service in which Amazon EC2 worker instances process an uploaded audio file and generate a text file. You must store both of these files in the same durable storage until the text file is retrieved. You do not know what the storage capacity requirements are. Which storage option is both cost-efficient and scalable?
    1. Multiple Amazon EBS volume with snapshots
    2. A single Amazon Glacier vault
    3. A single Amazon S3 bucket
    4. Multiple instance stores
  4. A user wants to upload a complete folder to AWS S3 using the S3 Management console. How can the user perform this activity?
    1. Just drag and drop the folder using the flash tool provided by S3
    2. Use the Enable Enhanced Folder option from the S3 console while uploading objects
    3. The user cannot upload the whole folder in one go with the S3 management console
    4. Use the Enable Enhanced Uploader option from the S3 console while uploading objects (NOTE – Its no longer supported by AWS)
  5. A media company produces new video files on-premises every day with a total size of around 100GB after compression. All files have a size of 1-2 GB and need to be uploaded to Amazon S3 every night in a fixed time window between 3am and 5am. Current upload takes almost 3 hours, although less than half of the available bandwidth is used. What step(s) would ensure that the file uploads are able to complete in the allotted time window?
    1. Increase your network bandwidth to provide faster throughput to S3
    2. Upload the files in parallel to S3 using mulipart upload
    3. Pack all files into a single archive, upload it to S3, then extract the files in AWS
    4. Use AWS Import/Export to transfer the video files
  6. A company is deploying a two-tier, highly available web application to AWS. Which service provides durable storage for static content while utilizing lower Overall CPU resources for the web tier?
    1. Amazon EBS volume
    2. Amazon S3
    3. Amazon EC2 instance store
    4. Amazon RDS instance
  7. You have an application running on an Amazon Elastic Compute Cloud instance, that uploads 5 GB video objects to Amazon Simple Storage Service (S3). Video uploads are taking longer than expected, resulting in poor application performance. Which method will help improve performance of your application?
    1. Enable enhanced networking
    2. Use Amazon S3 multipart upload
    3. Leveraging Amazon CloudFront, use the HTTP POST method to reduce latency.
    4. Use Amazon Elastic Block Store Provisioned IOPs and use an Amazon EBS-optimized instance
  8. When you put objects in Amazon S3, what is the indication that an object was successfully stored?
    1. Each S3 account has a special bucket named_s3_logs. Success codes are written to this bucket with a timestamp and checksum.
    2. A success code is inserted into the S3 object metadata.
    3. A HTTP 200 result code and MD5 checksum, taken together, indicate that the operation was successful.
    4. Amazon S3 is engineered for 99.999999999% durability. Therefore there is no need to confirm that data was inserted.
  9. You have private video content in S3 that you want to serve to subscribed users on the Internet. User IDs, credentials, and subscriptions are stored in an Amazon RDS database. Which configuration will allow you to securely serve private content to your users?
    1. Generate pre-signed URLs for each user as they request access to protected S3 content
    2. Create an IAM user for each subscribed user and assign the GetObject permission to each IAM user
    3. Create an S3 bucket policy that limits access to your private content to only your subscribed users’ credentials
    4. Create a CloudFront Origin Identity user for your subscribed users and assign the GetObject permission to this user
  10. You run an ad-supported photo sharing website using S3 to serve photos to visitors of your site. At some point you find out that other sites have been linking to the photos on your site, causing loss to your business. What is an effective method to mitigate this?
    1. Remove public read access and use signed URLs with expiry dates.
    2. Use CloudFront distributions for static content.
    3. Block the IPs of the offending websites in Security Groups.
    4. Store photos on an EBS volume of the web server.
  11. You are designing a web application that stores static assets in an Amazon Simple Storage Service (S3) bucket. You expect this bucket to immediately receive over 150 PUT requests per second. What should you do to ensure optimal performance?
    1. Use multi-part upload.
    2. Add a random prefix to the key names.
    3. Amazon S3 will automatically manage performance at this scale. (With latest S3 performance improvements, S3 scaled automatically)
    4. Use a predictable naming scheme, such as sequential numbers or date time sequences, in the key names
  12. What is the maximum number of S3 buckets available per AWS Account?
    1. 100 Per region
    2. There is no Limit
    3. 100 Per Account (Refer documentation)
    4. 500 Per Account
    5. 100 Per IAM User
  13. Your customer needs to create an application to allow contractors to upload videos to Amazon Simple Storage Service (S3) so they can be transcoded into a different format. She creates AWS Identity and Access Management (IAM) users for her application developers, and in just one week, they have the application hosted on a fleet of Amazon Elastic Compute Cloud (EC2) instances. The attached IAM role is assigned to the instances. As expected, a contractor who authenticates to the application is given a pre-signed URL that points to the location for video upload. However, contractors are reporting that they cannot upload their videos. Which of the following are valid reasons for this behavior? Choose 2 answers { “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: “s3:*”, “Resource”: “*” } ] }
    1. The IAM role does not explicitly grant permission to upload the object. (The role has all permissions for all activities on S3)
    2. The contractorsˈ accounts have not been granted “write” access to the S3 bucket. (using pre-signed urls the contractors account don’t need to have access but only the creator of the pre-signed urls)
    3. The application is not using valid security credentials to generate the pre-signed URL.
    4. The developers do not have access to upload objects to the S3 bucket. (developers are not uploading the objects but its using pre-signed urls)
    5. The S3 bucket still has the associated default permissions. (does not matter as long as the user has permission to upload)
    6. The pre-signed URL has expired.

AWS S3 Data Protection

AWS S3 Data Protection

  • S3 provides S3 data protection using highly durable storage infrastructure designed for mission-critical and primary data storage.
  • Objects are redundantly stored on multiple devices across multiple facilities in an S3 region.
  • S3 PUT and PUT Object copy operations synchronously store the data across multiple facilities before returning SUCCESS.
  • Once the objects are stored, S3 maintains its durability by quickly detecting and repairing any lost redundancy.
  • S3 also regularly verifies the integrity of data stored using checksums. If  S3 detects data corruption, it is repaired using redundant data.
  • In addition, S3 calculates checksums on all network traffic to detect corruption of data packets when storing or retrieving data
  • Data protection against accidental overwrites and deletions can be added by enabling Versioning to preserve, retrieve and restore every version of the object stored
  • S3 also provides the ability to protect data in transit (as it travels to and from S3) and at rest (while it is stored in S3)

S3 Encryption

Refer blog post @ S3 Encryption

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A customer is leveraging Amazon Simple Storage Service in eu-west-1 to store static content for a web-based property. The customer is storing objects using the Standard Storage class. Where are the customers objects replicated?
    1. A single facility in eu-west-1 and a single facility in eu-central-1
    2. A single facility in eu-west-1 and a single facility in us-east-1
    3. Multiple facilities in eu-west-1
    4. A single facility in eu-west-1
  2. A system admin is planning to encrypt all objects being uploaded to S3 from an application. The system admin does not want to implement his own encryption algorithm; instead he is planning to use server side encryption by supplying his own key (SSE-C). Which parameter is not required while making a call for SSE-C?
    1. x-amz-server-side-encryption-customer-key-AES-256
    2. x-amz-server-side-encryption-customer-key
    3. x-amz-server-side-encryption-customer-algorithm
    4. x-amz-server-side-encryption-customer-key-MD5

References

AWS_S3_Security

AWS S3 Permissions

AWS S3 Permissions

  • By default, all S3 buckets, objects, and related subresources are private.
  • Only the Resource owner, the AWS account (not the user) that creates the resource, can access the resource.
  • Resource owner can be
    • AWS account that creates the bucket or object owns those resources
    • If an IAM user creates the bucket or object, the AWS account of the IAM user owns the resource
    • If the bucket owner grants cross-account permissions to other AWS account users to upload objects to the buckets, the objects are owned by the AWS account of the user who uploaded the object and not the bucket owner except for the following conditions
      • Bucket owner can deny access to the object, as it is still the bucket owner who pays for the object
      • Bucket owner can delete or apply archival rules to the object and perform restoration
  • User is the AWS Account or the IAM user who access the resource
  • Bucket owner is the AWS account that created a bucket
  • Object owner is the AWS account that uploads the object to a bucket, not owned by the account
  • S3 permissions are classified into
    • Resource based policies and
    • User policies

User Policies

  • User policies use IAM with S3 to control the type of access a user or group of users has to specific parts of an S3 bucket the AWS account owns
  • User policy is always attached to a User, Group, or a Role
  • Anonymous permissions cannot be granted
  • If an AWS account that owns a bucket wants to grant permission to users in its account, it can use either a bucket policy or a user policy

Resource-Based policies

  • Bucket policies and access control lists (ACLs) are resource-based because they are attached to the  S3 resources

Screen Shot 2016-03-28 at 5.57.36 PM

Bucket Policies

  • Bucket policy can be used to grant cross-account access to other AWS accounts or IAM users in other accounts for the bucket and objects in it.
  • Bucket policies provide centralized, access control to buckets and objects based on a variety of conditions, including S3 operations, requesters, resources, and aspects of the request (e.g. IP address)
  • If an AWS account that owns a bucket wants to grant permission to users in its account, it can use either a bucket policy or a user policy
  • Permissions attached to a bucket apply to all of the objects in that bucket created and owned by the bucket owner
  • Policies can either add or deny permissions across all (or a subset) of objects within a bucket
  • Only the bucket owner is allowed to associate a policy with a bucket
  • Bucket policies can cater to multiple use cases
    • Granting permissions to multiple accounts with added conditions
    • Granting read-only permission to an anonymous user
    • Limiting access to specific IP addresses
    • Restricting access to a specific HTTP referer
    • Restricting access to a specific HTTP header for e.g. to enforce encryption
    • Granting permission to a CloudFront OAI
    • Adding a bucket policy to require MFA
    • Granting cross-account permissions to upload objects while ensuring the bucket owner has full control
    • Granting permissions for S3 inventory and Amazon S3 analytics
    • Granting permissions for S3 Storage Lens

Access Control Lists (ACLs)

  • Each bucket and object has an ACL associated with it.
  • An ACL is a list of grants identifying grantee and permission granted
  • ACLs are used to grant basic read/write permissions on resources to other AWS accounts.
  • ACL supports limited permissions set and
    • cannot grant conditional permissions, nor can you explicitly deny permissions
    • cannot be used to grant permissions for bucket subresources
  • Permission can be granted to an AWS account by the email address or the canonical user ID (is just an obfuscated Account Id). If an email address is provided, S3 will still find the canonical user ID for the user and add it to the ACL.
  • It is Recommended to use Canonical user ID as email address would not be supported
  • Bucket ACL
    • Only recommended use case for the bucket ACL is to grant write permission to the S3 Log Delivery group to write access log objects to the bucket
    • Bucket ACL will help grant write permission on the bucket to the Log Delivery group if access log delivery is needed to the bucket
    • Only way you can grant necessary permissions to the Log Delivery group is via a bucket ACL
  • Object ACL
    • Object ACLs control only Object-level Permissions
    • Object ACL is the only way to manage permission to an object in the bucket not owned by the bucket owner i.e. If the bucket owner allows cross-account object uploads and if the object owner is different from the bucket owner, the only way for the object owner to grant permissions on the object is through Object ACL
    • If the Bucket and Object is owned by the same AWS account, Bucket policy can be used to manage the permissions
    • If the Object and User is owned by the same AWS account, User policy can be used to manage the permissions

S3 Request Authorization

When S3 receives a request, it must evaluate all the user policies, bucket policies, and ACLs to determine whether to authorize or deny the request.

S3 evaluates the policies in 3 context

  • User context is basically the context in which S3 evaluates the User policy that the parent AWS account (context authority) attaches to the user
  • Bucket context is the context in which S3 evaluates the access policies owned by the bucket owner (context authority) to check if the bucket owner has not explicitly denied access to the resource
  • Object context is the context where S3 evaluates policies owned by the Object owner (context authority)

Analogy

  • Consider 3 Parents (AWS Account) A, B and C with Child (IAM User) AA, BA and CA respectively
  • Parent A owns a Toy box (Bucket) with Toy AAA and also allows toys (Objects) to be dropped and picked up
  • Parent A can grant permission (User Policy OR Bucket policy OR both) to his Child AA to access the Toy box and the toys
  • Parent A can grant permissions (Bucket policy) to Parent B (different AWS account) to drop toys into the toys box. Parent B can grant permissions (User policy) to his Child BA to drop Toy BAA
  • Parent B can grant permissions (Object ACL) to Parent A to access Toy BAA
  • Parent A can grant permissions (Bucket Policy) to Parent C to pick up the Toy AAA who in turn can grant permission (User Policy) to his Child CA to access the toy
  • Parent A can grant permission (through IAM Role) to Parent C to pick up the Toy BAA who in turn can grant permission (User Policy) to his Child CA to access the toy

Bucket Operation Authorization

Screen Shot 2016-03-28 at 6.35.36 AM

  1. If the requester is an IAM user, the user must have permission (User Policy) from the parent AWS account to which it belongs
  2. Amazon S3 evaluates a subset of policies owned by the parent account. This subset of policies includes the user policy that the parent account attaches to the user.
  3. If the parent also owns the resource in the request (in this case, the bucket), Amazon S3 also evaluates the corresponding resource policies (bucket policy and bucket ACL) at the same time.
  4. Requester must also have permissions (Bucket Policy or ACL) from the bucket owner to perform a specific bucket operation.
  5. Amazon S3 evaluates a subset of policies owned by the AWS account that owns the bucket. The bucket owner can grant permission by using a bucket policy or bucket ACL.
  6. Note that, if the AWS account that owns the bucket is also the parent account of an IAM user, then it can configure bucket permissions in a user policy or bucket policy or both

Object Operation Authorization

Screen Shot 2016-03-28 at 6.39.54 AM

  1. If the requester is an IAM user, the user must have permission (User Policy) from the parent AWS account to which it belongs.
  2. Amazon S3 evaluates a subset of policies owned by the parent account. This subset of policies includes the user policy that the parent attaches to the user.
  3. If the parent also owns the resource in the request (bucket, object), Amazon S3 evaluates the corresponding resource policies (bucket policy, bucket ACL, and object ACL) at the same time.
  4. If the parent AWS account owns the resource (bucket or object), it can grant resource permissions to its IAM user by using either the user policy or the resource policy.
  5. S3 evaluates policies owned by the AWS account that owns the bucket.
  6. If the AWS account that owns the object in the request is not the same as the bucket owner, in the bucket context Amazon S3 checks the policies if the bucket owner has explicitly denied access to the object.
  7. If there is an explicit deny set on the object, Amazon S3 does not authorize the request.
  8. Requester must have permissions from the object owner (Object ACL) to perform a specific object operation.
  9. Amazon S3 evaluates the object ACL.
  10. If bucket and object owners are the same, access to the object can be granted in the bucket policy, which is evaluated in the bucket context.
  11. If the owners are different, the object owners must use an object ACL to grant permissions.
  12. If the AWS account that owns the object is also the parent account to which the IAM user belongs, it can configure object permissions in a user policy, which is evaluated in the user context.

Permission Delegation

  • If an AWS account owns a resource, it can grant those permissions to another AWS account.
  • That account can then delegate those permissions, or a subset of them, to users in the account. This is referred to as permission delegation.
  • But an account that receives permissions from another account cannot delegate permission cross-account to another AWS account.
  • If the Bucket owner wants to grant permission to the Object which does not belong to it to another AWS account it cannot do it through cross-account permissions and need to define an IAM role which can be assumed by the AWS account to gain access

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which features can be used to restrict access to data in S3? Choose 2 answers
    1. Set an S3 ACL on the bucket or the object.
    2. Create a CloudFront distribution for the bucket.
    3. Set an S3 bucket policy.
    4. Enable IAM Identity Federation
    5. Use S3 Virtual Hosting
  2. Which method can be used to prevent an IP address block from accessing public objects in an S3 bucket?
    1. Create a bucket policy and apply it to the bucket
    2. Create a NACL and attach it to the VPC of the bucket
    3. Create an ACL and apply it to all objects in the bucket
    4. Modify the IAM policies of any users that would access the bucket
  3. A user has granted read/write permission of his S3 bucket using ACL. Which of the below mentioned options is a valid ID to grant permission to other AWS accounts (grantee. using ACL?
    1. IAM User ID
    2. S3 Secure ID
    3. Access ID
    4. Canonical user ID
  4. A root account owner has given full access of his S3 bucket to one of the IAM users using the bucket ACL. When the IAM user logs in to the S3 console, which actions can he perform?
    1. He can just view the content of the bucket
    2. He can do all the operations on the bucket
    3. It is not possible to give access to an IAM user using ACL
    4. The IAM user can perform all operations on the bucket using only API/SDK
  5. A root AWS account owner is trying to understand various options to set the permission to AWS S3. Which of the below mentioned options is not the right option to grant permission for S3?
    1. User Access Policy
    2. S3 Object Policy
    3. S3 Bucket Policy
    4. S3 ACL
  6. A system admin is managing buckets, objects and folders with AWS S3. Which of the below mentioned statements is true and should be taken in consideration by the sysadmin?
    1. Folders support only ACL
    2. Both the object and bucket can have an Access Policy but folder cannot have policy
    3. Folders can have a policy
    4. Both the object and bucket can have ACL but folders cannot have ACL
  7. A user has created an S3 bucket which is not publicly accessible. The bucket is having thirty objects which are also private. If the user wants to make the objects public, how can he configure this with minimal efforts?
    1. User should select all objects from the console and apply a single policy to mark them public
    2. User can write a program which programmatically makes all objects public using S3 SDK
    3. Set the AWS bucket policy which marks all objects as public
    4. Make the bucket ACL as public so it will also mark all objects as public
  8. You need to configure an Amazon S3 bucket to serve static assets for your public-facing web application. Which methods ensure that all objects uploaded to the bucket are set to public read? Choose 2 answers
    1. Set permissions on the object to public read during upload.
    2. Configure the bucket ACL to set all objects to public read.
    3. Configure the bucket policy to set all objects to public read.
    4. Use AWS Identity and Access Management roles to set the bucket to public read.
    5. Amazon S3 objects default to public read, so no action is needed.
  9. Amazon S3 doesn’t automatically give a user who creates _____ permission to perform other actions on that bucket or object.
    1. a file
    2. a bucket or object
    3. a bucket or file
    4. a object or file
  10. A root account owner is trying to understand the S3 bucket ACL. Which of the below mentioned options cannot be used to grant ACL on the object using the authorized predefined group?
    1. Authenticated user group
    2. All users group
    3. Log Delivery Group
    4. Canonical user group
  11. A user is enabling logging on a particular bucket. Which of the below mentioned options may be best suitable to allow access to the log bucket?
    1. Create an IAM policy and allow log access
    2. It is not possible to enable logging on the S3 bucket
    3. Create an IAM Role, which has access to the log bucket
    4. Provide ACL for the logging group
  12. A user is trying to configure access with S3. Which of the following options is not possible to provide access to the S3 bucket / object?
    1. Define the policy for the IAM user
    2. Define the ACL for the object
    3. Define the policy for the object
    4. Define the policy for the bucket
  13. A user is having access to objects of an S3 bucket, which is not owned by him. If he is trying to set the objects of that bucket public, which of the below mentioned options may be a right fit for this action?
    1. Make the bucket public with full access
    2. Define the policy for the bucket
    3. Provide ACL on the object
    4. Create an IAM user with permission
  14. A bucket owner has allowed another account’s IAM users to upload or access objects in his bucket. The IAM user of Account A is trying to access an object created by the IAM user of account B. What will happen in this scenario?
    1. The bucket policy may not be created as S3 will give error due to conflict of Access Rights
    2. It is not possible to give permission to multiple IAM users
    3. AWS S3 will verify proper rights given by the owner of Account A, the bucket owner as well as by the IAM user B to the object
    4. It is not possible that the IAM user of one account accesses objects of the other IAM user

References

AWS_S3_Access_Control

AWS S3 Object Lifecycle Management

S3 Lifecycle Management

S3 Object Lifecycle Management

  • S3 Object lifecycle can be managed by using a lifecycle configuration, which defines how S3 manages objects during their lifetime.
  • Lifecycle configuration enables simplification of object lifecycle management, for e.g. moving of less frequently access objects, backup or archival of data for several years, or permanent deletion of objects,
  • S3 controls all transitions automatically
  • Lifecycle Management rules applied to a bucket are applicable to all the existing objects in the bucket as well as the ones that will be added anew
  • S3 Object lifecycle management allows 2 types of behavior
    • Transition in which the storage class for the objects changes
    • Expiration where the objects expire and are permanently deleted
  • Lifecycle Management can be configured with Versioning, which allows storage of one current object version and zero or more non-current object versions
  • Object’s lifecycle management applies to both Non Versioning and Versioning enabled buckets
  • For Non Versioned buckets
    • Transitioning period is considered from the object’s creation date
  • For Versioned buckets,
    • Transitioning period for the current object is calculated for the object creation date
    • Transitioning period for a non-current object is calculated for the date when the object became a noncurrent versioned object
    • S3 uses the number of days since its successor was created as the number of days an object is noncurrent.
  • S3 calculates the time by adding the number of days specified in the rule to the object creation time and rounding the resulting time to the next day midnight UTC for e.g. if an object was created at 15/1/2016 10:30 AM UTC and you specify 3 days in a transition rule, which results in 18/1/2016 10:30 AM UTC and rounded of to next day midnight time 19/1/2016 00:00 UTC.
  • Lifecycle configuration on MFA-enabled buckets is not supported.
  • 1000 lifecycle rules can be configured per bucket

S3 Object Lifecycle Management Rules

S3 Lifecycle Management

Lifecycle Transitions Constraints

  1. STANDARD -> (128 KB & 30 days) -> STANDARD-IA or One Zone-IA or S3 Intelligent-Tiering
    • Larger Objects – Only objects with a size more than 128 KB can be transitioned, as cost benefits for transitioning to STANDARD-IA or One Zone-IA can be realized only for larger objects
    • Smaller Objects < 128 KB – S3 does not transition objects that are smaller than 128 KB
    • Minimum 30 days – Objects must be stored for at least 30 days in the current storage class before being transitioned to the STANDARD-IA or One Zone-IA, as younger objects are accessed more frequently or deleted sooner than is suitable for STANDARD-IA or One Zone-IA
  2. GLACIER -> (90 days) -> Permanent Deletion OR GLACIER Deep Archive -> (180 days) -> Permanent Deletion
    • Deleting data that is archived to Glacier is free if the objects deleted are archived for three months or longer.
    • S3 charges a prorated early deletion fee if the object is deleted or overwritten within three months of archiving it.
  3. Archival of objects to Glacier by using object lifecycle management is performed asynchronously and there may be a delay between the transition date in the lifecycle configuration rule and the date of the physical transition. However, AWS charges Glacier prices based on the transition date specified in the rule
  4. For a versioning-enabled bucket
    • Transition and Expiration actions apply to current versions.
    • NoncurrentVersionTransition and NoncurrentVersionExpiration actions apply to noncurrent versions and work similarly to the non-versioned objects except the time period is from the time the objects became noncurrent
  5. Expiration Rules
    • For Non Versioned bucket
      • Object is permanently deleted
    • For Versioned bucket
      • Expiration is applicable to the Current object only and does not impact any of the non-current objects
      • S3 will insert a Delete Marker object with a unique id and the previous current object becomes a non-current version
      • S3 will not take any action if the Current object is a Delete Marker
      • If the bucket has a single object which is the Delete Marker (referred to as expired object delete marker), S3 removes the Delete Marker
    • For Versioned Suspended bucket
      • S3 will insert a Delete Marker object with version ID null and overwrite any object with version ID null
  6. When an object reaches the end of its lifetime, S3 queues it for removal and removes it asynchronously. There may be a delay between the expiration date and the date at which S3 removes an object. Charged for storage time associated with an object that has expired are not incurred.
  7. Cost is incurred if objects are expired in STANDARD-IA before 30 days, GLACIER before 90 days, and GLACIER_DEEP_ARCHIVE before 180 days.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. If an object is stored in the Standard S3 storage class and you want to move it to Glacier, what must you do in order to properly migrate it?
    1. Change the storage class directly on the object.
    2. Delete the object and re-upload it, selecting Glacier as the storage class.
    3. None of the above.
    4. Create a lifecycle policy that will migrate it after a minimum of 30 days. (Any object uploaded to S3 must first be placed into either the Standard, Reduced Redundancy, or Infrequent Access storage class. Once in S3 the only way to move the object to glacier is through a lifecycle policy)
  2. A company wants to store their documents in AWS. Initially, these documents will be used frequently, and after a duration of 6 months, they would not be needed anymore. How would you architect this requirement?
    1. Store the files in Amazon EBS and create a Lifecycle Policy to remove the files after 6 months.
    2. Store the files in Amazon S3 and create a Lifecycle Policy to remove the files after 6 months.
    3. Store the files in Amazon Glacier and create a Lifecycle Policy to remove the files after 6 months.
    4. Store the files in Amazon EFS and create a Lifecycle Policy to remove the files after 6 months.
  3. Your firm has uploaded a large amount of aerial image data to S3. In the past, in your on-premises environment, you used a dedicated group of servers to oaten process this data and used Rabbit MQ, an open source messaging system, to get job information to the servers. Once processed the data would go to tape and be shipped offsite. Your manager told you to stay with the current design, and leverage AWS archival storage and messaging services to minimize cost. Which is correct?
    1. Use SQS for passing job messages, use Cloud Watch alarms to terminate EC2 worker instances when they become idle. Once data is processed, change the storage class of the S3 objects to Reduced Redundancy Storage (Need to replace On-Premises Tape functionality)
    2. Setup Auto-Scaled workers triggered by queue depth that use spot instances to process messages in SQS. Once data is processed, change the storage class of the S3 objects to Reduced Redundancy Storage (Need to replace On-Premises Tape functionality)
    3. Setup Auto-Scaled workers triggered by queue depth that use spot instances to process messages in SQS. Once data is processed, change the storage class of the S3 objects to Glacier (Glacier suitable for Tape backup)
    4. Use SNS to pass job messages use Cloud Watch alarms to terminate spot worker instances when they become idle. Once data is processed, change the storage class of the S3 object to Glacier.
  4. You have a proprietary data store on-premises that must be backed up daily by dumping the data store contents to a single compressed 50GB file and sending the file to AWS. Your SLAs state that any dump file backed up within the past 7 days can be retrieved within 2 hours. Your compliance department has stated that all data must be held indefinitely. The time required to restore the data store from a backup is approximately 1 hour. Your on-premise network connection is capable of sustaining 1gbps to AWS. Which backup methods to AWS would be most cost-effective while still meeting all of your requirements?
    1. Send the daily backup files to Glacier immediately after being generated (will not meet the RTO)
    2. Transfer the daily backup files to an EBS volume in AWS and take daily snapshots of the volume (Not cost effective)
    3. Transfer the daily backup files to S3 and use appropriate bucket lifecycle policies to send to Glacier (Store in S3 for seven days and then archive to Glacier)
    4. Host the backup files on a Storage Gateway with Gateway-Cached Volumes and take daily snapshots (Not Cost-effective as local storage as well as S3 storage)

References

AWS_S3_Object_Lifecycle_Management

AWS S3 Data Durability

Sample Exam Question :-
A customer is leveraging Amazon Simple Storage Service in eu-west-1 to store static content for web-based property. The customer is storing objects using the Standard Storage class. Where are the customers’ objects replicated?
  1. Single facility in eu-west-1 and a single facility in eu-central-1
  2. Single facility in eu-west-1 and a single facility in us-east-1
  3. Multiple facilities in eu-west-1
  4. A single facility in eu-west-1
Answer :-
The question targets the S3 Data Durability which mentions the objects are stored redundantly across multiple facilities in the same Amazon S3 region.

Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 region. To help better ensure data durability, Amazon S3 PUT and PUT Object copy operations synchronously store your data across multiple facilities before returning SUCCESS. Once the objects are stored, Amazon S3 maintains their durability by quickly detecting and repairing any lost redundancy.