AWS Glue

AWS Glue

  • AWS Glue is a fully-managed, ETL i.e extract, transform, and load service that automates the time-consuming steps of data preparation for analytics
  • AWS Glue is serverless and supports pay-as-you-go model. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run the ETL jobs on a fully managed, scale-out Apache Spark environment.
  • AWS Glue makes it simple and cost-effective to categorize the data, clean it, enrich it, and move it reliably between various data stores and streams
  • AWS Glue automatically discovers and profiles the data via the Glue Data Catalog, recommends and generates ETL code to transform the source data into target schemas, and runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load the data into its destination.
  • AWS Glue also helps setup, orchestrate, and monitor complex data flows.
  • AWS Glue consists of a
    • Data Catalog, which is a central metadata repository,
    • ETL engine that can automatically generate Scala or Python code,
    • Flexible scheduler that handles dependency resolution, job monitoring, and retries.
  • AWS Glue components help automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so more time can be spent on analyzing the data.
  • Glue can automatically discover both structured and semi-structured data stored in the data lake on S3, data warehouse in Redshift, and various databases running on AWS.
  • Glue provides a unified view of the data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Athena, EMR, and Redshift Spectrum.
  • AWS Glue natively supports data stored in
    • RDS (Aurora, MySQL, Oracle, PostgreSQL, SQL Server)
    • Redshift
    • DynamoDB
    • S3
    • MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in the Virtual Private Cloud (VPC) running on EC2.
    • in (beta) data streams from MSK, Kinesis Data Streams, and Apache Kafka.
  • Glue also supports custom Scala or Python code and import custom libraries and Jar files into the AWS Glue ETL jobs to access data sources not natively supported by AWS Glue.
  • Glue supports server side encryption for data at rest and SSL for data in motion.
  • AWS Glue provides development endpoints to edit, debug, and test the code it generates.

AWS Glue Data Catalog

  • AWS Glue Data Catalog is a central repository and persistent metadata store to store structural and operational metadata for all the data assets.
  • AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data.
  • For a given data set, AWS Glue Data Catalog can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.
  • AWS Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR
  • AWS Glue Data Catalog also provides out-of-box integration with Athena, EMR, and Redshift Spectrum.
  • Table definitions once added to the Glue Data Catalog, are available for ETL and also readily available for querying in Athena, EMR, and Redshift Spectrum to provide a common view of the data between these services.
  • AWS Glue Data Catalog supports bulk import of the metadata from existing persistent Apache Hive Metastore by using our import script.
  • Data Catalog provides comprehensive audit and governance capabilities, with schema change tracking and data access controls, which helps ensure that data is not inappropriately modified or inadvertently shared
  • Each AWS account has one AWS Glue Data Catalog per region.

AWS Glue Crawlers

  • AWS Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of the data and other statistics, and then populates the Glue Data Catalog with this metadata.
  • Glue crawlers scan various data stores to automatically infer schemas and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • Glue crawlers can be scheduled to run periodically so that the metadata is always up-to-date and in-sync with the underlying data.
  • Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions.

Dynamic Frames

  • AWS Glue is designed to work with semi-structured data and introduces a component called a dynamic frame, which you can use in the ETL scripts
  • Dynamic frame is a distributed table that supports nested data such as structures and arrays.
  • Each record is self-describing, designed for schema flexibility with semi-structured data. Each record contains both data and the schema that describes that data.
  • A Dynamic Frame is similar to an Apache Spark dataframe, which is a data abstraction used to organize data into rows and columns, except that each record is self-describing so no schema is required initially.
  • Dynamic frames provide schema flexibility and a set of advanced transformations specifically designed for dynamic frames.
  • Conversion can be done between Dynamic frames and Spark dataframes, to take advantage of both AWS Glue and Spark transformations to do the kinds of analysis needed.

AWS Glue Streaming ETL

  • AWS Glue enables performing ETL operations on streaming data using continuously-running jobs.
  • AWS Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can ingest streams from Kinesis Data Streams and Apache Kafka using Amazon Managed Streaming for Apache Kafka.
  • Streaming ETL can clean and transform streaming data and load it into S3 or JDBC data stores.
  • Use Streaming ETL in AWS Glue to process event data like IoT streams, clickstreams, and network logs.

Glue Job Bookmark

  • AWS Glue Job Bookmark tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run.
  • Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.
  • Job bookmarks help process new data when rerunning on a scheduled interval
  • Job bookmark is composed of the states for various elements of jobs, such as sources, transformations, and targets. for e.g, a ETL job might read new partitions in an S3 file. AWS Glue tracks which partitions the job has processed successfully to prevent duplicate processing and duplicate data in the job’s target data store.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An organization is setting up a data catalog and metadata management environment for their numerous data stores currently running on AWS. The data catalog will be used to determine the structure and other attributes of data in the data stores. The data stores are composed of Amazon RDS databases, Amazon Redshift, and CSV files residing on Amazon S3. The catalog should be populated on a scheduled basis, and minimal administration is required to manage the catalog. How can this be accomplished?
    1. Set up Amazon DynamoDB as the data catalog and run a scheduled AWS Lambda function that connects to data sources to populate the database.
    2. Use an Amazon database as the data catalog and run a scheduled AWS Lambda function that connects to data sources to populate the database.
    3. Use AWS Glue Data Catalog as the data catalog and schedule crawlers that connect to data sources to populate the database.
    4. Set up Apache Hive metastore on an Amazon EC2 instance and run a scheduled bash script that connects to data sources to populate the metastore.

AWS Certification – Content Delivery – Cheat Sheet


  • provides low latency and high data transfer speeds for distribution of static, dynamic web or streaming content to web users
  • delivers the content through a worldwide network of data centers called Edge Locations
  • keeps persistent connections with the origin servers so that the files can be fetched from the origin servers as quickly as possible.
  • dramatically reduces the number of network hops that users’ requests must pass through
  • supports multiple origin server options, like AWS hosted service for e.g. S3, EC2, ELB or an on premise server, which stores the original, definitive version of the objects
  • single distribution can have multiple origins and Path pattern in a cache behavior determines which requests are routed to the origin
  • supports Web Download distribution and RTMP Streaming distribution
    • Web distribution supports static, dynamic web content, on demand using progressive download & HLS and live streaming video content
    • RTMP supports streaming of media files using Adobe Media Server and the Adobe Real-Time Messaging Protocol (RTMP) ONLY
  • supports HTTPS using either
    • dedicated IP address, which is expensive as dedicated IP address is assigned to each CloudFront edge location
    • Server Name Indication (SNI), which is free but supported by modern browsers only with the domain name available in the request header
  • For E2E HTTPS connection,
    • Viewers -> CloudFront needs either self signed certificate, or certificate issued by CA or ACM
    • CloudFront -> Origin needs certificate issued by ACM for ELB and by CA for other origins
  •  Security
    • Origin Access Identity (OAI) can be used to restrict the content from S3 origin to be accessible from CloudFront only
    • supports Geo restriction (Geo-Blocking) to whitelist or blacklist countries that can access the content
    • Signed URLs 
      • for RTMP distribution as signed cookies aren’t supported
      • to restrict access to individual files, for e.g., an installation download for your application.
      • users using a client, for e.g. a custom HTTP client, that doesn’t support cookies
    • Signed Cookies
      • provide access to multiple restricted files, for e.g., video part files in HLS format or all of the files in the subscribers’ area of a website.
      • don’t want to change the current URLs
    • integrates with AWS WAF, a web application firewall that helps protect web applications from attacks by allowing rules configured based on IP addresses, HTTP headers, and custom URI strings
  • supports GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE to get object & object headers, add, update, and delete objects
    • only caches responses to GET and HEAD requests and, optionally, OPTIONS requests
    • does not cache responses to PUT, POST, PATCH, DELETE request methods and these requests are proxied back to the origin
  • object removal from cache
    • would be removed upon expiry (TTL) from the cache, by default 24 hrs
    • can be invalidated explicitly, but has a cost associated, however might continue to see the old version until it expires from those caches
    • objects can be invalidated only for Web distribution
    • change object name, versioning, to serve different version
  • supports adding or modifying custom headers before the request is sent to origin which can be used to
    • validate if user is accessing the content from CDN
    • identifying CDN from which the request was forwarded from, in case of multiple CloudFront distribution
    • for viewers not supporting CORS to return the Access-Control-Allow-Origin header for every request
  • supports Partial GET requests using range header to download object in smaller units improving the efficiency of partial downloads and recovery from partially failed transfers
  • supports compression to compress and serve compressed files when viewer requests include Accept-Encoding: gzip in the request header
  • supports different price class to include all regions, to include only least expensive regions and other regions to exclude most expensive regions
  • supports access logs which contain detailed information about every user request for both web and RTMP distribution

AWS Data Migration Service – DMS

AWS Data Migration Service – DMS

  • AWS Database Migration Service enables quick and secure data migration with minimal to zero downtime.
  • AWS Database Migration Service helps migration to AWS with virtually no downtime. Source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.
  • AWS Database Migration Service can migrate
    • relational databases, data warehouses, NoSQL databases, and other types of data stores
    • data to and from most widely used commercial and open-source databases.
  • AWS DMS supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations (using SCT) between different database platforms, such as Oracle or Microsoft SQL Server to Aurora.
  • AWS DMS enables both one time migration and continuous data replication with high availability and consolidate databases into a petabyte-scale data warehouse by streaming data to Redshift and S3.
  • AWS DMS continually monitors source and target databases, network connectivity, and the replication instance.
  • AWS DMS is highly resilient and self–healing. If the primary replication server fails for any reason, a backup replication server can take over with little or no interruption of service.
  • AWS DMS automatically manages all of the infrastructure that supports the migration server, including hardware and software, software patching, and error reporting.
  • In case of interruption, DMS automatically restarts the process and continues the migration from where it was halted.
  • AWS DMS supports Multi-AZ option to provide high-availability for database migration and continuous data replication by enabling redundant replication instances.
  • AWS DMS ensures that the data migration is secure. Data at rest is encrypted with AWS Key Management Service (AWS KMS) encryption. During migration, Secure Socket Layers (SSL) can be used to encrypt the in-flight data as it travels from source to target.

DMS Components

DMS Migration

DMS Replication Instance

  • A DMS replication instance performs the actual data migration between source and target.
  • DMS replication instance is a managed EC2 instance that hosts one or more replication tasks.
  • Replication instance also caches the transaction logs during the migration
  • CPU and memory capacity of the replication instance influences the overall time required for the migration.
  • DMS can provide high availability and failover support using a Multi-AZ deployment.
    • In a Multi-AZ deployment, AWS DMS automatically provisions and maintains a standby replica of the replication instance in a different AZ
    • Primary replication instance is synchronously replicated to the standby replica.
    • If the primary replication instance fails or becomes unresponsive, the standby resumes any running tasks with minimal interruption.
    • Because the primary is constantly replicating its state to the standby, Multi-AZ deployment does incur some performance overhead.


  • AWS DMS uses an endpoint to access the source or target data store.

Replication tasks

  • AWS DMS replication task helps move a set of data from the source endpoint to the target endpoint.
  • Replication task required Replication instance, source and target endpoints
  • Replication task supports following migration type options
    • Full load (Migrate existing data) – Migrate the data from the source to the target database as a one time migration.
    • CDC only (Replicate data changes only) – Replicate only changes, while using native export tools for performing bulk data load.
    • Full load + CDC (Migrate existing data and replicate ongoing changes) –  Performs a full data load while capturing changes on the source. After the full load is complete, captured changes are applied to the target. Once the changes reach a steady state, the applications can be switched over to the target.
  • LOB mode options
    • Don’t include LOB columns – LOB columns are excluded
    • Full LOB mode – Migrate complete LOBs regardless of size. AWS DMS migrates LOBs piecewise in chunks controlled by the Max LOB Size parameter. This mode is slower than using limited LOB mode.
    • Limited LOB mode – Truncate LOBs to the value specified by the Max LOB Size parameter. This mode is faster than using full LOB mode.
  • Data validation – If the data validation needs to be performed, once the migration has been completed.

AWS Schema Conversion Tool

  • AWS Schema Conversion Tool makes heterogeneous database migrations  by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database.
  • DMS and SCT work in conjunction to both migrate databases and support ongoing replication for a variety of uses such as populating datamarts, synchronizing systems, etc.
  • SCT can copy database schemas for homogeneous migrations and convert them for heterogeneous migrations.
  • SCT clearly marks any objects that cannot be automatically converted so that they can be manually converted to complete the migration.
  • SCT can scan the application source code for embedded SQL statements and convert them as part of a database schema conversion project.
  • SCT performs cloud native code optimization by converting legacy Oracle and SQL Server functions to their equivalent AWS service thus helping  modernize the applications at the same time of database migration.
  • Once schema conversion is complete, SCT can help migrate data from a range of data warehouses to Redshift using built-in data migration agents.

DMS Best Practices

  • DMS Performance
    • In full load, multiple tables are loaded in parallel and it is recommended to drop primary key indexes, secondary indexes, referential integrity constraints, and data manipulation language (DML) triggers.
    • For a full load + CDC task, it is recommended to add secondary indexes before the CDC phase. Because AWS DMS uses logical replication, secondary indexes that support DML operations should be in-place to prevent full table scans.
    • Replication task can be paused before the CDC phase to build indexes, create triggers, and create referential integrity constraints
    • Use multiple tasks for a single migration to improve performance
    • Disable backups and Multi-AZ on the target until ready to cut over.
  • Migration LOBs
    • DMS migrates LOBS in a two step process
      • creates a new row in the target table and populates the row with all data except the associated LOB value.
      • Update the row in the target table with the LOB data.
    • All LOB columns on the target table must be nullable
    • Limited LOB mode
      • default for all migration tasks
      • migrates all LOB values up to a user-specified size limit, default 32K
      • LOB values larger than the size limit must be manually migrated. typically provides the best performance.
      • Ensure that the Max LOB size parameter setting is set to the largest LOB size for all the tables.
    • Full LOB mode
      • migrates all LOB data in the tables, regardless of size.
      • provides the convenience of moving all LOB data in the tables, but the process can have a significant impact on performance.
  • Migrating Large Tables
    • Break the migration into more than one task.
    • Using row filtering, use a key or a partition key to create multiple tasks
  • Convert schema
    • Use SCT to convert the source objects, table, indexes, views, triggers, and other system objects into the target DDL format
    • DMS doesn’t perform schema or code conversion
  • Replication
    • Enable Multi-AZ for ongoing replication (for high availability and failover support)
  • DMS can read/write from/to encrypted DBs

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which AWS service would simplify migration of a database to AWS?
    1. AWS Storage Gateway
    2. AWS Database Migration Service (AWS DMS)
    3. Amazon Elastic Compute Cloud (Amazon EC2)
    4. Amazon AppStream 2.0

AWS IoT Core


AWS IoT Core

  • AWS IoT Core is a managed cloud platform that lets connected devices easily and securely interact with cloud applications and other devices.
  • AWS IoT Core can support billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely.
  • AWS IoT Core allows the applications to keep track of and communicate with all the devices, all the time, even when they aren’t connected.
  • AWS IoT Core offers
    • Connectivity between devices and the AWS cloud.
      • AWS IoT Core allows communication with connected devices securely, with low latency and with low overhead.
      • Communication can scale to as many devices as needed.
      • AWS IoT Core supports standard communication protocols (HTTP, MQTT, and WebSockets are supported currently).
      • Communication is secured using TLS.
    • Processing data sent from connected devices.
      • AWS IoT Core can continuously ingest, filter, transform, and route the data streamed from connected devices.
      • Actions can be taken based on the data and route it for further processing and analytics.
    • Application interaction with connected devices.
      • AWS IoT Core accelerates IoT application development.
      • It serves as an easy to use interface for applications running in the cloud and on mobile devices to access data sent from connected devices, and send data and commands back to the devices.


AWS IoT Core Works

  • Connected devices, such as sensors, actuators, embedded devices, smart appliances, and wearable devices, connect to AWS IoT Core over HTTPS, WebSockets, or secure MQTT.
  • Communication with AWS IoT Core is secure.
    • HTTPS and WebSockets requests sent to AWS IoT Core are authenticated using AWS IAM or AWS Cognito, both of which support the AWS SigV4 authentication.
    • HTTPS requests can also be authenticated using X.509 certificates.
    • MQTT messages to AWS IoT Core are authenticated using X.509 certificates.
    • With AWS IoT Core allows using AWS IoT Core generated certificates, as well as those signed by your preferred Certificate Authority (CA).
  • AWS IoT Core also offers fine-grained authorization to isolate and secure communication among authenticated clients.

Device Gateway

  • Device Gateway forms the backbone of communication between connected devices and the cloud capabilities such as the Rules Engine, Device Shadow, and other AWS and 3rd-party services.
  • Device Gateway allows secure, low-latency, low-overhead, bi-directional communication between connected devices, cloud and mobile application
  • Device Gateway supports the pub/sub messaging pattern, which involves involves clients publishing messages on logical communication channels called ‘topics’ and clients subscribing to topics to receive messages
  • Device gateway enables communication between publishers and subscribers
  • Device Gateway scales automatically as per the demand, without any operational overhead

Rules Engine

  • Rules Engine enables continuous processing of data sent by connected devices.
  • Rules can be configured to filter and transform the data using an intuitive, SQL-like syntax.
  • Rules can be configured to route the data to other AWS services such as DynamoDB, Kinesis, Lambda, SNS, SQS, CloudWatch, Elasticsearch Service with built-in Kibana integration, as well as to non-AWS services, via Lambda for further processing, storage, or analytics.


  • Registry allows registering devices and keeping track of devices connected to AWS IoT Core, or devices that may connect in the future.

Device Shadow

  • Device Shadow enables cloud and mobile applications to query data sent from devices and send commands to devices, using a simple REST API, while letting AWS IoT Core handle the underlying communication with the devices.
  • Device Shadow accelerates application development by providing
    • a uniform interface to devices, even when they use one of the several IoT communication and security protocols with which the applications may not be compatible.
    • an always available interface to devices even when the connected devices are constrained by intermittent connectivity, limited bandwidth, limited computing ability or limited power.

Device and its Device Shadow Lifecycle

  • A device (such as a light bulb) is registered in the Registry.
  • Connected device is programmed to publish a set of its property values or ‘state (“I am ON and my color is RED”) to the AWS IoT Core service.
  • Device Shadow also stores the last reported state in the  in AWS IoT Core.
  • An application (such as a mobile app controlling the light bulb) uses a RESTful API to query AWS IoT Core for the last reported state of the light bulb, without the complexity of communicating directly with the light bulb
  • When a user wants to change the state (such as turning the light bulb from ON to OFF), the application uses a RESTful API to request an update, i.e. sets a ‘desired’ state for the device in AWS IoT Core. AWS IoT Core takes care of synchronizing the desired state to the device.
  • Application gets notified when the connected device updates its state to the desired state.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You need to filter and transform incoming messages coming from a smart sensor you have connected with AWS. Once messages are received, you need to store them as time series data in DynamoDB. Which AWS service can you use?
    1. IoT Device Shadow Service (maintains device state)
    2. Redshift
    3. Kinesis (While Kinesis could technically be used as an intermediary between different sources, it isn’t a great way to get data into DynamoDB from an IoT device.)
    4. IoT Rules Engine

AWS Kinesis Data Streams vs SQS

Kinesis Data Streams vs SQS

Kinesis Data Streams vs SQS


  • Amazon Kinesis Data Streams
    • allows real-time processing of streaming big data and the ability to read and replay records to multiple Amazon Kinesis Applications.
    • Amazon Kinesis Client Library (KCL) delivers all records for a given partition key to the same record processor, making it easier to build multiple applications that read from the same Amazon Kinesis stream (for example, to perform counting, aggregation, and filtering).
  • Amazon SQS
    • offers a reliable, highly-scalable hosted queue for storing messages as they travel between applications or microservices.
    • It moves data between distributed application components and helps  decouple these components.
    • provides common middleware constructs such as dead-letter queues and poison-pill management.
    • provides a generic web services API and can be accessed by any programming language that the AWS SDK supports.
    • supports both standard and FIFO queues


  • Kinesis Data streams is not fully managed and requires manual provisioning and scaling by increasing shards
  • SQS is fully managed, highly scalable and requires no administrative overhead and little configuration


  • Kinesis provides ordering of records, as well as the ability to read and/or replay records in the same order to multiple Kinesis Applications
  • SQS Standard Queue does not guarantee data ordering and provides at least once delivery of messages
  • SQS FIFO Queue guarantees data ordering within the message group

Data Retention Period

  • Kinesis Data Streams stores the data up to 24 hours, by default, and can be extended to 7 days
  • SQS stores the message up to 4 days, by default, and can be configured from 1 minute to 14 days but clears the message once deleted by the consumer

Delivery Semantics

  • Kinesis and SQS Standard Queue both guarantee at-least once delivery of message
  • SQS FIFO Queue guarantees Exactly once delivery

Parallel Clients

  • Kinesis supports multiple consumers
  • SQS allows the messages to be delivered to only one consumer at a time and requires multiple queues to deliver message to multiple consumers

Use Cases

  • Kinesis use cases requirements
    • Ordering of records.
    • Ability to consume records in the same order a few hours later
    • Ability for multiple applications to consume the same stream concurrently
    • Routing related records to the same record processor (as in streaming MapReduce)
  • SQS uses cases requirements
    • Messaging semantics like message-level ack/fail and visibility timeout
    • Leveraging SQS’s ability to scale transparently
    • Dynamically increasing concurrency/throughput at read time
    • Individual message delay, which can be delayed

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are deploying an application to track GPS coordinates of delivery trucks in the United States. Coordinates are transmitted from each delivery truck once every three seconds. You need to design an architecture that will enable real-time processing of these coordinates from multiple consumers. Which service should you use to implement data ingestion?
    1. Amazon Kinesis
    2. AWS Data Pipeline
    3. Amazon AppStream
    4. Amazon Simple Queue Service
  2. Your customer is willing to consolidate their log streams (access logs, application logs, security logs etc.) in one single system. Once consolidated, the customer wants to analyze these logs in real time based on heuristics. From time to time, the customer needs to validate heuristics, which requires going back to data samples extracted from the last 12 hours? What is the best approach to meet your customer’s requirements?
    1. Send all the log events to Amazon SQS. Setup an Auto Scaling group of EC2 servers to consume the logs and apply the heuristics.
    2. Send all the log events to Amazon Kinesis develop a client process to apply heuristics on the logs (Can perform real time analysis and stores data for 24 hours which can be extended to 7 days)
    3. Configure Amazon CloudTrail to receive custom logs, use EMR to apply heuristics the logs (CloudTrail is only for auditing)
    4. Setup an Auto Scaling group of EC2 syslogd servers, store the logs on S3 use EMR to apply heuristics on the logs (EMR is for batch analysis)

AWS Certified Solutions Architect – Associate SAA-C02 Exam Learning Path

SAA-C02 Certification

AWS Certified Solutions Architect – Associate SAA-C02 Exam Learning Path

AWS Solutions Architect – Associate SAA-C02 exam is the latest AWS exam that has replaced the previous SAA-C01 certification exam. It basically validates the ability to effectively demonstrate knowledge of how to architect and deploy secure and robust applications on AWS technologies

  • Define a solution using architectural design principles based on customer requirements.
  • Provide implementation guidance based on best practices to the organization throughout the life cycle of the project.

Refer AWS_Solution_Architect_-_Associate_SAA-C02_Exam_Blue_Print

AWS Solutions Architect – Associate SAA-C02 Exam Summary

  • SAA-C02 exam consists of 65 questions in 130 minutes, and the time is more than sufficient if you are well prepared.
  • SAA-C02 Exam covers the architecture aspects in deep, so you must be able to visualize the architecture, even draw them out in the exam just to understand how it would work and how different services relate.
  • AWS has updated the exam concepts from the focus being on individual services to more building of scalable, highly available, cost-effective, performant, resilient.
  • If you had been preparing for the SAA-C01 –
    • SAA-C02 is pretty much similar to SAA-C01 except the operational effective architecture domain has been dropped
    • Although, most of the services and concepts covered by the SAA-C01 are the same. There are few new additions like Aurora Serverless, AWS Global Accelerator, FSx for Windows, FSx for Lustre
  • AWS exams are available online, and I took the online one. Just make sure you have a proper place to take the exam with no disturbance and nothing around you.
  • Also, if you are taking the AWS Online exam for the first time try to join atleast 30 minutes before the actual time.

AWS Solutions Architect – Associate SAA-C02 Exam Topics

Make sure you go through all the topics and focus on hints in italics


  • Be sure to create VPC from scratch. This is mandatory.
    • Create VPC and understand whats an CIDR and addressing patterns
    • Create public and private subnets, configure proper routes, security groups, NACLs. (hint: Subnets are public or private depending on whether they can route traffic directly through Internet gateway)
    • Create Bastion for communication with instances
    • Create NAT Gateway or Instances for instances in private subnets to interact with internet
    • Create two tier architecture with application in public and database in private subnets
    • Create three tier architecture with web servers in public, application and database servers in private. (hint: focus on security group configuration with least privilege)
    • Make sure to understand how the communication happens between Internet, Public subnets, Private subnets, NAT, Bastion etc.
  • Understand difference between Security Groups and NACLs (hint: Security Groups are Stateful vs NACLs are stateless. Also only NACLs provide an ability to deny or block IPs)
  • Understand VPC endpoints and what services it can help interact (hint: VPC Endpoints routes traffic internally without Internet)
    • VPC Gateway Endpoints supports S3 and DynamoDB.
    • VPC Interface Endpoints OR Private Links supports others
  • Understand difference between NAT Gateway and NAT Instance (hint: NAT Gateway is AWS managed and is scalable and highly available)
  • Understand how NAT high availability can be achieved (hint: provision NAT in each AZ and route traffic from subnets within that AZ through that NAT Gateway)
  • Understand VPN and Direct Connect for on-premises to AWS connectivity
    • VPN provides quick connectivity, cost-effective, secure channel, however routes through internet and does not provide consistent throughput
    • Direct Connect provides consistent dedicated throughput without Internet, however requires time to setup and is not cost-effective
  • Understand Data Migration techniques
    • Choose Snowball vs Snowmobile vs Direct Connect vs VPN depending on the bandwidth available, data transfer needed, time available, encryption requirement, one-time or continuous requirement
    • Snowball, SnowMobile are for one-time data, cost-effective, quick and ideal for huge data transfer
    • Direct Connect, VPN are ideal for continuous or frequent data transfers
  • Understand CloudFront as CDN and the static and dynamic caching it provides, what can be its origin (hint: CloudFront can point to on-premises sources and its usecases with S3 to reduce load and cost)
  • Understand Route 53 for routing
    • Understand Route 53 health checks and failover routing
    • Understand  Route 53 Routing Policies it provides and their use cases mainly for high availability (hint: focus on weighted, latency, geolocation, failover routing)
  • Be sure to cover ELB concepts in deep.
    • SAA-C02 focuses on ALB and NLB and does not cover CLB
    • Understand differences between  CLB vs ALB vs NLB
      • ALB is layer 7 while NLB is layer 4
      • ALB provides content based, host based, path based routing
      • ALB provides dynamic port mapping which allows same tasks to be hosted on ECS node
      • NLB provides low latency and ability to scale
      • NLB provides static IP address


  • Understand IAM as a whole
    • Focus on IAM role (hint: can be used for EC2 application access and Cross-account access)
    • Understand IAM identity providers and federation and use cases
    • Understand MFA and how would implement two factor authentication for an application
    • Understand IAM Policies (hint: expect couple of questions with policies defined and you need to select correct statements)
  • Understand encryption services
  • AWS WAF integrates with CloudFront to provide protection against Cross-site scripting (XSS) attacks. It also provide IP blocking and geo-protection.
  • AWS Shield integrates with CloudFront to provide protection against DDoS.
  • Refer Disaster Recovery whitepaper, be sure you know the different recovery types with impact on RTO/RPO.


  • Understand various storage options S3, EBS, Instance store, EFS, Glacier, FSx and what are the use cases and anti patterns for each
  • Instance Store
    • Understand Instance Store (hint: it is physically attached  to the EC2 instance and provides the lowest latency and highest IOPS)
  • Elastic Block Storage – EBS
    • Understand various EBS volume types and their use cases in terms of IOPS and throughput. SSD for IOPS and HDD for throughput
    • Understand Burst performance and I/O credits to handle occasional peaks
    • Understand EBS Snapshots (hint: backups are automated, snapshots are manual
  • Simple Storage Service – S3
    • Cover S3 in depth
    • Understand S3 storage classes with lifecycle policies
      • Understand the difference between SA Standard vs SA IA vs SA IA One Zone in terms of cost and durability
    • Understand S3 Data Protection (hint: S3 Client side encryption encrypts data before storing it in S3)
    • Understand S3 features including
      • S3 provides a cost effective static website hosting
      • S3 versioning provides protection against accidental overwrites and deletions
      • S3 Pre-Signed URLs for both upload and download provides access without needing AWS credentials
      • S3 CORS allows cross domain calls
      • S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket.
    • Understand Glacier as an archival storage with various retrieval patterns
    • Glacier Expedited retrieval now allows object retrieval within mins
  • Understand Storage gateway and its different types.
    • Cached Volume Gateway provides access to frequently accessed data, while using AWS as the actual storage
    • Stored Volume gateway uses AWS as a backup, while the data is being stored on-premises as well
    • File Gateway supports SMB protocol
  • Understand FSx easy and cost effective to launch and run popular file systems.
  • Understand the difference between EBS vs S3 vs EFS
    • EFS provides shared volume across multiple EC2 instances, while EBS can be attached to a single volume within the same AZ.
  • Understand the difference between EBS vs Instance Store
  • Would recommend referring Storage Options whitepaper, although a bit dated 90% still holds right


  • Understand Elastic Cloud Compute – EC2
  • Understand Auto Scaling and ELB, how they work together to provide High Available and Scalable solution. (hint: Span both ELB and Auto Scaling across Multi-AZs to provide High Availability)
  • Understand EC2 Instance Purchase Types – Reserved, Scheduled Reserved, On-demand and Spot and their use cases
    • Choose Reserved Instances for continuous persistent load
    • Choose Scheduled Reserved Instances for load with fixed scheduled and time interval
    • Choose Spot instances for fault tolerant and Spiky loads
    • Reserved instances provides cost benefits for long terms requirements over On-demand instances
    • Spot instances provides cost benefits for temporary fault tolerant spiky load
  • Understand EC2 Placement Groups (hint: Cluster placement groups provide low latency and high throughput communication, while Spread placement group provides high availability)
  • Understand Lambda and serverless architecture, its features and use cases. (hint: Lambda integrated with API Gateway to provide a serverless, highly scalable, cost-effective architecture)
  • Understand ECS with its ability to deploy containers and micro services architecture.
    • ECS role for tasks can be provided through taskRoleArn
    • ALB provides dynamic port mapping to allow multiple same tasks on the same node
  • Know Elastic Beanstalk at a high level, what it provides and its ability to get an application running quickly.


  • Understand relational and NoSQLs data storage options which include RDS, DynamoDB, Aurora and their use cases
  • RDS
    • Understand RDS features – Read Replicas vs Multi-AZ
      • Read Replicas for scalability, Multi-AZ for High Availability
      • Multi-AZ are regional only
      • Read Replicas can span across regions and can be used for disaster recovery
    • Understand Automated Backups, underlying volume types
  • Aurora
    • Understand Aurora
      • provides multiple read replicas and replicates 6 copies of data across AZs
    • Understand Aurora Serverless provides a highly scalable cost-effective database solution
  • DynamoDB
    • Understand DynamoDB with its low latency performance, key-value store (hint: DynamoDB is not a relational database)
    • DynamoDB DAX provides caching for DynamoDB
    • Understand DynamoDB provisioned throughput for Read/Writes (It is more cover in Developer exam though.)
  • Know ElastiCache use cases, mainly for caching performance

Integration Tools

  • Understand SQS as message queuing service and SNS as pub/sub notification service
  • Understand SQS features like visibility, long poll vs short poll
  • Focus on SQS as a decoupling service
  • Understand SQS Standard vs SQS FIFO difference (hint: FIFO provides exactly once delivery both low throughput)


  • Know Redshift as a business intelligence tool
  • Know Kinesis for real time data capture and analytics
  • Atleast know what AWS Glue does, so you can eliminate the answer

Management Tools

  • Understand CloudWatch monitoring to provide operational transparency
  • Know which EC2 metrics it can track. Remember, it cannot track memory and disk space/swap utilization
  • Understand CloudWatch is extendable with custom metrics
  • Understand CloudTrail for Audit
  • Have a basic understanding of CloudFormation, OpsWorks

AWS Solutions Architect – Associate SAA-C02 Exam Resources

AWS Whitepapers & Cheat sheets

AWS Solutions Architect – Associate Exam Domains

Domain 1: Design Resilient Architectures

  1. Design a multi-tier architecture solution
  2. Design highly available and/or fault-tolerant architectures
  3. Design decoupling mechanisms using AWS services
  4. Choose appropriate resilient storage

Domain 2: Define High-Performing Architectures

  1. Identify elastic and scalable compute solutions for a workload
  2. Select high-performing and scalable storage solutions for a workload
  3. Select high-performing networking solutions for a workload
  4. Choose high-performing database solutions for a workload

Domain 3: Specify Secure Applications and Architectures

  1. Design secure access to AWS resources
  2. Design secure application tiers
  3. Select appropriate data security options

Domain 4: Design Cost-Optimized Architectures

  1. Determine how to design cost-optimized storage.
  2. Determine how to design cost-optimized compute.

AWS FSx for Lustre

AWS FSx for Lustre

  • Amazon FSx for Lustre, is a fully managed service, that makes it easy and cost effective to launch and run the world’s most popular high-performance (HPC) Lustre file system.
  • Lustre is an open source file system designed for applications that require fast storage – where you want your storage to keep up with your compute
  • FSx handles the traditional complexity of setting up and managing high-performance Lustre file systems
  • FSx for Lustre is ideal for use cases where speed matters, such as machine learning, high performance computing (HPC), video processing, financial modeling, genome sequencing, and electronic design automation (EDA)
  • Amazon FSx provides multiple deployment options to optimize cost
    • Scratch file systems
      • designed for temporary storage and short-term processing of data.
      • data is not replicated and does not persist if a file server fails.
    • Persistent file systems
      • designed for long-term storage and workloads.
      • is highly available, and data is automatically replicated within the AZ that is associated with the file system.
      • data volumes attached to the file servers are replicated independently from the file servers to which they are attached.
  • FSx for Lustre is compatible with the most popular Linux-based AMIs, including Amazon Linux, Amazon Linux 2, Red Hat Enterprise Linux (RHEL), CentOS, SUSE Linux and Ubuntu.
  • FSx for Lustre can be accessed from a Linux instance, by installing the open-source Lustre client and mounting the file system using standard Linux commands.

FSx for Lustre with S3

  • Amazon FSx also integrates seamlessly with S3, making it easy to process cloud data sets with the Lustre high-performance file system.
  • Amazon FSx for Lustre file system transparently presents S3 objects as files and allows writing changed data back to S3.
  • Amazon FSx for Lustre file system can be linked with a specified S3 bucket, making the data in the S3 accessible to the file system.
  • S3 objects’ names and prefixes will be visible as files and directories
  • Amazon S3 objects are lazy loaded by default.
    • Objects are only loaded into the file system only when first accessed by the applications.
    • Amazon FSx for Lustre automatically loads the corresponding objects from S3 when accessed
    • Subsequent reads of these files are served directly out of the file system with low, consistent latencies.
    • Amazon FSx for Lustre file system can optionally batch hydrated
  • Amazon FSx for Lustre uses parallel data transfer techniques to transfer data from S3 at up to hundreds of GBs/s.
  • Files from the file system can be exported back to the S3 bucket

FSx for Lustre Security

  • FSx for Lustre provides encryption at rest for the file system and the backups, by default, using KMS
  • FSx encrypts data-in-transit when accessed from supported EC2 instances only

FSx for Lustre Scalability

  • Amazon FSx for Lustre file systems scale to hundreds of GB/s of throughput and millions of IOPS.
  • FSx for Lustre also supports concurrent access to the same file or directory from thousands of compute instances.
  • FSx for Lustre provides consistent, sub-millisecond latencies for file operations.

FSx for Lustre Availability and Durability

  • On a scratch file system, file servers are not replaced if they fail and data is not replicated.
  • On a persistent file system, if a file server becomes unavailable it is replaced automatically and within minutes.
  • Amazon FSx for Lustre provides a parallel file system, where data is stored across multiple network file servers to maximize performance and reduce bottlenecks, and each server has multiple disks.
  • Amazon FSx takes daily automatic incremental backups of the file systems, and allows manual backups at any point.
  • Backups are highly durable and file-system-consistent

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A solutions architect is designing storage for a high performance computing (HPC) environment based on Amazon Linux. The workload stores and processes a large amount of engineering drawings that require shared storage and heavy computing. Which storage option would be the optimal solution?
    1. Amazon Elastic File System (Amazon EFS)
    2. Amazon FSx for Lustre
    3. Amazon EC2 instance store
    4. Amazon EBS Provisioned IOPS SSD (io1)

AWS FSx for Windows

AWS FSx for Windows

  • Amazon FSx for Windows File Server provides fully managed, highly reliable, and scalable file storage that is accessible over the industry-standard Service Message Block (SMB) protocol.
  • Built on Windows Server, delivering a wide range of administrative features such as user quotas, end-user file restore, ACLs and Microsoft Active Directory (AD) integration.
  • Amazon FSx provides high levels of throughput and IOPS, and consistent sub-millisecond latencies.
  • Amazon FSx is accessible from Windows, Linux, and MacOS compute instances and devices.
  • Amazon FSx provides concurrent access to the file system to thousands of compute instances and devices
  • Amazon FSx can connect the file system to EC2, VMware Cloud on AWS, Amazon WorkSpaces, and Amazon AppStream 2.0 instances.
  • Integrated with CloudWatch to monitor storage capacity and file system activity
  • Integrated with CloudTrail to monitor all Amazon FSx API calls
  • Amazon FSx was designed for use cases that require Windows shared file storage, like CRM, ERP, custom or .NET applications, home directories, data analytics, media and entertainment workflows, web serving and content management, software build environments, and Microsoft SQL Server.
  • Amazon FSx file systems is accessible from the on-premises environment using an AWS Direct Connect or AWS VPN connection
  • Amazon FSx is accessible from multiple VPCs, AWS accounts, and AWS Regions using VPC Peering connections or AWS Transit Gateway
  • Amazon FSx provides consistent sub-millisecond latencies with SSD storage, and single-digit millisecond latencies with HDD storage
  • Amazon FSx supports Microsoft’s Distributed File System (DFS) to organize shares into a single folder structure up to hundreds of PB in size

FSx for Windows Security

  • Amazon FSx works with Microsoft Active Directory (AD) to integrate with  existing Windows environments, which can either be an AWS Managed Microsoft AD or self-managed Microsoft AD
  • Amazon FSx provides standard Windows permissions (full support for Windows Access Controls ACLS) for files and folders.
  • Amazon FSx for Windows File Server supports encryption at rest for the file system and backups using KMS managed keys
  • Amazon FSx encrypts data-in-transit using SMB Kerberos session keys, when accessing the file system from clients that support SMB 3.0
  • Amazon FSx supports file-level or folder-level restores to previous versions by supporting Windows shadow copies, which are snapshots of your file system at a point in time
  • Amazon FSx supports Windows shadow copies to enable your end-users to easily undo file changes and compare file versions by restoring files to previous versions, and backups to support your backup retention and compliance needs.

FSx for Windows Availability and durability

  • Amazon FSx automatically replicates the data within an Availability Zone (AZ) to protect it from component failure,
  • Amazon FSx continuously monitors for hardware failures, and automatically replaces infrastructure components in the event of a failure.
  • Amazon FSx supports Multi-AZ deployment
    • automatically provisions and maintains a standby file server in a different Availability Zone.
    • Any changes written to disk in the file system are synchronously replicated across AZs to the standby.
    • helps enhance availability during planned system maintenance
    • helps protect the data against instance failure and AZ disruption.
    • In the event of planned file system maintenance or unplanned service disruption, Amazon FSx automatically fails over to the secondary file server, allowing data accessibility without manual intervention.
  • Amazon FSx supports automatic backups of the file systems, which are incremental storing only the changes after the most recent backup
  • Amazon FSx stores backups in Amazon S3.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A data processing facility wants to move a group of Microsoft Windows servers to the AWS Cloud. Theses servers require access to a shared file system that can integrate with the facility’s existing Active Directory (AD) infrastructure for file and folder permissions. The solution needs to provide seamless support for shared files with AWS and on-premises servers and allow the environment to be highly available. The chosen solution should provide added security by supporting encryption at rest and in transit. The solution should also be cost-effective to implement and manage. Which storage solution would meet these requirements?
    1. An AWS Storage Gateway file gateway joined to the existing AD domain
    2. An Amazon FSx for Windows File Server file system joined to the existing AD domain
    3. An Amazon Elastic File System (Amazon EFS) file system joined to an AWS managed AD domain
    4. An Amazon S3 bucket mounted on Amazon EC2 instances in multiple Availability Zones running Windows Server and joined to an AWS managed AD domain

AWS S3 vs EBS vs EFS

S3 vs EBS vs EFS

Amazon EFS, Amazon EBS, and Amazon S3 are AWS’ three different storage types that are applicable for different types of workload needs

S3 vs EBS vs EFS

Amazon S3

    • is an object store with a with simple key, value store design and good at storing vast numbers of backups or user files.
    • offers pay for the storage you actually use. Offers cost-saving storage classes ideal for infrequently access data or for data archival
    • provides unlimited storage
    • provides durability as the data is replicated and stored across at least three geographically-dispersed AZs with a maximum of 99.999999999% (11! 9’s)
    • provide high availability with a maximum of 99.99%
    • provides Security with range of access control mechanism and ability. to encrypt data at rest and in transit
    • data can be accessed programmatically or directly from services such as AWS CloudFront.
    • provides backup capability using versioning and cross-region replication

Amazon EBS

    • delivers high-availability block-level storage volumes for Amazon Elastic Compute Cloud (EC2) instances.
    • offers pay for the provisioned storage, even if you do not use it
    • provides limited storage capability and cannot scale infinitely
    • stores data on a file system which can be retained after the EC2 instance is shut down.
    • provides durability by replicating data across multiple servers in an AZ to prevent the loss of data from the failure of any single component
    • designed for 99.999% availability
    • provides low-latency performance – using SSD EBS volumes, it offers reliable I/O performance scaled to meet your workload needs.
    • provides secure storage with access control and providing data at rest and in transit encryption
    • is only accessible from a single EC2 instance in the particular AWS region and AZ
    • provides backup capability using backups and snapshots

Amazon EFS

    • scalable file storage, also optimized for EC2.
    • offers pay for the storage you actually use. There’s no advance provisioning, up-front fees, or commitments
    • multiple instances can be configured to mount the file system.
    • allows mounting the file system across multiple regions and instances.
    • is designed to be highly durable and highly available. Data is redundantly stored across multiple AZs.
    • provides elasticity – scales up and down automatically, even to meet the most abrupt workload spikes.
    • provides performance that scales to support any workload: EFS offers the throughput changing workloads need. It can provide higher throughput in spurts that match sudden file system growth, even for workloads up to 500,000 IOPS or 10 GB per second.
    • provides accessible file storage, which can be accessed by On-premises servers and EC2 instances concurrently.
    • provides security and compliance – access to the file system can be secured with the current security solution, or control access to EFS file systems using IAM, VPC, or POSIX permissions.
    • provides data encryption in transit or at rest.
    • allows EC2 instances to access EFS file systems located in other AWS regions through VPC peering.
    • a file system can be accessed concurrently from all AZs in the region where it is located, which means the application can be architected to failover from one AZ to other AZs in the region in order to ensure the highest level of application availability. Mount targets themselves are designed to be highly available.
    • used as a common data source for any application or workload that runs on numerous instances.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company runs an application on a group of Amazon Linux EC2 instances. The application writes log files using standard API calls. For compliance reasons, all log files must be retained indefinitely and will be analyzed by a reporting tool that must access all files concurrently. Which storage service should a solutions architect use to provide the MOST cost-effective solution?
    1. Amazon EBS
    2. Amazon EFS
    3. Amazon EC2 instance store
    4. Amazon S3
  2. A new application is being deployed on Amazon EC2. The Application needs to read write upto 3 TB of data to an external data store and requires read-after-write consistency across all AWS regions for writing new objects into this data store.
    1. Amazon EBS
    2. Amazon Glacier
    3. Amazon EFS
    4. Amazon S3
  3. To meet the requirements of an application, an organization needs to save a constantly increasing volume of files on a cloud storage system with the following features and abilities. What below AWS service will meet these requirements?
      1. Pay only for the storage used
      2. Create different security policies for different groups of files
      3. Allow access to the public
      4. Retrieve the files at any time
      5. Store an unlimited number of files
    1. Amazon EBS
    2. Amazon S3
    3. Amazon Glacier
    4. Amazon EFS

AWS Certified Machine Learning -Specialty (MLS-C01) Exam Learning Path

AWS Certified Machine Learning Specialty Certification

Finally, cleared the AWS Certified Machine Learning – Specialty (MLS-C01). It took me around four months to prepare for the exam. This was my fourth Specialty certification and in terms of the difficulty level of all of them this is the toughest, partly because I am not a machine learning expert and learned everything from basics for this certification. Machine Learning is a vast specialization in itself and with AWS services, there is lots to cover and know for the exam. This is the only exam, where the majority of the focus is on the concepts outside of AWS i.e. pure machine learning. It also includes AWS Machine Learning and Big Data services.

AWS Certified Machine Learning – Specialty (MLS-C01) exam basically validates

  •  Select and justify the appropriate ML approach for a given business problem.
  • Identify appropriate AWS services to implement ML solutions.
  • Design and implement scalable, cost-optimized, reliable, and secure ML solutions.

Refer AWS Certified Machine Learning – Specialty Exam Guide for details

                              AWS Certified Machine Learning – Specialty Domains

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Summary

  • AWS Certified Machine Learning – Specialty exam, as its name suggests, covers a lot of Machine Learning concepts right. It really digs deep into Machine learning concepts, most of which are not related to AWS.
  • AWS Certified Machine Learning – Speciality exam covers the E2E Machine Learning lifecycle, right from data collection, transformation, making it usable and efficient for Machine Learning, pre-processing data for Machine Learning, training and validation and implementation.
  • As always, one of the key tactic I followed when solving any AWS Certification exam is to read the question and use paper and pencil to draw a rough architecture and focus on the areas that you need to improve. Trust me, you will be able to eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach to the right answer or atleast have a 50% chance of getting it right.

Preparation Summary

  • Machine Learning
    • Make sure you know and cover all the services in depth, as 60% of the exam is focused on generic Machine learning concepts not related to AWS services.
    • Know about complete generic Machine Learning lifecycle
    • Exploratory Data Analysis
      • Feature selection and Engineering
        • remove features which are not related to training
        • remove features which has same values, very low correlation, very little variance or lot of missing values
        • Apply techniques like Principal Component Analysis (PCA) for dimensionality reduction i.e reduce the number of features.
        • Apply techniques such as One-hot encoding and label encoding to help convert strings to numeric values, which are easier to process.
        • Apply Normalization i.e. values between 0 and 1 to handle data with large variance.
        • Apply feature engineering for feature reduction for e.g. using single height/weight feature instead of both the features
      • Handle Missing data
        • remove the feature or rows with missing data
        • impute using Mean/Median values – valid only for Numeric values and not categorical features also does not factor correlation between features
        • impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning – more accurate, factores correlation between features
      • Handle unbalanced data
        • Source more data
        • Oversample minority or Undersample majority
        • Data augmentation using techniques like SMOTE
    • Modeling
      • Know about Algorithms – Supervised, Unsupervised and Reinforcement and which algorithm is best suitable based on the available data either labelled or unlabelled.
        • Supervised learning trains on labelled data for e.g. Linear regression. Logistic regression, Decision trees, Random Forests
        • Unsupervised learning trains on unlabelled data for e.g. PCA, SVD, K-means
        • Reinforcement learning trained based on actions and rewards for e.g. Q-Learning
      • Hyperparameters
        • are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models
        • some of the common hyperparameters are learning rate, batch, epoch (hint:  If the learning rate is too large, the minimum slope might be missed and the graph would oscillate If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient
    • Evaluation
      • Know difference in evaluating model accuracy
        • Use Area Under the (Receiver Operating Characteristic) Curve (AUC) for Binary classification
        • Use root mean square error (RMSE) metric for regression
      • Understand Confusion matrix
        • A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
        • false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
        • Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN)  (hint: use this for cases like fraud detection,  cost of marking non fraud as frauds is lower than marking fraud as non-frauds)
        • Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP)  (hint: use this for cases like videos for kids, the cost of  dropping few valid videos is lower than showing few bad ones)
      • Handle Overfitting problems
        • Simplify the model, by reducing number of layers
        • Early Stopping – form of regularization while training a model with an iterative method, such as gradient descent
        • Data Augmentation
        • Regularization – technique to reduce the complexity of the model
        • Dropout is a regularization technique that prevents overfitting
        • Never train on test data
  • AWS Machine Learning
    • SageMaker
      • Know SageMaker in depth
      • supports both File mode and Pipe mode
        • File mode loads all of the data from S3 to the training instance volumes VS Pipe mode streams data directly from S3
        • File mode needs disk space to store both the final model artifacts and the full training dataset. VS Pipe mode which helps reduce the required size for EBS volumes
      • Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it. 
      • supports Model tracking capability to manage up to thousands of machine learning model experiments
      • supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
      • supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload
      • provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training & inference
      • SageMaker Automatic Model Tuning
        • is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
        • Best practices
          • limit the search to a smaller number as difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
          • DO NOT specify a very large range to cover every possible value for a hyperparameter as it affects the success of hyperparameter optimization.
          • log-scaled hyperparameter can be converted to improve hyperparameter optimization.
          • running one training job at a time achieves the best results with the least amount of compute time.
          • Design distributed training jobs so that you get they report the objective metric that you want.
        • SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
      • know how to take advantage of multiple GPUs (hint: increase learning rate and batch size w.r.t to the increase in GPUs)
      • Algorithms –
        • Blazing text provides Word2vec and text classification algorithms
        • DeepAR provides supervised learning algorithm for forecasting scalar (one-dimensional) time series (hint: train for new products based on existing products sales data)
        • Factorization machines provides supervised classification and regression tasks, helps capture interactions between features within high dimensional sparse datasets economically
        • Image classification algorithm is a supervised learning algorithm that supports multi-label classification
        • IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses
        • K-means is an unsupervised learning algorithm for clustering as it attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
        • k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression
        • Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Used to identify number of topics shared by documents within a text corpus
        • Linear models are supervised learning algorithms used for solving either classification or regression problems. 
          • For regression (predictor_type=’regressor’), the score is the prediction produced by the model.
          • For classification (predictor_type=’binary_classifier’ or predictor_type=’multiclass_classifier’)
        • Neural Topic Model (NTM) Algorithm is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
        • Object Detection algorithm detects and classifies objects in images using a single deep neural network
        • Principal Component Analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) (hint: dimensionality reduction)
        • Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data point (hint: anomaly detection)
        • Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. (hint: text summarization is the key use case)
    • SageMaker Ground Truth 
      • provides automated data labeling using machine learning
      • helps build highly accurate training datasets for machine learning quickly using Amazon Mechanical Turk
      • provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.
      • automated data labeling uses machine learning to label portions of the data automatically without having to send them to human workers
    • Comprehend
      • natural language processing (NLP) service to find insights and relationships in text.
      • identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
    • Lex
      • provides conversational interfaces using voice and text helpful in building voice and text chatbots
    • Polly
      • text into speech
      • supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch or volume.
      • supports pronunciation lexicons to customize the pronunciation of words
    • Rekognition
      • analyze image and video
      • helps identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
    • Translate – provides natural and fluent language translation
    • Transcribe – provides speech-to-text capability
    • Elastic Interface helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference by up to 75%.
  • Analytics
    • Make sure you know and understand data engineering concepts mainly in terms of data capture, data migration, data transformation and data storage
    • Kinesis
      • Understand Kinesis Data Streams and Kinesis Data Firehose in depth
      • Kinesis Data Analytics can process and analyze streaming data using standard SQL and integrates with Data Streams and Firehose
      • Know Kinesis Data Streams vs Kinesis Firehose
        • Know Kinesis Data Streams is open ended on both producer and consumer. It supports KCL and works with Spark.
        • Know Kinesis Firehose is open ended for producer only. Data is stored in S3, Redshift and ElasticSearch.
        • Kinesis Firehose works in batches with minimum 60secs interval.
        • Kinesis Data Firehose supports data transformation and record format conversion using Lambda function (hint: can be used for transforming csv or JSON into parquet)
    • Know ElasticSearch is a search service which supports indexing, full text search, faceting etc.
    • Know Data Pipeline for data transfer
    • Know Glue as fully managed ETL service
      • helps setup, orchestrate, and monitor complex data flows.
      • AWS Glue Data Catalog
        • is a central repository to store structural and operational metadata for all the data assets.
      • AWS Glue crawler
        • connects to a data store, progresses through a prioritized list of classifiers to extract the schema of the data and other statistics, and then populates the Glue Data Catalog with this metadata
  • Security, Identity & Compliance
    • Security is covered very lightly. (hint : SageMaker can read data from KMS encrypted S3. Make sure, the KMS key policies include the role attached with SageMaker)
  • Management & Governance Tools
    • Understand AWS CloudWatch for Logs and Metrics. (hint: SageMaker is integrated with Cloudwatch and logs and metrics are all stored in it)
  • Storage
    • Understand Data Storage Options – Know patterns for S3 vs RDS vs DynamoDB vs Redshift. (hint: S3 is, by default, the data storage option or Big Data storage and look for it in the answer.)

Whitepapers and articles

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Resources