AWS Certified Data Engineer – Associate DEA-C01 Exam Learning Path

June 24, 2024 ~ Last updated on : June 25, 2024 ~ jayendrapatil

AWS Certified Data Engineer – Associate DEA-C01 Exam Learning Path

Just cleared the AWS Certified Data Engineer – Associate DEA-C01 exam with a score of 930/1000.

AWS Certified Data Engineer – Associate DEA-C01 exam is the latest AWS exam released on 12th March 2024.

AWS Certified Data Engineer – Associate DEA-C01 Exam Content

Data Engineer exam validates skills and knowledge in core data-related AWS services, ability to ingest and transform data, orchestrate data pipelines while applying programming concepts, design data models, manage data life cycles, and ensure data quality.
Exam also validates a candidate’s ability to complete the following tasks:
- Ingest and transform data, and orchestrate data pipelines while applying programming concepts.
- Choose an optimal data store, design data models, catalog data schemas, and manage data lifecycles.
- Operationalize, maintain, and monitor data pipelines. Analyze data and ensure data quality.
- Implement appropriate authentication, authorization, data encryption, privacy, and governance. Enable logging

Refer AWS Certified Data Engineer – Associate DEA-C01 Exam Guide

AWS Certified Data Engineer – Associate DEA-C01 Exam Summary

DEA-C01 exam consists of 65 questions in 130 minutes, and the time is more than sufficient if you are well-prepared.
DEA-C01 exam includes two types of questions, multiple-choice and multiple-response.

DEA-C01 has a scaled score between 100 and 1,000. The scaled score needed to pass the exam is 720.
Associate exams currently cost $ 150 + tax.
You can get an additional 30 minutes if English is your second language by requesting Exam Accommodations. It might not be needed for Associate exams but is helpful for Professional and Specialty ones.

AWS exams can be taken either remotely or online, I prefer to take them online as it provides a lot of flexibility. Just make sure you have a proper place to take the exam with no disturbance and nothing around you.
Also, if you are taking the AWS Online exam for the first time try to join at least 30 minutes before the actual time as I have had issues with both PSI and Pearson with long wait times.

AWS Certified Data Engineer – Associate DEA-C01 Exam Resources

Online Courses
- Stephane Maarek – AWS Certified Data Engineer Associate 2024 – Hands On!
- Whizlabs – AWS Certified Data Engineer Associate Course
Practice tests
- Braincert AWS Solutions Architect – Associate DEA-C01 Practice Exams
- Stephane Maarek – AWS Certified Data Engineer – Associate Practice Exams
- Whizlabs – AWS Certified Data Engineer Associate Practice Tests
- AWS Offical Data Engineer Practice Set
Signed up with AWS for the Free Tier account which provides a lot of Services to be tried for free with certain limits which are more than enough to get things going. Be sure to decommission services beyond the free limits, preventing any surprises 🙂
Read the FAQs at least for the important topics, as they cover important points and are good for quick review

AWS Certified Data Engineer – Associate DEA-C01 Exam Topics

DEA-C01 Exam covers the data engineering aspects in terms of data ingestion, transformation, orchestration, designing data models, managing data life cycles, and ensuring data quality.

Analytics

Ensure you know and cover all the services in-depth, as 80% of the exam focuses on topics like Glue, Athena, Kinesis, and Redshift.
AWS Analytics Services Cheat Sheet

Glue
- DEA-C01 covers Glue in great detail.
- AWS Glue is a fully managed, ETL service that automates the time-consuming steps of data preparation for analytics.
- supports server-side encryption for data at rest and SSL for data in motion.
- Glue ETL engine to Extract, Transform, and Load data that can automatically generate Scala or Python code.
- Glue Data Catalog is a central repository and persistent metadata store to store structural and operational metadata for all the data assets. It works with Apache Hive as its metastore.
- Glue Crawlers scan various data stores to automatically infer schemas and partition structures to populate the Data Catalog with corresponding table definitions and statistics.
- Glue Job Bookmark tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run.
- Glue Streaming ETL enables performing ETL operations on streaming data using continuously running jobs.
- Glue provides a flexible scheduler that handles dependency resolution, job monitoring, and retries.
- Glue Studio offers a graphical interface for authoring AWS Glue jobs to process data allowing you to define the flow of the data sources, transformations, and targets in the visual interface and generating Apache Spark code on your behalf.
- Glue Data Quality helps reduce manual data quality efforts by automatically measuring and monitoring the quality of data in data lakes and pipelines.
- Glue DataBrew helps prepare, visualize, clean, and normalize data directly from the data lake, data warehouses, and databases, including S3, Redshift, Aurora, and RDS.
- Glue Flex execution option helps to reduce the costs of pre-production, test, and non-urgent data integration workloads by up to 34% and is ideal for customer workloads that don’t require fast jobs start times.
- Glue FindMatches transform helps identify duplicate or matching records in the dataset, even when the records do not have a common unique identifier and no fields match exactly.

Kinesis
- Understand Kinesis Data Streams and Kinesis Data Firehose in-depth.
- Know Kinesis Data Streams vs Kinesis Firehose
  - Know Kinesis Data Streams is open-ended for both producer and consumer. It supports KCL and works with Spark.
  - Know Kinesis Firehose is open-ended for producers only. Data is stored in S3, Redshift, and OpenSearch.
  - Kinesis Firehose works in batches with minimum 60secs intervals and in near-real time.
  - Kinesis Firehose supports out-of-the-box transformation and custom transformation using Lambda
- Kinesis supports encryption at rest using server-side encryption
- Kinesis supports Interface VPC endpoint to keep traffic between the VPC and Kinesis Data Streams from leaving the Amazon network and doesn’t require an internet gateway, NAT device, VPN connection, or Direct Connect connection.
- Kinesis Producer Library supports batching
- Kinesis Data Analytics OR Managed Service for Apache Flink
  - helps transform and analyze streaming data in real time using Apache Flink.
  - supports anomaly detection using Random Cut Forest ML
  - supports reference data stored in S3.
Redshift
- Redshift is also covered in depth.
- Redshift Advanced include
  - Redshift Distribution Style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
  - Redshift Enhanced VPC routing forces all COPY and UNLOAD traffic between the cluster and the data repositories through the VPC.
  - Workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
  - Redshift Spectrum
    - helps query structured and semistructured data from files in S3 without having to load the data into Redshift tables.
    - cannot access data from Glacier.
  - Federated Query feature allows querying and analyzing data across operational databases, data warehouses, and data lakes.
  - Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
  - Concurrency Scaling helps support thousands of concurrent users and concurrent queries, with consistently fast query performance.
  - Redshift Serverless is a serverless option of Redshift that makes it more efficient to run and scale analytics in seconds without the need to set up and manage data warehouse infrastructure.
  - Streaming ingestion provides low-latency, high-speed ingestion of stream data from Kinesis Data Streams and Managed Streaming for Apache Kafka into a Redshift provisioned or Redshift Serverless materialized view.
  - Redshift data sharing can securely share access to live data across Redshift clusters, workgroups, AWS accounts, and AWS Regions without manually moving or copying the data.
  - Redshift Data API provides a secure HTTP endpoint and integration with AWS SDKs to help access Redshift data with web services–based applications, including AWS Lambda, SageMaker notebooks, and AWS Cloud9.
- Redshift Best Practices w.r.t selection of Distribution style, Sort key, importing/exporting data
  - COPY command which allows parallelism, and performs better than multiple COPY commands
  - COPY command can use manifest files to load data
  - COPY command handles encrypted data
- Redshift Resizing cluster options (elastic resize did not support node type changes before, but does now)
- Redshift supports encryption at rest and in transit
- Redshift supports encrypting an unencrypted cluster using KMS. However, you can’t enable hardware security module (HSM) encryption by modifying the cluster. Instead, create a new, HSM-encrypted cluster and migrate your data to the new cluster.
- Know Redshift views to control access to data.
Athena
- is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats.
- provides a simplified, flexible way to analyze data in an S3 data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python without loading the data.
- integrates with QuickSight for visualizing the data or creating dashboards.
- uses a managed Glue Data Catalog to store information and schemas about the databases and tables for the data stored in S3.
- Workgroups can be used to separate users, teams, applications, or workloads, to set limits on the amount of data each query or the entire workgroup can process, and to track costs.
- Athena best practices
  - Data partitioning,
  - Partition projection, and
  - Columnar file formats like ORC or Parquet as they support compression and are splittable.

Elastic Map Reduce
- Understand EMRFS
  - Use Consistent view to make sure S3 objects referred by different applications are in sync. Although, it is not needed now.
- Know EMR Best Practices (hint: start with many small nodes instead of few large nodes)
- Know EMR Encryption options
  - supports SSE-S3, SS3-KMS, CSE-KMS, and CSE-Custom encryption for EMRFS
  - supports LUKS encryption for local disks
  - supports TLS for data in transit encryption
  - supports EBS encryption
- Hive metastore can be externally hosted using RDS, Aurora, and AWS Glue Data Catalog
OpenSearch
- OpenSearch is a search service that supports indexing, full-text search, faceting, etc.
- OpenSearch can be used for analysis and supports visualization using OpenSearch Dashboards which can be real-time.
- OpenSearch Service Storage tiers support Hot, UltraWarm, and Cold and the data can be transitioned using Index State management.
QuickSight
- Know Supported Data Sources
- QuickSight provides IP addresses that need to be whitelisted for QuickSight to access the data store.
- QuickSight provides direct integration with Microsoft AD
- QuickSight supports row-level security using dataset rules to control access to data at row granularity based on permissions associated with the user interacting with the data.
- QuickSight supports ML insights as well
- QuickSight supports users defined via IAM or email signup.

AWS Lake Formation
- is an integrated data lake service that helps to discover, ingest, clean, catalog, transform, and secure data and make it available for analysis.
- automatically manages access to the registered data in S3 through services including AWS Glue, Athena, Redshift, QuickSight, and EMR
- provides central access control for the data, including table-and-column-level access controls, and encryption for data at rest.
Simple Storage Service – S3 as a storage service
- S3 storage classes with lifecycle policies based on usage to provide cost-effective storage solutions.
- S3 Event Notifications integrates with SNS and Lambda for real-time data processing
Data Pipeline for data transfer helps automate and schedule regular data movement and data processing activities in AWS.
Step Functions help build distributed applications, automate processes, orchestrate microservices, and create data and ML pipelines.

AppFlow is a fully managed integration service to securely exchange data between software-as-a-service (SaaS) applications, such as Salesforce, and AWS services, such as Simple Storage Service (S3) and Redshift.

Security, Identity & Compliance

Identity and Access Management (IAM)
- Understand IAM Roles

Key Management Service (KMS) provides key management for encryption at rest.
- Integrates with S3, Redshift, Kinesis
- S3 Integration with SSE, SSE-C, SSE-KMS

AWS Secrets Manager
- helps protect secrets needed to access applications, services, and IT resources.
Amazon Macie is a security service that uses machine learning to automatically discover, classify, and protect sensitive data in S3.

Management & Governance Tools

Understand AWS CloudWatch for Logs and Metrics.
CloudWatch Logs Subscription Filters can be used to route data to Kinesis Data Streams, Kinesis Data Firehose, and Lambda.

On the Exam Day

Make sure you are relaxed and get some good night’s sleep. The exam is not tough if you are well-prepared.

If you are taking the AWS Online exam
- Try to join at least 30 minutes before the actual time as I have had issues with both PSI and Pearson with long wait times.
- The online verification process does take some time and usually, there are glitches.
- Remember, you would not be allowed to take the take if you are late by more than 30 minutes.
- Make sure you have your desk clear, no hand-watches, or external monitors, keep your phones away, and nobody can enter the room.

Finally, All the Best 🙂

AWS Redshift Advanced

June 20, 2024 ~ Last updated on : June 20, 2024 ~ jayendrapatil ~ 12 Comments

AWS Redshift Advanced

Redshift Distribution Style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.

Redshift enhanced VPC routing forces all COPY and UNLOAD traffic between the cluster and the data repositories through the VPC.
Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.

Redshift Spectrum helps query and retrieve structured and semistructured data from files in S3 without having to load the data into Redshift tables.
Redshift Federated Query feature allows querying and analyzing data across operational databases, data warehouses, and data lakes.

Distribution Styles

Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.

Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.

KEY distribution

A single column acts as a distribution key (DISTKEY) and helps place matching values on the same node slice.
As a rule of thumb, choose a column that:
- Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
- acts as a JOIN column – for tables related to dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.

EVEN distribution

distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column

Choose EVEN distribution
- when the table does not participate in joins
- when there is not a clear choice between KEY and ALL distribution.

ALL distribution

Whole table is replicated in every compute node.
ensures that every row is collocated for every join that the table participates in.
ideal for relatively slow-moving tables, tables that are not updated frequently or extensively.

Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.

AUTO distribution

Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
Amazon Redshift applies AUTO distribution, by default.

Sort Key

Sort keys define the order in which the data will be stored.
Sorting enables efficient handling of range-restricted predicates.
Only one sort key per table can be defined, but it can be composed of one or more columns.

Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If the query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
The are two kinds of sort keys in Redshift: Compound and Interleaved.

Compound Keys

A compound key is made up of all of the columns listed in the sort key definition, in the order, they are listed.

A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.

Interleaved Sort Keys

An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.

An interleaved sort key is more efficient when multiple queries use different columns for filters.
Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering, and grouping data using different columns as query dimensions.

Constraints

Redshift does not support Indexes.
Redshift supports UNIQUE, PRIMARY KEY, and FOREIGN KEY constraints, however, they are only for informational purposes.
Redshift does not perform integrity checks for these constraints and is used by the query planner, as hints, in order to optimize executions.

Redshift does enforce NOT NULL column constraints.

Redshift Enhanced VPC Routing

Redshift enhanced VPC routing forces all COPY and UNLOAD traffic between the cluster and the data repositories through the VPC.
Without enhanced VPC routing, Redshift would route traffic through the internet, including traffic to other services within the AWS network.

Redshift Workload Management

Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
- Slots: number of concurrent queries that can be executed in this queue.
- Working memory: percentage of memory assigned to this queue.
- Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
Queries can be routed to different queues using Query Groups and User Groups.

As a rule of thumb, it is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
By default, Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one.
A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

Redshift WLM supports two modes – Manual and Automatic
- Automatic WLM supports queue priorities.

Redshift Concurrency Scaling

Concurrency Scaling helps support thousands of concurrent users and concurrent queries, with consistently fast query performance.

With Concurrency scaling, Redshift automatically adds additional cluster capacity to process an increase in both read and write queries.
Queries run on the main cluster or a concurrency-scaling cluster returns the most current data.
Queries sent to the concurrency-scaling cluster can be managed by configuring WLM queues.

Redshift Short Query Acceleration – SQA

Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
SQA runs short-running queries in a dedicated space, so that SQA queries aren’t forced to wait in queues behind longer queries.
SQA only prioritizes queries that are short-running and are in a user-defined queue.

Redshift Loading Data

A COPY command is the most efficient way to load a table.
- COPY command is able to read from multiple data files or multiple data streams simultaneously.
- Redshift allocates the workload to the cluster nodes and performs the load operations in parallel, including sorting the rows and distributing data across node slices.
- COPY command supports loading data from S3, EMR, DynamoDB, and remote hosts such as EC2 instances using SSH.
- COPY supports decryption and can decrypt the data as it performs the load if the data is encrypted
- COPY can then speed up the load process by uncompressing the files as they are read if the data is compressed.
- COPY command can be used with COMPUPDATE set to ON to analyze and apply compression automatically based on sample data.
- Optimizing storage for narrow tables (multiple rows few columns) by using Single COPY command instead of multiple COPY commands, as it would not work well due to hidden fields and compression issues.
Auto Copy
- Auto-copy provides the ability to automate copy statements by tracking S3 folders and ingesting new files without customer intervention
- Without Auto-copy, a copy statement immediately starts the file ingestion process for existing files.
- Auto-copy extends the existing copy command and provides the ability to
  - Automate file ingestion process by monitoring specified S3 paths for new files
  - Re-use copy configurations, reducing the need to create and run new copy statements for repetitive ingestion tasks and
  - Keep track of loaded files to avoid data duplication.

INSERT command
- Clients can connect to Amazon Redshift using ODBC or JDBC and issue ‘insert’ SQL commands to insert the data.
- INSERT command is much less efficient than using COPY as they are routed through the single leader node.

Redshift Resizing Cluster

Elastic resize
- Use elastic resize to change the node type, number of nodes, or both. (Circa April 2020 – Changing node type is available recently and was not supported before)
- If only the number of nodes is changed, then queries are temporarily paused and connections are held open if possible.
- During the resize operation, the cluster is read-only.
- Elastic resize takes 10–15 minutes.
Classic resize
- Use classic resize to change the node type, number of nodes, or both.
- During the resize operation, data is copied to a new cluster and the source cluster is read-only
- Classic resize takes 2 hours – 2 days or longer, depending on the data’s size

Snapshot and restore with classic resize
- To keep the cluster available during a classic resize, create a snapshot , make a copy of an existing cluster, then resize the new cluster.

Redshift Spectrum

Redshift Spectrum helps query and retrieve structured and semistructured data from files in S3 without having to load the data into Redshift tables.

Redshift Spectrum queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in S3.
Multiple clusters can concurrently query the same dataset in S3 without the need to make copies of the data for each cluster.
Redshift Spectrum resides on dedicated Redshift servers that are independent of the existing cluster.

Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer.
Redshift Spectrum also scales automatically, based on the demands of the queries, and can potentially use thousands of instances to take advantage of massively parallel processing.
Supports external data catalog using Glue, Athena, or Hive metastore

Redshift cluster and the S3 bucket must be in the same AWS Region.
Redshift Spectrum external tables are read-only. You can’t COPY or INSERT to an external table.

Redshift Federated Query

Redshift Federated Query feature allows querying and analyzing data across operational databases, warehouses, and lakes.

Redshift Federated Query allows integrating queries on live data in RDS for PostgreSQL and Aurora PostgreSQL with queries across Redshift and S3.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
1. Create custom table views.
2. Add interleaved sort keys per team.
3. Maintain team-specific copies of the table.
4. Add support for workload management queue hopping.

Amazon Athena

June 19, 2024 ~ Last updated on : June 20, 2024 ~ jayendrapatil ~ 4 Comments

Amazon Athena

Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats.

provides a simplified, flexible way to analyze petabytes of data in an S3 data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python without loading the data.
is built on open-source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

is highly available and runs queries using compute resources across multiple facilities, automatically routing queries appropriately if a particular facility is unreachable
can process unstructured, semi-structured, and structured datasets.
integrates with QuickSight for visualizing the data or creating dashboards.

supports various standard data formats, including CSV, TSV, JSON, ORC, Avro, and Parquet.
supports compressed data in Snappy, Zlib, LZO, and GZIP formats. You can improve performance and reduce costs by compressing, partitioning, and using columnar formats.
can handle complex analysis, including large joins, window functions, and arrays

uses a managed Glue Data Catalog to store information and schemas about the databases and tables that you create for the data stored in S3
uses schema-on-read technology, which means that the table definitions are applied to the data in S3 when queries are being applied. There’s no data loading or transformation required. Table definitions and schema can be deleted without impacting the underlying data stored in S3.
supports fine-grained access control with AWS Lake Formation which allows for centrally managing permissions and access control for data catalog resources in the S3 data lake.

Athena Workgroups

Athena workgroups can be used to separate users, teams, applications, or workloads, to set limits on amount of data each query or the entire workgroup can process, and to track costs.
Resource-level identity-based policies can be used to control access to a specific workgroup.

Workgroups help view query-related metrics in CloudWatch, control costs by configuring limits on the amount of data scanned, create thresholds, and trigger actions, such as SNS, when these thresholds are breached.
Workgroups integrate with IAM, CloudWatch, Simple Notification Service, and AWS Cost and Usage Reports as follows:
- IAM identity-based policies with resource-level permissions control who can run queries in a workgroup.
- Athena publishes the workgroup query metrics to CloudWatch if you enable query metrics.
- SNS topics can be created that issue alarms to specified workgroup users when data usage controls for queries in a workgroup exceed the established thresholds.
- Workgroup tag can be configured as a cost allocation tag in the Billing and Cost Management console and the costs associated with running queries in that workgroup appear in the Cost and Usage Reports with that cost allocation tag.

Athena Best Practices

Partition the data
- which helps keep the related data together based on column values such as date, country, and region.
- Athena supports Hive partitioning
- Pick partition keys that will support the queries
- Partition projection is an Athena feature that stores partition information not in the Glue Data Catalog but as rules in the properties of the table in AWS Glue.
Compression
- Compressing the data can speed up queries significantly, as long as the files are either of an optimal size or the files are splittable.
- Smaller data sizes reduce the data scanned from S3, resulting in lower costs of running queries and reduced network traffic.
Optimize file sizes
- Queries run more efficiently when data scanning can be parallelized and when blocks of data can be read sequentially.
Columnar file formats
- Columnar storage formats like ORC and Parquet are optimized for fast retrieval of data as they allow compression and are splittable.
- A splittable file can be read in parallel by the execution engine in Athena, whereas an unsplittable file can’t be read in parallel.
Optimize queries

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A SysOps administrator is storing access logs in Amazon S3 and wants to use standard SQL to query data and generate a report
without having to manage infrastructure. Which AWS service will allow the SysOps administrator to accomplish this task?
1. Amazon Inspector
2. Amazon CloudWatch
3. Amazon Athena
4. Amazon RDS
A Solutions Architect must design a storage solution for incoming billing reports in CSV format. The data does not need to be
scanned frequently and is discarded after 30 days. Which service will be MOST cost-effective in meeting these requirements?
1. Import the logs into an RDS MySQL instance
2. Use AWS Data pipeline to import the logs into a DynamoDB table
3. Write the files to an S3 bucket and use Amazon Athena to query the data
4. Import the logs to an Amazon Redshift cluster

References

Amazon_Athena

AWS Certified Machine Learning -Specialty (MLS-C01) Exam Learning Path

AWS Machine Learning - Specialty Certification

June 4, 2024 ~ Last updated on : June 4, 2024 ~ jayendrapatil ~ 28 Comments

AWS Certified Machine Learning -Specialty (MLS-C01) Exam Learning Path

Finally Re-certified the updated AWS Certified Machine Learning – Specialty (MLS-C01) certification exam after 3 months of preparation.

In terms of the difficulty level of all professional and specialty certifications, I find this to be the toughest, partly because I am still diving deep into machine learning and relearned everything from basics for this certification.
Machine Learning is a vast specialization in itself and with AWS services, there is a lot to cover and know for the exam. This is the only exam, where the majority of the focus is on concepts outside of AWS i.e. pure machine learning. It also includes AWS Machine Learning and Data Engineering services.

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Content

AWS Certified Machine Learning – Specialty (MLS-C01) exam validates
- Select and justify the appropriate ML approach for a given business problem.
- Identify appropriate AWS services to implement ML solutions.
- Design and implement scalable, cost-optimized, reliable, and secure ML solutions.

Refer AWS Certified Machine Learning – Specialty Exam Guide for details

AWS Certified Machine Learning – Specialty Domains

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Summary

Specialty exams are tough, lengthy, and tiresome. Most of the questions and answers options have a lot of prose and a lot of reading that needs to be done, so be sure you are prepared and manage your time well.

MLS-C01 exam has 65 questions to be solved in 170 minutes which gives you roughly 2 1/2 minutes to attempt each question.
MLS-C01 exam includes two types of questions, multiple-choice and multiple-response.
MLS-C01 has a scaled score between 100 and 1,000. The scaled score needed to pass the exam is 750.

Specialty exams currently cost $ 300 + tax.
You can get an additional 30 minutes if English is your second language by requesting Exam Accommodations. It might not be needed for Associate exams but is helpful for Professional and Specialty ones.
As always, mark the questions for review, move on, and come back to them after you are done with all.

As always, having a rough architecture or mental picture of the setup helps focus on the areas that you need to improve. Trust me, you will be able to eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach the right answer or at least have a 50% chance of getting it right.
AWS exams can be taken either remotely or online, I prefer to take them online as it provides a lot of flexibility. Just make sure you have a proper place to take the exam with no disturbance and nothing around you.
Also, if you are taking the AWS Online exam for the first time try to join at least 30 minutes before the actual time as I have had issues with both PSI and Pearson with long wait times.

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Resources

Online Courses
- Stephane Maarek – AWS Certified Machine Learning Specialty Exam
- Whizlabs – AWS Certified Machine Learning Specialty Course
- Exam Readiness: AWS Certified Machine Learning – Specialty
Practice tests
- Braincert – AWS Certified Machine Learning – Specialty MLS-C01 Practice Exams
- Whizlabs – AWS Certified Machine Learning Specialty Practice Tests

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Topics

AWS Certified Machine Learning – Specialty exam covers a lot of Machine Learning concepts. It digs deep into Machine learning concepts, most of which are not related to AWS.
AWS Certified Machine Learning – Speciality exam covers the E2E Machine Learning lifecycle, right from data collection, transformation, making it usable and efficient for Machine Learning, pre-processing data for Machine Learning, training and validation, and implementation.

Machine Learning Concepts

Exploratory Data Analysis
- Feature selection and Engineering
  - remove features that are not related to training
  - remove features that have the same values, very low correlation, very little variance, or a lot of missing values
  - Apply techniques like Principal Component Analysis (PCA) for dimensionality reduction i.e. reduce the number of features.
  - Apply techniques such as One-hot encoding and label encoding to help convert strings to numeric values, which are easier to process.
  - Apply Normalization i.e. values between 0 and 1 to handle data with large variance.
  - Apply feature engineering for feature reduction e.g. using a single height/weight feature instead of both features.
- Handle Missing data
  - remove the feature or rows with missing data
  - impute using Mean/Median values – valid only for Numeric values and not categorical features also does not factor correlation between features
  - impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning – more accurate and helps factors correlation between features
- Handle unbalanced data
  - Source more data
  - Oversample minority or Undersample majority
  - Data augmentation using techniques like Synthetic Minority Oversampling Technique (SMOTE).
Modeling
- Know about Algorithms – Supervised, Unsupervised and Reinforcement and which algorithm is best suitable based on the available data either labelled or unlabelled.
  - Supervised learning trains on labeled data e.g. Linear regression. Logistic regression, Decision trees, Random Forests
  - Unsupervised learning trains on unlabelled data e.g. PCA, SVD, K-means
  - Reinforcement learning trained based on actions and rewards e.g. Q-Learning
- Hyperparameters
  - are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models
  - some of the common hyperparameters are learning rate, batch, epoch (hint: If the learning rate is too large, the minimum slope might be missed and the graph would oscillate If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient)

Evaluation
- Know difference in evaluating model accuracy
  - Use Area Under the (Receiver Operating Characteristic) Curve (AUC) for Binary classification
  - Use root mean square error (RMSE) metric for regression
- Understand Confusion matrix
  - A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
  - A false positive is an outcome where the model incorrectly predicts the positive class. A false negative is an outcome where the model incorrectly predicts the negative class.
  - Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN) (hint: use this for cases like fraud detection, cost of marking non fraud as frauds is lower than marking fraud as non-frauds)
  - Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP) (hint: use this for cases like videos for kids, the cost of dropping few valid videos is lower than showing few bad ones)
- Handle Overfitting problems
  - Simplify the model, by reducing the number of layers
  - Early Stopping – form of regularization while training a model with an iterative method, such as gradient descent
  - Data Augmentation
  - Regularization – technique to reduce the complexity of the model
  - Dropout is a regularization technique that prevents overfitting
  - Never train on test data

Machine Learning Services

SageMaker
- supports both File mode, Pipe mode, and Fast File mode
  - File mode loads all of the data from S3 to the training instance volumes VS Pipe mode streams data directly from S3
  - File mode needs disk space to store both the final model artifacts and the full training dataset. VS Pipe mode which helps reduce the required size for EBS volumes.
  - Fast File mode combines the ease of use of the existing File Mode with the performance of Pipe Mode.
- Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it.
- supports Model tracking capability to manage up to thousands of machine learning model experiments
- supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload
- provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training & inference
- SageMaker Automatic Model Tuning
  - is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
  - Best practices
    - limit the search to a smaller number as the difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
    - DO NOT specify a very large range to cover every possible value for a hyperparameter as it affects the success of hyperparameter optimization.
    - log-scaled hyperparameter can be converted to improve hyperparameter optimization.
    - running one training job at a time achieves the best results with the least amount of compute time.
    - Design distributed training jobs so that you get they report the objective metric that you want.
- know how to take advantage of multiple GPUs (hint: increase learning rate and batch size w.r.t to the increase in GPUs)
- Elastic Interface (now replaced by Inferentia) helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference.
- SageMaker Inference options.
  - Real-time inference is ideal for online inferences that have low latency or high throughput requirements.
  - Serverless Inference is ideal for intermittent or unpredictable traffic patterns as it manages all of the underlying infrastructure with no need to manage instances or scaling policies.
  - Batch Transform is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint.
  - Asynchronous Inference is ideal when you want to queue requests and have large payloads with long processing times.
- SageMaker Model deployment allows deploying multiple variants of a model to the same SageMaker endpoint to test new models without impacting the user experience
  - Production Variants
    - supports A/B or Canary testing where you can allocate a portion of the inference requests to each variant.
    - helps compare production variants’ performance relative to each other.
  - Shadow Variants
    - replicates a portion of the inference requests that go to the production variant to the shadow variant.
    - logs the responses of the shadow variant for comparison and not returned to the caller.
    - helps test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant.
- SageMaker Managed Spot training can help use spot instances to save cost and with Checkpointing feature can save the state of ML models during training
- SageMaker Feature Store
  - helps to create, share, and manage features for ML development.
  - is a centralized store for features and associated metadata so features can be easily discovered and reused.
- SageMaker Debugger provides tools to debug training jobs and resolve problems such as overfitting, saturated activation functions, and vanishing gradients to improve the model’s performance.
- SageMaker Model Monitor monitors the quality of SageMaker machine learning models in production and can help set alerts that notify when there are deviations in the model quality.
- SageMaker Automatic Model Tuning helps find a set of hyperparameters for an algorithm that can yield an optimal model.
- SageMaker Data Wrangler
  - reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes.
  - simplifies the process of data preparation (including data selection, cleansing, exploration, visualization, and processing at scale) and feature engineering.
- SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare machine learning experiments.
- SageMaker Clarify helps improve the ML models by detecting potential bias and helping to explain the predictions that the models make.
- SageMaker Model Governance is a framework that gives systematic visibility into ML model development, validation, and usage.
- SageMaker Autopilot is an automated machine learning (AutoML) feature set that automates the end-to-end process of building, training, tuning, and deploying machine learning models.
- SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
- SageMaker API and SageMaker Runtime support VPC interface endpoints powered by AWS PrivateLink that helps connect VPC directly to the SageMaker API or SageMaker Runtime using AWS PrivateLink without using an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
- Algorithms –
  - Blazing text provides Word2vec and text classification algorithms
  - DeepAR provides supervised learning algorithm for forecasting scalar (one-dimensional) time series (hint: train for new products based on existing products sales data).
  - Factorization machines provide supervised classification and regression tasks, helps capture interactions between features within high dimensional sparse datasets economically.
  - Image classification algorithm is a supervised learning algorithm that supports multi-label classification.
  - IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
  - K-means is an unsupervised learning algorithm for clustering as it attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
  - k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression.
  - Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Used to identify number of topics shared by documents within a text corpus
  - Neural Topic Model (NTM) Algorithm is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
  - Linear models are supervised learning algorithms used for solving either classification or regression problems.
    - For regression (predictor_type=’regressor’), the score is the prediction produced by the model.
    - For classification (predictor_type=’binary_classifier’ or predictor_type=’multiclass_classifier’)
  - Object Detection algorithm detects and classifies objects in images using a single deep neural network
  - Principal Component Analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) (hint: dimensionality reduction)
  - Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points (hint: anomaly detection)
  - Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. (hint: text summarization is the key use case)

SageMaker Ground Truth
- provides automated data labeling using machine learning
- helps build highly accurate training datasets for machine learning quickly using Amazon Mechanical Turk
- provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.
- automated data labeling uses machine learning to label portions of the data automatically without having to send them to human workers

Machine Learning & AI Managed Services

Comprehend
- natural language processing (NLP) service to find insights and relationships in text.
- identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
Lex
- provides conversational interfaces using voice and text helpful in building voice and text chatbots
Polly
- text into speech
- supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch or volume.
- supports pronunciation lexicons to customize the pronunciation of words
Rekognition – analyze images and video
- helps identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
Translate – natural and fluent language translation
Transcribe – automatic speech recognition (ASR) speech-to-text

Kendra – an intelligent search service that uses NLP and advanced ML algorithms to return specific answers to search questions from your data.
Panorama brings computer vision to the on-premises camera network.
Augmented AI (Amazon A2I) is an ML service that makes it easy to build the workflows required for human review.

Forecast – highly accurate forecasts.

Analytics

Make sure you know and understand data engineering concepts mainly in terms of data capture, migration, transformation, and storage.
Kinesis
- Understand Kinesis Data Streams and Kinesis Data Firehose in depth
- Kinesis Data Analytics can process and analyze streaming data using standard SQL and integrates with Data Streams and Firehose
- Know Kinesis Data Streams vs Kinesis Firehose
  - Know Kinesis Data Streams is open ended on both producer and consumer. It supports KCL and works with Spark.
  - Know Kinesis Firehose is open ended for producer only. Data is stored in S3, Redshift and ElasticSearch.
  - Kinesis Firehose works in batches with minimum 60secs interval.
  - Kinesis Data Firehose supports data transformation and record format conversion using Lambda function (hint: can be used for transforming csv or JSON into parquet)
- Kinesis Video Streams provides a fully managed service to ingest, index store, and stream live video. HLS can be used to view a Kinesis video stream, either for live playback or to view archived video.
OpenSearch (ElasticSearch) is a search service that supports indexing, full-text search, faceting, etc.
Data Pipeline helps define data-driven flows to automate and schedule regular data movement and data processing activities in AWS
Glue is a fully managed, ETL (extract, transform, and load) service that automates the time-consuming steps of data preparation for analytics
- helps setup, orchestrate, and monitor complex data flows.
- Glue Data Catalog is a central repository to store structural and operational metadata for all the data assets.
- Glue crawler connects to a data store, extracts the schema of the data, and then populates the Glue Data Catalog with this metadata
- Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code.
DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between storage systems and services.

Security, Identity & Compliance

Security is covered very lightly. (hint : SageMaker can read data from KMS-encrypted S3. Make sure, the KMS key policies include the role attached with SageMaker)

Management & Governance Tools

Understand AWS CloudWatch for Logs and Metrics. (hint: SageMaker is integrated with Cloudwatch and logs and metrics are all stored in it)

Storage

Understand Data Storage Options – Know patterns for S3 vs RDS vs DynamoDB vs Redshift. (hint: S3 is, by default, the data storage option or Big Data storage, and look for it in the answer.)

Whitepapers and articles

AWS SageMaker Built-in Algorithms Summary

June 3, 2024 ~ Last updated on : June 4, 2024 ~ jayendrapatil

SageMaker Built-in Algorithms

SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and ML practitioners get started on training and deploying ML models quickly.

Text-based

BlazingText algorithm

provides highly optimized implementations of the Word2vec and text classification algorithms.

Word2vec algorithm
- useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
- maps words to high-quality distributed vectors, whose representation is called word embeddings
- word embeddings capture the semantic relationships between words.
Text classification
- is an important task for applications performing web searches, information retrieval, ranking, and document classification

provides the Skip-gram and continuous bag-of-words (CBOW) training architectures

Forecasting

DeepAR

is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.

Recommendation

Factorization Machine

is a general-purpose supervised learning algorithm used for both classification and regression tasks.
extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically, such as click prediction and item recommendation.

Clustering

K-means algorithm

is an unsupervised learning algorithm for clustering

attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups

Classification

K-nearest neighbors (k-NN) algorithm

is an index-based algorithm.
uses a non-parametric method for classification or regression.

For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Linear Learner

are supervised learning algorithms used for solving either classification or regression problems

XGBoost (eXtreme Gradient Boosting)

is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models

Topic Modelling

Latent Dirichlet Allocation (LDA)

is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.

used to discover a user-specified number of topics shared by documents within a text corpus.

Neural Topic Model (NTM)

is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.

Feature Reduction

Object2Vec

is a general-purpose neural embedding algorithm that is highly customizable
can learn low-dimensional dense embeddings of high-dimensional objects.

Principal Component Analysis – PCA

is an unsupervised ML algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.

Anomaly Detection

Random Cut Forest (RCF)

is an unsupervised algorithm for detecting anomalous data points within a data set.

IP Insights

is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers

Sequence Translation

Sequence to Sequence – seq2seq

is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio), and the output generated is another sequence of tokens.
key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)

Computer Vision – CV

Image classification

a supervised learning algorithm that supports multi-label classification

takes an image as input and outputs one or more labels
uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.

Object Detection

detects and classifies objects in images using a single deep neural network.
is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.

Semantic Segmentation

provides a fine-grained, pixel-level approach to developing computer vision applications.

tags every pixel in an image with a class label from a predefined set of classes and is critical to an increasing number of CV applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.
also provides information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image, called a segmentation mask.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

An Analytics team is leading an organization and wants to use anomaly detection to identify potential risks. What Amazon SageMaker machine learning algorithms are best suited for identifying anomalies?
1. Semantic segmentation
2. K-nearest neighbors
3. Latent Dirichlet Allocation (LDA)
4. Random Cut Forest (RCF)
A ML specialist team works for a marketing consulting firm wants to
apply different marketing strategies per segment of their customer base. Online retailer purchase history from the last 5 years is available, it has been decided to segment the customers based on their purchase history. Which type of machine learning algorithm would give you segmentation based on purchase history in the most expeditious manner?
1. K-Nearest Neighbors (KNN)
2. K-Means
3. Semantic Segmentation
4. Neural Topic Model (NTM)
A ML specialist team is looking to improve the quality of searches for their library of documents that are uploaded in PDF, Rich Text Format, or ASCII text. It is looking to use machine learning to automate the identification of key topics for each of the documents. What machine learning resources are best suited for this problem? (Select TWO)
1. BlazingText algorithm
2. Latent Dirichlet Allocation (LDA) algorithm
3. Topic Finder (TF) algorithm
4. Neural Topic Model (NTM) algorithm

A manufacturing company has a large set of labeled historical sales data. The company would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to solve this problem?
1. BlazingText algorithm
2. Random Cut Forest (RCF)
3. Principal component analysis (PCA)
4. Linear regression
An agency collects census information with responses for approximately 500 questions from each citizen. Which algorithm would help reduce the number for features?
1. Factorization machines (FM) algorithm
2. Latent Dirichlet Allocation (LDA) algorithm
3. Principal component analysis (PCA) algorithm
4. Random Cut Forest (RCF) algorithm
A store wants to understand some characteristics of visitors to the store. The store has security video recordings from the past several years. The store wants to group visitors by hair style and hair color. Which solution will meet these requirements with the LEAST amount of effort?
1. Object detection algorithm
2. Latent Dirichlet Allocation (LDA) algorithm
3. Random Cut Forest (RCF) algorithm
4. Semantic segmentation algorithm

References

SageMaker_Build-in_Algortithms

AWS Machine Learning Services – Cheat Sheet

May 12, 2024 ~ Last updated on : June 4, 2024 ~ jayendrapatil ~ 1 Comment

AWS Machine Learning Services

Amazon SageMaker

Build, train, and deploy machine learning models at scale
fully-managed service that enables data scientists and developers to quickly and easily build, train & deploy machine learning models.

enables developers and scientists to build machine learning models for use in intelligent, predictive apps.
is designed for high availability with no maintenance windows or scheduled downtimes.
allows users to select the number and type of instance used for the hosted notebook, training & model hosting.

can be deployed as endpoint interfaces and batch.
supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
supports Jupyter notebooks.

Users can persist their notebook files on the attached ML storage volume.
Users can modify the notebook instance and select a larger profile through the SageMaker console, after saving their files and data on the attached ML storage volume.
includes built-in algorithms for linear regression, logistic regression, k-means clustering, principal component analysis, factorization machines, neural topic modeling, latent dirichlet allocation, gradient boosted trees, seq2seq, time series forecasting, word2vec & image classification

algorithms work best when using the optimized protobuf recordIO format for the training data, which allows Pipe mode that streams data directly from S3 and helps faster start times and reduce space requirements
provides built-in algorithms, pre-built container images, or extend a pre-built container image and even build your custom container image.
supports users custom training algorithms provided through a Docker image adhering to the documented specification.

also provides optimized MXNet, Tensorflow, Chainer & PyTorch containers
ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.
requests to the API and console are made over a secure (SSL) connection.

stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
SageMaker Neo is a new capability that enables machine learning models to train once and run anywhere in the cloud and at the edge.

Amazon Textract

Textract provides OCR and helps add document text detection and analysis to the applications.

includes simple, easy-to-use API operations that can analyze image files and PDF files.

Amazon Comprehend

Comprehend is a managed natural language processing (NLP) service to find insights and relationships in text.
identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.

can analyze a collection of documents and other text files (such as social media posts) and automatically organize them by relevant terms or topics.

Amazon Lex

is a service for building conversational interfaces using voice and text.
provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable building applications with highly engaging user experiences and lifelike conversational interactions.

common use cases of Lex include: Application/Transactional bot, Informational bot, Enterprise Productivity bot, and Device Control bot.
leverages Lambda for Intent fulfillment, Cognito for user authentication & Polly for text-to-speech.
scales to customers’ needs and does not impose bandwidth constraints.

is a completely managed service so users don’t have to manage the scaling of resources or maintenance of code.
uses deep learning to improve over time.

Amazon Polly

text into speech

uses advanced deep-learning technologies to synthesize speech that sounds like a human voice.
supports Lexicons to customize pronunciation of specific words & phrases
supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch, pauses, or volume.

Amazon Rekognition

analyzes image and video
identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
provides highly accurate facial analysis and facial search capabilities that can be used to detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.

helps identify potentially unsafe or inappropriate content across both image and video assets and provides detailed labels that help accurately control what you want to allow based on your needs.

Amazon Forecast

Amazon Forecast is a fully managed time-series forecasting service that uses statistical and machine learning algorithms to deliver highly accurate time-series forecasts and is built for business metrics analysis.
automatically tracks the accuracy of the model over time as new data is imported. Model’s deviation from initial quality metrics can be systematically quantified and used to make more informed decisions about keeping, retraining, or rebuilding the model as new data comes in.

provides six built-in algorithms which include ARIMA, Prophet, NPTS, ETS, CNN-QR, and DeepAR+.
integrates with AutoML to choose the optimal model for the datasets.

Amazon SageMaker Ground Truth

helps build highly accurate training datasets for machine learning quickly.

offers easy access to labelers through Amazon Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.
allows using your own labelers or use vendors recommended by Amazon through AWS Marketplace.
helps lower labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.

provides annotation consolidation to help improve the accuracy of the data object’s labels.

Amazon Translate

provides natural and fluent language translation
a neural machine translation service that delivers fast, high-quality, and affordable language translation.

Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms.
allows content localization – such as websites and applications – for international users, and to easily translate large volumes of text efficiently.

Amazon Transcribe

provides speech-to-text capability

uses a deep learning process called automatic speech recognition (ASR) to convert speech to text quickly and accurately.
can be used to transcribe customer service calls, automate closed captioning and subtitling, and generate metadata for media assets to create a fully searchable archive.
adds punctuation and formatting so that the output closely matches the quality of manual transcription at a fraction of the time and expense.

process audio in batch or near real-time.
supports automatic language identification.
supports custom vocabulary to generate more accurate transcriptions for domain-specific words and phrases like product names, technical terminology, or names of individuals.

supports specifying a list of words to remove from transcripts.

Amazon Kendra

is an intelligent search service that uses NLP and advanced ML algorithms to return specific answers to search questions from your data.
uses its semantic and contextual understanding capabilities to decide whether a document is relevant to a search query.

returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.
provides a unified search experience by connecting multiple data repositories to an index and ingesting and crawling documents.
can use the document metadata to create a feature-rich and customized search experience for the users, helping them efficiently find the right answers to their queries.

Augmented AI (Amazon A2I)

Augmented AI (Amazon A2I) is an ML service that makes it easy to build the workflows required for human review.
brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether it runs on AWS or not.

Amazon Personalize

Personalize is a fully managed machine learning service that uses data to generate item recommendations.

can also generate user segments based on the users’ affinity for certain items or item metadata.
generates recommendations primarily based on item interaction data that comes from the users interacting with items in the catalog.
includes API operations for real-time personalization, and batch operations for bulk recommendations and user segments.

Amazon Panorama

brings computer vision to the on-premises camera network.
AWS Panorama Appliance or another compatible device can be installed in the data center and registered with AWS Panorama to deploy computer vision applications from the cloud.
AWS Panorama Appliance
- is a compact edge appliance that uses a powerful system-on-module (SOM) that is optimized for ML workloads.
- can run multiple computer vision models against multiple video streams in parallel and output the results in real-time.
- is designed for use in commercial and industrial settings and is rated for dust and liquid protection.

works with the existing real-time streaming protocol (RTSP) network cameras.

Amazon Fraud Detector

Fraud Detector is a fully managed service to identify potentially fraudulent online activities such as online payment fraud and fake account creation.
takes care of all the heavy lifting such as data validation and enrichment, feature engineering, algorithm selection, hyperparameter tuning, and model deployment.

AWS IoT Greengrass ML Inference

IoT Greengrass helps perform machine learning inference locally on devices, using models that are created, trained, and optimized in the cloud.
provides flexibility to use machine learning models trained in SageMaker or to bring your pre-trained model stored in S3.
helps get inference results with very low latency to ensure the IoT applications can respond quickly to local events.

Amazon Elastic Inference

helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference by up to 75%.
supports TensorFlow, Apache MXNet, and ONNX models, with more frameworks coming soon.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A company has built a deep learning model and now wants to deploy it using the SageMaker Hosting Services. For inference, they want a cost-effective option that guarantees low latency but still comes at a fraction of the cost of using a GPU instance for your endpoint. As a machine learning Specialist, what feature should be used?
1. Inference Pipeline
2. Elastic Inference
3. SageMaker Ground Truth
4. SageMaker Neo
A machine learning specialist works for an online retail company that sells health products. The company allows users to enter reviews of the products they buy from the website. The company wants to make sure the reviews do not contain any offensive or unsafe content, such as obscenities or threatening language. Which Amazon SageMaker algorithm or service will allow scanning user’s review text in the simplest way?
1. BlazingText
2. Transcribe
3. Semantic Segmentation
4. Comprehend
A company develops a tool whose coverage includes blogs, news sites, forums, videos, reviews, images, and social networks such as Twitter and Facebook. Users can search data by using Text and Image Search, and use charting, categorization, sentiment analysis, and other features to provide further information and analysis. They want to provide Image and text analysis capabilities to the applications which include identifying objects, people, text, scenes, and activities, and also provide highly accurate facial analysis and facial recognition. What service can provide this capability?
1. Amazon Comprehend
2. Amazon Rekognition
3. Amazon Polly
4. Amazon SageMaker

AWS SageMaker

May 10, 2024 ~ Last updated on : June 4, 2024 ~ jayendrapatil ~ 2 Comments

AWS SageMaker

SageMaker is a fully managed machine learning service to build, train, and deploy machine learning (ML) models quickly.

removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models.
is designed for high availability with no maintenance windows or scheduled downtimes

APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS region to provide fault tolerance in the event of a server failure or AZ outage
provides a full end-to-end workflow, but users can continue to use their existing tools with SageMaker.
supports Jupyter notebooks.

allows users to select the number and type of instance used for the hosted notebook, training & model hosting.

SageMaker Machine Learning

Generate example data

Involves exploring and preprocessing, or “wrangling,” example data before using it for model training.

To preprocess data, you typically do the following:
- Fetch the data
- Clean the data
- Prepare or transform the data

Train a model

Model training includes both training and evaluating the model, as follows:
Training the model
- Needs an algorithm, which depends on several factors.
- Need compute resources for training.
Evaluating the model
- determines whether the accuracy of the inferences is acceptable.

Training Data Format – File mode vs Pipe mode vs Fast File mode

SageMaker supports Simple Storage Service (S3), Elastic File System (
EFS), and FSx for Lustre. for training dataset location.
Most SageMaker algorithms work best when using the optimized protobuf recordIO format for the training data.

Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it.
File mode
- loads all the data from S3 to the training instance volumes.
- downloads training data to an encrypted EBS volume attached to the training instance.
- download needs to finish before model training starts
- needs disk space to store both the final model artifacts and the full training dataset.

In Pipe mode
- streams data directly from S3.
- Streaming can provide faster start times for training jobs and better throughput.
- helps reduce the size of the EBS volumes for the training instances as it needs only enough disk space to store the final model artifacts.
- needs code changes
Fast File mode
- combines the ease of use of the existing File Mode with the performance of Pipe Mode.
- provides access to data as if it were downloaded locally while offering the performance benefit of streaming the data directly from S3.
- training can start without waiting for the entire dataset to be downloaded to the training instances.

Build Model

SageMaker provides several built-in machine-learning algorithms that can be used for a variety of problem types.
Write a custom training script in a machine learning framework that SageMaker supports, and use one of the pre-built framework containers to run it in SageMaker.
Bring your algorithm or model to train or host in SageMaker.
- SageMaker provides pre-built Docker images for its built-in algorithms and the supported deep-learning frameworks used for training and inference.
- By using containers, machine learning algorithms can be trained and deployed quickly and reliably at any scale.
Use an algorithm that you subscribe to from AWS Marketplace.

Model Deployment

Model deployment helps deploy the ML code to make predictions, also known as Inference.
supports auto-scaling for the hosted models to dynamically adjust the number of instances provisioned in response to changes in the workload.
supports Multi-model endpoints to provide a scalable and cost-effective solution for deploying large numbers of models using a shared

can provide high availability and reliability by deploying multiple instances of the production endpoint across multiple AZs.
SageMaker provides multiple inference options.
- Real-time inference
  - is ideal for online inferences that have low latency or high throughput requirements.
  - provides a persistent and fully managed endpoint (REST API) that can handle sustained traffic, backed by the instance type of your choice.
  - can support payload sizes up to 6 MB and processing times of 60 seconds.
- Serverless Inference
  - is ideal for intermittent or unpredictable traffic patterns.
  - manages all of the underlying infrastructure with no need to manage instances or scaling policies.
  - provides a pay-as-you-use model, and charges only for what you use and not for idle time.
  - can support payload sizes up to 4 MB and processing times up to 60 seconds.
- Batch Transform
  - is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint.
  - can be used for pre-processing datasets.
  - can support large datasets that are GBs in size and processing times of days.
- Asynchronous Inference
  - is ideal when you want to queue requests and have large payloads with long processing times.
  - can support payloads up to 1 GB and long processing times up to one hour.
  - can also scale down the endpoint to 0 when there are no requests to process.
Inference pipeline
- is a SageMaker model that is composed of a linear sequence of multiple (2-15) containers that process requests for inferences on data.
- can be used to define and deploy any combination of pre-trained SageMaker built-in algorithms and your custom algorithms packaged in Docker containers.

Real-Time Inference Variants

SageMaker supports testing multiple models or model versions behind the same endpoint using variants.
A variant consists of an ML instance and the serving components specified in a SageMaker model.

Each variant can have a different instance type or a SageMaker model that can be autoscaled independently of the others.
Models within the variants can be trained using different datasets, algorithms, ML frameworks, or any combination of all of these.
All the variants behind an endpoint share the same inference code.

SageMaker supports two types of variants, production variants and shadow variants.
- Production Variants
  - supports A/B or Canary testing where you can allocate a portion of the inference requests to each variant.
  - helps compare production variants performance relative to each other.
- Shadow Variants
  - replicates a portion of the inference requests that go to the production variant to the shadow variant.
  - logs the responses of the shadow variant for comparison and not returned to the caller.
  - helps test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant.

SageMaker Training Optimization

SageMaker Managed Spot Training
- uses EC2 Spot instances to optimize the cost of training models over on-demand instances.
- Spot interruptions are managed by SageMaker on your behalf.
SageMaker Checkpoints
- help save the state of ML models during training.
- are snapshots of the model and can be configured by the callback functions of ML frameworks.
- saved checkpoints can be used to restart a training job from the last saved checkpoint.

SageMaker managed spot training with checkpoints help save on training costs.
SageMaker Inference Recommender
- helps select the best instance type and configuration (such as instance count, container parameters, and model optimizations) or serverless configuration (such as max concurrency and memory size) for the ML models and workloads.
- help deploy the model to a real-time or serverless inference endpoint that delivers the best performance at the lowest cost.
- reduces the time required to get ML models in production by automating load testing and model tuning across SageMaker ML instances.
- provides two types of recommendations
  - Default, run a set of load tests on the recommended instance types.
  - Advanced, based on a custom load test where you can select desired ML instances or a serverless endpoint, provide a custom traffic pattern, and provide requirements for latency and throughput based on your production requirements.

SageMaker Security

SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.

SageMaker allows using encrypted S3 buckets for model artifacts and data, as well as pass a KMS key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume.
Requests to the SageMaker API and console are made over a secure (SSL) connection.
SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.

SageMaker API and SageMaker Runtime support VPC interface endpoints powered by AWS PrivateLink that helps connect VPC directly to the SageMaker API or SageMaker Runtime using AWS PrivateLink without using an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
SageMaker Network Isolation makes sure
- containers can’t make any outbound network calls, even to other AWS services such as S3.
- no AWS credentials are made available to the container runtime environment.
- network inbound and outbound traffic is limited to the peers of each training container, in the case of a training job with multiple instances.
- S3 download and upload operation is performed using the SageMaker execution role in isolation from the training or inference container.

SageMaker Notebooks

SageMaker notebooks are collaborative notebooks that are built into SageMaker Studio running the Jupyter Notebook App.
can be accessed without setting up compute instances and file storage.
charged only for the resources consumed when notebooks are running

instance types can be easily switched if more or less computing power is needed, during the experimentation phase.
come with multiple environments already installed containing Jupyter kernels and Python packages including scikit, Pandas, NumPy, MXNet, and TensorFlow.
use a lifecycle configuration that includes both a script that runs each time during notebook instance creation and restarts to install custom environments and kernels on the notebook instance’s EBS volume.

restart the notebooks to automatically apply patches.

SageMaker Built-in Algorithms

Please refer SageMaker Built-in Algorithms for details

SageMaker Feature Store

SageMaker Feature Store helps to create, share, and manage features for ML development.

is a centralized store for features and associated metadata so features can be easily discovered and reused.
helps by reducing repetitive data processing and curation work required to convert raw data into features for training an ML algorithm.
consists of FeatureGroup which is a group of features defined to describe a Record.

Data processing logic is developed only once and the features generated can be used for both training and inference, reducing the training-serving skew.
supports online and offline store
- Online store
  - is used for low-latency real-time inference use cases
  - retains only the latest feature data.
- Offline store
  - is used for training and batch inference.
  - is an append-only store and can be used to store and access historical feature data.
  - data is stored in Parquet format for optimized storage and query access.

Inferentia

AWS Inferentia is designed to accelerate deep learning workloads by providing high-performance inference in the cloud.
helps deliver higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances.

Elastic Inference (EI)

helps speed up the throughput and decrease the latency of getting real-time inferences from the deep learning models deployed as SageMaker-hosted models
add inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
NOTE: Elastic Inference has been deprecated and replaced by Inferentia.

SageMaker Ground Truth

SageMaker Ground Truth provides automated data labeling using machine learning
helps build highly accurate training datasets for machine learning quickly.
offers easy access to labelers through Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.

allows using your labelers as a private workforce or vendors recommended by Amazon through AWS Marketplace.
helps lower the labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.
significantly reduces the time and effort required to create datasets for training to reduce costs

automated data labeling uses active learning to automate the labeling of your input data for certain built-in task types.
provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.

first selects a random sample of data and sends it to Amazon Mechanical Turk to be labeled.
results are then used to train a labeling model that attempts to label a new sample of raw data automatically.
labels are committed when the model can label the data with a confidence score that meets or exceeds a threshold you set.

for a confidence score falling below the defined threshold, the data is sent to human labelers.
Some of the data labeled by humans is used to generate a new training dataset for the labeling model, and the model is automatically retrained to improve its accuracy.
process repeats with each sample of raw data to be labeled.

labeling model becomes more capable of automatically labeling raw data with each iteration, and less data is routed to humans.

SageMaker AutoPilot

SageMaker Autopilot is an automated machine learning (AutoML) feature set that automates the end-to-end process of building, training, tuning, and deploying machine learning models.
analyzes the data, selects algorithms suitable for the problem type, preprocesses the data to prepare it for training, handles automatic model training, and performs hyperparameter optimization to find the best-performing model for the dataset.

helps users understand how models make predictions by automatically generating reports that show the importance of each individual feature providing transparency and insights into the factors influencing the predictions, which can be used by risk and compliance teams and external regulators.

SageMaker JumpStart

SageMaker JumpStart provides pretrained, open-source models for various problem types to help get started with machine learning.

can incrementally train and tune these models before deployment.
provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning with SageMaker.

SageMaker Automatic Model Tuning

Hyperparameters are parameters exposed by ML algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models.

Automatic model tuning helps find a set of hyperparameters for an algorithm that can yield an optimal model.
SageMaker automatic model tuning can use managed spot training
Best Practices for Hyperparameter tuning
- Choosing the Number of Hyperparameters – limit the search to a smaller number as the difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that SageMaker has to search.
- Choosing Hyperparameter Ranges – DO NOT specify a very large range to cover every possible value for a hyperparameter. Range of values for hyperparameters that you choose to search can significantly affect the success of hyperparameter optimization.
- Using Logarithmic Scales for Hyperparameters – log-scaled hyperparameter can be converted to improve hyperparameter optimization.
- Choosing the Best Number of Concurrent Training Jobs – running one training job at a time achieves the best results with the least amount of compute time.
- Running Training Jobs on Multiple Instances – Design distributed training jobs so that you can target the objective metric that you want.
Warm start can be used to start a hyperparameter tuning job using one or more previous tuning jobs as a starting point.

SageMaker Experiments

SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare your machine-learning experiments.
helps organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and track the best-performing models.
automatically tracks the inputs, parameters, configurations, and results of the iterations as runs.

SageMaker Debugger

SageMaker Debugger provides tools to debug training jobs and resolve problems such as overfitting, saturated activation functions, and vanishing gradients to improve the performance of the model.
offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.
supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks.

SageMaker Data Wrangler

SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes.
simplifies the process of data preparation and feature engineering, and completes each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface.
supports SQL to select the data from various data sources and import it quickly.
provides data quality and insights reports to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage.
contains over 300 built-in data transformations, so you can quickly transform data without writing any code.

SageMaker Clarify

SageMaker Clarify helps improve the ML models by detecting potential bias and helping to explain the predictions that the models make.
provides purpose-built tools to gain greater insights into the ML models and data, based on metrics such as accuracy, robustness, toxicity, and bias to improve model quality and support responsible AI initiatives.
can help to
- detect bias in and help explain the model predictions.
- identify types of bias in pre-training data.
- identify types of bias in post-training data that can emerge during training or when the model is in production.
integrates with SageMaker Data Wrangler, Model Monitor, and Auto Pilot.

SageMaker Model Monitor

SageMaker Model Monitor monitors the quality of SageMaker machine learning models in production.
Continuous monitoring can be setup with a real-time endpoint (or a batch transform job that runs regularly), or on-schedule monitoring for asynchronous batch transform jobs.
helps set alerts that notify when there are deviations in the model quality.
provides early and proactive detection of deviations enabling you to take corrective actions, such as retraining models, auditing upstream systems, or fixing quality issues without having to monitor models manually or build additional tooling.
provides prebuilt monitoring capabilities that do not require coding with the flexibility to monitor models by coding to provide custom analysis.
provides the following types of monitoring:
- Monitor data quality – Monitor drift in data quality.
- Monitor model quality – Monitor drift in model quality metrics, such as accuracy.
- Monitor Bias Drift for Models in Production – Monitor bias in the model’s predictions.
- Monitor Feature Attribution Drift for Models in Production – Monitor drift in feature attribution.

SageMaker Neo

SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
automatically optimizes machine learning models for inference on cloud instances and edge devices to run faster with no loss in accuracy.
Optimized models run up to two times faster and consume less than a tenth of the resources of typical machine learning models.
can be used with IoT Greengrass to help perform machine learning inference locally on devices.

SageMaker Model Goverance

SageMaker Model Governance is a framework that gives systematic visibility into machine learning (ML) model development, validation, and usage.
SageMaker provides purpose-built ML governance tools for managing control access, activity tracking, and reporting across the ML lifecycle.

SageMaker Pricing

Users pay for ML compute, storage, and data processing resources they use for hosting the notebook, training the model, performing predictions & logging the outputs.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A company has built a deep learning model and now wants to deploy it using the SageMaker Hosting Services. For inference, they want a cost-effective option that guarantees low latency but still comes at a fraction of the cost of using a GPU instance for your endpoint. As a machine learning Specialist, what feature should be used?
1. Inference Pipeline
2. Elastic Inference
3. SageMaker Ground Truth
4. SageMaker Neo
A trading company is experimenting with different datasets, algorithms, and hyperparameters to find the best combination for the machine learning problem. The company doesn’t want to limit the number of experiments the team can perform but wants to track the several hundred to over a thousand experiments throughout the modeling effort. Which Amazon SageMaker feature should they use to help manage your team’s experiments at scale?
1. SageMaker Inference Pipeline
2. SageMaker model tracking
3. SageMaker Neo
4. SageMaker model containers
A Machine Learning Specialist needs to monitor Amazon SageMaker in a production environment for analyzing record of actions
taken by a user, role, or an AWS service.
Which service should the Specialist use to meet these needs?
1. AWS CloudTrail
2. Amazon CloudWatch
3. AWS Systems Manager
4. AWS Config

References

Amazon_SageMaker

Machine Learning Concepts – Cheat Sheet

May 10, 2024 ~ Last updated on : June 4, 2024 ~ jayendrapatil ~ 2 Comments

Machine Learning Concepts

This post covers some of the basic Machine Learning concepts mostly relevant for the AWS Machine Learning certification exam.

Machine Learning Lifecycle

Data Processing and Exploratory Analysis

To train a model, you need data.

Type of data that depends on the business problem that you want the model to solve (the inferences that you want the model to generate).
Process data includes data collection, data cleaning, data split, data exploring, preprocessing, transformation, formatting etc.

Feature Selection and Engineering

helps improve model accuracy and speed up training

remove irrelevant data inputs using domain knowledge for e.g. name
remove features which has same values, very low correlation, very little variance or lot of missing values
handle missing data using mean values or imputation

combine features which are related for e.g. height and age to height/age
convert or transform features to useful representation for e.g. date to day or hour
standardize data ranges across features

Missing Data

do nothing
remove the feature with lot of missing data points
remove samples with missing data, if the feature needs to be used

Impute using mean/median value
- no impact and the dataset is not skewed
- works with numerical values only. Do not use for categorical features.
- doesn’t factor correlations between features
Impute using (Most Frequent) or (Zero/Constant) Values
- works with categorical features
- doesn’t factor correlations between features
- can introduce bias
Impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning
- more accurate than the mean, median or most frequent
- Computationally expensive

Unbalanced Data

Source more real data

Oversampling instances of the minority class or undersampling instances of the majority class
Create or synthesize data using techniques like SMOTE (Synthetic Minority Oversampling TEchnique)

Label Encoding and One-hot Encoding

Models cannot multiply strings by the learned weights, encoding helps convert strings to numeric values.

Label encoding
- Use Label encoding to provide lookup or map string data values to a numerical values
- However, the values are random and would impact the model

One-hot encoding
- Use One-hot encoding for Categorical features that have a discrete set of possible values.
- One-hot encoding provide binary representation by converting data values into features without impacting the relationships
- a binary vector is created for each categorical feature in the model that represents values as follows:
  - For values that apply to the example, set corresponding vector elements to 1.
  - Set all other elements to 0.
- Multi-hot encoding is when multiple values are 1

Cleaning Data

Scaling or Normalization means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1)

Train a model

Model training includes both training and evaluating the model,

To train a model, algorithm is needed.
Data can be split into training data, validation data and test data
- Algorithm sees and is directly influenced by the training data
- Algorithm uses but is indirectly influenced by the validation data
- Algorithm does not see the testing data during training
Training can be performed using normal parameters or features and hyperparameters

Supervised, Unsupervised, and Reinforcement Learning

Splitting and Randomization

Always randomize the data before splitting

Hyperparameters

influence how the training occurs

Common hyperparameters are learning rate, epoch, batch size
Learning rate
- size of the step taken during gradient descent optimization
- Large learning rates can overshoot the correct solution
- Small learning rates increase training time
Batch size
- number of samples used to train at any one time
- can be all (batch), one (stochastic), or some (mini-batch)
- calculable from infrastructure
- Small batch sizes tend to not get stuck in local minima
- Large batch sizes can converge on the wrong solution at random.
Epochs
- number of times the algorithm processes the entire training data
- each epoch or run can see the model get closer to the desired state
depends on algorithm used

Evaluating the model

After training the model, evaluate it to determine whether the accuracy of the inferences is acceptable.

ML Model Insights

For binary classification models use accuracy metric called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples.
For regression tasks, use the industry standard root mean square error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model.

Cross-Validation

is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.
Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
there is no separate validation data, involves splitting the training data into chunks of validation data and use it for validation

Optimization

Gradient Descent is used to optimize many different types of machine learning algorithms
Step size sets Learning rate
- If the learning rate is too large, the minimum slope might be missed and the graph would oscillate
- If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient

Underfitting

Model is underfitting the training data when the model performs poorly on the training data because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).

To increase model flexibility
- Add new domain-specific features and more feature Cartesian products, and change the types of feature processing used (e.g., increasing n-grams size)
- Regularization – Decrease the amount of regularization used
- Increase the amount of training data examples.
- Increase the number of passes on the existing training data.

Overfitting

Model is overfitting the training data when the model performs well on the training data but does not perform well on the evaluation data because the model is memorizing the data it has seen and is unable to generalize to unseen examples.

To increase model flexibility
- Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins.
- Simplify the model, by reducing the number of layers.
- Regularization – technique to reduce the complexity of the model. Increase the amount of regularization used.
- Early Stopping – a form of regularization while training a model with an iterative method, such as gradient descent.
- Data Augmentation – process of artificially generating new data from existing data, primarily to train new ML models.
- Dropout is a regularization technique that prevents overfitting.

Classification Model Evaluation

Confusion Matrix

Confusion matrix represents the percentage of times each label was predicted in the training set during evaluation
An NxN table that summarizes how successful a classification model’s predictions were; that is, the correlation between the label and the model’s classification.

One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label.
N represents the number of classes. In a binary classification problem, N=2
- For example, here is a sample confusion matrix for a binary classification problem:

	Tumor (predicted)	Non-Tumor (predicted)
Tumor (actual)	18 (True Positives)	1 (False Negatives)
Non-Tumor (actual)	6 (False Positives)	452 (True Negatives)

- Confusion matrix shows that of the 19 samples that actually had tumors, the model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false negative).
- Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).

Confusion matrix for a multi-class classification problem can help you determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.

Accuracy, Precision, Recall (Sensitivity) and Specificity

Accuracy

A metric for classification models, that identifies fraction of predictions that a classification model got right.
In Binary classification, calculated as (True Positives+True Negatives)/Total Number Of Examples

In Multi-class classification, calculated as Correct Predictions/Total Number Of Examples

Precision

A metric for classification models. that identifies the frequency with which a model was correct when predicting the positive class.
Calculated as True Positives/(True Positives + False Positives)

Recall – Sensitivity – True Positive Rate (TPR)

A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify i.e. Number of correct positives out of actual positive results
Calculated as True Positives/(True Positives + False Negatives)
Important when – False Positives are acceptable as long as ALL positives are found for e.g. it is fine to predict Non-Tumor as Tumor as long as All the Tumors are correctly predicted

Specificity – True Negative Rate (TNR)

Number of correct negatives out of actual negative results
Calculated as True Negatives/(True Negatives + False Positives)
Important when – False Positives are unacceptable; it’s better to have false negatives for e.g. it is not fine to predict Non-Tumor as Tumor;

ROC and AUC

ROC (Receiver Operating Characteristic) Curve

An ROC curve (receiver operating characteristic curve) is curve of true positive rate vs. false positive rate at different classification thresholds.
An ROC curve is a graph showing the performance of a classification model at all classification thresholds.
An ROC curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR) at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

AUC (Area under the ROC curve)

AUC stands for “Area under the ROC Curve.”
AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

AUC provides an aggregate measure of performance across all possible classification thresholds.
One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

AUC (Area under the ROC Curve).

F1 Score

F₁ score (also F-score or F-measure) is a measure of a test’s accuracy.
It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). we $F_{1}=\left({\frac {2}{\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}}\right)=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}$ .

Deploy the model

Re-engineer a model before integrate it with the application and deploy it.

Can be deployed as a Batch or as a Service

Amazon RDS Blue/Green Deployments

March 13, 2024 ~ Last updated on : April 1, 2024 ~ jayendrapatil

Amazon RDS Blue/Green Deployments

Amazon RDS Blue/Green Deployments help make and test database changes before implementing them in a production environment.

RDS Blue/Green Deployment has the blue environment as the current production environment and the green environment as the staging environment.
RDS Blue/Green Deployment creates a staging or green environment that exactly copies the production environment.

Green environment is a copy of the topology of the production environment and includes the features used by the DB instance including the Multi-AZ deployment, read replicas, the storage configuration, DB snapshots, automated backups, Performance Insights, and Enhanced Monitoring.
Green environment or the staging environment always stays in sync with the current production environment using logical replication.
RDS DB instances in the green environment can be changed without affecting production workloads. Changes can include the upgrade of major or minor DB engine versions, upgrade of underlying file system configuration, or change of database parameters in the staging environment.

Changes can be thoroughly tested in the green environment and when ready, the environments can be switched over to promote the green environment to be the new production environment.
Switchover typically takes under a minute with no data loss and no need for application changes.
Blue/Green Deployments are currently supported only for RDS for MariaDB, MySQL, and PostgreSQL.

RDS Blue/Green Deployments Benefits

Easily create a production-ready staging environment.
Automatically replicate database changes from the production environment to the staging environment.

Test database changes in a safe staging environment without affecting the production environment.
Stay current with database patches and system updates.
Implement and test newer database features.

Switch over your staging environment to be the new production environment without changes to your application.
Safely switch over through the use of built-in switchover guardrails.
Eliminate data loss during switchover.

Switch over quickly, typically under a minute depending on your workload.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

References

Amazon_Blue_Green_Deployments

AWS RDS Security

March 7, 2024 ~ Last updated on : March 12, 2024 ~ jayendrapatil ~ 14 Comments

AWS RDS Security

AWS RDS Security provides multiple features
- DB instance can be hosted in a VPC for the greatest possible network access control.
- IAM policies can be used to assign permissions that determine who is allowed to manage RDS resources.
- Security groups allow control of what IP addresses or EC2 instances can connect to the databases on a DB instance.
- RDS supports encryption in transit using SSL connections
- RDS supports encryption at rest to secure instances and snapshots at rest.
- Network encryption and transparent data encryption (TDE) with Oracle DB instances
- Authentication can be implemented using Password, Kerberos, and IAM database authentication.

RDS IAM and Access Control

IAM can be used to control which RDS operations each individual user has permission to call.

RDS Encryption at Rest

RDS encrypted instances use the industry-standard AES-256 encryption algorithm to encrypt data on the server that hosts the RDS instance.
RDS handles authentication of access and decryption of the data with a minimal impact on performance, and with no need to modify the database client applications
Data at Rest Encryption
- can be enabled on RDS instances to encrypt the underlying storage
- encryption keys are managed by KMS
- can be enabled only during instance creation
- once enabled, the encryption keys cannot be changed
- if the key is lost, the DB can only be restored from the backup
Once encryption is enabled for an RDS instance,
- logs are encrypted
- snapshots are encrypted
- automated backups are encrypted
- read replicas are encrypted
~~Cross-region replicas and snapshots copy does not work since the key is only available in a single region~~
Encrypted snapshots from one AWS Region can’t be copied to another, by specifying the KMS key identifier of the destination AWS Region as KMS encryption keys are specific to the AWS Region that they are created.

Encrypted snapshots can be copied to another region by specifying a KMS key valid in the destination AWS Region. It can be a Region-specific KMS key, or a multi-Region key.
RDS DB Snapshot considerations
- DB snapshot encrypted using a KMS encryption key can be copied
- Copying an encrypted DB snapshot results in an encrypted copy of the DB snapshot
- When copying, the DB snapshot can either be encrypted with the same KMS encryption key as the original DB snapshot, or a different KMS encryption key to encrypt the copy of the DB snapshot.
- An unencrypted DB snapshot can be copied to an encrypted snapshot, to add encryption to a previously unencrypted DB instance.
- Encrypted snapshot can be restored only to an encrypted DB instance
- If a KMS encryption key is specified when restoring from an unencrypted DB cluster snapshot, the restored DB cluster is encrypted using the specified KMS encryption key
- Copying an encrypted snapshot shared from another AWS account requires access to the KMS encryption key used to encrypt the DB snapshot.
- ~~Because KMS encryption keys are specific to the region that they are created in, an encrypted snapshot cannot be copied to another region~~
Transparent Data Encryption (TDE)
- Automatically encrypts the data before it is written to the underlying storage device and decrypts when it is read from the storage device
- is supported by Oracle and SQL Server
  - Oracle requires key storage outside of the KMS and integrates with CloudHSM for this
  - SQL Server requires a key but is managed by RDS

RDS Encryption in Transit – SSL

Encrypt connections using SSL for data in transit between the applications and the DB instance
RDS creates an SSL certificate and installs the certificate on the DB instance when RDS provisions the instance.
SSL certificates are signed by a certificate authority. SSL certificate includes the DB instance endpoint as the Common Name (CN) for the SSL certificate to guard against spoofing attacks

While SSL offers security benefits, be aware that SSL encryption is a compute-intensive operation and will increase the latency of the database connection.
For encrypted and unencrypted DB instances, data that is in transit between the source and the read replicas is encrypted, even when replicating across AWS Regions.

IAM Database Authentication

IAM database authentication works with MySQL and PostgreSQL.

IAM database authentication prevents the need to store static user credentials in the database because authentication is managed externally using IAM.
Authorization still happens within RDS (not IAM).
IAM database authentication does not require a password but needs an authentication token

An authentication token is a unique string of characters that RDS generates on request.
Authentication tokens are generated using AWS Signature Version 4.
Each Authentication token has a lifetime of 15 minutes

IAM database authentication provides the following benefits:
- Network traffic to and from the database is encrypted using the Secure Sockets Layer (SSL).
- helps centrally manage access to the database resources, instead of managing access individually on each DB instance.
- enables using IAM Roles to access the database instead of a password, for greater security.

RDS Security Groups

Security groups control the access that traffic has in and out of a DB instance
VPC security groups act like a firewall controlling network access to your DB instance.

VPC security groups can be configured and associated with the DB instance to allow access from an IP address range, port, or EC2 security group
Database security groups default to a “deny all” access mode and customers must specifically authorize network ingress.

RDS Rotating Secrets

RDS supports AWS Secrets Manager to automatically rotate the secret

Secrets Manager uses a Lambda function Secrets Manager provides.
Secrets Manager provides the following benefits
- Rotate secrets safely – rotate secrets automatically without disrupting the applications.
  - Secrets Manager offers built-in integrations for rotating credentials for RDS databases for MySQL, PostgreSQL, and Aurora.
  - Secrets Manager can be extended to meet custom rotation requirements by creating a Lambda function to rotate other types of secrets
- Manage secrets centrally – to store, view, and manage all the secrets.
- Security – By default, Secrets Manager encrypts these secrets with encryption keys that you own and control. Using fine-grained IAM policies, access to secrets can be controlled
- Monitor and audit easily – Secrets Manager integrates with AWS logging and monitoring services to enable meet your security and compliance requirements.
- Pay as you go – Pay for the secrets stored and for the use of these secrets; there are no long-term contracts or licensing fees.

Master User Account Privileges

When you create a new DB instance, the default master user that is used gets certain privileges for that DB instance
Subsequently, other users with permissions can be created.

Event Notification

Event notifications can be configured for important events that occur on the DB instance

Notifications of a variety of important events that can occur on the RDS instance, such as whether the instance was shut down, a backup was started, a failover occurred, the security group was changed, or your storage space is low can be received

RDS Encrypted DB Instances Limitations

RDS Encryption can be enabled only during the creation of an RDS DB instance, not after the DB instance is created.
DB instances that are encrypted can’t be modified to disable encryption.

Encrypted snapshot of an unencrypted DB instance cannot be created.
An unencrypted backup or snapshot can’t be restored to an encrypted DB instance.
An unencrypted DB instance or an unencrypted read replica of an encrypted DB instance can’t have an encrypted read replica.

DB snapshot of an encrypted DB instance must be encrypted using the same KMS key as the DB instance.
Encrypted read replicas must be encrypted with the same CMK as the source DB instance when both are in the same AWS Region.
For encrypting an unencrypted RDS database, the following approaches can be used.
- Using Snapshots, however, this option is feasible if you can afford downtime.
  - Create a DB snapshot of the DB instance, which would be unencrypted.
  - Copy the unencrypted DB snapshot to an encrypted snapshot.
  - Restore a DB instance from the encrypted snapshot, which would be an encrypted DB instance.
- For minimal to no downtime you can use AWS Database Migration Service (AWS DMS) to migrate and continuously replicate the data so that the cutover to the new, encrypted database.

RDS API with Interface Endpoints (AWS PrivateLink)

AWS PrivateLink enables you to privately access RDS API operations without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.

DB instances in the VPC don’t need public IP addresses to communicate with RDS API endpoints to launch, modify, or terminate DB instances.
DB instances also don’t need public IP addresses to use any of the available RDS API operations.
Traffic between the VPC and RDS doesn’t leave the Amazon network.

AWS Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Can I encrypt connections between my application and my DB Instance using SSL?
1. No
2. Yes
3. Only in VPC
4. Only in certain regions

Which of these configuration or deployment practices is a security risk for RDS?
1. Storing SQL function code in plaintext
2. Non-Multi-AZ RDS instance
3. Having RDS and EC2 instances exist in the same subnet
4. RDS in a public subnet (Making RDS accessible to the public internet in a public subnet poses a security risk, by making your database directly addressable and spammable. DB instances deployed within a VPC can be configured to be accessible from the Internet or from EC2 instances outside the VPC. If a VPC security group specifies a port access such as TCP port 22, you would not be able to access the DB instance because the firewall for the DB instance provides access only via the IP addresses specified by the DB security groups the instance is a member of and the port defined when the DB instance was created. Refer link)

References

AWS_RDS_User_Guide – Security