AWS Certified Machine Learning -Specialty (MLS-C01) Exam Learning Path

AWS Machine Learning - Specialty Certification

AWS Certified Machine Learning -Specialty (MLS-C01) Exam Learning Path

  • Finally Re-certified the updated AWS Certified Machine Learning – Specialty (MLS-C01) certification exam after 3 months of preparation.
  • In terms of the difficulty level of all professional and specialty certifications, I find this to be the toughest, partly because I am still diving deep into machine learning and relearned everything from basics for this certification.
  • Machine Learning is a vast specialization in itself and with AWS services, there is a lot to cover and know for the exam. This is the only exam, where the majority of the focus is on concepts outside of AWS i.e. pure machine learning. It also includes AWS Machine Learning and Data Engineering services.

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Content

  • AWS Certified Machine Learning – Specialty (MLS-C01) exam validates
    • Select and justify the appropriate ML approach for a given business problem.
    • Identify appropriate AWS services to implement ML solutions.
    • Design and implement scalable, cost-optimized, reliable, and secure ML solutions.

Refer AWS Certified Machine Learning – Specialty Exam Guide for details

                              AWS Certified Machine Learning – Specialty Domains

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Summary

  • Specialty exams are tough, lengthy, and tiresome. Most of the questions and answers options have a lot of prose and a lot of reading that needs to be done, so be sure you are prepared and manage your time well.
  • MLS-C01 exam has 65 questions to be solved in 170 minutes which gives you roughly 2 1/2 minutes to attempt each question.
  • MLS-C01 exam includes two types of questions, multiple-choice and multiple-response.
  • MLS-C01 has a scaled score between 100 and 1,000. The scaled score needed to pass the exam is 750.
  • Specialty exams currently cost $ 300 + tax.
  • You can get an additional 30 minutes if English is your second language by requesting Exam Accommodations. It might not be needed for Associate exams but is helpful for Professional and Specialty ones.
  • As always, mark the questions for review, move on, and come back to them after you are done with all.
  • As always, having a rough architecture or mental picture of the setup helps focus on the areas that you need to improve. Trust me, you will be able to eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach the right answer or at least have a 50% chance of getting it right.
  • AWS exams can be taken either remotely or online, I prefer to take them online as it provides a lot of flexibility. Just make sure you have a proper place to take the exam with no disturbance and nothing around you.
  • Also, if you are taking the AWS Online exam for the first time try to join at least 30 minutes before the actual time as I have had issues with both PSI and Pearson with long wait times.

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Resources

AWS Certified Machine Learning – Specialty (MLS-C01) Exam Topics

  • AWS Certified Machine Learning – Specialty exam covers a lot of Machine Learning concepts. It digs deep into Machine learning concepts, most of which are not related to AWS.
  • AWS Certified Machine Learning – Speciality exam covers the E2E Machine Learning lifecycle, right from data collection, transformation, making it usable and efficient for Machine Learning, pre-processing data for Machine Learning, training and validation, and implementation.

Machine Learning Concepts

  • Exploratory Data Analysis
    • Feature selection and Engineering
      • remove features that are not related to training
      • remove features that have the same values, very low correlation, very little variance, or a lot of missing values
      • Apply techniques like Principal Component Analysis (PCA) for dimensionality reduction i.e. reduce the number of features.
      • Apply techniques such as One-hot encoding and label encoding to help convert strings to numeric values, which are easier to process.
      • Apply Normalization i.e. values between 0 and 1 to handle data with large variance.
      • Apply feature engineering for feature reduction e.g. using a single height/weight feature instead of both features.
    • Handle Missing data
      • remove the feature or rows with missing data
      • impute using Mean/Median values – valid only for Numeric values and not categorical features also does not factor correlation between features
      • impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning – more accurate and helps factors correlation between features
    • Handle unbalanced data
      • Source more data
      • Oversample minority or Undersample majority
      • Data augmentation using techniques like Synthetic Minority Oversampling Technique (SMOTE).
  • Modeling
    • Know about Algorithms – Supervised, Unsupervised and Reinforcement and which algorithm is best suitable based on the available data either labelled or unlabelled.
      • Supervised learning trains on labeled data e.g. Linear regression. Logistic regression, Decision trees, Random Forests
      • Unsupervised learning trains on unlabelled data e.g. PCA, SVD, K-means
      • Reinforcement learning trained based on actions and rewards e.g. Q-Learning
    • Hyperparameters
      • are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models
      • some of the common hyperparameters are learning rate, batch, epoch (hint:  If the learning rate is too large, the minimum slope might be missed and the graph would oscillate If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient
  • Evaluation
    • Know difference in evaluating model accuracy
      • Use Area Under the (Receiver Operating Characteristic) Curve (AUC) for Binary classification
      • Use root mean square error (RMSE) metric for regression
    • Understand Confusion matrix
      • A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
      • false positive is an outcome where the model incorrectly predicts the positive class. A false negative is an outcome where the model incorrectly predicts the negative class.
      • Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN)  (hint: use this for cases like fraud detection,  cost of marking non fraud as frauds is lower than marking fraud as non-frauds)
      • Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP)  (hint: use this for cases like videos for kids, the cost of  dropping few valid videos is lower than showing few bad ones)
    • Handle Overfitting problems
      • Simplify the model, by reducing the number of layers
      • Early Stopping – form of regularization while training a model with an iterative method, such as gradient descent
      • Data Augmentation
      • Regularization – technique to reduce the complexity of the model
      • Dropout is a regularization technique that prevents overfitting
      • Never train on test data

Machine Learning Services

  • SageMaker
    • supports both File mode, Pipe mode, and Fast File mode
      • File mode loads all of the data from S3 to the training instance volumes VS Pipe mode streams data directly from S3
      • File mode needs disk space to store both the final model artifacts and the full training dataset. VS Pipe mode which helps reduce the required size for EBS volumes.
      • Fast File mode combines the ease of use of the existing File Mode with the performance of Pipe Mode.
    • Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it. 
    • supports Model tracking capability to manage up to thousands of machine learning model experiments
    • supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload
    • provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training & inference
    • SageMaker Automatic Model Tuning
      • is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
      • Best practices
        • limit the search to a smaller number as the difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
        • DO NOT specify a very large range to cover every possible value for a hyperparameter as it affects the success of hyperparameter optimization.
        • log-scaled hyperparameter can be converted to improve hyperparameter optimization.
        • running one training job at a time achieves the best results with the least amount of compute time.
        • Design distributed training jobs so that you get they report the objective metric that you want.
    • know how to take advantage of multiple GPUs (hint: increase learning rate and batch size w.r.t to the increase in GPUs)
    • Elastic Interface (now replaced by Inferentia) helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference.
    • SageMaker Inference options.
      • Real-time inference is ideal for online inferences that have low latency or high throughput requirements.
      • Serverless Inference is ideal for intermittent or unpredictable traffic patterns as it manages all of the underlying infrastructure with no need to manage instances or scaling policies.
      • Batch Transform is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint.
      • Asynchronous Inference is ideal when you want to queue requests and have large payloads with long processing times.
    • SageMaker Model deployment allows deploying multiple variants of a model to the same SageMaker endpoint to test new models without impacting the user experience
      • Production Variants
        • supports A/B or Canary testing where you can allocate a portion of the inference requests to each variant.
        • helps compare production variants’ performance relative to each other.
      • Shadow Variants
        • replicates a portion of the inference requests that go to the production variant to the shadow variant.
        • logs the responses of the shadow variant for comparison and not returned to the caller.
        • helps test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant.
    • SageMaker Managed Spot training can help use spot instances to save cost and with Checkpointing feature can save the state of ML models during training
    • SageMaker Feature Store
      • helps to create, share, and manage features for ML development.
      • is a centralized store for features and associated metadata so features can be easily discovered and reused.
    • SageMaker Debugger provides tools to debug training jobs and resolve problems such as overfitting, saturated activation functions, and vanishing gradients to improve the model’s performance.
    • SageMaker Model Monitor monitors the quality of SageMaker machine learning models in production and can help set alerts that notify when there are deviations in the model quality.
    • SageMaker Automatic Model Tuning helps find a set of hyperparameters for an algorithm that can yield an optimal model.
    • SageMaker Data Wrangler
      • reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes.
      • simplifies the process of data preparation (including data selection, cleansing, exploration, visualization, and processing at scale) and feature engineering.
    • SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare machine learning experiments.
    • SageMaker Clarify helps improve the ML models by detecting potential bias and helping to explain the predictions that the models make.
    • SageMaker Model Governance is a framework that gives systematic visibility into ML model development, validation, and usage.
    • SageMaker Autopilot is an automated machine learning (AutoML) feature set that automates the end-to-end process of building, training, tuning, and deploying machine learning models.
    • SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
    • SageMaker API and SageMaker Runtime support VPC interface endpoints powered by AWS PrivateLink that helps connect VPC directly to the SageMaker API or SageMaker Runtime using AWS PrivateLink without using an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
    • Algorithms –
      • Blazing text provides Word2vec and text classification algorithms
      • DeepAR provides supervised learning algorithm for forecasting scalar (one-dimensional) time series (hint: train for new products based on existing products sales data).
      • Factorization machines provide supervised classification and regression tasks, helps capture interactions between features within high dimensional sparse datasets economically.
      • Image classification algorithm is a supervised learning algorithm that supports multi-label classification.
      • IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
      • K-means is an unsupervised learning algorithm for clustering as it attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
      • k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression.
      • Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. Used to identify number of topics shared by documents within a text corpus
      • Neural Topic Model (NTM) Algorithm is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
      • Linear models are supervised learning algorithms used for solving either classification or regression problems. 
        • For regression (predictor_type=’regressor’), the score is the prediction produced by the model.
        • For classification (predictor_type=’binary_classifier’ or predictor_type=’multiclass_classifier’)
      • Object Detection algorithm detects and classifies objects in images using a single deep neural network
      • Principal Component Analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) (hint: dimensionality reduction)
      • Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points (hint: anomaly detection)
      • Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. (hint: text summarization is the key use case)
  • SageMaker Ground Truth
    • provides automated data labeling using machine learning
    • helps build highly accurate training datasets for machine learning quickly using Amazon Mechanical Turk
    • provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.
    • automated data labeling uses machine learning to label portions of the data automatically without having to send them to human workers

Machine Learning & AI Managed Services

  • Comprehend
    • natural language processing (NLP) service to find insights and relationships in text.
    • identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
  • Lex
    • provides conversational interfaces using voice and text helpful in building voice and text chatbots
  • Polly
    • text into speech
    • supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch or volume.
    • supports pronunciation lexicons to customize the pronunciation of words
  • Rekognition – analyze images and video
    • helps identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
  • Translate – natural and fluent language translation
  • Transcribe – automatic speech recognition (ASR) speech-to-text
  • Kendra – an intelligent search service that uses NLP and advanced ML algorithms to return specific answers to search questions from your data.
  • Panorama brings computer vision to the on-premises camera network.
  • Augmented AI (Amazon A2I) is an ML service that makes it easy to build the workflows required for human review.
  • Forecast – highly accurate forecasts.

Analytics

  • Make sure you know and understand data engineering concepts mainly in terms of data capture, migration, transformation, and storage.
  • Kinesis
    • Understand Kinesis Data Streams and Kinesis Data Firehose in depth
    • Kinesis Data Analytics can process and analyze streaming data using standard SQL and integrates with Data Streams and Firehose
    • Know Kinesis Data Streams vs Kinesis Firehose
      • Know Kinesis Data Streams is open ended on both producer and consumer. It supports KCL and works with Spark.
      • Know Kinesis Firehose is open ended for producer only. Data is stored in S3, Redshift and ElasticSearch.
      • Kinesis Firehose works in batches with minimum 60secs interval.
      • Kinesis Data Firehose supports data transformation and record format conversion using Lambda function (hint: can be used for transforming csv or JSON into parquet)
    • Kinesis Video Streams provides a fully managed service to ingest, index store, and stream live video. HLS can be used to view a Kinesis video stream, either for live playback or to view archived video.
  • OpenSearch (ElasticSearch) is a search service that supports indexing, full-text search, faceting, etc.
  • Data Pipeline helps define data-driven flows to automate and schedule regular data movement and data processing activities in AWS
  • Glue is a fully managed, ETL (extract, transform, and load) service that automates the time-consuming steps of data preparation for analytics
    • helps setup, orchestrate, and monitor complex data flows.
    • Glue Data Catalog is a central repository to store structural and operational metadata for all the data assets.
    • Glue crawler connects to a data store, extracts the schema of the data, and then populates the Glue Data Catalog with this metadata
    • Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code.
  • DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between storage systems and services.

Security, Identity & Compliance

  • Security is covered very lightly. (hint : SageMaker can read data from KMS-encrypted S3. Make sure, the KMS key policies include the role attached with SageMaker)

Management & Governance Tools

  • Understand AWS CloudWatch for Logs and Metrics. (hint: SageMaker is integrated with Cloudwatch and logs and metrics are all stored in it)

Storage

  • Understand Data Storage Options – Know patterns for S3 vs RDS vs DynamoDB vs Redshift. (hint: S3 is, by default, the data storage option or Big Data storage, and look for it in the answer.)

Whitepapers and articles

AWS SageMaker Built-in Algorithms Summary

SageMaker Built-in Algorothms

SageMaker Built-in Algorithms

  • SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and ML practitioners get started on training and deploying ML models quickly.

SageMaker Built-in Algorothms

Text-based

BlazingText algorithm

  • provides highly optimized implementations of the Word2vec and text classification algorithms.
  • Word2vec algorithm
    • useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
    • maps words to high-quality distributed vectors, whose representation is called word embeddings
    • word embeddings capture the semantic relationships between words.
  • Text classification
    • is an important task for applications performing web searches, information retrieval, ranking, and document classification
  • provides the Skip-gram and continuous bag-of-words (CBOW) training architectures

Forecasting

DeepAR

  • is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
  • use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.

Recommendation

Factorization Machine

  • is a general-purpose supervised learning algorithm used for both classification and regression tasks.
  • extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically, such as click prediction and item recommendation.

Clustering

K-means algorithm

  • is an unsupervised learning algorithm for clustering
  • attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups

Classification

K-nearest neighbors (k-NN) algorithm

  • is an index-based algorithm.
  • uses a non-parametric method for classification or regression.
  • For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
  • For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Linear Learner

  • are supervised learning algorithms used for solving either classification or regression problems

XGBoost (eXtreme Gradient Boosting)

  • is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
  • Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models

Topic Modelling

Latent Dirichlet Allocation (LDA)

  • is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
  • used to discover a user-specified number of topics shared by documents within a text corpus.

Neural Topic Model (NTM)

  • is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
  • Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.

Feature Reduction

Object2Vec

  • is a general-purpose neural embedding algorithm that is highly customizable
  • can learn low-dimensional dense embeddings of high-dimensional objects.

Principal Component Analysis – PCA

  • is an unsupervised ML algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.

Anomaly Detection

Random Cut Forest (RCF)

  • is an unsupervised algorithm for detecting anomalous data points within a data set.

IP Insights

  • is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
  • designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers

Sequence Translation

Sequence to Sequence – seq2seq

  • is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio), and the output generated is another sequence of tokens.
  • key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)

Computer Vision – CV

Image classification

  • a supervised learning algorithm that supports multi-label classification
  • takes an image as input and outputs one or more labels
  • uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
  • recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.

Object Detection

  • detects and classifies objects in images using a single deep neural network.
  • is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.

Semantic Segmentation

  • provides a fine-grained, pixel-level approach to developing computer vision applications.
  • tags every pixel in an image with a class label from a predefined set of classes and is critical to an increasing number of CV applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.
  • also provides information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image, called a segmentation mask.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An Analytics team is leading an organization and wants to use anomaly detection to identify potential risks. What Amazon SageMaker machine learning algorithms are best suited for identifying anomalies?
    1. Semantic segmentation
    2. K-nearest neighbors
    3. Latent Dirichlet Allocation (LDA)
    4. Random Cut Forest (RCF)
  2. A ML specialist team works for a marketing consulting firm wants to
    apply different marketing strategies per segment of their customer base. Online retailer purchase history from the last 5 years is available, it has been decided to segment the customers based on their purchase history. Which type of machine learning algorithm would give you segmentation based on purchase history in the most expeditious manner?

    1. K-Nearest Neighbors (KNN)
    2. K-Means
    3. Semantic Segmentation
    4. Neural Topic Model (NTM)
  3. A ML specialist team is looking to improve the quality of searches for their library of documents that are uploaded in PDF, Rich Text Format, or ASCII text. It is looking to use machine learning to automate the identification of key topics for each of the documents. What machine learning resources are best suited for this problem? (Select TWO)
    1. BlazingText algorithm
    2. Latent Dirichlet Allocation (LDA) algorithm
    3. Topic Finder (TF) algorithm
    4. Neural Topic Model (NTM) algorithm
  4. A manufacturing company has a large set of labeled historical sales data. The company would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to solve this problem?
    1. BlazingText algorithm
    2. Random Cut Forest (RCF)
    3. Principal component analysis (PCA)
    4. Linear regression
  5. An agency collects census information with responses for approximately 500 questions from each citizen. Which algorithm would help reduce the number for features?
    1. Factorization machines (FM) algorithm
    2. Latent Dirichlet Allocation (LDA) algorithm
    3. Principal component analysis (PCA) algorithm
    4. Random Cut Forest (RCF) algorithm
  6. A store wants to understand some characteristics of visitors to the store. The store has security video recordings from the past several years. The store wants to group visitors by hair style and hair color. Which solution will meet these requirements with the LEAST amount of effort?
    1. Object detection algorithm
    2. Latent Dirichlet Allocation (LDA) algorithm
    3. Random Cut Forest (RCF) algorithm
    4. Semantic segmentation algorithm

References

SageMaker_Build-in_Algortithms

AWS Machine Learning Services – Cheat Sheet

AWS Machine Learning Services

AWS Machine Learning Services

AWS Machine Learning Services

Amazon SageMaker

  • Build, train, and deploy machine learning models at scale
  • fully-managed service that enables data scientists and developers to quickly and easily build, train & deploy machine learning models.
  • enables developers and scientists to build machine learning models for use in intelligent, predictive apps.
  • is designed for high availability with no maintenance windows or scheduled downtimes.
  • allows users to select the number and type of instance used for the hosted notebook, training & model hosting.
  • can be deployed as endpoint interfaces and batch.
  • supports Canary deployment using ProductionVariant and deploying multiple variants of a model to the same SageMaker HTTPS endpoint.
  • supports Jupyter notebooks.
  • Users can persist their notebook files on the attached ML storage volume.
  • Users can modify the notebook instance and select a larger profile through the SageMaker console, after saving their files and data on the attached ML storage volume.
  • includes built-in algorithms for linear regression, logistic regression, k-means clustering, principal component analysis, factorization machines, neural topic modeling, latent dirichlet allocation, gradient boosted trees, seq2seq, time series forecasting, word2vec & image classification
  • algorithms work best when using the optimized protobuf recordIO format for the training data, which allows Pipe mode that streams data directly from S3 and helps faster start times and reduce space requirements
  • provides built-in algorithms, pre-built container images, or extend a pre-built container image and even build your custom container image.
  • supports users custom training algorithms provided through a Docker image adhering to the documented specification.
  • also provides optimized MXNet, Tensorflow, Chainer & PyTorch containers
  • ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.
  • requests to the API and console are made over a secure (SSL) connection.
  • stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
  • SageMaker Neo is a new capability that enables machine learning models to train once and run anywhere in the cloud and at the edge.

Amazon Textract

  • Textract provides OCR and helps add document text detection and analysis to the applications.
  • includes simple, easy-to-use API operations that can analyze image files and PDF files.

Amazon Comprehend

  • Comprehend is a managed natural language processing (NLP) service to find insights and relationships in text.
  • identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
  • can analyze a collection of documents and other text files (such as social media posts) and automatically organize them by relevant terms or topics.

Amazon Lex

  • is a service for building conversational interfaces using voice and text.
  • provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable building applications with highly engaging user experiences and lifelike conversational interactions.
  • common use cases of Lex include: Application/Transactional bot, Informational bot, Enterprise Productivity bot, and Device Control bot.
  • leverages Lambda for Intent fulfillment, Cognito for user authentication & Polly for text-to-speech.
  • scales to customers’ needs and does not impose bandwidth constraints.
  • is a completely managed service so users don’t have to manage the scaling of resources or maintenance of code.
  • uses deep learning to improve over time.

Amazon Polly

  • text into speech
  • uses advanced deep-learning technologies to synthesize speech that sounds like a human voice.
  • supports Lexicons to customize pronunciation of specific words & phrases
  • supports Speech Synthesis Markup Language (SSML) tags like prosody so users can adjust the speech rate, pitch, pauses, or volume.

Amazon Rekognition

  • analyzes image and video
  • identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
  • provides highly accurate facial analysis and facial search capabilities that can be used to detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.
  • helps identify potentially unsafe or inappropriate content across both image and video assets and provides detailed labels that help accurately control what you want to allow based on your needs.

Amazon Forecast

  • Amazon Forecast is a fully managed time-series forecasting service that uses statistical and machine learning algorithms to deliver highly accurate time-series forecasts and is built for business metrics analysis.
  • automatically tracks the accuracy of the model over time as new data is imported. Model’s deviation from initial quality metrics can be systematically quantified and used to make more informed decisions about keeping, retraining, or rebuilding the model as new data comes in.
  • provides six built-in algorithms which include ARIMA, Prophet, NPTS, ETS, CNN-QR, and DeepAR+.
  • integrates with AutoML to choose the optimal model for the datasets.

Amazon SageMaker Ground Truth

  • helps build highly accurate training datasets for machine learning quickly.
  • offers easy access to labelers through Amazon Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.
  • allows using your own labelers or use vendors recommended by Amazon through AWS Marketplace.
  • helps lower labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.
  • provides annotation consolidation to help improve the accuracy of the data object’s labels.

Amazon Translate

  • provides natural and fluent language translation
  • a neural machine translation service that delivers fast, high-quality, and affordable language translation.
  • Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms.
  • allows content localization – such as websites and applications – for international users, and to easily translate large volumes of text efficiently.

Amazon Transcribe

  • provides speech-to-text capability
  • uses a deep learning process called automatic speech recognition (ASR) to convert speech to text quickly and accurately.
  • can be used to transcribe customer service calls, automate closed captioning and subtitling, and generate metadata for media assets to create a fully searchable archive.
  • adds punctuation and formatting so that the output closely matches the quality of manual transcription at a fraction of the time and expense.
  • process audio in batch or near real-time.
  • supports automatic language identification.
  • supports custom vocabulary to generate more accurate transcriptions for domain-specific words and phrases like product names, technical terminology, or names of individuals.
  • supports specifying a list of words to remove from transcripts.

Amazon Kendra

  • is an intelligent search service that uses NLP and advanced ML algorithms to return specific answers to search questions from your data.
  • uses its semantic and contextual understanding capabilities to decide whether a document is relevant to a search query.
  • returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.
  • provides a unified search experience by connecting multiple data repositories to an index and ingesting and crawling documents.
  • can use the document metadata to create a feature-rich and customized search experience for the users, helping them efficiently find the right answers to their queries.

Augmented AI (Amazon A2I)

  • Augmented AI (Amazon A2I) is an ML service that makes it easy to build the workflows required for human review.
  • brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether it runs on AWS or not.

Amazon Personalize

  • Personalize is a fully managed machine learning service that uses data to generate item recommendations.
  • can also generate user segments based on the users’ affinity for certain items or item metadata.
  • generates recommendations primarily based on item interaction data that comes from the users interacting with items in the catalog.
  • includes API operations for real-time personalization, and batch operations for bulk recommendations and user segments.

Amazon Panorama

  • brings computer vision to the on-premises camera network.
  • AWS Panorama Appliance or another compatible device can be installed in the data center and registered with AWS Panorama to deploy computer vision applications from the cloud.
  • AWS Panorama Appliance
    • is a compact edge appliance that uses a powerful system-on-module (SOM) that is optimized for ML workloads.
    • can run multiple computer vision models against multiple video streams in parallel and output the results in real-time.
    • is designed for use in commercial and industrial settings and is rated for dust and liquid protection.
  • works with the existing real-time streaming protocol (RTSP) network cameras.

Amazon Fraud Detector

  • Fraud Detector is a fully managed service to identify potentially fraudulent online activities such as online payment fraud and fake account creation.
  • takes care of all the heavy lifting such as data validation and enrichment, feature engineering, algorithm selection, hyperparameter tuning, and model deployment.

AWS IoT Greengrass ML Inference

  • IoT Greengrass helps perform machine learning inference locally on devices, using models that are created, trained, and optimized in the cloud.
  • provides flexibility to use machine learning models trained in SageMaker or to bring your pre-trained model stored in S3.
  • helps get inference results with very low latency to ensure the IoT applications can respond quickly to local events.

Amazon Elastic Inference

  • helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference by up to 75%.
  • supports TensorFlow, Apache MXNet, and ONNX models, with more frameworks coming soon.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company has built a deep learning model and now wants to deploy it using the SageMaker Hosting Services. For inference, they want a cost-effective option that guarantees low latency but still comes at a fraction of the cost of using a GPU instance for your endpoint. As a machine learning Specialist, what feature should be used?
    1. Inference Pipeline
    2. Elastic Inference
    3. SageMaker Ground Truth
    4. SageMaker Neo
  2. A machine learning specialist works for an online retail company that sells health products. The company allows users to enter reviews of the products they buy from the website. The company wants to make sure the reviews do not contain any offensive or unsafe content, such as obscenities or threatening language. Which Amazon SageMaker algorithm or service will allow scanning user’s review text in the simplest way?
    1. BlazingText
    2. Transcribe
    3. Semantic Segmentation
    4. Comprehend
  3. A company develops a tool whose coverage includes blogs, news sites, forums, videos, reviews, images, and social networks such as Twitter and Facebook. Users can search data by using Text and Image Search, and use charting, categorization, sentiment analysis, and other features to provide further information and analysis. They want to provide Image and text analysis capabilities to the applications which include identifying objects, people, text, scenes, and activities, and also provide highly accurate facial analysis and facial recognition. What service can provide this capability?
    1. Amazon Comprehend
    2. Amazon Rekognition
    3. Amazon Polly
    4. Amazon SageMaker

AWS SageMaker

SageMaker Overview

AWS SageMaker

  • SageMaker is a fully managed machine learning service to build, train, and deploy machine learning (ML) models quickly.
  • removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models.
  • is designed for high availability with no maintenance windows or scheduled downtimes
  • APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS region to provide fault tolerance in the event of a server failure or AZ outage
  • provides a full end-to-end workflow, but users can continue to use their existing tools with SageMaker.
  • supports Jupyter notebooks.
  • allows users to select the number and type of instance used for the hosted notebook, training & model hosting.

SageMaker Overview

SageMaker Machine Learning

Generate example data

  • Involves exploring and preprocessing, or “wrangling,” example data before using it for model training.
  • To preprocess data, you typically do the following:
    • Fetch the data
    • Clean the data
    • Prepare or transform the data

Train a model

  • Model training includes both training and evaluating the model, as follows:
  • Training the model
    • Needs an algorithm, which depends on several factors.
    • Need compute resources for training.
  • Evaluating the model
    • determines whether the accuracy of the inferences is acceptable.

Training Data Format – File mode vs Pipe mode vs Fast File mode

  • SageMaker supports Simple Storage Service (S3), Elastic File System (
    EFS), and FSx for Lustre. for training dataset location.
  • Most SageMaker algorithms work best when using the optimized protobuf recordIO format for the training data.
  • Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it.
  • File mode
    • loads all the data from S3 to the training instance volumes.
    • downloads training data to an encrypted EBS volume attached to the training instance.
    • download needs to finish before model training starts
    • needs disk space to store both the final model artifacts and the full training dataset.
  • In Pipe mode
    • streams data directly from S3.
    • Streaming can provide faster start times for training jobs and better throughput.
    • helps reduce the size of the EBS volumes for the training instances as it needs only enough disk space to store the final model artifacts.
    • needs code changes
  • Fast File mode
    • combines the ease of use of the existing File Mode with the performance of Pipe Mode.
    • provides access to data as if it were downloaded locally while offering the performance benefit of streaming the data directly from S3.
    • training can start without waiting for the entire dataset to be downloaded to the training instances.

Build Model

  • SageMaker provides several built-in machine-learning algorithms that can be used for a variety of problem types.
  • Write a custom training script in a machine learning framework that SageMaker supports, and use one of the pre-built framework containers to run it in SageMaker.
  • Bring your algorithm or model to train or host in SageMaker.
    • SageMaker provides pre-built Docker images for its built-in algorithms and the supported deep-learning frameworks used for training and inference.
    • By using containers, machine learning algorithms can be trained and deployed quickly and reliably at any scale.
  • Use an algorithm that you subscribe to from AWS Marketplace.

Model Deployment

  • Model deployment helps deploy the ML code to make predictions, also known as Inference.
  • supports auto-scaling for the hosted models to dynamically adjust the number of instances provisioned in response to changes in the workload.
  • supports Multi-model endpoints to provide a scalable and cost-effective solution for deploying large numbers of models using a shared
  • can provide high availability and reliability by deploying multiple instances of the production endpoint across multiple AZs.
  • SageMaker provides multiple inference options.
    • Real-time inference
      • is ideal for online inferences that have low latency or high throughput requirements.
      • provides a persistent and fully managed endpoint (REST API) that can handle sustained traffic, backed by the instance type of your choice.
      • can support payload sizes up to 6 MB and processing times of 60 seconds.
    • Serverless Inference
      • is ideal for intermittent or unpredictable traffic patterns.
      • manages all of the underlying infrastructure with no need to manage instances or scaling policies.
      • provides a pay-as-you-use model, and charges only for what you use and not for idle time.
      • can support payload sizes up to 4 MB and processing times up to 60 seconds.
    • Batch Transform
      • is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint.
      • can be used for pre-processing datasets.
      • can support large datasets that are GBs in size and processing times of days.
    • Asynchronous Inference
      • is ideal when you want to queue requests and have large payloads with long processing times.
      • can support payloads up to 1 GB and long processing times up to one hour.
      • can also scale down the endpoint to 0 when there are no requests to process.
  • Inference pipeline
    • is a SageMaker model that is composed of a linear sequence of multiple (2-15) containers that process requests for inferences on data.
    • can be used to define and deploy any combination of pre-trained SageMaker built-in algorithms and your custom algorithms packaged in Docker containers.

Real-Time Inference Variants

  • SageMaker supports testing multiple models or model versions behind the same endpoint using variants.
  • A variant consists of an ML instance and the serving components specified in a SageMaker model.
  • Each variant can have a different instance type or a SageMaker model that can be autoscaled independently of the others.
  • Models within the variants can be trained using different datasets, algorithms, ML frameworks, or any combination of all of these.
  • All the variants behind an endpoint share the same inference code.
  • SageMaker supports two types of variants, production variants and shadow variants.
    • Production Variants
      • supports A/B or Canary testing where you can allocate a portion of the inference requests to each variant.
      • helps compare production variants performance relative to each other.
    • Shadow Variants
      • replicates a portion of the inference requests that go to the production variant to the shadow variant.
      • logs the responses of the shadow variant for comparison and not returned to the caller.
      • helps test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant.

SageMaker Training Optimization

  • SageMaker Managed Spot Training
    • uses EC2 Spot instances to optimize the cost of training models over on-demand instances.
    • Spot interruptions are managed by SageMaker on your behalf.
  • SageMaker Checkpoints
    • help save the state of ML models during training.
    • are snapshots of the model and can be configured by the callback functions of ML frameworks.
    • saved checkpoints can be used to restart a training job from the last saved checkpoint.
  • SageMaker managed spot training with checkpoints help save on training costs.
  • SageMaker Inference Recommender
    • helps select the best instance type and configuration (such as instance count, container parameters, and model optimizations) or serverless configuration (such as max concurrency and memory size) for the ML models and workloads.
    • help deploy the model to a real-time or serverless inference endpoint that delivers the best performance at the lowest cost.
    • reduces the time required to get ML models in production by automating load testing and model tuning across SageMaker ML instances.
    • provides two types of recommendations
      • Default, run a set of load tests on the recommended instance types.
      • Advanced, based on a custom load test where you can select desired ML instances or a serverless endpoint, provide a custom traffic pattern, and provide requirements for latency and throughput based on your production requirements.

SageMaker Security

  • SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest.
  • SageMaker allows using encrypted S3 buckets for model artifacts and data, as well as pass a KMS key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume.
  • Requests to the SageMaker API and console are made over a secure (SSL) connection.
  • SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.
  • SageMaker API and SageMaker Runtime support VPC interface endpoints powered by AWS PrivateLink that helps connect VPC directly to the SageMaker API or SageMaker Runtime using AWS PrivateLink without using an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
  • SageMaker Network Isolation makes sure
    • containers can’t make any outbound network calls, even to other AWS services such as S3.
    • no AWS credentials are made available to the container runtime environment.
    • network inbound and outbound traffic is limited to the peers of each training container, in the case of a training job with multiple instances.
    • S3 download and upload operation is performed using the SageMaker execution role in isolation from the training or inference container.

SageMaker Notebooks

  • SageMaker notebooks are collaborative notebooks that are built into SageMaker Studio running the Jupyter Notebook App.
  • can be accessed without setting up compute instances and file storage.
  • charged only for the resources consumed when notebooks are running
  • instance types can be easily switched if more or less computing power is needed, during the experimentation phase.
  • come with multiple environments already installed containing Jupyter kernels and Python packages including scikit, Pandas, NumPy, MXNet,  and TensorFlow.
  • use a lifecycle configuration that includes both a script that runs each time during notebook instance creation and restarts to install custom environments and kernels on the notebook instance’s EBS volume.
  • restart the notebooks to automatically apply patches.

SageMaker Built-in Algorithms

Please refer SageMaker Built-in Algorithms for details

SageMaker Feature Store

  • SageMaker Feature Store helps to create, share, and manage features for ML development.
  • is a centralized store for features and associated metadata so features can be easily discovered and reused.
  • helps by reducing repetitive data processing and curation work required to convert raw data into features for training an ML algorithm.
  • consists of FeatureGroup which is a group of features defined to describe a Record.
  • Data processing logic is developed only once and the features generated can be used for both training and inference, reducing the training-serving skew.
  • supports online and offline store
    • Online store
      • is used for low-latency real-time inference use cases
      • retains only the latest feature data.
    • Offline store
      • is used for training and batch inference.
      • is an append-only store and can be used to store and access historical feature data.
      • data is stored in Parquet format for optimized storage and query access.

The diagram shows where Feature Store fits in your machine learning pipeline. In this diagram the pipeline consists of processing your raw data into features and then storing them into Feature Store, where you can then serve your features. After you read in your raw data, you can use Feature Store to transform your raw data into meaningful features, a process called feature processing; ingest streaming or batch data into online and offline feature stores; and serve your features for your real-time and batch applications and model training.

Inferentia

  • AWS Inferentia is designed to accelerate deep learning workloads by providing high-performance inference in the cloud.
  • helps deliver higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances.

Elastic Inference (EI)

  • helps speed up the throughput and decrease the latency of getting real-time inferences from the deep learning models deployed as SageMaker-hosted models
  • add inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
  • NOTE: Elastic Inference has been deprecated and replaced by Inferentia.

SageMaker Ground Truth

  • SageMaker Ground Truth provides automated data labeling using machine learning
  • helps build highly accurate training datasets for machine learning quickly.
  • offers easy access to labelers through Mechanical Turk and provides them with built-in workflows and interfaces for common labeling tasks.
  • allows using your labelers as a private workforce or vendors recommended by Amazon through AWS Marketplace.
  • helps lower the labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently.
  • significantly reduces the time and effort required to create datasets for training to reduce costs
  • automated data labeling uses active learning to automate the labeling of your input data for certain built-in task types.
  • provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.

  • first selects a random sample of data and sends it to Amazon Mechanical Turk to be labeled.
  • results are then used to train a labeling model that attempts to label a new sample of raw data automatically.
  • labels are committed when the model can label the data with a confidence score that meets or exceeds a threshold you set.
  • for a confidence score falling below the defined threshold, the data is sent to human labelers.
  • Some of the data labeled by humans is used to generate a new training dataset for the labeling model, and the model is automatically retrained to improve its accuracy.
  • process repeats with each sample of raw data to be labeled.
  • labeling model becomes more capable of automatically labeling raw data with each iteration, and less data is routed to humans.

SageMaker AutoPilot

  • SageMaker Autopilot is an automated machine learning (AutoML) feature set that automates the end-to-end process of building, training, tuning, and deploying machine learning models.
  • analyzes the data, selects algorithms suitable for the problem type, preprocesses the data to prepare it for training, handles automatic model training, and performs hyperparameter optimization to find the best-performing model for the dataset.
  • helps users understand how models make predictions by automatically generating reports that show the importance of each individual feature providing transparency and insights into the factors influencing the predictions, which can be used by risk and compliance teams and external regulators.

Overview of Amazon SageMaker Autopilot AutoML process.

SageMaker JumpStart

  • SageMaker JumpStart provides pretrained, open-source models for various problem types to help get started with machine learning.
  • can incrementally train and tune these models before deployment.
  • provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning with SageMaker.

SageMaker Automatic Model Tuning

  • Hyperparameters are parameters exposed by ML algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models.
  • Automatic model tuning helps find a set of hyperparameters for an algorithm that can yield an optimal model.
  • SageMaker automatic model tuning can use managed spot training
  • Best Practices for Hyperparameter tuning
    • Choosing the Number of Hyperparameters – limit the search to a smaller number as the difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that SageMaker has to search.
    • Choosing Hyperparameter RangesDO NOT specify a very large range to cover every possible value for a hyperparameter. Range of values for hyperparameters that you choose to search can significantly affect the success of hyperparameter optimization.
    • Using Logarithmic Scales for Hyperparameters – log-scaled hyperparameter can be converted to improve hyperparameter optimization.
    • Choosing the Best Number of Concurrent Training Jobsrunning one training job at a time achieves the best results with the least amount of compute time.
    • Running Training Jobs on Multiple Instances – Design distributed training jobs so that you can target the objective metric that you want.
  • Warm start can be used to start a hyperparameter tuning job using one or more previous tuning jobs as a starting point.

SageMaker Experiments

  • SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare your machine-learning experiments.
  • helps organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and track the best-performing models.
  • automatically tracks the inputs, parameters, configurations, and results of the iterations as runs.

SageMaker Debugger

  • SageMaker Debugger provides tools to debug training jobs and resolve problems such as overfitting, saturated activation functions, and vanishing gradients to improve the performance of the model.
  • offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.
  • supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks.

SageMaker Data Wrangler

  • SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes.
  • simplifies the process of data preparation and feature engineering, and completes each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface.
  • supports SQL to select the data from various data sources and import it quickly.
  • provides data quality and insights reports to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage.
  • contains over 300 built-in data transformations, so you can quickly transform data without writing any code.

How Amazon SageMaker Data Wrangler works

SageMaker Clarify

  • SageMaker Clarify helps improve the ML models by detecting potential bias and helping to explain the predictions that the models make.
  • provides purpose-built tools to gain greater insights into the ML models and data, based on metrics such as accuracy, robustness, toxicity, and bias to improve model quality and support responsible AI initiatives.
  • can help to
    • detect bias in and help explain the model predictions.
    • identify types of bias in pre-training data.
    • identify types of bias in post-training data that can emerge during training or when the model is in production.
  • integrates with SageMaker Data Wrangler, Model Monitor, and Auto Pilot.

SageMaker Model Monitor

  • SageMaker Model Monitor monitors the quality of SageMaker machine learning models in production.
  • Continuous monitoring can be setup with a real-time endpoint (or a batch transform job that runs regularly), or on-schedule monitoring for asynchronous batch transform jobs.
  • helps set alerts that notify when there are deviations in the model quality.
  • provides early and proactive detection of deviations enabling you to take corrective actions, such as retraining models, auditing upstream systems, or fixing quality issues without having to monitor models manually or build additional tooling.
  • provides prebuilt monitoring capabilities that do not require coding with the flexibility to monitor models by coding to provide custom analysis.
  • provides the following types of monitoring:
    • Monitor data quality – Monitor drift in data quality.
    • Monitor model quality – Monitor drift in model quality metrics, such as accuracy.
    • Monitor Bias Drift for Models in Production – Monitor bias in the model’s predictions.
    • Monitor Feature Attribution Drift for Models in Production – Monitor drift in feature attribution.

SageMaker Neo

  • SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
  • automatically optimizes machine learning models for inference on cloud instances and edge devices to run faster with no loss in accuracy.
  • Optimized models run up to two times faster and consume less than a tenth of the resources of typical machine learning models.
  • can be used with IoT Greengrass to help perform machine learning inference locally on devices.

SageMaker Model Goverance

  • SageMaker Model Governance is a framework that gives systematic visibility into machine learning (ML) model development, validation, and usage.
  • SageMaker provides purpose-built ML governance tools for managing control access, activity tracking, and reporting across the ML lifecycle.

SageMaker Pricing

  • Users pay for ML compute, storage, and data processing resources they use for hosting the notebook, training the model, performing predictions & logging the outputs.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company has built a deep learning model and now wants to deploy it using the SageMaker Hosting Services. For inference, they want a cost-effective option that guarantees low latency but still comes at a fraction of the cost of using a GPU instance for your endpoint. As a machine learning Specialist, what feature should be used?
    1. Inference Pipeline
    2. Elastic Inference
    3. SageMaker Ground Truth
    4. SageMaker Neo
  2. A trading company is experimenting with different datasets, algorithms, and hyperparameters to find the best combination for the machine learning problem. The company doesn’t want to limit the number of experiments the team can perform but wants to track the several hundred to over a thousand experiments throughout the modeling effort. Which Amazon SageMaker feature should they use to help manage your team’s experiments at scale?
    1. SageMaker Inference Pipeline
    2. SageMaker model tracking
    3. SageMaker Neo
    4. SageMaker model containers
  3. A Machine Learning Specialist needs to monitor Amazon SageMaker in a production environment for analyzing record of actions
    taken by a user, role, or an AWS service.
    Which service should the Specialist use to meet these needs?

    1. AWS CloudTrail
    2. Amazon CloudWatch
    3. AWS Systems Manager
    4. AWS Config

References

Amazon_SageMaker

Machine Learning Concepts – Cheat Sheet

Machine Learning Concepts

This post covers some of the basic Machine Learning concepts mostly relevant for the AWS Machine Learning certification exam.

Machine Learning Lifecycle

Data Processing and Exploratory Analysis

  • To train a model, you need data.
  • Type of data that depends on the business problem that you want the model to solve (the inferences that you want the model to generate).
  • Process data includes data collection, data cleaning, data split, data exploring, preprocessing, transformation, formatting etc.

Feature Selection and Engineering

  • helps improve model accuracy and speed up training
  • remove irrelevant data inputs using domain knowledge for e.g. name
  • remove features which has same values, very low correlation, very little variance or lot of missing values
  • handle missing data using mean values or imputation
  • combine features which are related for e.g. height and age to height/age
  • convert or transform features to useful representation for e.g. date to day or hour
  • standardize data ranges across features

Missing Data

  • do nothing
  • remove the feature with lot of missing data points
  • remove samples with missing data, if the feature needs to be used
  • Impute using mean/median value
    • no impact and the dataset is not skewed
    • works with numerical values only. Do not use for categorical features.
    • doesn’t factor correlations between features
  • Impute using (Most Frequent) or (Zero/Constant) Values
    • works with categorical features
    • doesn’t factor correlations between features
    • can introduce bias
  • Impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning
    • more accurate than the mean, median or most frequent
    • Computationally expensive

Unbalanced Data

  • Source more real data
  • Oversampling instances of the minority class or undersampling instances of the majority class
  • Create or synthesize data using techniques like SMOTE (Synthetic Minority Oversampling TEchnique)

Label Encoding and One-hot Encoding

  • Models cannot multiply strings by the learned weights, encoding helps convert strings to numeric values.
  • Label encoding
    • Use Label encoding to provide lookup or map string data values to a numerical values
    • However, the values are random and would impact the model
  • One-hot encoding
    • Use One-hot encoding for Categorical features that have a discrete set of possible values.
    • One-hot encoding provide binary representation by converting data values into features without impacting the relationships
    • a binary vector is created for each categorical feature in the model that represents values as follows:
      • For values that apply to the example, set corresponding vector elements to 1.
      • Set all other elements to 0.
    • Multi-hot encoding is when multiple values are 1

Cleaning Data

  • Scaling or Normalization means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1)

Train a model

  • Model training includes both training and evaluating the model,
  • To train a model, algorithm is needed.
  • Data can be split into training data, validation data and test data
    • Algorithm sees and is directly influenced by the training data
    • Algorithm uses but is indirectly influenced by the validation data
    • Algorithm does not see the testing data during training
  • Training can be performed using normal parameters or features and hyperparameters

Supervised, Unsupervised, and Reinforcement Learning

Splitting and Randomization

  • Always randomize the data before splitting

Hyperparameters

  • influence how the training occurs
  • Common hyperparameters are learning rate, epoch, batch size
  • Learning rate
    • size of the step taken during gradient descent optimization
    • Large learning rates can overshoot the correct solution
    • Small learning rates increase training time
  • Batch size
    • number of samples used to train at any one time
    • can be all (batch), one (stochastic), or some (mini-batch)
    • calculable from infrastructure
    • Small batch sizes tend to not get stuck in local minima
    • Large batch sizes can converge on the wrong solution at random.
  • Epochs
    • number of times the algorithm processes the entire training data
    • each epoch or run can see the model get closer to the desired state
  • depends on algorithm used

Evaluating the model

After training the model, evaluate it to determine whether the accuracy of the inferences is acceptable.

ML Model Insights

  • For binary classification models use accuracy metric called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples.
  • For regression tasks, use the industry standard root mean square error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model.

Cross-Validation

  • is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.
  • Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
  • there is no separate validation data, involves splitting the training data into chunks of validation data and use it for validation

Optimization

  • Gradient Descent is used to optimize many different types of machine learning algorithms
  • Step size sets Learning rate
    • If the learning rate is too large, the minimum slope might be missed and the graph would oscillate
    • If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient

Underfitting

  • Model is underfitting the training data when the model performs poorly on the training data because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).
  • To increase model flexibility
    • Add new domain-specific features and more feature Cartesian products, and change the types of feature processing used (e.g., increasing n-grams size)
    • Regularization – Decrease the amount of regularization used
    • Increase the amount of training data examples.
    • Increase the number of passes on the existing training data.

Overfitting

  • Model is overfitting the training data when the model performs well on the training data but does not perform well on the evaluation data because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
  • To increase model flexibility
    • Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins.
    • Simplify the model, by reducing the number of layers.
    • Regularization – technique to reduce the complexity of the model. Increase the amount of regularization used.
    • Early Stopping – a form of regularization while training a model with an iterative method, such as gradient descent.
    • Data Augmentation – process of artificially generating new data from existing data, primarily to train new ML models.
    • Dropout is a regularization technique that prevents overfitting.

Classification Model Evaluation

Confusion Matrix

  • Confusion matrix represents the percentage of times each label was predicted in the training set during evaluation
  • An NxN table that summarizes how successful a classification model’s predictions were; that is, the correlation between the label and the model’s classification.
  • One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label.
  • N represents the number of classes. In a binary classification problem, N=2
    • For example, here is a sample confusion matrix for a binary classification problem:
Tumor (predicted) Non-Tumor (predicted)
Tumor (actual) 18 (True Positives) 1 (False Negatives)
Non-Tumor (actual) 6 (False Positives) 452 (True Negatives)
    • Confusion matrix shows that of the 19 samples that actually had tumors, the model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false negative).
    • Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).
  • Confusion matrix for a multi-class classification problem can help you determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.

Accuracy, Precision, Recall (Sensitivity) and Specificity

Accuracy

  • A metric for classification models, that identifies fraction of predictions that a classification model got right.
  • In Binary classification, calculated as (True Positives+True Negatives)/Total Number Of Examples
  • In Multi-class classification, calculated as Correct Predictions/Total Number Of Examples

Precision

  • A metric for classification models. that identifies the frequency with which a model was correct when predicting the positive class.
  • Calculated as True Positives/(True Positives + False Positives)

Recall – Sensitivity – True Positive Rate (TPR)

  • A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify i.e. Number of correct positives out of actual positive results
  • Calculated as True Positives/(True Positives + False Negatives)
  • Important when – False Positives are acceptable as long as ALL positives are found for e.g. it is fine to predict Non-Tumor as Tumor as long as All the Tumors are correctly predicted

Specificity – True Negative Rate (TNR)

  • Number of correct negatives out of actual negative results
  • Calculated as True Negatives/(True Negatives + False Positives)
  • Important when – False Positives are unacceptable; it’s better to have false negatives for e.g. it is not fine to predict Non-Tumor as Tumor; 

ROC and AUC

ROC (Receiver Operating Characteristic) Curve

  • An ROC curve (receiver operating characteristic curve) is curve of true positive rate vs. false positive rate at different classification thresholds.
  • An ROC curve is a graph showing the performance of a classification model at all classification thresholds.
  • An ROC curve plots True Positive Rate (TPR) vs. False Positive Rate (FPR) at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

    ROC Curve showing TP Rate vs. FP Rate at different classification thresholds.

AUC (Area under the ROC curve)

  • AUC stands for “Area under the ROC Curve.”
  • AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
  • AUC provides an aggregate measure of performance across all possible classification thresholds.
  • One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

AUC (Area under the ROC Curve).

F1 Score

  • F1 score (also F-score or F-measure) is a measure of a test’s accuracy.
  • It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). we.

Deploy the model

  • Re-engineer a model before integrate it with the application and deploy it.
  • Can be deployed as a Batch or as a Service

Amazon RDS Blue/Green Deployments

RDS Blue/Green Deployment

Amazon RDS Blue/Green Deployments

  • Amazon RDS Blue/Green Deployments help make and test database changes before implementing them in a production environment.
  • RDS Blue/Green Deployment has the blue environment as the current production environment and the green environment as the staging environment.
  • RDS Blue/Green Deployment creates a staging or green environment that exactly copies the production environment.
  • Green environment is a copy of the topology of the production environment and includes the features used by the DB instance including the Multi-AZ deployment, read replicas, the storage configuration, DB snapshots, automated backups, Performance Insights, and Enhanced Monitoring.
  • Green environment or the staging environment always stays in sync with the current production environment using logical replication.
  • RDS DB instances in the green environment can be changed without affecting production workloads. Changes can include the upgrade of major or minor DB engine versions, upgrade of underlying file system configuration, or change of database parameters in the staging environment.
  • Changes can be thoroughly tested in the green environment and when ready, the environments can be switched over to promote the green environment to be the new production environment.
  • Switchover typically takes under a minute with no data loss and no need for application changes.
  • Blue/Green Deployments are currently supported only for RDS for MariaDB, MySQL, and PostgreSQL.

RDS Blue/Green Deployment

RDS Blue/Green Deployments Benefits

  • Easily create a production-ready staging environment.
  • Automatically replicate database changes from the production environment to the staging environment.
  • Test database changes in a safe staging environment without affecting the production environment.
  • Stay current with database patches and system updates.
  • Implement and test newer database features.
  • Switch over your staging environment to be the new production environment without changes to your application.
  • Safely switch over through the use of built-in switchover guardrails.
  • Eliminate data loss during switchover.
  • Switch over quickly, typically under a minute depending on your workload.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

References

Amazon_Blue_Green_Deployments

AWS RDS Security

AWS RDS Security

  • AWS RDS Security provides multiple features
    • DB instance can be hosted in a VPC for the greatest possible network access control.
    • IAM policies can be used to assign permissions that determine who is allowed to manage RDS resources.
    • Security groups allow control of what IP addresses or EC2 instances can connect to the databases on a DB instance.
    • RDS supports encryption in transit using SSL connections
    • RDS supports encryption at rest to secure instances and snapshots at rest.
    • Network encryption and transparent data encryption (TDE) with Oracle DB instances
    • Authentication can be implemented using Password, Kerberos, and IAM database authentication.

RDS IAM and Access Control

  • IAM can be used to control which RDS operations each individual user has permission to call.

RDS Encryption at Rest

  • RDS encrypted instances use the industry-standard AES-256 encryption algorithm to encrypt data on the server that hosts the RDS instance.
  • RDS handles authentication of access and decryption of the data with a minimal impact on performance, and with no need to modify the database client applications
  • Data at Rest Encryption
    • can be enabled on RDS instances to encrypt the underlying storage
    • encryption keys are managed by KMS
    • can be enabled only during instance creation
    • once enabled, the encryption keys cannot be changed
    • if the key is lost, the DB can only be restored from the backup
  • Once encryption is enabled for an RDS instance,
    • logs are encrypted
    • snapshots are encrypted
    • automated backups are encrypted
    • read replicas are encrypted
  • Cross-region replicas and snapshots copy does not work since the key is only available in a single region
  • Encrypted snapshots from one AWS Region can’t be copied to another, by specifying the KMS key identifier of the destination AWS Region as KMS encryption keys are specific to the AWS Region that they are created.
  • Encrypted snapshots can be copied to another region by specifying a KMS key valid in the destination AWS Region. It can be a Region-specific KMS key, or a multi-Region key.
  • RDS DB Snapshot considerations
    • DB snapshot encrypted using a KMS encryption key can be copied
    • Copying an encrypted DB snapshot results in an encrypted copy of the DB snapshot
    • When copying, the DB snapshot can either be encrypted with the same KMS encryption key as the original DB snapshot, or a different KMS encryption key to encrypt the copy of the DB snapshot.
    • An unencrypted DB snapshot can be copied to an encrypted snapshot, to add encryption to a previously unencrypted DB instance.
    • Encrypted snapshot can be restored only to an encrypted DB instance
    • If a KMS encryption key is specified when restoring from an unencrypted DB cluster snapshot, the restored DB cluster is encrypted using the specified KMS encryption key
    • Copying an encrypted snapshot shared from another AWS account requires access to the KMS encryption key used to encrypt the DB snapshot.
    • Because KMS encryption keys are specific to the region that they are created in, an encrypted snapshot cannot be copied to another region
  • Transparent Data Encryption (TDE)
    • Automatically encrypts the data before it is written to the underlying storage device and decrypts when it is read  from the storage device
    • is supported by Oracle and SQL Server
      • Oracle requires key storage outside of the KMS and integrates with CloudHSM for this
      • SQL Server requires a key but is managed by RDS

RDS Encryption in Transit – SSL

  • Encrypt connections using SSL for data in transit between the applications and the DB instance
  • RDS creates an SSL certificate and installs the certificate on the DB instance when RDS provisions the instance.
  • SSL certificates are signed by a certificate authority. SSL certificate includes the DB instance endpoint as the Common Name (CN) for the SSL certificate to guard against spoofing attacks
  • While SSL offers security benefits, be aware that SSL encryption is a compute-intensive operation and will increase the latency of the database connection.
  • For encrypted and unencrypted DB instances, data that is in transit between the source and the read replicas is encrypted, even when replicating across AWS Regions.

IAM Database Authentication

  • IAM database authentication works with MySQL and PostgreSQL.
  • IAM database authentication prevents the need to store static user credentials in the database because authentication is managed externally using IAM.
  • Authorization still happens within RDS (not IAM).
  • IAM database authentication does not require a password but needs an authentication token
  • An authentication token is a unique string of characters that RDS generates on request.
  • Authentication tokens are generated using AWS Signature Version 4.
  • Each Authentication token has a lifetime of 15 minutes
  • IAM database authentication provides the following benefits:
    • Network traffic to and from the database is encrypted using the Secure Sockets Layer (SSL).
    • helps centrally manage access to the database resources, instead of managing access individually on each DB instance.
    • enables using IAM Roles to access the database instead of a password, for greater security.

RDS Security Groups

  • Security groups control the access that traffic has in and out of a DB instance
  • VPC security groups act like a firewall controlling network access to your DB instance.
  • VPC security groups can be configured and associated with the DB instance to allow access from an IP address range, port, or EC2 security group
  • Database security groups default to a “deny all” access mode and customers must specifically authorize network ingress.

RDS Rotating Secrets

  • RDS supports AWS Secrets Manager to automatically rotate the secret
  • Secrets Manager uses a Lambda function Secrets Manager provides.
  • Secrets Manager provides the following benefits
    • Rotate secrets safely – rotate secrets automatically without disrupting the applications.
      • Secrets Manager offers built-in integrations for rotating credentials for  RDS databases for MySQL, PostgreSQL, and Aurora.
      • Secrets Manager can be extended to meet custom rotation requirements by creating a Lambda function to rotate other types of secrets
    • Manage secrets centrally – to store, view, and manage all the secrets.
    • Security – By default, Secrets Manager encrypts these secrets with encryption keys that you own and control. Using fine-grained IAM policies, access to secrets can be controlled
    • Monitor and audit easily – Secrets Manager integrates with AWS logging and monitoring services to enable meet your security and compliance requirements.
    • Pay as you go – Pay for the secrets stored and for the use of these secrets; there are no long-term contracts or licensing fees.

Master User Account Privileges

  • When you create a new DB instance, the default master user that is used gets certain privileges for that DB instance
  • Subsequently, other users with permissions can be created.

Event Notification

  • Event notifications can be configured for important events that occur on the DB instance
  • Notifications of a variety of important events that can occur on the RDS instance, such as whether the instance was shut down, a backup was started, a failover occurred, the security group was changed, or your storage space is low can be received

RDS Encrypted DB Instances Limitations

  • RDS Encryption can be enabled only during the creation of an RDS DB instance, not after the DB instance is created.
  • DB instances that are encrypted can’t be modified to disable encryption.
  • Encrypted snapshot of an unencrypted DB instance cannot be created.
  • An unencrypted backup or snapshot can’t be restored to an encrypted DB instance.
  • An unencrypted DB instance or an unencrypted read replica of an encrypted DB instance can’t have an encrypted read replica.
  • DB snapshot of an encrypted DB instance must be encrypted using the same KMS key as the DB instance.
  • Encrypted read replicas must be encrypted with the same CMK as the source DB instance when both are in the same AWS Region.
  • For encrypting an unencrypted RDS database, the following approaches can be used.
    • Using Snapshots, however, this option is feasible if you can afford downtime.
      • Create a DB snapshot of the DB instance, which would be unencrypted.
      • Copy the unencrypted DB snapshot to an encrypted snapshot.
      • Restore a DB instance from the encrypted snapshot, which would be an encrypted DB instance.
    • For minimal to no downtime you can use AWS Database Migration Service (AWS DMS) to migrate and continuously replicate the data so that the cutover to the new, encrypted database.

RDS API with Interface Endpoints (AWS PrivateLink)

  • AWS PrivateLink enables you to privately access RDS API operations without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
  • DB instances in the VPC don’t need public IP addresses to communicate with RDS API endpoints to launch, modify, or terminate DB instances.
  • DB instances also don’t need public IP addresses to use any of the available RDS API operations.
  • Traffic between the VPC and RDS doesn’t leave the Amazon network.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Can I encrypt connections between my application and my DB Instance using SSL?
    1. No
    2. Yes
    3. Only in VPC
    4. Only in certain regions
  2. Which of these configuration or deployment practices is a security risk for RDS?
    1. Storing SQL function code in plaintext
    2. Non-Multi-AZ RDS instance
    3. Having RDS and EC2 instances exist in the same subnet
    4. RDS in a public subnet (Making RDS accessible to the public internet in a public subnet poses a security risk, by making your database directly addressable and spammable. DB instances deployed within a VPC can be configured to be accessible from the Internet or from EC2 instances outside the VPC. If a VPC security group specifies a port access such as TCP port 22, you would not be able to access the DB instance because the firewall for the DB instance provides access only via the IP addresses specified by the DB security groups the instance is a member of and the port defined when the DB instance was created. Refer link)

References

AWS_RDS_User_Guide – Security

Amazon RDS Automated Backups vs Manual Snapshots

RDS Automated Backups vs Manual Snapshots

RDS Automated Backups vs Manual Snapshots

  • Amazon RDS Automated Backups are AWS Initiated. Backups are created automatically as per the defined backup window. Backups are also created when a read replica is created.
  • Amazon RDS DB snapshots are manual, user-initiated backups that enable a DB instance backup to be restored to that specific state at any time.

RDS Automated Backups vs Manual Snapshots

Instance Deletion & Backup Retention Period

  • Amazon RDS Backups can be configured with a retention period varying from 1 day to a maximum of 35 days.
  • RDS Automated Backups are deleted when the DB instance is deleted. However, RDS can now be configured to retain the automated backups on RDS instance deletion. These backups would be retained only till their retention window.
  • RDS Snapshots don’t expire and RDS keeps all manual DB snapshots until explicitly deleted and aren’t subject to the backup retention period.

Backup Mode

  • RDS Backups are incremental. The first snapshot of a DB instance contains the data for the full database. Subsequent backups of the same database are incremental, meaning only the data that has changed after your most recent backup is saved.
  • RDS Snapshots are always full.

Point In Time Recovery – PITR

  • RDS Automated Backups with transaction logs help support Point In Time Recovery – PITR. You can restore your DB by rewinding it to a specific time you choose.
  • RDS Snapshots restores to saved snapshot data only. It cannot be used for PITR.

Sharing

  • RDS Automated Backups cannot be shared. You can copy the automated backup to a manual snapshot to share.
  • RDS Manual Snapshots can be shared with the public and with other AWS Accounts.

Use case

  • RDS Backups are Good for disaster recovery
  • RDS Snapshots can be used for checkpoint before making large changes, non-production/test environments, and final copy before deleting a database.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You receive a frantic call from a new DBA who accidentally dropped a table containing all your customers. Which Amazon RDS feature will allow you to reliably restore your database within 5 minutes of when the mistake was made?
    1. Multi-AZ RDS
    2. RDS snapshots
    3. RDS read replicas
    4. RDS automated backup

References

Amazon_RDS_Backup_Restore

Amazon RDS Cross-Region Read Replicas

Cross-Region Read Replicas

RDS Cross-Region Read Replicas

  • RDS Cross-Region Read Replicas create an asynchronously replicated read-only DB instance in a secondary AWS Region.
  • Supported for MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server
  • Cross-Region Read Replicas help to improve
    • disaster recovery capabilities (reduces RTO and RPO),
    • scale read operations into a region closer to end users,
    • migration from a data center in one region to another region

Cross-Region Read Replicas

RDS Cross-Region Read Replicas Process

  • RDS configures the source DB instance as a replication source and setups the specified read replica in the destination AWS Region.
  • RDS creates an automated DB snapshot of the source DB instance in the source AWS Region.
  • RDS begins a cross-Region snapshot copy for the initial data transfer.
  • RDS then uses the copied DB snapshot for the initial data load on the read replica. When the load is complete the DB snapshot copy is deleted.
  • RDS starts by replicating the changes made to the source instance since the start of the create read replica operation.

RDS Cross-Region Read Replicas Considerations

  • A source DB instance can have cross-region read replicas in multiple AWS Regions.
  • Replica lags are higher for Cross-region replicas. This lag time comes from the longer network channels between regional data centers.
  • RDS can’t guarantee more than five cross-region read replica instances, due to the limit on the number of access control list (ACL) entries for a VPC
  • Read Replica uses the default DB parameter group and DB option group for the specified DB engine when configured from AWS console.
  • Read Replica uses the default security group.
  • Cross-Region RDS read replica can be created from a source RDS DB instance that is not a read replica of another RDS DB instance for Microsoft SQL Server, Oracle, and PostgreSQL DB instances. This limitation doesn’t apply to MariaDB and MySQL DB instances.
  • Deleting the source for a cross-region read replica will result in
    • read replica promotion for MariaDB, MySQL, and Oracle DB instances
    • no read replica promotion for PostgreSQL DB instances and the replication status of the read replica is set to terminated.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company has HQ in Tokyo and branch offices worldwide and is using logistics software with a multi-regional deployment on AWS in Japan, Europe, and US. The logistic software has a 3-tier architecture and uses MySQL 5.6 for data persistence. Each region has deployed its database. In the HQ region, you run an hourly batch process reading data from every region to compute cross-regional reports that are sent by email to all offices this batch process must be completed as fast as possible to optimize logistics quickly. How do you build the database architecture to meet the requirements?
    1. For each regional deployment, use RDS MySQL with a master in the region and a read replica in the HQ region
    2. For each regional deployment, use MySQL on EC2 with a master in the region and send hourly EBS snapshots to the HQ region
    3. For each regional deployment, use RDS MySQL with a master in the region and send hourly RDS snapshots to the HQ region
    4. For each regional deployment, use MySQL on EC2 with a master in the region and use S3 to copy data files hourly to the HQ region

AWS RDS Replication – Multi-AZ vs Read Replica

RDS Multi-AZ vs Read Replica

RDS Multi-AZ vs Read Replica

RDS DB instances replicas can be created in two ways Multi-AZ & Read Replica, which provide high availability, durability, and scalability to RDS.

RDS Multi-AZ vs Read Replica

Purpose

  • Multi-AZ DB Instance deployments provide high availability, durability, and automatic failover support.
  • Read replicas enable increased scalability and database availability in the case of an AZ failure. Read Replicas allow elastic scaling beyond the capacity constraints of a single DB instance for read-heavy database workloads

RDS Read Replicas vs Multi-AZ

Region & Availability Zones

  • RDS Multi-AZ deployment automatically provisions and manages a standby instance in a different AZ (independent infrastructure in a physically separate location) within the same AWS region.
  • RDS Read Replicas can be provisioned within the same AZ, Cross-AZ or even as a Cross-Region replica.

Replication Mode

  • RDS Multi-AZ deployment manages a synchronous standby instance in a different AZ
  • RDS Read Replicas has the data replicated asynchronously from the Primary instance to the read replicas

Standby Instance can Accept Reads

  • Multi-AZ DB instance deployment is a high-availability solution and the standby instance does not support requests.
  • Read Replica deployment provides readable instances to increase application read-throughput.

Automatic Failover & Failover Time

  • Multi-AZ DB instance deployment performs an automatic failover to the standby instance without administrative intervention, and the failover time can be up to 120 seconds based on the crash recovery.
    • Planned database maintenance
    • Software patching
    • Rebooting the Primary instance with failover
    • Primary DB instance connectivity or host failure, or an
    • Availability Zone failure
  • RDS maintains the same endpoint for the DB Instance after a failover, so the application can resume database operation without the need for manual administrative intervention.
  • Read Replica deployment does not provide automatic failover. Read Replica instance needs to be manually promoted to a Standalone instance.

Upgrades

  • For a Multi-AZ deployment, Database engine version upgrades happen on the Primary instance.
  • For Read Replicas, the Database engine version upgrade is independent of the Primary instance.

Automated Backups

  • Multi-AZ deployment has the Automated Backups taken from the Standby instance
  • Read Replicas do not have any backups configured, by default.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are running a successful multi-tier web application on AWS and your marketing department has asked you to add a reporting tier to the application. The reporting tier will aggregate and publish status reports every 30 minutes from user-generated information that is being stored in your web applications database. You are currently running a Multi-AZ RDS MySQL instance for the database tier. You also have implemented ElastiCache as a database caching layer between the application tier and database tier. Please select the answer that will allow you to successfully implement the reporting tier with as little impact as possible to your database.
    1. Continually send transaction logs from your master database to an S3 bucket and generate the reports of the S3 bucket using S3 byte range requests.
    2. Generate the reports by querying the synchronously replicated standby RDS MySQL instance maintained through Multi-AZ (Standby instance cannot be used as a scaling solution)
    3. Launch an RDS Read Replica connected to your Multi-AZ master database and generate reports by querying the Read Replica.
    4. Generate the reports by querying the ElastiCache database caching tier. (ElasticCache does not maintain full data and is simply a caching solution)
  2. A company is deploying a new two-tier web application in AWS. The company has limited staff and requires high availability, and the application requires complex queries and table joins. Which configuration provides the solution for the company’s requirements?
    1. MySQL Installed on two Amazon EC2 Instances in a single Availability Zone (does not provide High Availability out of the box)
    2. Amazon RDS for MySQL with Multi-AZ
    3. Amazon ElastiCache (Just a caching solution)
    4. Amazon DynamoDB (Not suitable for complex queries and joins)
  3. Your company is getting ready to do a major public announcement of a social media site on AWS. The website is running on EC2 instances deployed across multiple Availability Zones with a Multi-AZ RDS MySQL Extra Large DB Instance. The site performs a high number of small reads and writes per second and relies on an eventual consistency model. After comprehensive tests you discover that there is read contention on RDS MySQL. Which are the best approaches to meet these requirements? (Choose 2 answers)
    1. Deploy ElastiCache in-memory cache running in each availability zone
    2. Implement sharding to distribute load to multiple RDS MySQL instances (this is only a read contention, the writes work fine)
    3. Increase the RDS MySQL Instance size and Implement provisioned IOPS (not scalable, this is only a read contention, the writes work fine)
    4. Add an RDS MySQL read replica in each availability zone
  4. Your company has HQ in Tokyo and branch offices all over the world and is using logistics software with a multi-regional deployment on AWS in Japan, Europe and US. The logistic software has a 3-tier architecture and currently uses MySQL 5.6 for data persistence. Each region has deployed its own database. In the HQ region you run an hourly batch process reading data from every region to compute cross-regional reports that are sent by email to all offices this batch process must be completed as fast as possible to quickly optimize logistics. How do you build the database architecture in order to meet the requirements?
    1. For each regional deployment, use RDS MySQL with a master in the region and a read replica in the HQ region
    2. For each regional deployment, use MySQL on EC2 with a master in the region and send hourly EBS snapshots to the HQ region
    3. For each regional deployment, use RDS MySQL with a master in the region and send hourly RDS snapshots to the HQ region
    4. For each regional deployment, use MySQL on EC2 with a master in the region and use S3 to copy data files hourly to the HQ region
    5. Use Direct Connect to connect all regional MySQL deployments to the HQ region and reduce network latency for the batch process
  5. What would happen to an RDS (Relational Database Service) Multi-Availability Zone deployment if the primary DB instance fails?
    1. IP of the primary DB Instance is switched to the standby DB Instance.
    2. A new DB instance is created in the standby availability zone.
    3. The canonical name record (CNAME) is changed from primary to standby.
    4. The RDS (Relational Database Service) DB instance reboots.
  6. Your business is building a new application that will store its entire customer database on a RDS MySQL database, and will have various applications and users that will query that data for different purposes. Large analytics jobs on the database are likely to cause other applications to not be able to get the query results they need to, before time out. Also, as your data grows, these analytics jobs will start to take more time, increasing the negative effect on the other applications. How do you solve the contention issues between these different workloads on the same data?
    1. Enable Multi-AZ mode on the RDS instance
    2. Use ElastiCache to offload the analytics job data
    3. Create RDS Read-Replicas for the analytics work
    4. Run the RDS instance on the largest size possible
  7. Will my standby RDS instance be in the same Availability Zone as my primary?
    1. Only for Oracle RDS types
    2. Yes
    3. Only if configured at launch
    4. No
  8. Is creating a Read Replica of another Read Replica supported?
    1. Only in certain regions
    2. Only with MySQL based RDS
    3. Only for Oracle RDS types
    4. No
  9. A user is planning to set up the Multi-AZ feature of RDS. Which of the below mentioned conditions won’t take advantage of the Multi-AZ feature?
    1. Availability zone outage
    2. A manual failover of the DB instance using Reboot with failover option
    3. Region outage
    4. When the user changes the DB instance’s server type
  10. When you run a DB Instance as a Multi-AZ deployment, the “_____” serves database writes and reads
    1. secondary
    2. backup
    3. stand by
    4. primary
  11. When running my DB Instance as a Multi-AZ deployment, can I use the standby for read or write operations?
    1. Yes
    2. Only with MSSQL based RDS
    3. Only for Oracle RDS instances
    4. No
  12. Read Replicas require a transactional storage engine and are only supported for the _________ storage engine
    1. OracleISAM
    2. MSSQLDB
    3. InnoDB
    4. MyISAM
  13. A user is configuring the Multi-AZ feature of an RDS DB. The user came to know that this RDS DB does not use the AWS technology, but uses server mirroring to achieve replication. Which DB is the user using right now?
    1. MySQL
    2. Oracle
    3. MS SQL
    4. PostgreSQL
  14. If I have multiple Read Replicas for my master DB Instance and I promote one of them, what happens to the rest of the Read Replicas?
    1. The remaining Read Replicas will still replicate from the older master DB Instance
    2. The remaining Read Replicas will be deleted
    3. The remaining Read Replicas will be combined to one read replica
  15. If you have chosen Multi-AZ deployment, in the event of a planned or unplanned outage of your primary DB Instance, Amazon RDS automatically switches to the standby replica. The automatic failover mechanism simply changes the ______ record of the main DB Instance to point to the standby DB Instance.
    1. DNAME
    2. CNAME
    3. TXT
    4. MX
  16. When automatic failover occurs, Amazon RDS will emit a DB Instance event to inform you that automatic failover occurred. You can use the _____ to return information about events related to your DB Instance
    1. FetchFailure
    2. DescriveFailure
    3. DescribeEvents
    4. FetchEvents
  17. The new DB Instance that is created when you promote a Read Replica retains the backup window period.
    1. TRUE
    2. FALSE
  18. Will I be alerted when automatic failover occurs?
    1. Only if SNS configured
    2. No
    3. Yes
    4. Only if Cloudwatch configured
  19. Can I initiate a “forced failover” for my MySQL Multi-AZ DB Instance deployment?
    1. Only in certain regions
    2. Only in VPC
    3. Yes
    4. No
  20. A user is accessing RDS from an application. The user has enabled the Multi-AZ feature with the MS SQL RDS DB. During a planned outage how will AWS ensure that a switch from DB to a standby replica will not affect access to the application?
    1. RDS will have an internal IP which will redirect all requests to the new DB
    2. RDS uses DNS to switch over to standby replica for seamless transition
    3. The switch over changes Hardware so RDS does not need to worry about access
    4. RDS will have both the DBs running independently and the user has to manually switch over
  21. Which of the following is part of the failover process for a Multi-AZ Amazon Relational Database Service (RDS) instance?
    1. The failed RDS DB instance reboots.
    2. The IP of the primary DB instance is switched to the standby DB instance.
    3. The DNS record for the RDS endpoint is changed from primary to standby.
    4. A new DB instance is created in the standby availability zone.
  22. Which of these is not a reason a Multi-AZ RDS instance will failover?
    1. An Availability Zone outage
    2. A manual failover of the DB instance was initiated using Reboot with failover
    3. To autoscale to a higher instance class (Refer link)
    4. Master database corruption occurs
    5. The primary DB instance fails
  23. You need to scale an RDS deployment. You are operating at 10% writes and 90% reads, based on your logging. How best can you scale this in a simple way?
    1. Create a second master RDS instance and peer the RDS groups.
    2. Cache all the database responses on the read side with CloudFront.
    3. Create read replicas for RDS since the load is mostly reads.
    4. Create a Multi-AZ RDS installs and route read traffic to standby.
  24. How does Amazon RDS multi Availability Zone model work?
    1. A second, standby database is deployed and maintained in a different availability zone from master, using synchronous replication. (Refer link)
    2. A second, standby database is deployed and maintained in a different availability zone from master using asynchronous replication.
    3. A second, standby database is deployed and maintained in a different region from master using asynchronous replication.
    4. A second, standby database is deployed and maintained in a different region from master using synchronous replication.
  25. A customer is running an application in US-West (Northern California) region and wants to setup disaster recovery failover to the Asian Pacific (Singapore) region. The customer is interested in achieving a low Recovery Point Objective (RPO) for an Amazon RDS multi-AZ MySQL database instance. Which approach is best suited to this need?
    1. Synchronous replication
    2. Asynchronous replication
    3. Route53 health checks
    4. Copying of RDS incremental snapshots
  26. A user is using a small MySQL RDS DB. The user is experiencing high latency due to the Multi AZ feature. Which of the below mentioned options may not help the user in this situation?
    1. Schedule the automated back up in non-working hours
    2. Use a large or higher size instance
    3. Use PIOPS
    4. Take a snapshot from standby Replica
  27. Are Reserved Instances available for Multi-AZ Deployments?
    1. Only for Cluster Compute instances
    2. Yes for all instance types
    3. Only for M3 instance types
  28. My Read Replica appears “stuck” after a Multi-AZ failover and is unable to obtain or apply updates from the source DB Instance. What do I do?
    1. You will need to delete the Read Replica and create a new one to replace it.
    2. You will need to disassociate the DB Engine and re-associate it.
    3. The instance should be deployed to Single AZ and then moved to Multi-AZ once again
    4. You will need to delete the DB Instance and create a new one to replace it.
  29. What is the charge for the data transfer incurred in replicating data between your primary and standby?
    1. No charge. It is free.
    2. Double the standard data transfer charge
    3. Same as the standard data transfer charge
    4. Half of the standard data transfer charge
  30. A user has enabled the Multi-AZ feature with the MS SQL RDS database server. Which of the below mentioned statements will help the user understand the Multi-AZ feature better?
    1. In a Multi-AZ, AWS runs two DBs in parallel and copies the data asynchronously to the replica copy
    2. In a Multi-AZ, AWS runs two DBs in parallel and copies the data synchronously to the replica copy
    3. In a Multi-AZ, AWS runs just one DB but copies the data synchronously to the standby replica
    4. AWS MS SQL does not support the Multi-AZ feature
  31. A company is running a batch analysis every hour on their main transactional DB running on an RDS MySQL instance to populate their central Data Warehouse running on Redshift. During the execution of the batch their transactional applications are very slow. When the batch completes they need to update the top management dashboard with the new data. The dashboard is produced by another system running on-premises that is currently started when a manually sent email notifies that an update is required The on-premises system cannot be modified because is managed by another team. How would you optimize this scenario to solve performance issues and automate the process as much as possible?
    1. Replace RDS with Redshift for the batch analysis and SNS to notify the on-premises system to update the dashboard
    2. Replace RDS with Redshift for the batch analysis and SQS to send a message to the on-premises system to update the dashboard
    3. Create an RDS Read Replica for the batch analysis and SNS to notify me on-premises system to update the dashboard
    4. Create an RDS Read Replica for the batch analysis and SQS to send a message to the on-premises system to update the dashboard.