AWS Certified Machine Learning Engineer – Associate (MLA-C01) Exam Learning Path
- Certified for the last pending AWS Certified Machine Learning Engineer – Associate (MLA-C01) certification, which was newly introduced on October 8, 2024, following its beta period.
- Machine Learning Engineer – Associate exam validates knowledge to build, operationalize, deploy, and maintain machine learning (ML) solutions and pipelines by using the AWS Cloud.
- Exam also validates a candidate’s ability to complete the following tasks:
- Ingest, transform, validate, and prepare data for ML modeling.
- Select general modeling approaches, train models, tune hyperparameters, analyze model performance, and manage model versions.
- Choose deployment infrastructure and endpoints, provision compute resources, and configure auto scaling based on requirements.
- Set up continuous integration and continuous delivery (CI/CD) pipelines to automate orchestration of ML workflows.
- Monitor models, data, and infrastructure to detect issues.
- Secure ML systems and resources through access controls, compliance features, and best practices.
Refer AWS Certified Machine Learning Engineer – Associate (MLA-C01) Exam Guide
AWS Certified Machine Learning Engineer – Associate (MLA-C01) Exam Summary
- MLA-C01 exam consists of 65 questions (50 scored and 15 unscored) in 130 minutes, and the time is more than sufficient if you are well-prepared.
- In addition to the usual types of multiple-choice and multiple-response questions, the AIF exams have introduced the following new types
- Ordering: Has a list of 3-5 responses which you need to select and place in the correct order to complete a specified task.
- Matching: Has a list of responses to match with a list of 3-7 prompts. You must match all the pairs correctly to receive credit for the question.
- Case study: A case study presents a single scenario with multiple questions. Each question is evaluated independently, and credit is given for each correct answer.
- MLA-C01 has a scaled score between 100 and 1,000. The scaled score needed to pass the exam is 720.
- Associate exams currently cost $ 150 + tax.
- You can get an additional 30 minutes if English is your second language by requesting Exam Accommodations. It might not be needed for Associate exams but is helpful for Professional and Specialty ones.
- AWS exams can be taken either remotely or online, I prefer to take them online as it provides a lot of flexibility. Just make sure you have a proper place to take the exam with no disturbance and nothing around you.
- Also, if you are taking the AWS Online exam for the first time try to join at least 30 minutes before the actual time as I have had issues with both PSI and Pearson with long wait times.
AWS Certified Machine Learning Engineer – Associate (MLA-C01) Exam Resources
- Online Courses
- Stephane Maarek – AWS Certified Machine Learning Engineer Associate: Hands On!
- Whizlabs – AWS Certified Machine Learning Engineer Associate (MLA-C01)
- Practice tests
- Read the FAQs at least for the important topics, as they cover important points and are good for quick review
AWS Certified Machine Learning Engineer – Associate (MLA-C01) Exam Topics
- AWS Certified Machine Learning Engineer – Associate exam covers a lot of Machine Learning concepts in addition to the AWS ML Services.
- AWS Certified Machine Learning exam covers the Machine Learning lifecycle, data collection, transformation, making it usable and efficient for Machine Learning, pre-processing data for Machine Learning, training and validation, and implementation.
Machine Learning Concepts
- Exploratory Data Analysis
- Feature selection and Engineering
- remove features that are not related to training
- remove features that have the same values, very low correlation, very little variance, or a lot of missing values
- Apply techniques like Principal Component Analysis (PCA) for dimensionality reduction i.e. reduce the number of features.
- Apply techniques such as One-hot encoding and label encoding to help convert strings to numeric values, which are easier to process.
- Apply Normalization i.e. values between 0 and 1 to handle data with large variance.
- Apply feature engineering for feature reduction e.g. using a single height/weight feature instead of both features.
- Handle Missing data
- remove the feature or rows with missing data
- impute using Mean/Median values – valid only for Numeric values and not categorical features also does not factor correlation between features
- impute using k-NN, Multivariate Imputation by Chained Equation (MICE), Deep Learning – more accurate and helps factors correlation between features
- Handle unbalanced data
- Source more data
- Oversample minority or Undersample majority
- Data augmentation using techniques like Synthetic Minority Oversampling Technique (SMOTE).
- Feature selection and Engineering
- Modeling
- Know about Algorithms – Supervised, Unsupervised and Reinforcement and which algorithm is best suitable based on the available data either labelled or unlabelled.
- Supervised learning trains on labeled data e.g. Linear regression. Logistic regression, Decision trees, Random Forests
- Unsupervised learning trains on unlabelled data e.g. PCA, SVD, K-means
- Reinforcement learning trained based on actions and rewards e.g. Q-Learning
- Hyperparameters
- are parameters exposed by machine learning algorithms that control how the underlying algorithm operates and their values affect the quality of the trained models
- some of the common hyperparameters are learning rate, batch, epoch (hint: If the learning rate is too large, the minimum slope might be missed and the graph would oscillate If the learning rate is too small, it requires too many steps which would take the process longer and is less efficient)
- Know about Algorithms – Supervised, Unsupervised and Reinforcement and which algorithm is best suitable based on the available data either labelled or unlabelled.
- Evaluation
- Know difference in evaluating model accuracy
- Use Area Under the (Receiver Operating Characteristic) Curve (AUC) for Binary classification
- Use root mean square error (RMSE) metric for regression
- Understand Confusion matrix
- A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
- A false positive is an outcome where the model incorrectly predicts the positive class. A false negative is an outcome where the model incorrectly predicts the negative class.
- Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN) (hint: use this for cases like fraud detection, cost of marking non fraud as frauds is lower than marking fraud as non-frauds)
- Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP) (hint: use this for cases like videos for kids, the cost of dropping few valid videos is lower than showing few bad ones)
- Handle Overfitting problems
- Simplify the model, by reducing the number of layers
- Early Stopping – form of regularization while training a model with an iterative method, such as gradient descent
- Data Augmentation
- Regularization – technique to reduce the complexity of the model
- Dropout is a regularization technique that prevents overfitting
- Never train on test data
- Know difference in evaluating model accuracy
Machine Learning Services
SageMaker
- supports both File mode, Pipe mode, and Fast File mode
- File mode loads all of the data from S3 to the training instance volumes VS Pipe mode streams data directly from S3
- File mode needs disk space to store both the final model artifacts and the full training dataset. VS Pipe mode which helps reduce the required size for EBS volumes.
- Fast File mode combines the ease of use of the existing File Mode with the performance of Pipe Mode.
- Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it.
- supports Model tracking capability to manage up to thousands of machine learning model experiments
- supports automatic scaling for production variants. Automatic scaling dynamically adjusts the number of instances provisioned for a production variant in response to changes in your workload
- provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training & inference
- SageMaker Automatic Model Tuning
- is the process of finding a set of hyperparameters for an algorithm that can yield an optimal model.
- Best practices
- limit the search to a smaller number as the difficulty of a hyperparameter tuning job depends primarily on the number of hyperparameters that Amazon SageMaker has to search
- DO NOT specify a very large range to cover every possible value for a hyperparameter as it affects the success of hyperparameter optimization.
- log-scaled hyperparameter can be converted to improve hyperparameter optimization.
- running one training job at a time achieves the best results with the least amount of compute time.
- Design distributed training jobs so that you get they report the objective metric that you want.
- know how to take advantage of multiple GPUs (hint: increase learning rate and batch size w.r.t to the increase in GPUs)
- Elastic Interface (now replaced by Inferentia) helps attach low-cost GPU-powered acceleration to EC2 and SageMaker instances or ECS tasks to reduce the cost of running deep learning inference.
- SageMaker Inference options.
- Real-time inference is ideal for online inferences that have low latency or high throughput requirements.
- Serverless Inference is ideal for intermittent or unpredictable traffic patterns as it manages all of the underlying infrastructure with no need to manage instances or scaling policies.
- Batch Transform is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint.
- Asynchronous Inference is ideal when you want to queue requests and have large payloads with long processing times.
- SageMaker Model deployment allows deploying multiple variants of a model to the same SageMaker endpoint to test new models without impacting the user experience
- Production Variants
- supports A/B or Canary testing where you can allocate a portion of the inference requests to each variant.
- helps compare production variants’ performance relative to each other.
- Shadow Variants
- replicates a portion of the inference requests that go to the production variant to the shadow variant.
- logs the responses of the shadow variant for comparison and not returned to the caller.
- helps test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant.
- Production Variants
- SageMaker Managed Spot training can help use spot instances to save cost and with Checkpointing feature can save the state of ML models during training
- SageMaker Feature Store
- helps to create, share, and manage features for ML development.
- is a centralized store for features and associated metadata so features can be easily discovered and reused.
- SageMaker Debugger provides tools to debug training jobs and resolve problems such as overfitting, saturated activation functions, and vanishing gradients to improve the model’s performance.
- SageMaker Model Monitor monitors the quality of SageMaker machine learning models in production and can help set alerts that notify when there are deviations in the model quality.
- SageMaker Automatic Model Tuning helps find a set of hyperparameters for an algorithm that can yield an optimal model.
- SageMaker Data Wrangler
- reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes.
- simplifies the process of data preparation (including data selection, cleansing, exploration, visualization, and processing at scale) and feature engineering.
- supports
- Direct connection always has latest data.
- Cataloged connection which is the result of a data transfer and hence the data in the cataloged connection doesn’t necessarily have the most recent data.
- SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare machine learning experiments.
- SageMaker Clarify helps improve the ML models by detecting potential bias and helping to explain the predictions that the models make.
- SageMaker Model Governance is a framework that gives systematic visibility into ML model development, validation, and usage.
- SageMaker Model Cards
- helps document critical details about the ML models in a single place for streamlined governance and reporting.
- helps capture key information about the models throughout their lifecycle and implement responsible AI practices.
- SageMaker Autopilot is an automated machine learning (AutoML) feature set that automates the end-to-end process of building, training, tuning, and deploying machine learning models.
- SageMaker Neo enables machine learning models to train once and run anywhere in the cloud and at the edge.
- SageMaker API and SageMaker Runtime support VPC interface endpoints powered by AWS PrivateLink that helps connect VPC directly to the SageMaker API or SageMaker Runtime using AWS PrivateLink without using an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.
- SageMaker managed warm pools retain and reuse provisioned infrastructure after the training job completion to reduce latency for repetitive workloads.
- SageMaker supports Elastic File System (EFS) and FSx for Lustre file systems as data sources for training machine learning models.
- SageMaker MLOps
- ML Lineage Tracking creates and stores tracking information about the steps of a ML workflow from data preparation to model deployment that can help reproduce the workflow steps, track model and dataset lineage, and establish model governance and audit standards.
-
Model Registry provides a model catalog, helps manage model versions, associate metadata, manage model approval status, deploy models to production and share models with other users.
SageMaker Ground Truth
- provides automated data labeling using machine learning
- helps build highly accurate training datasets for machine learning quickly using Amazon Mechanical Turk
- provides annotation consolidation to help improve the accuracy of the data object’s labels. It combines the results of multiple worker’s annotation tasks into one high-fidelity label.
- automated data labeling uses machine learning to label portions of the data automatically without having to send them to human workers
Machine Learning & AI Managed Services
- Comprehend
- natural language processing (NLP) service to find insights and relationships in text.
- identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; analyzes text using tokenization and parts of speech; and automatically organizes a collection of text files by topic.
- Rekognition – analyze images and video to identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content.
- Transcribe – automatic speech recognition (ASR) speech-to-text
- Kendra – an intelligent search service that uses NLP and advanced ML algorithms to return specific answers to search questions from your data.
- Augmented AI (Amazon A2I) is an ML service that makes it easy to build the workflows required for human review.
Generative AI
- MLA-C01 covers few concepts of Generative AI at a very high level.
- Foundation Models:
- Large, pre-trained models built on diverse data that can be fine-tuned for specific tasks like text, image, and speech generation. for e.g. GPT, BERT, and DALL·E.
- Large Language Models (LLMs):
- A subset of foundation models designed to understand and generate human-like text. Capable of answering questions, summarizing, translating, and more.
- LLM Components
- Tokens:
- Basic units of text (words, subwords, or characters) that LLMs process.
- Vectors
- Numerical representations of tokens in high-dimensional space, enabling the model to perform mathematical operations on text.
- Each token is converted into a vector for processing in the neural network.
- Embeddings:
- Pre-trained numerical vector representations of tokens that capture their semantic meaning.
- Tokens:
- Prompt Engineering:
- Crafting effective input instructions to guide generative AI toward desired outputs. Key for improving performance without fine-tuning the model.
- Retrieval-Augmented Generation (RAG):
- Combines LLMs with external knowledge bases to retrieve accurate and up-to-date information during text generation. Useful for chatbots and domain-specific tasks.
- Fine-Tuning:
- Adjusting pre-trained models using domain-specific data to optimize performance for specific applications.
- Responsible AI Features:
- Incorporates fairness, transparency, and bias mitigation techniques to ensure ethical AI outputs.
- Multi-Modal Capabilities:
- Models that process and generate outputs across multiple data types, such as text, images, and audio.
- Controls
- Temperature:
- Adjusts randomness in the output; lower values produce focused results, while higher values generate creative outputs. Essential for creative tasks or deterministic responses.
- Lower values (e.g., 0.2) make the output more focused and deterministic, while higher values (e.g., 1.0 or above) make it more creative and diverse.
- Top P (Nucleus Sampling):
- Determines the probability threshold for token selection for e.g., with Top P = 0.9, the model considers only the smallest set of tokens whose cumulative probability is 90%, filtering out less likely options.
- Top K:
- Limits the token selection to the top K most probable tokens for e.g. with Top K = 10, the model randomly chooses tokens only from the 10 most likely options, providing more control over diversity.
- Token Length (Max Tokens):
- Sets the maximum number of tokens the model can generate in a response.
- Temperature:
Analytics
- Kinesis
- Understand Kinesis Data Streams and Kinesis Data Firehose in depth
- Glue is a fully managed, ETL (extract, transform, and load) service that automates the time-consuming steps of data preparation for analytics
- helps setup, orchestrate, and monitor complex data flows.
- Glue Data Catalog is a central repository to store structural and operational metadata for all the data assets.
- Glue crawler connects to a data store, extracts the schema of the data, and then populates the Glue Data Catalog with this metadata
- Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code.
Security, Identity & Compliance
- SageMaker can read data from KMS-encrypted S3. Make sure, the KMS key policies include the role attached with SageMaker
Management & Governance Tools
- Understand AWS CloudWatch for Logs and Metrics. (hint: SageMaker is integrated with CloudWatch and logs and metrics are all stored in it)
Whitepapers and articles
- AWS Machine Learning Services Cheat Sheet
- AWS Analytics Services Cheat Sheet
- Machine Learning Concepts Cheat Sheet
- AWS Machine Learning & AI Learning Hub
On the Exam Day
- Make sure you are relaxed and get some good night’s sleep. The exam is not tough if you are well-prepared.
- If you are taking the AWS Online exam
- Try to join at least 30 minutes before the actual time as I have had issues with both PSI and Pearson with long wait times.
- The online verification process does take some time and usually, there are glitches.
- Remember, you would not be allowed to take the take if you are late by more than 30 minutes.
- Make sure you have your desk clear, no hand-watches, or external monitors, keep your phones away, and nobody can enter the room.
Finally, All the Best 🙂