AWS SageMaker Built-in Algorithms Summary

SageMaker Built-in Algorothms

SageMaker AI Built-in Algorithms

📌 Naming Update (December 2024): On December 3, 2024, Amazon SageMaker was renamed to Amazon SageMaker AI. The “SageMaker” brand now refers to the next-generation unified platform for data, analytics, and AI. All built-in algorithms remain available under SageMaker AI.

  • SageMaker AI provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and ML practitioners get started on training and deploying ML models quickly.
  • SageMaker AI also provides SageMaker JumpStart with pre-trained foundation models (including LLMs like LLaMA, BLOOM, Falcon) for generative AI tasks such as text generation, summarization, and question answering.

SageMaker AI Built-in Algorithms

Tabular Data – Classification & Regression

AutoGluon-Tabular

  • is an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers.
  • automatically performs data processing, model selection, and hyperparameter tuning.
  • used for both classification and regression tasks on tabular data.
  • supports CPU and GPU (single instance only) training.

CatBoost

  • is an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
  • used for both classification and regression tasks.
  • handles categorical features natively without requiring manual encoding.
  • supports CPU (single instance only) training.

LightGBM

  • is an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability.
  • uses Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
  • used for both classification and regression tasks.
  • supports CPU (single instance only) training.

TabTransformer

  • is a novel deep tabular data modeling architecture built on self-attention-based Transformers.
  • converts categorical features into contextual embeddings using Transformer layers.
  • used for both classification and regression tasks.
  • supports CPU and GPU (single instance only) training.

XGBoost (eXtreme Gradient Boosting)

  • is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
  • Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models.
  • supports both classification and regression tasks.
  • supports distributed training across multiple instances.

Linear Learner

  • are supervised learning algorithms used for solving either classification or regression problems.
  • learns a linear function for regression or a linear threshold function for classification.
  • supports distributed training.

K-nearest neighbors (k-NN) algorithm

  • is an index-based algorithm.
  • uses a non-parametric method for classification or regression.
  • For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
  • For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Factorization Machine

  • is a general-purpose supervised learning algorithm used for both classification and regression tasks.
  • extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically, such as click prediction and item recommendation.

Text-based

BlazingText algorithm

  • provides highly optimized implementations of the Word2vec and text classification algorithms.
  • Word2vec algorithm
    • useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
    • maps words to high-quality distributed vectors, whose representation is called word embeddings
    • word embeddings capture the semantic relationships between words.
  • Text classification
    • is an important task for applications performing web searches, information retrieval, ranking, and document classification
  • provides the Skip-gram and continuous bag-of-words (CBOW) training architectures

Text Classification – TensorFlow

  • is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub.
  • uses deep learning networks such as BERT which are highly accurate for text classification.
  • takes text as input and outputs probability for each of the class labels.
  • useful for sentiment analysis, spam detection, and document categorization.

Sequence to Sequence – seq2seq

  • is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio), and the output generated is another sequence of tokens.
  • key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)

Forecasting

DeepAR

  • is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
  • use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.
  • supports learning complex patterns from multiple related time series simultaneously.

Clustering

K-means algorithm

  • is an unsupervised learning algorithm for clustering
  • attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups

Topic Modelling

Latent Dirichlet Allocation (LDA)

  • is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
  • used to discover a user-specified number of topics shared by documents within a text corpus.

Neural Topic Model (NTM)

  • is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
  • Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.

Feature Reduction

Object2Vec

  • is a general-purpose neural embedding algorithm that is highly customizable
  • can learn low-dimensional dense embeddings of high-dimensional objects.
  • useful for duplicate detection, finding similar items, and relationship prediction.

Principal Component Analysis – PCA

  • is an unsupervised ML algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible.
  • projects data points onto the first few principal components (eigenvectors of the data’s covariance matrix).

Anomaly Detection

Random Cut Forest (RCF)

  • is an unsupervised algorithm for detecting anomalous data points within a data set.
  • detects data points that diverge from otherwise well-structured or patterned data.

IP Insights

  • is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
  • designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers
  • useful for detecting suspicious login attempts from anomalous IP addresses.

Computer Vision – CV

Image Classification – MXNet

  • a supervised learning algorithm that supports multi-label classification
  • takes an image as input and outputs one or more labels
  • uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
  • recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.

Image Classification – TensorFlow

  • is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub.
  • uses deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet for image classification.
  • takes an image as input and outputs probability for each of the class labels.
  • supports fine-tuning pretrained models for specific image classification tasks.

Object Detection – MXNet

  • detects and classifies objects in images using a single deep neural network.
  • is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.

Object Detection – TensorFlow

  • is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Model Garden.
  • takes an image as input and predicts bounding boxes and object labels.
  • uses deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet for object detection.

Semantic Segmentation

  • provides a fine-grained, pixel-level approach to developing computer vision applications.
  • tags every pixel in an image with a class label from a predefined set of classes and is critical to an increasing number of CV applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.
  • also provides information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image, called a segmentation mask.

SageMaker JumpStart – Pre-trained Models

  • SageMaker JumpStart provides pre-trained foundation models, pre-built solution templates, and example notebooks for popular ML problem types.
  • Foundation models include large language models (LLMs) such as LLaMA, Falcon, BLOOM, FLAN-T5, Mistral, and GPT-J for generative AI tasks.
  • Supports 15+ problem types including:
    • Text Generation, Text Summarization, Question Answering
    • Text Embedding, Named Entity Recognition
    • Image Classification, Object Detection, Instance Segmentation
    • Tabular Classification, Tabular Regression
    • Machine Translation, Sentence Pair Classification
  • Models can be fine-tuned on custom datasets and deployed directly from SageMaker Studio.

SageMaker Autopilot (AutoML)

  • SageMaker Autopilot automatically explores different solutions to find the best model for your data.
  • Analyzes data, selects algorithms, preprocesses data, trains models, and performs hyperparameter optimization.
  • Supports classification, regression, and time-series forecasting problem types.
  • Available as a no-code/low-code option through SageMaker Canvas for business analysts.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An Analytics team is leading an organization and wants to use anomaly detection to identify potential risks. What Amazon SageMaker AI machine learning algorithms are best suited for identifying anomalies?
    1. Semantic segmentation
    2. K-nearest neighbors
    3. Latent Dirichlet Allocation (LDA)
    4. Random Cut Forest (RCF)
  2. A ML specialist team works for a marketing consulting firm wants to
    apply different marketing strategies per segment of their customer base. Online retailer purchase history from the last 5 years is available, it has been decided to segment the customers based on their purchase history. Which type of machine learning algorithm would give you segmentation based on purchase history in the most expeditious manner?

    1. K-Nearest Neighbors (KNN)
    2. K-Means
    3. Semantic Segmentation
    4. Neural Topic Model (NTM)
  3. A ML specialist team is looking to improve the quality of searches for their library of documents that are uploaded in PDF, Rich Text Format, or ASCII text. It is looking to use machine learning to automate the identification of key topics for each of the documents. What machine learning resources are best suited for this problem? (Select TWO)
    1. BlazingText algorithm
    2. Latent Dirichlet Allocation (LDA) algorithm
    3. Topic Finder (TF) algorithm
    4. Neural Topic Model (NTM) algorithm
  4. A manufacturing company has a large set of labeled historical sales data. The company would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to solve this problem?
    1. BlazingText algorithm
    2. Random Cut Forest (RCF)
    3. Principal component analysis (PCA)
    4. Linear regression
  5. An agency collects census information with responses for approximately 500 questions from each citizen. Which algorithm would help reduce the number of features?
    1. Factorization machines (FM) algorithm
    2. Latent Dirichlet Allocation (LDA) algorithm
    3. Principal component analysis (PCA) algorithm
    4. Random Cut Forest (RCF) algorithm
  6. A store wants to understand some characteristics of visitors to the store. The store has security video recordings from the past several years. The store wants to group visitors by hair style and hair color. Which solution will meet these requirements with the LEAST amount of effort?
    1. Object detection algorithm
    2. Latent Dirichlet Allocation (LDA) algorithm
    3. Random Cut Forest (RCF) algorithm
    4. Semantic segmentation algorithm
  7. A data scientist needs to build a model that can automatically classify product reviews as positive or negative. The dataset contains millions of labeled reviews. Which SageMaker AI built-in algorithm is MOST suitable for this text classification task with transfer learning?
    1. Sequence-to-Sequence (seq2seq)
    2. BlazingText in Word2Vec mode
    3. Text Classification – TensorFlow
    4. Neural Topic Model (NTM)
  8. A company wants to predict customer churn using a tabular dataset with both numerical and categorical features. The team wants an AutoML approach that automatically ensembles multiple models. Which SageMaker AI built-in algorithm should they use?
    1. XGBoost
    2. Linear Learner
    3. AutoGluon-Tabular
    4. Factorization Machines
  9. A team needs to detect objects in images and draw bounding boxes around them. They want to leverage pretrained models and use transfer learning. Which SageMaker AI algorithm should they choose?
    1. Image Classification – MXNet
    2. Semantic Segmentation
    3. Image Classification – TensorFlow
    4. Object Detection – TensorFlow
  10. A company has tabular data with many categorical features and wants a gradient-boosted trees algorithm that handles categorical features natively without manual encoding. Which algorithm is BEST suited?
    1. XGBoost
    2. LightGBM
    3. CatBoost
    4. Linear Learner

References