AWS SageMaker Built-in Algorithms Summary

SageMaker Built-in Algorithms

BlazingText algorithm

    • provides highly optimized implementations of the Word2vec and text classification algorithms.
    • Word2vec algorithm
      • useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
      • maps words to high-quality distributed vectors, whose representation is called word embeddings
      • word embeddings capture the semantic relationships between words.
    • Text classification
      • is an important task for applications performing web searches, information retrieval, ranking, and document classification
    • provides the Skip-gram and continuous bag-of-words (CBOW) training architectures

DeepAR forecasting algorithm

    • is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
    • use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.

Factorization machine

    • is a general-purpose supervised learning algorithm used for both classification and regression tasks.
    • extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically

Image classification algorithm

    • a supervised learning algorithm that supports multi-label classification
    • takes an image as input and outputs one or more labels
    • uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
    • recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.

IP Insights

    • is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
    • designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers

K-means algorithm

    • is an unsupervised learning algorithm for clustering
    • attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups

K-nearest neighbors (k-NN) algorithm

    • is an index-based algorithm.
    • uses a non-parametric method for classification or regression.
    • For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
    • For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Latent Dirichlet Allocation (LDA) algorithm

    • is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
    • used to discover a user-specified number of topics shared by documents within a text corpus.

Linear Learner

    • are supervised learning algorithms used for solving either classification or regression problems

Neural Topic Model (NTM) Algorithm

    • is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
    • Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.

Object2Vec algorithm

    • is a general-purpose neural embedding algorithm that is highly customizable
    • can learn low-dimensional dense embeddings of high-dimensional objects.

Object Detection algorithm

    • detects and classifies objects in images using a single deep neural network.
    • is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.

Principal Component Analysis

    • is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible

Random Cut Forest (RCF)

    • is an unsupervised algorithm for detecting anomalous data points within a data set.

Semantic segmentation algorithm

    • provides a fine-grained, pixel-level approach to developing computer vision applications

SageMaker Sequence to Sequence (seq2seq)

    • is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
    • key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)

XGBoost (eXtreme Gradient Boosting)

    • is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
    • Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models