Table of Contents hide
SageMaker Built-in Algorithms
SageMaker Built-in Algorithms
- provides highly optimized implementations of the Word2vec and text classification algorithms.
- Word2vec algorithm
- useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc.
- maps words to high-quality distributed vectors, whose representation is called word embeddings
- word embeddings capture the semantic relationships between words.
- Text classification
- is an important task for applications performing web searches, information retrieval, ranking, and document classification
- provides the Skip-gram and continuous bag-of-words (CBOW) training architectures
DeepAR forecasting algorithm
- is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
- use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.
- is a general-purpose supervised learning algorithm used for both classification and regression tasks.
- extension of a linear model designed to capture interactions between features within high dimensional sparse datasets economically
Image classification algorithm
- a supervised learning algorithm that supports multi-label classification
- takes an image as input and outputs one or more labels
- uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.
- recommended input format is Apache MXNet RecordIO. Also supports raw images in .jpg or .png format.
- is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses.
- designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers
- is an unsupervised learning algorithm for clustering
- attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups
K-nearest neighbors (k-NN) algorithm
- is an index-based algorithm.
- uses a non-parametric method for classification or regression.
- For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label.
- For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.
Latent Dirichlet Allocation (LDA) algorithm
- is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
- used to discover a user-specified number of topics shared by documents within a text corpus.
- are supervised learning algorithms used for solving either classification or regression problems
Neural Topic Model (NTM) Algorithm
- is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
- Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.
- is a general-purpose neural embedding algorithm that is highly customizable
- can learn low-dimensional dense embeddings of high-dimensional objects.
Object Detection algorithm
- detects and classifies objects in images using a single deep neural network.
- is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
Principal Component Analysis
- is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible
Random Cut Forest (RCF)
- is an unsupervised algorithm for detecting anomalous data points within a data set.
Semantic segmentation algorithm
- provides a fine-grained, pixel-level approach to developing computer vision applications
SageMaker Sequence to Sequence (seq2seq)
- is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
- key uses cases are machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens)
XGBoost (eXtreme Gradient Boosting)
- is a popular and efficient open-source implementation of the gradient boosted trees algorithm.
- Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models