AWS Certification – Analytics Services – Cheat Sheet

Learn from the best — get courses from expert instructors starting at just $13.99!

Kinesis Data Streams

  • enables real-time processing of streaming data at massive scale
  • provides ordering of records, as well as the ability to read and/or replay records in the same order to multiple Kinesis applications
  • data is replicated across three data centers within a region and preserved for 24 hours, by default and can be extended to 7 days
  • streams can be scaled using multiple shards, based on the partition key, with each shard providing the capacity of 1MB/sec data input and 2MB/sec data output with 1000 PUT requests per second
  • Data encryption can be supported either using client side encryption before pushing the data to data streams or server side encryption.
  • Kinesis vs SQS
    • real-time processing of streaming big data vs reliable, highly scalable hosted queue for storing messages
    • ordered records, as well as the ability to read and/or replay records in the same order vs no guarantee on data ordering (with the standard queues before the FIFO queue feature was released)
    • data storage up to 24 hours, extended to 7 days vs up to 14 days, can be configured from 1 minute to 14 days but cleared if deleted by the consumer
    • supports multiple consumers vs single consumer at a time and requires multiple queues to deliver message to multiple consumers
  • Producer & Consumers
    • API PutRecord and PutRecords are synchronous, while KPL producer supports synchronous or asynchronous use cases
    • KCL uses a unique DynamoDB table to keep track of the application’s state, so if Kinesis Data Streams application receives provisioned-throughput exceptions, increase the provisioned throughput for the DynamoDB table

Detailed Reading

Kinesis Firehose

  • is a fully managed service
  • data transfer solution for delivering real time streaming data to destinations such as S3,  Redshift,  Elasticsearch service, and Splunk.
  • is NOT real time (min. 60 secs) as it buffers incoming streaming data to a certain size or for a certain period of time before delivering it
  • supports multiple producers as datasource, which include Kinesis data stream, Kinesis Agent, or the Kinesis Data Firehose API using the AWS SDK, CloudWatch Logs, CloudWatch Events, or AWS IoT
  • supports out of box data transformation as well as custom transformation using Lambda function to transform incoming source data and deliver the transformed data to destinations
  • supports interface VPC endpoint to keep traffic between the Amazon VPC and Kinesis Data Firehose from leaving the Amazon network.

Detailed Reading

Kinesis Data Streams vs Kinesis Firehose

Redshift

  • Redshift is a fast, fully managed data warehouse
  • provides simple and cost-effective solution to analyze all the data using standard SQL and the existing Business Intelligence (BI) tools.
  • manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, and patching.
  • automatically monitors your nodes and drives to help you recover from failures.
  • only supports Single-AZ deployments.
  • replicates all the data within the data warehouse cluster when it is loaded and also continuously backs up your data to S3.
  • attempts to maintain at least three copies of your data (the original and replica on the compute nodes and a backup in S3).
  • supports cross-region snapshot replication to another region for disaster recovery
  • Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.
    • KEY distribution uses a single column as distribution key (DISTKEY) and helps place matching values on the same node slice
    • Even distribution distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
    • ALL distribution replicates whole table in every compute node.
    • AUTO distribution lets Redshift assigns an optimal distribution style based on the size of the table data
  • Redshift supports Compound and Interleaved sort keys
    • A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed and is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
    • An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
  • Column encodings CANNOT be changed once created.
  • Redshift provides query queues for Workload Management, in order to manage concurrency and resource planning. It is a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU

Detailed Reading

EMR

  • is a web service that utilizes a hosted Hadoop framework running on the web-scale infrastructure of EC2 and S3
  • launches all nodes for a given cluster in the same Availability Zone, which improves performance as it provides higher data access rate
  • seamlessly supports Reserved, On-Demand and Spot Instances
  • consists of Master Node for management and Slave nodes, which consists of Core nodes holding data and Task nodes for performing tasks only
  • is fault tolerant for slave node failures and continues job execution if a slave node goes down
  • does not automatically provision another node to take over failed slaves
  • supports Persistent and Transient cluster types
    • Persistent which continue to run
    • Transient which terminates once the job steps are completed
  • supports EMRFS which allows S3 to be used as a durable HA data storage

Detailed Reading

Data Pipeline

  • orchestration service that helps define data-driven workflows to automate and schedule regular data movement and data processing activities
  • integrates with on-premises and cloud-based storage systems
  • allows scheduling, retry, and failure logic for the workflows

Detailed Reading

2 thoughts on “AWS Certification – Analytics Services – Cheat Sheet

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.