data transfer solution for delivering real time streaming data to destinations such as S3, Redshift, Elasticsearch service, and Splunk.
is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration
is Near Real Time (min. 60 secs) as it buffers incoming streaming data to a certain size or for a certain period of time before delivering it
supports batching, compression, and encryption of the data before loading it, minimizing the amount of storage used at the destination and increasing security
supports data compression, minimizing the amount of storage used at the destination. It currently supports GZIP, ZIP, and SNAPPY compression formats. Only GZIP is supported if the data is further loaded to Redshift.
supports out of box data transformation as well as custom transformationusing Lambda function to transform incoming source data and deliver the transformed data to destinations
uses at least once semantics for data delivery.
supports multiple producers as datasource, which include Kinesis data stream, KPL, Kinesis Agent, or the Kinesis Data Firehose API using the AWS SDK, CloudWatch Logs, CloudWatch Events, or AWS IoT
does NOT support consumers like Spark and KCL
supports interface VPC endpoint to keep traffic between the VPC and Kinesis Data Firehose from leaving the Amazon network.
Kinesis Data Streams vs Kinesis Data Firehose
Kinesis Data Analytics
helps analyze streaming data, gain actionable insights, and respond to the business and customer needs in real time.
reduces the complexity of building, managing, and integrating streaming applications with other AWS service
Redshift is a fast, fully managed data warehouse
provides simple and cost-effective solution to analyze all the data using standard SQL and the existing Business Intelligence (BI) tools.
manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, and patching.
automatically monitors your nodes and drives to help you recover from failures.
only supports Single-AZ deployments.
replicates all the data within the data warehouse cluster when it is loaded and also continuously backs up your data to S3.
attempts to maintain at least three copies of your data (the original and replica on the compute nodes and a backup in S3).
supports cross-region snapshot replication to another region for disaster recovery
Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.
KEY distribution uses a single column as distribution key (DISTKEY) and helps place matching values on the same node slice
Even distribution distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
ALL distribution replicates whole table in every compute node.
AUTO distribution lets Redshift assigns an optimal distribution style based on the size of the table data
Redshift supports Compound and Interleaved sort keys
A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed and is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
Column encodings CANNOT be changed once created.
Redshift provides query queues for Workload Management, in order to manage concurrency and resource planning. It is a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU