AWS Redshift Advanced

AWS Redshift Advanced

  • Redshift Distribution Style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
  • Redshift enhanced VPC routing forces all COPY and UNLOAD traffic between the cluster and the data repositories through the VPC.
  • Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
  • Redshift Spectrum helps query and retrieve structured and semistructured data from files in S3 without having to load the data into Redshift tables.
  • Redshift Federated Query feature allows querying and analyzing data across operational databases, data warehouses, and data lakes.
  • Zero-ETL Integrations facilitate point-to-point data movement from operational databases to Redshift without the need to build and manage data pipelines.
  • Redshift Data Sharing enables live, transactionally consistent data sharing across Redshift clusters without copying data.
  • Redshift Serverless automatically provisions and scales data warehouse capacity without managing infrastructure.

Distribution Styles

  • Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
  • Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.

KEY distribution

  • A single column acts as a distribution key (DISTKEY) and helps place matching values on the same node slice.
  • As a rule of thumb, choose a column that:
    • Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
    • acts as a JOIN column – for tables related to dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.

EVEN distribution

  • distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
  • Choose EVEN distribution
    • when the table does not participate in joins
    • when there is not a clear choice between KEY and ALL distribution.

ALL distribution

  • Whole table is replicated in every compute node.
  • ensures that every row is collocated for every join that the table participates in.
  • ideal for relatively slow-moving tables, tables that are not updated frequently or extensively.
  • Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.

AUTO distribution

  • Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
  • Amazon Redshift applies AUTO distribution, by default.
  • Redshift’s automatic table optimization (ATO) continuously monitors query patterns and can automatically adjust distribution keys and sort keys for optimal performance.

Sort Key

  • Sort keys define the order in which the data will be stored.
  • Sorting enables efficient handling of range-restricted predicates.
  • Only one sort key per table can be defined, but it can be composed of one or more columns.
  • Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If the query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
  • The are two kinds of sort keys in Redshift: Compound and Interleaved.

Compound Keys

  • A compound key is made up of all of the columns listed in the sort key definition, in the order, they are listed.
  • A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
  • Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.

Interleaved Sort Keys

  • An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
  • An interleaved sort key is more efficient when multiple queries use different columns for filters.
  • Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
  • Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering, and grouping data using different columns as query dimensions.
  • Note: AWS recommends using compound sort keys for most workloads. Interleaved sort keys require more maintenance (VACUUM REINDEX) and have higher overhead.

Constraints

  • Redshift does not support Indexes.
  • Redshift supports UNIQUE, PRIMARY KEY, and FOREIGN KEY constraints, however, they are only for informational purposes.
  • Redshift does not perform integrity checks for these constraints and is used by the query planner, as hints, in order to optimize executions.
  • Redshift does enforce NOT NULL column constraints.

Redshift Enhanced VPC Routing

  • Redshift enhanced VPC routing forces all COPY and UNLOAD traffic between the cluster and the data repositories through the VPC.
  • Without enhanced VPC routing, Redshift would route traffic through the internet, including traffic to other services within the AWS network.
  • Enhanced VPC routing is now supported for zero-ETL integration warehouses (as of September 2024), enabling secure data replication within the VPC.

Redshift Workload Management

  • Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
  • Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
    • Slots: number of concurrent queries that can be executed in this queue.
    • Working memory: percentage of memory assigned to this queue.
    • Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
  • Queries can be routed to different queues using Query Groups and User Groups.
  • As a rule of thumb, it is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
  • By default, Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one.
  • A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
  • Redshift WLM supports two modes – Manual and Automatic
    • Automatic WLM supports queue priorities.
    • Automatic WLM is the recommended mode and uses ML to dynamically allocate resources.
  • Query Monitoring Rules (QMR) define metrics-based performance boundaries for WLM queues and specify actions when a query exceeds those boundaries.
    • Up to 25 rules per queue, with a limit of 25 rules across all queues.
    • Each rule includes up to three conditions (predicates) and one action (log, cancel, hop, or change priority).
    • Queue-based QMR is now supported in Redshift Serverless (2026), enabling granular workload control.

Redshift Concurrency Scaling

  • Concurrency Scaling helps support thousands of concurrent users and concurrent queries, with consistently fast query performance.
  • With Concurrency scaling, Redshift automatically adds additional cluster capacity to process an increase in both read and write queries.
  • Queries run on the main cluster or a concurrency-scaling cluster returns the most current data.
  • Queries sent to the concurrency-scaling cluster can be managed by configuring WLM queues.
  • Concurrency scaling now supports more types of write queries (INSERT, CREATE TABLE AS, UPDATE, DELETE), expanding beyond read-only scaling.

Redshift Short Query Acceleration – SQA

  • Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries.
  • SQA runs short-running queries in a dedicated space, so that SQA queries aren’t forced to wait in queues behind longer queries.
  • SQA only prioritizes queries that are short-running and are in a user-defined queue.

Redshift Loading Data

  • A COPY command is the most efficient way to load a table.
    • COPY command is able to read from multiple data files or multiple data streams simultaneously.
    • Redshift allocates the workload to the cluster nodes and performs the load operations in parallel, including sorting the rows and distributing data across node slices.
    • COPY command supports loading data from S3, EMR, DynamoDB, and remote hosts such as EC2 instances using SSH.
    • COPY supports decryption and can decrypt the data as it performs the load if the data is encrypted
    • COPY can then speed up the load process by uncompressing the files as they are read if the data is compressed.
    • COPY command can be used with COMPUPDATE set to ON to analyze and apply compression automatically based on sample data.
    • Optimizing storage for narrow tables (multiple rows few columns) by using Single COPY command instead of multiple COPY commands, as it would not work well due to hidden fields and compression issues.
  • Auto Copy
    • Auto-copy (GA October 2024) provides the ability to automate copy statements by tracking S3 folders and ingesting new files without customer intervention.
    • Without Auto-copy, a copy statement immediately starts the file ingestion process for existing files.
    • Auto-copy extends the existing copy command and provides the ability to
      • Automate file ingestion process by monitoring specified S3 paths for new files
      • Re-use copy configurations, reducing the need to create and run new copy statements for repetitive ingestion tasks and
      • Keep track of loaded files to avoid data duplication.
  • Streaming Ingestion
    • Redshift supports streaming ingestion from Amazon Kinesis Data Streams, Amazon MSK, Confluent Managed Cloud, and self-managed Apache Kafka clusters.
    • Streaming ingestion uses materialized views to ingest data from streams directly into Redshift tables for near real-time analytics.
    • Supports cascading refresh of nested materialized views on streaming sources (2025).
  • INSERT command
    • Clients can connect to Amazon Redshift using ODBC or JDBC and issue ‘insert’ SQL commands to insert the data.
    • INSERT command is much less efficient than using COPY as they are routed through the single leader node.

Redshift Resizing Cluster

  • Elastic resize
    • Use elastic resize to change the node type, number of nodes, or both.
    • If only the number of nodes is changed, then queries are temporarily paused and connections are held open if possible.
    • During the resize operation, the cluster is read-only.
    • Elastic resize takes 10–15 minutes.
  • Classic resize
    • Use classic resize to change the node type, number of nodes, or both.
    • During the resize operation, data is copied to a new cluster and the source cluster is read-only
    • Classic resize takes 2 hours – 2 days or longer, depending on the data’s size
  • Snapshot and restore with classic resize
    • To keep the cluster available during a classic resize, create a snapshot, make a copy of an existing cluster, then resize the new cluster.

Redshift Spectrum

  • Redshift Spectrum helps query and retrieve structured and semistructured data from files in S3 without having to load the data into Redshift tables.
  • Redshift Spectrum queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in S3.
  • Multiple clusters can concurrently query the same dataset in S3 without the need to make copies of the data for each cluster.
  • Redshift Spectrum resides on dedicated Redshift servers that are independent of the existing cluster.
  • Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer.
  • Redshift Spectrum also scales automatically, based on the demands of the queries, and can potentially use thousands of instances to take advantage of massively parallel processing.
  • Supports external data catalog using Glue, Athena, or Hive metastore
  • Supports querying Apache Iceberg tables and S3 Tables (purpose-built tabular storage with Iceberg support, launched at re:Invent 2024).
  • Iceberg query performance has improved up to 3x year-over-year through optimizations including Glue Data Catalog statistics, dynamic partition elimination, and parallel manifest file processing.
  • Supports incremental refresh for materialized views on data lake tables, eliminating the need for full recomputation when new data arrives.
  • Redshift cluster and the S3 bucket must be in the same AWS Region.
  • Redshift Spectrum external tables are read-only. You can’t COPY or INSERT to an external table.

Redshift Federated Query

  • Redshift Federated Query feature allows querying and analyzing data across operational databases, warehouses, and lakes.
  • Redshift Federated Query allows integrating queries on live data in RDS for PostgreSQL, Aurora PostgreSQL, RDS for MySQL, and Aurora MySQL with queries across Redshift and S3.
  • Supports both PostgreSQL and MySQL engines for federated access.

Zero-ETL Integrations

  • Zero-ETL integrations facilitate point-to-point data movement from operational databases to Redshift without the need to build and manage custom data pipelines.
  • Provides near real-time analytics on transactional data within seconds of it being written to the source.
  • Supported Sources:
    • Amazon Aurora MySQL-Compatible Edition (first zero-ETL source)
    • Amazon Aurora PostgreSQL-Compatible Edition (GA October 2024)
    • Amazon RDS for MySQL (GA September 2024)
    • Amazon DynamoDB (GA October 2024)
    • Self-managed databases (MySQL, PostgreSQL) via CDC replication
    • Enterprise Applications (re:Invent 2024): Salesforce, Zendesk, ServiceNow, SAP, Facebook Ads, Instagram Ads, Pardot, and Zoho CRM
  • Key Features:
    • Data filtering to selectively extract tables and schemas using regular expressions
    • Support for incremental and auto-refresh materialized views on replicated data
    • Configurable change data capture (CDC) refresh rates
    • Cross-account integrations within the same region
    • Supports both Redshift Serverless workgroups and provisioned clusters using RA3 instance types
    • Compatible with enhanced VPC routing and Multi-AZ deployments

Redshift Data Sharing

  • Redshift Data Sharing allows securely sharing live, transactionally consistent data between Redshift clusters without physically copying or moving data.
  • Supports cross-account and cross-Region data sharing.
  • For cross-account data sharing, both the producer and consumer cluster must be encrypted.
  • Producer clusters create datashares; consumer clusters associate with them via a two-way handshake for cross-account sharing.
  • Multi-data warehouse writes through data sharing (GA November 2024) allows writing to shared Redshift databases from multiple data warehouses, enabling distributed ETL workloads.
  • Supports data sharing with data lake tables, enabling unified access across warehouses and data lakes.
  • Works with both RA3 provisioned clusters and Redshift Serverless.
  • Integrated with AWS Lake Formation for fine-grained access control on shared data.

Redshift Serverless

  • Redshift Serverless automatically provisions and scales data warehouse capacity to deliver fast performance without managing infrastructure.
  • Pay only for compute capacity when the data warehouse is active, measured in Redshift Processing Units (RPUs).
  • Each RPU provides 16 GB of memory; base capacity ranges from 4 RPUs to 1024 RPUs.
  • Starting capacity as low as 4 RPUs ($1.50/hour), making it cost-effective for smaller workloads.
  • AI-driven scaling and optimization (GA October 2024, default for new workgroups April 2026):
    • Automatically learns workload patterns and adjusts compute resources based on query complexity, data volume, and scan size.
    • Offers a price-performance slider with five profiles from “Optimized for Cost” to “Optimized for Performance.”
    • Deploys automatic optimizations including materialized views and table design optimization.
    • Up to 10x price-performance improvement for variable workloads.
  • Serverless Reservations (April 2025): Commit to specific RPUs for a one-year term with 20% (no-upfront) or 24% (all-upfront) discount off on-demand rates.
  • Supports all Redshift features including data sharing, streaming ingestion, federated queries, and zero-ETL integrations.

Redshift Multi-AZ Deployments

  • Multi-AZ deployments (GA November 2023 for RA3 clusters) support running the data warehouse across multiple Availability Zones simultaneously.
  • Provides high availability by continuing operations during unforeseen failure scenarios in a single AZ.
  • Available for RA3 provisioned clusters in most commercial regions and GovCloud (US).
  • Compatible with zero-ETL integrations for highly available near real-time analytics.

Redshift Node Types

  • RG Instances (GA May 2026) – Latest generation powered by AWS Graviton processors
    • Up to 2.2x faster for data warehouse workloads and 2.4x faster for data lake workloads compared to RA3.
    • 30% lower price per vCPU compared to RA3 instances.
    • Includes a custom-built vectorized data lake query engine that processes Apache Iceberg and Parquet data on cluster nodes.
    • Available in rg.xlarge and rg.4xlarge node types.
    • Recommended upgrade path from RA3 instances.
  • RA3 Instances – Managed storage with separate compute and storage scaling
    • Available in ra3.xlplus, ra3.4xlarge, ra3.16xlarge, and ra3.large sizes.
    • Managed storage automatically tiers data between high-performance SSD and S3.
    • RA3.large (GA October 2024) offers a cost-effective migration path from DC2.large.
  • DC2 Instances – Dense compute with local SSD storage (legacy, migration to RA3/RG recommended)

Redshift Generative AI Integration

  • Amazon Q generative SQL (GA September 2024) in Redshift Query Editor allows users to express queries in natural language and receive SQL code recommendations.
  • Amazon Bedrock Integration (October 2024) enables invoking large language models (LLMs) from SQL commands for tasks like text generation, summarization, sentiment analysis, and language translation.
  • Amazon Bedrock Knowledge Bases supports natural language querying to retrieve structured data from Redshift warehouses, automatically translating questions into SQL.

Redshift SageMaker Lakehouse

  • Amazon SageMaker Lakehouse (re:Invent 2024) unifies data across S3 data lakes and Redshift warehouses.
  • Provides access via Apache Iceberg open standards for use with any Iceberg-compatible engine.
  • Existing Redshift data warehouses can be published to SageMaker Lakehouse, opening warehouse data with Iceberg REST API.
  • Supports creating new data lake tables using Redshift Managed Storage (RMS) as native storage.
  • Offers integrated access controls and fine-grained permissions through Lake Formation across all engines.

Redshift Behavior Changes and Deprecations

  • Python UDFs End of Support (June 30, 2026)
    • Creation of new Python UDFs blocked since October 30, 2025.
    • Existing Python UDFs will stop functioning after June 30, 2026.
    • Migration: Use Lambda UDFs which provide better integration, flexibility, scalability, and security.
  • ODBC 1.x Driver End of Support: September 30, 2026. Migrate to ODBC 2.x driver.
  • Minimum TLS Version: TLS 1.2 minimum required starting January 31, 2026.
  • Materialized View Auto-REFRESH Behavior Change: After February 27, 2026, auto-refresh respects workload priorities.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
    1. Create custom table views.
    2. Add interleaved sort keys per team.
    3. Maintain team-specific copies of the table.
    4. Add support for workload management queue hopping.
  2. A company needs to replicate data from their Aurora PostgreSQL database to Redshift for near real-time analytics without building custom ETL pipelines. Which approach requires the LEAST operational overhead?
    1. Set up AWS Glue jobs to periodically extract and load data
    2. Use Amazon Kinesis Data Firehose to stream changes
    3. Configure Aurora PostgreSQL zero-ETL integration with Redshift
    4. Create Lambda functions triggered by DynamoDB Streams
  3. An organization uses Redshift Serverless and wants to optimize for cost during off-peak hours while maintaining performance during peak business hours. Which feature best addresses this requirement?
    1. Manual RPU scaling with CloudWatch alarms
    2. Concurrency Scaling with WLM queue configuration
    3. AI-driven scaling and optimization with cost-optimized profile
    4. Scheduled pause and resume of the Serverless workgroup
  4. A company wants to share live data from their Redshift cluster with a partner organization’s Redshift cluster in a different AWS account without copying data. What is the recommended approach?
    1. Use Redshift Spectrum with cross-account S3 access
    2. Set up AWS Data Exchange for data delivery
    3. Configure cross-account Redshift Data Sharing
    4. Use AWS Glue ETL to replicate data to the partner account
  5. A team wants to perform sentiment analysis on customer feedback stored in Redshift without moving data to a separate ML service. Which Redshift feature enables this?
    1. Redshift ML with SageMaker Autopilot
    2. Export to S3 and use Comprehend
    3. Amazon Redshift integration with Amazon Bedrock using SQL commands
    4. Redshift federated query to an NLP endpoint
  6. A company is migrating from RA3 instances and wants better price-performance for both data warehouse and data lake workloads. Which instance type should they consider?
    1. DC2.8xlarge for compute-intensive workloads
    2. RA3.16xlarge with AQUA enabled
    3. RG instances powered by AWS Graviton
    4. Redshift Serverless with 1024 RPU base capacity

12 thoughts on “AWS Redshift Advanced

  1. Hi,

    I’m confused whether the answer is B or D . any thoughts ?

    A data engineer is running a DWH on a 25-node Redshift cluster of a SaaS service. The data engineer needs to build a dashboard that will be used by customers. Five big customers represent 80% of usage, and there is a long tail of dozens of smaller customers. The data engineer has selected the dashboarding tool.
    How should the data engineer make sure that the larger customer workloads do NOT interfere with the smaller customer workloads?
    A. Apply query filters based on customer-id that can NOT be changed by the user and apply distribution keys on customer-id.
    B. Place the largest customers into a single user group with a dedicated query queue and place the rest of the customers into a different query queue.
    C. Push aggregations into an RDS for Aurora instance. Connect the dashboard application to Aurora rather than Redshift for faster queries.
    D. Route the largest customers to a dedicated Redshift cluster. Raise the concurrency of the multi-tenant Redshift cluster to accommodate the remaining customers.

    1. Seems B, creating dedicated query queue for large customers and other queue for rest.

  2. Hi,

    Could you help answering this question

    A company is building a new application in AWS. The architect needs to design a system to collect application log events. The design should be a repeatable pattern that minimizes data loss if an application instance fails, and keeps a durable copy of a log data for at least 30 days.

    What is the simplest architecture that will allow the architect to analyze the logs?

    A. Write them directly to a Kinesis Firehose. Configure Kinesis Firehose to load the events into an Amazon Redshift cluster for analysis.
    B. Write them to a file on Amazon Simple Storage Service (S3). Write an AWS Lambda function that runs in response to the S3 event to load the events into Amazon Elasticsearch Service for analysis.
    C. Write them to the local disk and configure the Amazon CloudWatch Logs agent to load the data into CloudWatch Logs and subsequently into Amazon Elasticsearch Service.
    D. Write them to CloudWatch Logs and use an AWS Lambda function to load them into HDFS on an Amazon Elastic MapReduce (EMR) cluster for analysis.

      1. I am thinking B (Firehose->s3->Lambda->ES) and my reasoning is that Although this is not as simple as writing directly to RedShift, ES is more suited for analysis on Application logs. I agree its not the SIMPLEST but it makes more sense than RedShift.
        What do you think?

  3. On a second thought, I believe you are right Jayendra, the simplest would be fireshost to Redshift.

  4. I think its B.
    Key points – Keeps a durable copy of a log data – S3. On S3 Event Trigger Lambda to feed into Elastic Search.
    Log Analysis – Typically Elastic Search is a good choice .

  5. Pingback: AWS Redshift

Comments are closed.