AWS Redshift Best Practices
📌 Last Updated: June 2026. Covers RG instances (Graviton-powered), Multidimensional Data Layouts (MDDL), Zero-ETL integrations, Auto-copy, Streaming Ingestion, AI-driven scaling for Serverless, and concurrency scaling for COPY commands.
Designing Tables
Distribution Style Selection
- Distribute the fact table and one dimension table on their common columns.
- A fact table can have only one distribution key. Any tables that join on another key aren’t collocated with the fact table.
- Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows.
- Designate both the dimension table’s primary key and the fact table’s corresponding foreign key as the DISTKEY.
- Choose the largest dimension based on the size of the filtered dataset.
- Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
- Choose a column with high cardinality in the filtered result set.
- If you distribute a sales table on a date column, for e.g, you should probably get fairly even data distribution, unless most of the sales are seasonal
- However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
- Change some dimension tables to use ALL distribution.
- If a dimension table cannot be collocated with the fact table or other important joining tables, query performance can be improved significantly by distributing the entire table to all of the nodes.
- Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations.
- Use AUTO distribution for tables where optimal distribution is unclear.
- With AUTO distribution (the default), Redshift assigns an optimal distribution style based on the table size — using ALL for small tables and EVEN for larger tables, then adjusting automatically.
- Combined with Automatic Table Optimization (ATO), Redshift can monitor workloads and automatically apply optimal distribution keys without manual intervention.
Sort Key Selection
- Redshift stores the data on disk in sorted order according to the sort key, which helps query optimizer to determine optimal query plans.
- If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
- Queries are more efficient because they can skip entire blocks that fall outside the time range.
- If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
- Redshift can skip reading entire blocks of data for that column.
- Redshift tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.
- If you frequently join a table, specify the join column as both the sort key and the distribution key.
- Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join.
- As the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
- Use AUTO sort key or Multidimensional Data Layouts (MDDL) for complex workloads.
- When you set SORTKEY AUTO, Redshift’s Automatic Table Optimization (ATO) analyzes your query history and automatically selects either a single-column sort key or Multidimensional Data Layouts based on which is better for your workload.
- Multidimensional Data Layouts (MDDL) — GA since September 2025 — dynamically sort data based on actual query filter patterns rather than a single column, accelerating performance for workloads with multiple filter predicates.
- MDDL constructs a multidimensional virtual sort key that co-locates rows typically accessed by the same queries, enabling data block skipping across multiple predicate columns.
Automatic Table Optimization (ATO)
- ATO is a self-tuning capability that automatically optimizes table design by applying sort and distribution keys without administrator intervention.
- ATO monitors cluster workload and table metadata, runs AI algorithms over observations, and implements sort and distribution keys online in the background without interrupting running queries.
- When ATO is enabled, you don’t need to manually choose sort keys or distribution styles — Redshift will determine optimal settings based on actual query patterns.
- To enable ATO on an existing table:
ALTER TABLE tablename ALTER SORTKEY AUTO; ALTER TABLE tablename ALTER DISTSTYLE AUTO;
- Best practice for new tables: use AUTO distribution and AUTO sort key unless you have specific, well-understood access patterns.
Other Practices
- Automatic compression produces the best results
- COPY command analyzes the data and applies compression encodings to an empty table automatically as part of the load operation
- Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
- Don’t use the maximum column size for convenience.
- Use RA3 or RG instances with Redshift Managed Storage (RMS) to decouple compute from storage and enable independent scaling.
Loading Data
- You can load data into the tables using the following methods:
- Using Multi-Row INSERT
- Using Bulk INSERT
- Using COPY command
- Staging tables
- Auto-copy from S3 (continuous automatic ingestion)
- Streaming Ingestion (from Kinesis Data Streams or Amazon MSK)
- Zero-ETL integrations (from Aurora, DynamoDB, RDS, and SaaS applications)
- Copy Command
- COPY command loads data in parallel from S3, EMR, DynamoDB, or multiple data sources on remote hosts.
- COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
- Use a Single COPY Command to Load from Multiple Files
- DON’T use multiple concurrent COPY commands to load one table from multiple files as Redshift is forced to perform a serialized load, which is much slower.
- Concurrency Scaling for COPY (May 2026): Redshift now extends concurrency scaling to support high-volume data ingestion workloads, automatically scaling for COPY queries in Parquet and ORC formats from S3. Data pipelines no longer need to choose between ingestion speed and query performance during peak demand.
- Split the Load Data into Multiple Files
- Divide the data in multiple files with equal size (between 1MB and 1GB)
- Number of files should be a multiple of the number of slices in the cluster
- Helps to distribute workload uniformly in the cluster.
- Use a Manifest File
S3 provides eventual consistency for some operations — Note: Since December 2020, Amazon S3 provides strong read-after-write consistency for all operations. Redshift COPY, UNLOAD, and Spectrum operations benefit from this consistency automatically.
- Manifest files are still recommended to explicitly specify the exact list of files to load, preventing accidental inclusion/exclusion of files.
- Manifest file helps specify different S3 locations in a more efficient way than with the use of S3 prefixes.
- Compress Data Files
- Individually compress the load files using gzip, lzop, bzip2, or Zstandard for large datasets
- Avoid using compression, if small amount of data because the benefit of compression would be outweighed by the processing cost of decompression
- If the priority is to reduce the time spent by COPY commands use LZO compression. If the priority is to reduce the size of the files in S3 and the network bandwidth use GZIP or Zstandard (ZSTD) compression.
- Load Data in Sort Key Order
- Load the data in sort key order to avoid needing to vacuum.
- As long as each batch of new data follows the existing rows in the table, the data will be properly stored in sort order, and you will not need to run a vacuum.
- Presorting rows is not needed in each load because COPY sorts each batch of incoming data as it loads.
- Load Data using IAM role
- Attach an IAM role to the cluster rather than embedding credentials in the COPY command.
Auto-copy from S3
- Auto-copy enables continuous, automatic file ingestion from Amazon S3 into Redshift tables without additional tools or custom solutions.
- Set up ingestion rules using
COPY JOB to track S3 paths and automatically load new files as they arrive.
- Auto-copy uses S3 event notifications to detect new files and trigger loads automatically.
- Concurrency scaling for auto-copy (March 2026): Auto-copy now supports concurrency scaling, ensuring ingestion performance doesn’t degrade during peak query workloads.
- Best suited for continuous batch data arriving in S3 (e.g., log files, IoT data, CDC exports).
Streaming Ingestion
- Streaming Ingestion enables near real-time analytics by creating materialized views directly on top of data streams from Amazon Kinesis Data Streams or Amazon MSK (Managed Streaming for Apache Kafka).
- Eliminates the need to stage data in S3 before loading — data flows directly from streams to Redshift.
- Use
CREATE MATERIALIZED VIEW with stream source to define ingestion, then REFRESH MATERIALIZED VIEW to consume latest data.
- Supports JSON, CSV, and other formats directly from streams.
- Available on both provisioned clusters and Redshift Serverless.
Zero-ETL Integrations
- Zero-ETL integrations provide fully managed, near real-time data replication from operational databases and SaaS applications to Redshift without building ETL pipelines.
- Supported sources:
- Amazon Aurora (MySQL-compatible and PostgreSQL-compatible)
- Amazon RDS (MySQL and PostgreSQL — PostgreSQL GA July 2025)
- Amazon DynamoDB (GA October 2024)
- Self-managed databases (MySQL, PostgreSQL)
- SaaS applications — Salesforce, SAP, ServiceNow, Zendesk, and others (via Amazon SageMaker Lakehouse)
- History mode (April 2025) preserves complete history of data changes for auditing and trend analysis without maintaining duplicate copies.
- Concurrency scaling support for zero-ETL (March 2026) ensures replication performance during high query loads.
- Available on RA3, RG instances and Redshift Serverless workgroups.
Designing Queries
- Avoid using select *. Include only the columns you specifically need.
- Use a CASE Expression to perform complex aggregations instead of selecting from the same table multiple times.
- Don’t use cross-joins unless absolutely necessary
- Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
- Use predicates to restrict the dataset as much as possible.
- In the predicate, use the least expensive operators that you can.
- Avoid using functions in query predicates.
- If possible, use a WHERE clause to restrict the dataset.
- Add predicates to filter tables that participate in joins, even if the predicates apply the same filters.
- Use materialized views for frequently executed queries to pre-compute results and reduce query latency.
- Leverage data sharing to share live data across multiple Redshift clusters/workgroups without copying, enabling workload isolation.
Cluster Configuration Best Practices
Instance Types
- RG Instances (GA May 2026): New Graviton-powered nodes delivering up to 2.2x faster for data warehouse workloads and up to 2.4x faster for data lake workloads at 30% lower price per vCPU compared to RA3. Recommended for new provisioned deployments.
- RA3 Instances: Previous generation with Redshift Managed Storage (RMS) that separates compute and storage. Still fully supported.
- DC2/DS2 Instances: Legacy instance types. Migrate to RA3 or RG for better price-performance and managed storage benefits.
- Both RG and RA3 use Redshift Managed Storage — data is automatically stored in S3 with intelligent caching on local SSDs.
Multi-AZ Deployments
- Redshift supports Multi-AZ deployments for RG and RA3 clusters, providing high availability across two Availability Zones.
- All nodes in both AZs are used for read and write workloads during normal operation.
- If one AZ experiences an outage, the cluster continues operating in the other AZ.
- Can convert existing Single-AZ clusters to Multi-AZ or restore from snapshot into Multi-AZ configuration.
Redshift Serverless
- Redshift Serverless automatically provisions and scales data warehouse capacity without cluster management.
- Capacity measured in RPUs (Redshift Processing Units). Range: 4 RPUs to 512 RPUs (4 RPU minimum available since June 2025, starting at $1.50/hour).
- AI-driven scaling and optimization (default for new workgroups since April 2026) uses ML to predict compute needs and automatically adjust resources before queries queue, supporting 8–512 RPU base range.
- Pay only for compute consumed when the warehouse is active — ideal for intermittent or unpredictable workloads.
- Supports all the same features as provisioned: data sharing, streaming ingestion, zero-ETL, auto-copy.
UDF Best Practices
- ⚠️ Python UDFs Deprecated: Amazon Redshift no longer supports creation of new Python UDFs (since November 1, 2025). Existing Python UDFs will reach end of support after June 30, 2026.
- Use Lambda UDFs as the recommended replacement — they provide Python 3 support, access to external services, better scalability, and enhanced security.
- SQL UDFs remain fully supported for simple transformations.
- Migrate existing Python UDFs to Lambda UDFs before the end-of-support date.
AWS Certification Exam Practice Questions
- Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
- AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
- AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
- Open to further feedback, discussion and correction.
- An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
- When the tables are highly denormalized and do NOT participate in frequent joins.
- When data must be grouped based on a specific key on a defined slice.
- When data transfer between nodes must be eliminated.
- When a new table has been loaded and it is unclear how it will be joined to dimension.
- An administrator has a 500-GB file in Amazon S3. The administrator runs a nightly COPY command into a 10-node Amazon Redshift cluster. The administrator wants to prepare the data to optimize performance of the COPY command. How should the administrator prepare the data?
- Compress the file using gz compression.
- Split the file into 500 smaller files.
- Convert the file format to AVRO.
- Split the file into 10 files of equal size.
- A company needs to load data from multiple operational databases into Amazon Redshift in near real-time for analytics without building ETL pipelines. Which feature should they use?
- Redshift Streaming Ingestion
- Amazon Kinesis Data Firehose
- Zero-ETL integrations
- AWS Database Migration Service (DMS)
- An organization wants to optimize Redshift sort key selection for a workload that filters on multiple columns across different queries. The current single-column sort key only benefits a subset of queries. What should they use?
- Create compound sort keys with all filter columns
- Switch to interleaved sort keys
- Enable Automatic Table Optimization with SORTKEY AUTO to leverage Multidimensional Data Layouts (MDDL)
- Create multiple copies of the table with different sort keys
- A data engineering team needs to continuously ingest new files arriving in S3 into Redshift without managing external schedulers or custom Lambda triggers. Which Redshift feature addresses this requirement?
- Redshift Spectrum
- Auto-copy (COPY JOB)
- Streaming Ingestion from Kinesis
- Zero-ETL integration with S3
- A company is deploying a new Redshift provisioned cluster and wants the best price-performance. They need both data warehouse and data lake query capabilities. Which instance type should they select? (June 2026)
- DC2 instances
- DS2 instances
- RA3 instances
- RG instances (Graviton-powered)
- A team wants to enable near real-time analytics on streaming data from Amazon MSK without staging data in S3. Which approach should they use?
- Use AWS Glue streaming ETL job to load into Redshift
- Create a streaming materialized view in Redshift that reads directly from the MSK topic
- Use Kinesis Data Firehose to deliver to Redshift
- Use Lambda to consume MSK events and INSERT into Redshift
References