AWS Redshift Advanced
AWS Redshift Advanced topics cover Distribution Styles for table, Workload Management etc.
- Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
- Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.
- A single column acts as distribution key (DISTKEY) and helps place matching values on the same node slice.
- As a rule of thumb, choose a column that:
- Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
- acts as a JOIN column – for tables related with dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.
- distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
- Choose EVEN distribution
- when the table does not participate in joins
- when there is not a clear choice between KEY and ALL distribution.
- whole table is replicated in every compute node.
- ensures that every row is collocated for every join that the table participates in
- ideal for for relatively slow moving tables, tables that are not updated frequently or extensively
- Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.
- Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
- Amazon Redshift applies AUTO distribution, be default.
- Sort keys define the order in which the data will be stored.
- Sorting enables efficient handling of range-restricted predicates
- Only one sort key per table can be defined, but it can be composed with one or more columns.
- Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
- The are two kinds of sort keys in Redshift: Compound and Interleaved.
- A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed.
- A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
- Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.
Interleaved Sort Keys
- An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
- An interleaved sort key is more efficient when multiple queries use different columns for filters
- Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
- Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering and grouping data using different columns as query dimensions.
- Redshift does not support Indexes.
- Redshift supports UNIQUE, PRIMARY KEY and FOREIGN KEY constraints, however they are only with informational purposes.
- Redshift does not perform integrity checks for these constraints and are used by query planner, as hints, in order to optimize executions.
- Redshift does enforce NOT NULL column constraints.
Redshift Enhanced VPC Routing
- Redshift enhanced VPC routing forces all COPY and UNLOAD traffic between the cluster and the data repositories through the VPC.
- Without enhanced VPC routing, Redshift would route traffic through the internet, including traffic to other services within the AWS network.
Redshift Workload Management
- Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
- Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
- Slots: number of concurrent queries that can be executed in this queue.
- Working memory: percentage of memory assigned to this queue.
- Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
- Queries can be routed to different queues using Query Groups and User Groups
- As a rule of thumb, it is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
- By default, Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one.
- A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
- Redshift WLM supports two modes – Manual and Automatic
- Automatic WLM supports queue priorities
Redshift Loading Data
- A COPY command is the most efficient way to load a table.
- COPY command is able to read from multiple data files or multiple data streams simultaneously.
- Redshift allocates the workload to the cluster nodes and performs the load operations in parallel, including sorting the rows and distributing data across node slices.
- COPY command supports loading data from S3, EMR, DynamoDB and remote hosts such as EC2 instances using SSH.
- COPY supports decryption and can decrypt the data as it performs the load, if the data is encrypted
- COPY can then speed up the load process by uncompressing the files as they are read, if the data is compressed.
- COPY command can be used with COMPUPDATE set to ON to analyze and apply compression automatically based on sample data.
- Optimizing storage for narrow tables (multiple rows few columns) by using Single COPY command instead of multiple COPY commands, as it would not work well due to hidden fields and compression issues.
- Data can also be added using INSERT commands, though it is much less efficient than using COPY.
Redshift Resizing Cluster
- Elastic resize
- Use elastic resize to change the node type, number of nodes, or both. (Circa April 2020 – Changing node type is available recently and was not supported before)
- If only the number of nodes are changed, then queries are temporarily paused and connections are held open if possible.
- During the resize operation, the cluster is read-only.
- Elastic resize takes 10–15 minutes
- Classic resize
- Use classic resize to change the node type, number of nodes, or both.
- During the resize operation, data is copied to a new cluster and the source cluster is read-only
- Classic resize takes 2 hours–2 days or longer, depending on the data’s size
- Snapshot and restore with classic resize
- To keep the cluster available during a classic resize, create a snapshot , make a copy of an existing cluster, then resize the new cluster.
- Redshift Spectrum helps query and retrieve structured and semistructured data from files in S3 without having to load the data into Redshift tables.
- Redshift Spectrum queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in S3.
- Multiple clusters can concurrently query the same dataset in S3 without the need to make copies of the data for each cluster.
- Redshift Spectrum resides on dedicated Redshift servers that are independent of the existing cluster.
- Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer.
- Redshift Spectrum also scales automatically, based on the demands of the queries and can potentially use thousands of instances to take advantage of massively parallel processing.
- Supports external data catalog using Glue, Athena or Hive metastore
- Redshift cluster and the S3 bucket must be in the same AWS Region.
- Redshift Spectrum external tables are read-only. You can’t COPY or INSERT to an external table.
Redshift Federated Query
- Redshift Federated Query feature allows querying and analyzing data across operational databases, data warehouses, and data lakes.
- Redshift Federated Query allows integrating queries on live data in RDS for PostgreSQL and Aurora PostgreSQL with queries across Redshift and S3.
AWS Certification Exam Practice Questions
- Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
- AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
- AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
- Open to further feedback, discussion and correction.
- A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
- Create custom table views.
- Add interleaved sort keys per team.
- Maintain team-specific copies of the table.
- Add support for workload management queue hopping.