AWS Redshift Advanced
AWS Redshift Advanced topics cover Distribution Styles for table, Workload Management etc.
- Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
- Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.
- A single column acts as distribution key (DISTKEY) and helps place matching values on the same node slice.
- As a rule of thumb you should choose a column that:
- Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
- acts as a JOIN column – for tables related with dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.
- distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
- Choose EVEN distribution
- when the table does not participate in joins
- when there is not a clear choice between KEY and ALL distribution.
- whole table is replicated in every compute node.
- ensures that every row is collocated for every join that the table participates in
- ideal for for relatively slow moving tables, tables that are not updated frequently or extensively
- Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.
- Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
- Amazon Redshift applies AUTO distribution, be default.
- Sort keys define the order data in which the data will be stored.
- Sorting enables efficient handling of range-restricted predicates
- Only one sort key per table can be defined, but it can be composed with one or more columns.
- Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
- The are two kinds of sort keys in Redshift: Compound and Interleaved.
- A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed.
- A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
- Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.
Interleaved Sort Keys
- An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
- An interleaved sort key is more efficient when multiple queries use different columns for filters
- Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
- Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering and grouping data using different columns as query dimensions.
- Redshift supports UNIQUE, PRIMARY KEY and FOREIGN KEY constraints, however they are only with informational purposes.
- Redshift does not perform integrity checks for these constraints and are used by query planner, as hints, in order to optimize executions.
- Redshift does enforce NOT NULL column constraints.
Redshift Workload Management
- Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
- Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
- Slots: number of concurrent queries that can be executed in this queue.
- Working memory: percentage of memory assigned to this queue.
- Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
- Queries can be routed to different queues using Query Groups and User Groups. As a rule of thumb, is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
- By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
AWS Certification Exam Practice Questions
- Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
- AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
- AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
- Open to further feedback, discussion and correction.
- A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
- Create custom table views.
- Add interleaved sort keys per team.
- Maintain team-specific copies of the table.
- Add support for workload management queue hopping.