AWS Redshift Advanced

Learn from the best — get courses from expert instructors starting at just $13.99!

AWS Redshift Advanced

AWS Redshift Advanced topics cover Distribution Styles for table, Workload Management etc.

Distribution Styles

  • Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
  • Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.

KEY distribution

  • A single column acts as distribution key (DISTKEY) and helps place matching values on the same node slice.
  • As a rule of thumb you should choose a column that:
    • Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
    • acts as a JOIN column – for tables related with dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.

EVEN distribution

  • distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
  • Choose EVEN distribution
    • when the table does not participate in joins
    • when there is not a clear choice between KEY and ALL distribution.

ALL distribution

  • whole table is replicated in every compute node.
  • ensures that every row is collocated for every join that the table participates in
  • ideal for for relatively slow moving tables, tables that are not updated frequently or extensively
  • Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.

AUTO distribution

  • Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
  • Amazon Redshift applies AUTO distribution, be default.

Sort Key

  • Sort keys define the order data in which the data will be stored.
  • Sorting enables efficient handling of range-restricted predicates
  • Only one sort key per table can be defined, but it can be composed with one or more columns.
  • Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
  • The are two kinds of sort keys in Redshift: Compound and Interleaved.

Compound Keys

  • A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed.
  • A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
  • Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.

Interleaved Sort Keys

  • An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
  • An interleaved sort key is more efficient when multiple queries use different columns for filters
  • Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
  • Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering and grouping data using different columns as query dimensions.

Constraints

  • Redshift supports UNIQUE, PRIMARY KEY and FOREIGN KEY constraints, however they are only with informational purposes.
  • Redshift does not perform integrity checks for these constraints and are used by query planner, as hints, in order to optimize executions.
  • Redshift does enforce NOT NULL column constraints.

Redshift Workload Management

  • Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
  • Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
    • Slots: number of concurrent queries that can be executed in this queue.
    • Working memory: percentage of memory assigned to this queue.
    • Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
  • Queries can be routed to different queues using Query Groups and User Groups. As a rule of thumb, is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
  • By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

[do_widget id=text-15]

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
    1. Create custom table views.
    2. Add interleaved sort keys per team.
    3. Maintain team-specific copies of the table.
    4. Add support for workload management queue hopping.

4 thoughts on “AWS Redshift Advanced

  1. Hi,

    I’m confused whether the answer is B or D . any thoughts ?

    A data engineer is running a DWH on a 25-node Redshift cluster of a SaaS service. The data engineer needs to build a dashboard that will be used by customers. Five big customers represent 80% of usage, and there is a long tail of dozens of smaller customers. The data engineer has selected the dashboarding tool.
    How should the data engineer make sure that the larger customer workloads do NOT interfere with the smaller customer workloads?
    A. Apply query filters based on customer-id that can NOT be changed by the user and apply distribution keys on customer-id.
    B. Place the largest customers into a single user group with a dedicated query queue and place the rest of the customers into a different query queue.
    C. Push aggregations into an RDS for Aurora instance. Connect the dashboard application to Aurora rather than Redshift for faster queries.
    D. Route the largest customers to a dedicated Redshift cluster. Raise the concurrency of the multi-tenant Redshift cluster to accommodate the remaining customers.

  2. Hi,

    Could you help answering this question

    A company is building a new application in AWS. The architect needs to design a system to collect application log events. The design should be a repeatable pattern that minimizes data loss if an application instance fails, and keeps a durable copy of a log data for at least 30 days.

    What is the simplest architecture that will allow the architect to analyze the logs?

    A. Write them directly to a Kinesis Firehose. Configure Kinesis Firehose to load the events into an Amazon Redshift cluster for analysis.
    B. Write them to a file on Amazon Simple Storage Service (S3). Write an AWS Lambda function that runs in response to the S3 event to load the events into Amazon Elasticsearch Service for analysis.
    C. Write them to the local disk and configure the Amazon CloudWatch Logs agent to load the data into CloudWatch Logs and subsequently into Amazon Elasticsearch Service.
    D. Write them to CloudWatch Logs and use an AWS Lambda function to load them into HDFS on an Amazon Elastic MapReduce (EMR) cluster for analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.