AWS Redshift Best Practices
Sort Key Selection
- Redshift stores the data on disk in sorted order according to the sort key, which helps query optimizer to determine optimal query plans.
- If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
- Queries are more efficient because they can skip entire blocks that fall outside the time range.
- If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
- Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.
- If you frequently join a table, specify the join column as both the sort key and the distribution key.
- Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Distribution Style selection
- Distribute the fact table and one dimension table on their common columns.
- Your fact table can have only one distribution key. Any tables that join on another key aren’t collocated with the fact table.
- Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows.
- Designate both the dimension table’s primary key and the fact table’s corresponding foreign key as the DISTKEY.
- Choose the largest dimension based on the size of the filtered dataset.
- Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
- Choose a column with high cardinality in the filtered result set.
- If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal.
- However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
- Change some dimension tables to use ALL distribution.
- If a dimension table cannot be collocated with the fact table or other important joining tables, query performance can be improved significantly by distributing the entire table to all of the nodes.
- Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations.
- Automatic compression produces the best results
- COPY command analyzes the data and applies compression encodings to an empty table automatically as part of the load operation
- Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
- Don’t use the maximum column size for convenience.
- You can load data into the tables using the three following methods:
- Using Multi-Row INSERT
- Using Bulk INSERT
- Using COPY command
- Staging tables
- COPY command loads data in parallel from S3, EMR, DynamoDB, or multiple data sources on remote hosts.
- COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
- Use a Single COPY Command to Load from Multiple Files
- DON’T use multiple concurrent COPY commands to load one table from multiple files as Redshift is forced to perform a serialized load, which is much slower.
- Split the Load Data into Multiple Files
- divide the data in multiple files with equal size ( between 1MB and 1GB
- number of files to be a multiple of the number of slices in the cluster
- helps to distribute workload uniformly in the cluster.
- Use a Manifest File
- S3 provides eventual consistency for some operations, so it is possible that new data will not be available immediately after the upload, which could result in an incomplete data load or loading stale data.
- Data consistency can be managed using a manifest file to load data.
- Manifest file helps specify different S3 locations in a more efficient way that with the use of S3 prefixes.
Compress Your Data Files
- Individually compress the load files using gzip, lzop, bzip2, or Zstandard for large datasets
- Avoid to use compression if you have small amount of data because the benefit of compression would be outweighed by the processing cost of decompression.
- If the priority is to reduce the time spent by COPY commands use LZO compression. In the other hand if the priority is to reduce the size of the files in S3 and the network bandwidth use BZ2 compression.
- Load Data in Sort Key Order
- Load your data in sort key order to avoid needing to vacuum.
- As long as each batch of new data follows the existing rows in the table, the data will be properly stored in sort order, and you will not need to run a vacuum.
- Presorting rows is not needed in each load because COPY sorts each batch of incoming data as it loads.
- Load Data using IAM role
- Avoid using select *. Include only the columns you specifically need.
- Use a CASE Expression to perform complex aggregations instead of selecting from the same table multiple times.
- Don’t use cross-joins unless absolutely necessary
- Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
- Use predicates to restrict the dataset as much as possible.
- In the predicate, use the least expensive operators that you can.
- Avoid using functions in query predicates.
- If possible, use a WHERE clause to restrict the dataset.
- Add predicates to filter tables that participate in joins, even if the predicates apply the same filters.
AWS Certification Exam Practice Questions
- Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
- AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
- AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
- Open to further feedback, discussion and correction.
- An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
- When the tables are highly denormalized and do NOT participate in frequent joins.
- When data must be grouped based on a specific key on a defined slice.
- When data transfer between nodes must be eliminated.
- When a new table has been loaded and it is unclear how it will be joined to dimension.
- An administrator has a 500-GB file in Amazon S3. The administrator runs a nightly COPY command into a 10-node Amazon Redshift cluster. The administrator wants to prepare the data to optimize performance of the COPY command. How should the administrator prepare the data?
- Compress the file using gz compression.
- Split the file into 500 smaller files.
- Convert the file format to AVRO.
- Split the file into 10 files of equal size.