AWS Certified Big Data -Speciality (BDS-C00) Exam Learning Path

Clearing the AWS Certified Big Data – Speciality (BDS-C00) was a great feeling. This was my third Speciality certification and in terms of the difficulty level (compared to Network and Security Speciality exams), I would rate it between Network (being the toughest) Security (being the simpler one).

Big Data in itself is a very vast topic and with AWS services, there is lots to cover and know for the exam. If you have worked on Big Data technologies including a bit of Visualization and Machine learning, it would be a great asset to pass this exam.

AWS Certified Big Data – Speciality (BDS-C00) exam basically validates

  • Implement core AWS Big Data services according to basic architectural best practices
  • Design and maintain Big Data
  • Leverage tools to automate Data Analysis

Refer AWS Certified Big Data – Speciality Exam Guide for details

                              AWS Certified Big Data – Speciality Domains

AWS Certified Big Data – Speciality (BDS-C00) Exam Summary

  • AWS Certified Big Data – Speciality exam, as its name suggests, covers a lot of Big Data concepts right from data transfer and collection techniques, storage, pre and post processing, analytics, visualization with the added concepts for data security at each layer.
  • One of the key tactic I followed when solving any AWS Certification exam is to read the question and use paper and pencil to draw a rough architecture and focus on the areas that you need to improve. Trust me, you will be able to eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach to the right answer or atleast have a 50% chance of getting it right.
  • Be sure to cover the following topics
    • Whitepapers and articles
    • Analytics
      • Make sure you know and cover all the services in depth, as 80% of the exam is focused on these topics
      • Elastic Map Reduce
        • Understand EMR in depth
        • Understand EMRFS (hint: Use Consistent view to make sure S3 objects referred by different applications are in sync)
        • Know EMR Best Practices (hint: start with many small nodes instead on few large nodes)
        • Know Hive can be externally hosted using RDS, Aurora and AWS Glue Data Catalog
        • Know also different technologies
          • Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources
          • D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS
          • Spark is a distributed processing framework and programming model that helps do machine learning, stream processing, or graph analytics using Amazon EMR clusters
          • Zeppelin/Jupyter as a notebook for interactive data exploration and provides open-source web application that can be used to create and share documents that contain live code, equations, visualizations, and narrative text
          • Phoenix is used for OLTP and operational analytics, allowing you to use standard SQL queries and JDBC APIs to work with an Apache HBase backing store
      • Kinesis
        • Understand Kinesis Data Streams and Kinesis Data Firehose in depth
        • Know Kinesis Data Streams vs Kinesis Firehose
          • Know Kinesis Data Streams is open ended on both producer and consumer. It supports KCL and works with Spark.
          • Know Kineses Firehose is open ended for producer only. Data is stored in S3, Redshift and ElasticSearch.
          • Kinesis Firehose works in batches with minimum 60secs interval.
        • Understand Kinesis Encryption (hint: use server side encryption or encrypt in producer for data streams)
        • Know difference between KPL vs SDK (hint: PutRecords are synchronously, while KPL supports batching)
        • Kinesis Best Practices (hint: increase performance increasing the shards)
      • Know ElasticSearch is a search service which supports indexing, full text search, faceting etc.
      • Redshift
        • Understand Redshift in depth
        • Understand Redshift Advance topics like Workload Management, Distribution Style, Sort key
        • Know Redshift Best Practices w.r.t selection of Distribution style, Sort key, COPY command which allows parallelism
        • Know Redshift views to control access to data.
      • Amazon Machine Learning
      • Know Data Pipeline for data transfer
      • QuickSight
      • Know Glue as the ETL tool
    • Security, Identity & Compliance
    • Management & Governance Tools
      • Understand AWS CloudWatch for Logs and Metrics. Also, CloudWatch Events more real time alerts as compared to CloudTrail
    • Storage
    • Compute
      • Know EC2 access to services using IAM Role and Lambda using Execution role.
      • Lambda esp. how to improve performance batching, breaking functions etc.

AWS Certified Big Data – Speciality (BDS-C00) Exam Resources

AWS Data Transfer Services

AWS Data Transfer Services

  • AWS provides a suite of data transfer services that includes many methods that to migrate your data more effectively.
  • Data Transfer services work both Online and Offline and the usage depends on several factors like amount of data, time required, frequency, available bandwidth and cost.
  • Online data transfer and hybrid cloud storage
    • A network link to the VPC, transfer data to AWS, or use S3 for hybrid cloud storage with an existing on-premises applications.
    • helps both to lift and shift large datasets once, as well as help you integrate existing process flows like backup and recovery or continuous data streams directly with cloud storage.
  • Offline data migration to Amazon S3.
    • use shippable, ruggedized devices are ideal for moving large archives, data lakes, or in situations where bandwidth and data volumes cannot pass over your networks within your desired time frame.

Online data transfer

VPN

  • connect securely between data centers and AWS
  • quick to setup and cost efficient
  • ideal for small data transfers and connectivity
  • not reliable as still uses shared Internet connection

Direct Connect

  • provides dedicated physical connection to accelerate network transfers between data centers and AWS
  • provides reliable data transfer
  • ideal for regular large data transfer
  • needs time to setup
  • is not a cost efficient solution
  • can be secured using VPN over Direct Connect

AWS S3 Transfer Acceleration

  • makes public Internet transfers to S3 faster.
  • helps maximize the available bandwidth regardless of distance or varying Internet weather, and there are no special clients or proprietary network protocols.  Simply change the endpoint you use with your S3 bucket and acceleration is automatically applied.
  • ideal for recurring jobs that travel across the globe, such as media uploads, backups, and local data processing tasks that are regularly sent to a central location

AWS DataSync

  • automates moving data between on-premises storage and S3 or Elastic File System (Amazon EFS).
  • automatically handles many of the tasks related to data transfers that can slow down migrations or burden the IT operations, including running your own instances, handling encryption, managing scripts, network optimization, and data integrity validation.
  • helps transfer data at speeds up to 10 times faster than open-source tools.
  • uses AWS Direct Connect or internet links to AWS and ideal for one-time data migrations, recurring data processing workflows, and automated replication for data protection and recovery.

Offline data transfer

AWS Snowball

  • is a petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data into and out of AWS.
  • ideal for one time large data transfers with limited network bandwidth, long transfer times, and security concerns
  • is simple, fast, and secure.
  • can be very cost and time efficient for large data transfer

AWS Snowball Edge

  • is a petabyte to exabytes scale data transfer device with on-board storage and compute capabilities
  • move large amounts of data into and out of AWS, as a temporary storage tier for large local datasets, or to support local workloads in remote or offline locations.
  • ideal for one time large data transfers with limited network bandwidth, long transfer times, and security concerns
  • is simple, fast, and secure.
  • can be very cost and time efficient for large data transfer

AWS Snowmobile

  • is an exabyte-scale data transport solution that uses a secure semi 40-foot shipping container to transfer large amounts of data into and out of AWS.
  • addresses common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns.
  • transfer done through through a custom engagement, is fast, secure, and can be as little as one-fifth the cost of high-speed Internet.

Data Transfer Chart – Bandwidth vs Time

Data Migration Speeds

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An organization is moving non-business-critical applications to AWS while maintaining a mission critical application in an on-premises data center. An on-premises application must share limited confidential information with the applications in AWS. The Internet performance is unpredictable. Which configuration will ensure continued connectivity between sites MOST securely?
    1. VPN and a cached storage gateway
    2. AWS Snowball Edge
    3. VPN Gateway over AWS Direct Connect
    4. AWS Direct Connect
  2. A company wants to transfer petabyte scale of data to AWS for their analytics, however are constrained on their internet connectivity? Which AWS service can help them transfer the data quickly?
    1. S3 enhanced uploader
    2. Snowmobile
    3. Snowball
    4. Direct Connect
  3. A company wants to transfer their video library data, which runs in exabyte, to AWS. Which AWS service can help the company transfer the data?
    1. Snowmobile
    2. Snowball
    3. S3 upload
    4. S3 enhanced uploader
  4. You are working with a customer who has 100 TB of archival data that they want to migrate to Amazon Glacier. The customer has a 1-Gbps connection to the Internet. Which service or feature provides the fastest method of getting the data into Amazon Glacier?
    1. Amazon Glacier multipart upload
    2. AWS Storage Gateway
    3. VM Import/Export
    4. AWS Snowball

AWS Redshift Advanced

AWS Redshift Advanced

AWS Redshift Advanced topics cover Distribution Styles for table, Workload Management etc.

Distribution Styles

  • Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
  • Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.

KEY distribution

  • A single column acts as distribution key (DISTKEY) and helps place matching values on the same node slice.
  • As a rule of thumb you should choose a column that:
    • Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node leading to undesired situations where some slices will process bigger amounts of data than others and causing bottlenecks.
    • acts as a JOIN column – for tables related with dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.

EVEN distribution

  • distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
  • Choose EVEN distribution
    • when the table does not participate in joins
    • when there is not a clear choice between KEY and ALL distribution.

ALL distribution

  • whole table is replicated in every compute node.
  • ensures that every row is collocated for every join that the table participates in
  • ideal for for relatively slow moving tables, tables that are not updated frequently or extensively
  • Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.

AUTO distribution

  • Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
  • Amazon Redshift applies AUTO distribution, be default.

Sort Key

  • Sort keys define the order data in which the data will be stored.
  • Sorting enables efficient handling of range-restricted predicates
  • Only one sort key per table can be defined, but it can be composed with one or more columns.
  • Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
  • The are two kinds of sort keys in Redshift: Compound and Interleaved.

Compound Keys

  • A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed.
  • A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
  • Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.

Interleaved Sort Keys

  • An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
  • An interleaved sort key is more efficient when multiple queries use different columns for filters
  • Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
  • Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering and grouping data using different columns as query dimensions.

Constraints

  • Redshift supports UNIQUE, PRIMARY KEY and FOREIGN KEY constraints, however they are only with informational purposes.
  • Redshift does not perform integrity checks for these constraints and are used by query planner, as hints, in order to optimize executions.
  • Redshift does enforce NOT NULL column constraints.

Redshift Workload Management

  • Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
  • Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
    • Slots: number of concurrent queries that can be executed in this queue.
    • Working memory: percentage of memory assigned to this queue.
    • Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
  • Queries can be routed to different queues using Query Groups and User Groups. As a rule of thumb, is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
  • By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

[do_widget id=text-15]

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation?
    1. Create custom table views.
    2. Add interleaved sort keys per team.
    3. Maintain team-specific copies of the table.
    4. Add support for workload management queue hopping.

AWS Redshift Best Practices

AWS Redshift Best Practices

Designing Tables

Sort Key Selection

  • Redshift stores the data on disk in sorted order according to the sort key, which helps query optimizer to determine optimal query plans.
  • If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
    • Queries are more efficient because they can skip entire blocks that fall outside the time range.
  • If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
    • Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.
  • If you frequently join a table, specify the join column as both the sort key and the distribution key.
    • Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Distribution Style selection

  • Distribute the fact table and one dimension table on their common columns.
    • Your fact table can have only one distribution key. Any tables that join on another key aren’t collocated with the fact table.
    • Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows.
    • Designate both the dimension table’s primary key and the fact table’s corresponding foreign key as the DISTKEY.
  • Choose the largest dimension based on the size of the filtered dataset.
    • Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
  • Choose a column with high cardinality in the filtered result set.
    • If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal.
    • However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
  • Change some dimension tables to use ALL distribution.
    • If a dimension table cannot be collocated with the fact table or other important joining tables, query performance can be improved significantly by distributing the entire table to all of the nodes.
    • Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations.

Other Practices

  • Automatic compression produces the best results
  • COPY command analyzes the data and applies compression encodings to an empty table automatically as part of the load operation
  • Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
  • Don’t use the maximum column size for convenience.

Loading Data

  • You can load data into the tables using the three following methods:
    • Using Multi-Row INSERT
    • Using Bulk INSERT
    • Using COPY command
    • Staging tables
  • Copy Command
    • COPY command loads data in parallel from S3, EMR, DynamoDB, or multiple data sources on remote hosts.
    • COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
    • Use a Single COPY Command to Load from Multiple Files
    • DON’T use multiple concurrent COPY commands to load one table from multiple files as Redshift is forced to perform a serialized load, which is much slower.
  • Split the Load Data into Multiple Files
    • divide the data in multiple files with equal size ( between 1MB and 1GB
    • number of files to be a multiple of the number of slices in the cluster
    • helps to distribute workload uniformly in the cluster.
  • Use a Manifest File
    • S3 provides eventual consistency for some operations, so it is possible that new data will not be available immediately after the upload, which could result in an incomplete data load or loading stale data.
    • Data consistency can be managed using a manifest file to load data.
    • Manifest file helps specify different S3 locations in a more efficient way that with the use of S3 prefixes.
  • Compress Your Data Files
    • Individually compress the load files using gzip, lzop, bzip2, or Zstandard for large datasets
    • Avoid to use compression if you have small amount of data because the benefit of compression would be outweighed by the processing cost of decompression.
    • If the priority is to reduce the time spent by COPY commands use LZO compression. In the other hand if the priority is to reduce the size of the files in S3 and the network bandwidth use BZ2 compression.
  • Load Data in Sort Key Order
    • Load your data in sort key order to avoid needing to vacuum.
    • As long as each batch of new data follows the existing rows in the table, the data will be properly stored in sort order, and you will not need to run a vacuum.
    • Presorting rows is not needed in each load because COPY sorts each batch of incoming data as it loads.
  • Load Data using IAM role

Designing Queries

  • Avoid using select *. Include only the columns you specifically need.
  • Use a CASE Expression to perform complex aggregations instead of selecting from the same table multiple times.
  • Don’t use cross-joins unless absolutely necessary
  • Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
  • Use predicates to restrict the dataset as much as possible.
  • In the predicate, use the least expensive operators that you can.
  • Avoid using functions in query predicates.
  • If possible, use a WHERE clause to restrict the dataset.
  • Add predicates to filter tables that participate in joins, even if the predicates apply the same filters.

[do_widget id=text-15]

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
    1. When the tables are highly denormalized and do NOT participate in frequent joins.
    2. When data must be grouped based on a specific key on a defined slice.
    3. When data transfer between nodes must be eliminated.
    4. When a new table has been loaded and it is unclear how it will be joined to dimension.
  2. An administrator has a 500-GB file in Amazon S3. The administrator runs a nightly COPY command into a 10-node Amazon Redshift cluster. The administrator wants to prepare the data to optimize performance of the COPY command. How should the administrator prepare the data?
    1. Compress the file using gz compression.
    2. Split the file into 500 smaller files.
    3. Convert the file format to AVRO.
    4. Split the file into 10 files of equal size.

AWS Kinesis Data Firehose

AWS Kinesis Data Firehose

  • Kinesis Data Firehose is a fully managed service as there is no need to write applications or manage resources
  • data transfer solution for delivering real time streaming data to destinations such as S3,  Redshift,  Elasticsearch service, and Splunk.
  • is NOT real time as it buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations. Buffer
  • Size is in MBs and Buffer Interval is in seconds.
  • supports multiple producers as datasource, which include Kinesis data stream, Kinesis Agent, or the Kinesis Data Firehose API using the AWS SDK, CloudWatch Logs, CloudWatch Events, or AWS IoT
  • supports out of box data transformation as well as custom transformation using Lambda function to transform incoming source data and deliver the transformed data to destinations
  • supports interface VPC endpoint to keep traffic between the Amazon VPC and Kinesis Data Firehose from leaving the Amazon network. Interface VPC endpoints don’t require an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection

Kinesis Data Firehose

Kinesis Data Streams vs Kinesis Firehose

Refer Kinesis Data Streams vs Kinesis Firehose blog post.


[do_widget id=text-15]

AWS Certification Exam Practice Questions

  1. A user is designing a new service that receives location updates from 3600 rental cars every hour. The cars location needs to be uploaded to an Amazon S3 bucket. Each location must also be checked for distance from the original rental location. Which services will process the updates and automatically scale? ​
    1. Amazon EC2 and Amazon EBS
    2. Amazon Kinesis Firehose and Amazon S3
    3. Amazon ECS and Amazon RDS
    4. Amazon S3 events and AWS Lambda
  2. You need to perform ad-hoc SQL queries on massive amounts of well-structured data. Additional data comes in constantly at a high velocity, and you don’t want to have to manage the infrastructure processing it if possible. Which solution should you use?
    1. Kinesis Firehose and RDS
    2. EMR running Apache Spark
    3. Kinesis Firehose and Redshift
    4. EMR using Hive
  3. Your organization needs to ingest a big data stream into their data lake on Amazon S3. The data may stream in at a rate of hundreds of megabytes per second. What AWS service will accomplish the goal with the least amount of management?
    1. Amazon Kinesis Firehose
    2. Amazon Kinesis Streams
    3. Amazon CloudFront
    4. Amazon SQS
  4. A startup company is building an application to track the high scores for a popular video game. Their Solution Architect is tasked with designing a solution to allow real-time processing of scores from millions of players worldwide. Which AWS service should the Architect use to provide reliable data ingestion from the video game into the datastore?
    1. AWS Data Pipeline
    2. Amazon Kinesis Firehose
    3. Amazon DynamoDB Streams
    4. Amazon Elasticsearch Service
  5. A company has an infrastructure that consists of machines which keep sending log information every 5 minutes. The number of these machines can run into thousands and it is required to ensure that the data can be analyzed at a later stage. Which of the following would help in fulfilling this requirement?
    1. Use Kinesis Firehose with S3 to take the logs and store them in S3 for further processing.
    2. Launch an Elastic Beanstalk application to take the processing job of the logs.
    3. Launch an EC2 instance with enough EBS volumes to consume the logs which can be used for further processing.
    4. Use CloudTrail to store all the logs which can be analyzed at a later stage.

References

AWS Kinesis Data Streams vs Kinesis Firehose

AWS Kinesis Data Streams vs Kinesis Firehose

Kinesis acts as a highly available conduit to stream messages between data producers and data consumers. Data producers can be almost any source of data: system or web log data, social network data, financial trading information, geospatial data, mobile app data, or telemetry from connected IoT devices. Data consumers will typically fall into the category of data processing and storage applications such as Apache Hadoop, Apache Storm, and Amazon Simple Storage Service (S3), and ElasticSearch.

Kinesis data streams – Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs. However, requires manual scaling and provisioning. Data typically is made available in a stream for 24 hours, but for an additional cost, users can gain data availability for up to seven days.

Kineses firehose – Firehose handles loading data streams directly into AWS products for processing. Scaling is handled automatically, up to gigabytes per second, and allows for batching, encrypting, and compressing. Firehose also allows for streaming to S3, Elasticsearch Service, or Redshift, where data can be copied for processing through additional services.

Kinesis Data Streams vs. Firehose

[do_widget id=text-15]

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your organization needs to ingest a big data stream into their data lake on Amazon S3. The data may stream in at a rate of hundreds of megabytes per second. What AWS service will accomplish the goal with the least amount of management?
    1. Amazon Kinesis Firehose
    2. Amazon Kinesis Streams
    3. Amazon CloudFront
    4. Amazon SQS
  2. Your organization is looking for a solution that can help the business with streaming data several services will require access to read and process the same stream concurrently. What AWS service meets the business requirements?
    1. Amazon Kinesis Firehose
    2. Amazon Kinesis Streams
    3. Amazon CloudFront
    4. Amazon SQS
  3. Your application generates a 1 KB JSON payload that needs to be queued and delivered to EC2 instances for applications. At the end of the day, the application needs to replay the data for the past 24 hours. In the near future, you also need the ability for other multiple EC2 applications to consume the same stream concurrently. What is the best solution for this?
    1. Kinesis Data Streams
    2. Kinesis Firehose
    3. SNS
    4. SQS

AWS Systems Manager Overview

AWS Systems Manager

  • provides visibility and control of the infrastructure on AWS
  • helps to view operational data from multiple AWS services and automate operational tasks across AWS resources.
  • works with managed instances, which are configured for use with Systems Manager
  • helps configure and maintain managed instances.
  • helps maintain security and compliance by scanning the managed instances and reporting on (or taking corrective action on) any policy violations it detects.
  • supports machine types include EC2 instances, on-premises servers, and virtual machines (VMs), including VMs in other cloud environments. Supported operating system types include Windows Server, multiple distributions of Linux, and Raspbian.

Systems Manager Capabilities

Operations Management

Capabilities that help manage the AWS resources

  • Trusted Advisor is an online tool that provides you real time guidance to help you provision your resources following AWS best practices
  • AWS Personal Health Dashboard provides information about AWS Health events that can affect your account
  • OpsCenter provides a central location where operations engineers and IT professionals can view, investigate, and resolve operational work items (OpsItems) related to AWS resources

Actions & Change

Capabilities for taking action against or changing the AWS resources

Systems Manager Automation

  • helps automate common maintenance and deployment tasks for e.g. create and update AMIs, apply driver and agent updates, reset passwords on Windows instance, reset SSH keys on Linux instances, and apply OS patches or application updates.

Maintenance Windows

  •  helps set up recurring schedules for managed instances to run administrative tasks like installing patches and updates without interrupting business-critical operations.

Instances & Nodes

Capabilities for managing the EC2 instances, on-premises servers and virtual machines (VMs) in the hybrid environment, and other types of AWS resources (nodes)

Systems Manager Configuration Compliance

  • helps scan fleet of managed instances for patch compliance and configuration inconsistencies.
  • helps collect and aggregate data from multiple AWS accounts and Regions, and then drill down into specific resources that aren’t compliant.
  • provides, by default, displays compliance data about Patch Manager patching and State Manager associations, but can be customized

Session Manager

  • helps manage EC2 instances through an interactive one-click browser-based shell or through the AWS CLI.
  • provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys.
  • helps comply with corporate policies that require controlled access to instances, strict security practices, and fully auditable logs with instance access details, while still providing end users with simple one-click cross-platform access to the EC2 instances.

Systems Manager Run Command

  • helps to remotely and securely manage the configuration of the managed instances at scale.
  • helps perform on-demand changes like updating applications or running Linux shell scripts and Windows PowerShell commands on a target set of dozens or hundreds of instances.

Patch Manager

  • helps automate process of patching managed instances with both security related and other types of updates.
  • helps apply patches for both operating systems and applications. (On Windows Server, application support is limited to updates for Microsoft applications.)
  • enables scanning of instances for missing patches and applies them individually or to large groups of instances by using EC2 instance tags.
  • uses patch baselines, which can include rules for auto-approving patches within days of their release, as well as a list of approved and rejected patches.
  • helps install security patches on a regular basis by scheduling patching to run as a Systems Manager maintenance window task.

Systems Manager Inventory

  • provides visibility into your Amazon EC2 and on-premises computing environment
  • collect metadata from the managed instances about applications, files, components, patches, and more on your managed instances

Systems Manager State Manager

  • helps automate the process of keeping the managed instances in a defined state.
  • helps ensure that the instances are bootstrapped with specific software at startup, joined to a Windows domain (Windows instances only), or patched with specific software updates.

Shared Resources

Capabilities for managing and configuring the AWS resources

Systems Manager document (SSM document)

  • defines the actions that Systems Manager performs.
  • SSM document types include 
    • Command documents, which are used by State Manager and Run Command, and 
    • Automation documents, which are used by Systems Manager Automation.

Parameter Store

  • provides secure, hierarchical storage for configuration data and secrets management.
  • can store data such as passwords, database strings, and license codes as parameter values.
  • supports values as plain text or encrypted data, referenced by using the specified unique name

Systems Manager Agent

  • is software that can be installed and configured on an EC2 instance, an on-premises server, or a virtual machine (VM)
  • makes it possible for Systems Manager to update, manage, and configure these resources
  • must be installed on each instance to use with Systems Manager
  • usually comes preinstalled with lot of Amazon Machine Images (AMIs), while it must be installed manually on other AMIs, and on on-premises servers and virtual machines for your hybrid environment

[do_widget id=text-15]

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which of the following tools from AWS allows the automatic collection of software inventory from EC2 instances and helps apply OS patches?
    1. AWS Code Deploy 
    2. Systems Manager
    3. EC2 AMI’s
    4. AWS Code Pipeline
  2. A Developer is writing several Lambda functions that each access data in a common RDS DB instance. They must share a connection string that contains the database credentials, which are a secret. A company policy requires that all secrets be stored encrypted. Which solution will minimize the amount of code the Developer must write?
    1. Use common DynamoDB table to store settings
    2. Use AWS Lambda environment variables
    3. Use Systems Manager Parameter Store secure strings
    4. Use a table in a separate RDS database
  3. A company has a fleet of EC2 instances and needs to remotely execute scripts for all of the instances. Which Amazon EC2 systems Manager feature allows this?
    1. Systems Manager Automation
    2. Systems Manager Run Command
    3. Systems Manager Parameter Store
    4. Systems Manager Inventory
  4. As a part of compliance check it was found that EC2 instances launched by the deployment team were not in compliance to latest security patches. The team had all tagged the resources. Which AWS service can help make the instances complaint?
    1. AWS Inspector
    2. AWS GuardDuty
    3. AWS Systems Manager
    4. AWS Shield

References

AWS Certified Solution Architect – Professional (SAP-C01) Exam Learning Path

AWS Certified Solutions Architect – Professional (SAP-C01) Exam Learning Path

AWS Certified Solutions Architect – Professional (SAP-C01) exam is the upgraded pattern of the previous Solution Architect – Professional exam which was released last year (2018) and upgraded this year. I recently passed the latest pattern and difference is quite a lot between the previous pattern and the latest pattern. The amount of overlap between the associates and professional exams and even the Solutions Architect and DevOps has drastically reduced.

AWS Certified Solutions Architect – Professional (SAP-C01) exam basically validates

  • Design and deploy dynamically scalable, highly available, fault-tolerant, and reliable applications on AWS
  • Select appropriate AWS services to design and deploy an application based on given requirements
  • Migrate complex, multi-tier applications on AWS
  • Design and deploy enterprise-wide scalable operations on AWS
  • Implement cost-control strategies

Refer to AWS Certified Solutions Architect – Professional Exam Guide

AWS Certified Solutions Architect – Professional (SAP-C01) Exam Summary

  • AWS Certified Solutions Architect – Professional (SAP-C01) exam was for a total of 170 minutes but it had 75 questions. The questions and answers options are quite long and there is a lot of reading that needs to be done, so be sure you are prepared and manage your time well. As always, mark the questions for review and move on and come back to them after you are done with all.
  • One of the key tactic I followed when solving any question was to read the question and use paper and pencil to draw a rough architecture and focus on the areas that you need to improve. Trust me, you will be able eliminate 2 answers for sure and then need to focus on only the other two. Read the other 2 answers to check the difference area and that would help you reach to the right answer or atleast have a 50% chance of getting it right.
  • AWS Certified Solutions Architect – Professional (SAP-C01) focuses a lot on concepts and services related to Scalability, High Availability, Disaster Recovery, Migration, Security and Cost Control.
  • Be sure to cover the following topics
    • Analytics
      • Understand Kinesis
        • Understand the difference between Kinesis Data Streams and Kinesis Firehose
      • Know Amazon Elasticsearch provides a managed solution
    • Integration Tools
      • Understand SQS in terms of loose coupling and scaling.
      • Know how CloudWatch integration with SNS and Lambda can help in notification

AWS Certified Solutions Architect – Professional (SAP-C01) Exam Resources