AWS CloudFormation Best Practices

AWS CloudFormation Best Practices

  • AWS CloudFormation Best Practices are based on real-world experience from current AWS CloudFormation customers
  • AWS CloudFormation Best Practices help provide guidelines on
    • how to plan and organize stacks,
    • create templates that describe resources and the software applications that run on them,
    • and manage stacks and their resources

Required Mainly for Developer, SysOps Associate & DevOps Professional Exam

Planning and Organizing

Shorten the Feedback Loop to Improve Development Velocity

  • Adopt practices and tools that help shorten the feedback loop for infrastructure described with CloudFormation templates.
  • Perform early linting and testing of templates in your workstation to discover potential syntax and configuration issues before submitting to a source code repository.
  • Use CloudFormation Linter (cfn-lint) to validate templates against the CloudFormation Resource Specification, including checking valid values for resource properties and best practices.
  • Use TaskCat to test templates by programmatically creating stacks in the AWS Regions you choose, generating pass/fail reports per Region.
  • Integrate cfn-lint in your source code repository for pre-commit validation of templates.
  • Use the CloudFormation Language Server (launched 2025) in your IDE via AWS Toolkit for context-aware auto-completion, built-in validation, and drift-aware deployment views.

Organize Your Stacks By Lifecycle and Ownership

  • Use the lifecycle and ownership of the AWS resources to help you decide what resources should go in each stack.
  • By grouping resources with common lifecycles and ownership, owners can make changes to their set of resources by using their own process and schedule without affecting other resources.
  • Use two common frameworks for organizing stacks: a multi-layered architecture (horizontal layers with dependencies) or a service-oriented architecture (SOA) (self-contained services wired together).
  • For e.g. Consider an Application using Web and Database instances. Both the Web and Database have a different lifecycle and usually the ownership lies with different teams. Maintaining both in a single stack would need communication and co-ordination between different teams introducing complexity. It would be best to have different stacks owned by the respective teams, so that they can update their resources without impacting each other’s stack.

Use Cross-Stack References to Export Shared Resources

  • With multiple stacks, there is usually a need to refer values and resources across stacks.
  • Use cross-stack references to export resources from a stack so that other stacks can use them.
  • CloudFormation provides two approaches:
    • Fn::ImportValue – Import values that another stack has explicitly exported. Creates a strong reference within the same account and Region. CloudFormation prevents deleting the exporting stack while other stacks depend on its exports.
    • Fn::GetStackOutput – Reference any stack output directly, including outputs from stacks in other AWS accounts or Regions, without requiring explicit exports. Creates a weak reference resolved at create or update time.
  • For e.g. Web stack would always need resources from the Network stack like VPC, Subnets etc.

Use CloudFormation StackSets for Multi-Account and Multi-Region Deployments

  • CloudFormation StackSets extend the capability of stacks by enabling you to create, update, or delete stacks across multiple accounts and Regions with a single operation.
  • Use StackSets for deploying common infrastructure components, compliance controls, or shared services across your organization.
  • Implement service-managed permissions with AWS Organizations for simplified permission management without manually configuring IAM roles in each account.
  • StackSets Deployment Ordering (2025): Supports defining the sequence in which stack instances automatically deploy across accounts and regions using the DependsOn parameter (up to 10 dependencies per stack instance).

Use IAM to Control Access

  • Use IAM to control access to
    • what AWS CloudFormation actions users can perform, such as viewing stack templates, creating stacks, or deleting stacks
    • what actions CloudFormation can perform on resources on their behalf
  • Remember, having access to CloudFormation does not provide user with access to AWS resources. That needs to be provided separately.
  • To separate permissions between a user and the AWS CloudFormation service, use a service role. AWS CloudFormation uses the service role’s policy to make calls instead of the user’s policy.
  • Apply the principle of least privilege – Grant only the permissions necessary for the intended functionality, and avoid using wildcard permissions.
  • Use IAM Access Analyzer to review permissions granted to CloudFormation service roles and identify unused permissions.

Verify Quotas for All Resource Types

  • Ensure that stack can create all the required resources without hitting the AWS account limits.
  • By default, you can only launch 2000 CloudFormation stacks per Region in your AWS account.

Reuse Templates to Replicate Stacks in Multiple Environments

  • Reuse templates to replicate infrastructure in multiple environments
  • Use parameters, mappings, and conditions sections to customize and make templates reusable
  • for e.g. creating the same stack in development, staging and production environment with different instance types, instance counts etc.

Use Nested Stacks to Reuse Common Template Patterns

  • Nested stacks are stacks that create other stacks.
  • Nested stacks separate out the common patterns and components to create dedicated templates for them, preventing copy pasting across stacks.
  • for e.g. a standard load balancer configuration can be created as nested stack and just used by other stacks

Use Modules to Reuse Resource Configurations

  • Modules allow packaging resource configurations for inclusion across stack templates in a transparent, manageable, and repeatable way.
  • Modules can encapsulate common service configurations and best practices as modular, customizable building blocks.
  • Modules can be for a single resource (e.g., best practices for an EC2 instance) or multiple resources (common application architecture patterns).
  • Modules can be nested into other modules for higher-level building blocks.
  • Available in the CloudFormation registry and can be used like a native resource.
  • When using a module, the template is expanded into the consuming template, allowing access to resources inside using Ref or Fn::GetAtt.

Adopt Infrastructure as Code Practices

  • Treat CloudFormation templates as code by implementing Infrastructure as Code (IaC) practices.
  • Store templates in version control systems, implement code reviews, and use automated testing.
  • Implement CI/CD pipelines using AWS CodePipeline, CodeBuild, and CodeDeploy for automated infrastructure deployments.
  • Use CloudFormation Git Sync to automatically trigger deployments whenever a tracked Git repository is updated, with support for Pull Request review workflows (2024).

Creating Templates

Do Not Embed Credentials in Your Templates

  • Use dynamic references in your stack template rather than embedding sensitive information.
  • Dynamic references provide a compact way to reference external values stored in other services:
    • AWS Systems Manager Parameter Store – for configuration data and secure strings
    • AWS Secrets Manager – for passwords, database credentials, API keys, and other secrets with rotation support
  • CloudFormation retrieves the value of the dynamic reference during stack and change set operations but never stores the actual reference value.
  • Use the NoEcho property to obfuscate parameter values if using input parameters. Note that NoEcho doesn’t prevent values from being logged if passed to other services.

Use AWS-Specific Parameter Types

  • For existing AWS-specific values, such as existing Virtual Private Cloud IDs or an EC2 key pair name, use AWS-specific parameter types.
  • AWS CloudFormation can quickly validate values for AWS-specific parameter types before creating your stack.
  • The CloudFormation console shows a drop-down list of valid values, eliminating the need to look up or memorize IDs.

Use Parameter Constraints

  • Use Parameter constraints to describe allowed input values so that CloudFormation catches any invalid values before creating a stack.
  • Set constraints such as minimum length, maximum length, and allowed patterns.
  • For e.g. constraints for database user name with min and max length

Use Pseudo Parameters to Promote Portability

  • Use pseudo parameters (AWS::Partition, AWS::Region, AWS::AccountId, AWS::StackName) as arguments for intrinsic functions to increase template portability across Regions and accounts.
  • Instead of hard-coding ARN values, use !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/MySampleParameter'.
  • Use AWS::StackName as a prefix for exports to help ensure unique export names.

Use AWS::CloudFormation::Init to Deploy Software Applications on Amazon EC2 Instances

  • Use AWS::CloudFormation::Init resource and the cfn-init helper script to install and configure software applications on EC2 instances.
  • Describe desired configurations rather than scripting procedural steps.
  • Configurations can be updated without recreating instances.

Use the Latest Helper Scripts

  • Helper scripts are updated periodically. Include yum install -y aws-cfn-bootstrap in the UserData property before calling helper scripts.

Validate Templates Before Using Them

  • Validate templates before creating or updating a stack.
  • Validating a template helps catch syntax and some semantic errors, such as circular dependencies, before AWS CloudFormation creates any resources.
  • During validation, AWS CloudFormation first checks if the template is valid JSON or a valid YAML. If both checks fail, AWS CloudFormation returns a template validation error.
  • Use AWS CloudFormation Guard (cfn-guard) for policy-as-code validation to ensure templates comply with organizational policies, security best practices, and governance requirements.
  • Pre-deployment Validation (2025): CloudFormation now validates templates during change set creation, catching common errors like invalid property syntax, resource name conflicts, and S3 bucket constraints before resource provisioning begins.

Use YAML or JSON for Template Authoring

  • Use YAML when you prioritize human readability, want to include comments, or work with complex nested structures.
  • Use JSON when integrating with tools that prefer JSON, working with programmatic template generation, or requiring strict data validation.
  • YAML is generally recommended for manual template authoring due to readability and comment support.

Implement a Comprehensive Tagging Strategy

  • Implement a consistent tagging strategy for all resources created by CloudFormation templates.
  • Tags help with resource organization, cost allocation, access control, and automation.
  • Include tags for environment, owner, cost center, application, and purpose.
  • Use the stack’s Tags property to apply tags to all supported resources automatically.

Leverage Template Macros for Advanced Transformations

  • CloudFormation macros enable custom processing on templates, from find-and-replace operations to complex transformations that generate additional resources.
  • The AWS Serverless Application Model (SAM) is an example of a macro that simplifies serverless application development.

Managing Stacks

Manage All Stack Resources Through AWS CloudFormation

  • After launching the stack, any further updates should be done through CloudFormation only.
  • Doing changes outside the stack can create a mismatch between the stack’s template and the current state of the stack resources (known as drift), which can cause errors if you update or delete the stack.

Create Change Sets Before Updating Your Stacks

  • Change sets provide a preview of how the proposed changes to a stack might impact the running resources before you implement them.
  • CloudFormation doesn’t make any changes to the stack until you execute the change set, allowing you to decide whether to proceed with the proposed changes or create another change set.
  • Drift-Aware Change Sets (2025): Provides a three-way comparison between your new template, last-deployed template, and actual infrastructure state to prevent unexpected overwrites of drift and help keep infrastructure in sync with templates.

Use Stack Policies

  • Stack policies help protect critical stack resources from unintentional updates that could cause resources to be interrupted or even replaced.
  • During a stack update, you must explicitly specify the protected resources that you want to update; otherwise, no changes are made to protected resources.

Use AWS CloudTrail to Log AWS CloudFormation Calls

  • AWS CloudTrail tracks anyone making AWS CloudFormation API calls in the AWS account.
  • API calls are logged whenever anyone uses the AWS CloudFormation API, the AWS CloudFormation console, a back-end console, or AWS CloudFormation AWS CLI commands.
  • Enable logging and specify an Amazon S3 bucket to store the logs.

Use Code Reviews and Revision Controls to Manage Your Templates

  • Using code reviews and revision controls help track changes between different versions of your templates and changes to stack resources.
  • Maintaining history can help revert the stack to a certain version of the template.

Use Drift Detection Regularly

  • Regularly use the CloudFormation drift detection feature to identify resources that have been modified outside of CloudFormation management.
  • Detecting and resolving drift helps maintain the integrity of your infrastructure as code approach.
  • Consider implementing automated drift detection using AWS Lambda functions triggered by Amazon EventBridge rules to periodically check for drift and notify your team.
  • Use drift-aware change sets to systematically revert drift and keep infrastructure in sync with templates.

Configure Rollback Triggers for Automatic Recovery

  • Use rollback triggers to specify Amazon CloudWatch alarms that CloudFormation monitors during stack creation and update operations.
  • If any specified alarm goes into the ALARM state, CloudFormation automatically rolls back the entire stack operation.
  • Configure rollback triggers for critical metrics such as application error rates, resource utilization, or custom business metrics.

Implement Effective Stack Refactoring Strategies

  • Stack Refactoring (2025) enables reorganizing CloudFormation and CDK infrastructure without disrupting deployed resources.
  • Move resources between stacks, rename logical IDs, and decompose monolithic stacks into focused components while maintaining resource stability.
  • Use cases:
    • Splitting monolithic stacks into smaller, manageable stacks
    • Consolidating related resources from multiple stacks
    • Extracting reusable components into modules or nested stacks
    • Improving resource organization to reflect relationships and dependencies

Use CloudFormation Hooks for Lifecycle Management

  • CloudFormation Hooks provide code that proactively inspects the configuration of AWS resources before provisioning.
  • Hooks check if resources, stacks, and change sets are compliant with your organization’s security, operational, and cost optimization needs.
  • Can provide warnings before provisioning or fail the operation and stop it altogether.
  • Violations and warnings are logged in Amazon CloudWatch for visibility.
  • Managed Proactive Controls (2025): Hooks now supports managed proactive controls from the AWS Control Tower Controls Catalog, eliminating the need to write custom Hooks logic. Controls can run in warn mode for testing before enforcement.

Authoring Tools

Use IaC Generator to Create Templates from Existing Resources

  • IaC Generator (launched 2024) helps create CloudFormation templates from existing AWS resources that are managed outside CloudFormation.
  • Useful for replicating existing infrastructure, documenting manually created resources, or bringing unmanaged resources under CloudFormation management.
  • Targeted Resource Scans (2025): Supports scanning specific resource types instead of all resources, simplifying the template generation process.
  • Works with resource types supported by the Cloud Control API in your Region.

Use AWS Infrastructure Composer for Visual Template Design

  • AWS Infrastructure Composer (formerly AWS Application Composer) is a visual builder that helps create, visualize, and modify CloudFormation templates using drag-and-drop.
  • Useful for architecture planning, template modernization, training, and stakeholder communication.
  • Available in the CloudFormation console and as a VS Code extension via the AWS Toolkit.
  • Maintains a visual representation in sync with your IaC – changes in the visual canvas are reflected in the template and vice versa.

Consider Using AWS Cloud Development Kit (AWS CDK)

  • For complex infrastructure, use AWS CDK to define cloud resources using programming languages like TypeScript, Python, Java, and .NET.
  • AWS CDK generates CloudFormation templates from your code, combining CloudFormation capabilities with high-level programming constructs.
  • Provides high-level constructs that encapsulate best practices and simplify common infrastructure patterns.

Use the AWS IaC MCP Server for AI-Powered Development

  • AWS IaC MCP Server (2025) bridges AI assistants with AWS infrastructure development workflows using the Model Context Protocol (MCP).
  • Enables AI assistants to search CloudFormation and CDK documentation, validate templates, troubleshoot deployments, and follow best practices.
  • Provides remote documentation search tools and local validation/troubleshooting tools (cfn-lint, CloudFormation Guard, CloudTrail integration).

Security and Compliance

Implement Policy as Code with AWS CloudFormation Guard

  • AWS CloudFormation Guard (cfn-guard) is an open-source policy-as-code tool that allows defining and enforcing rules for CloudFormation templates.
  • Ensures templates comply with organizational policies, security best practices, and governance requirements.
  • Integrate cfn-guard into CI/CD pipelines to automatically validate templates against policy rules before deployment.
  • Includes a rulegen feature to extract rules from existing compliant CloudFormation templates.
  • Can be used with CloudFormation Hooks for proactive enforcement during provisioning.

Secure Sensitive Parameters

  • Use AWS Systems Manager Parameter Store or AWS Secrets Manager for sensitive information instead of embedding in templates.
  • Use dynamic references to securely retrieve values during stack operations:
    • {{resolve:ssm:parameter-name}} – for SSM Parameter Store values
    • {{resolve:ssm-secure:parameter-name}} – for SSM SecureString parameters
    • {{resolve:secretsmanager:secret-id}} – for Secrets Manager values
  • CloudFormation never stores the actual resolved secret value when using dynamic references.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company has deployed their application using CloudFormation. They want to update their stack. However, they want to understand how the changes will affect running resources before implementing the updated. How can the company achieve the same?
    1. Use CloudFormation Validate Stack feature
    2. Use CloudFormation Dry Run feature
    3. Use CloudFormation Stage feature
    4. Use CloudFormation Change Sets feature
  2. You have multiple similar three-tier applications and have decided to use CloudFormation to maintain version control and achieve automation. How can you best use CloudFormation to keep everything agile and maintain multiple environments while keeping cost down?
    1. Create multiple templates in one CloudFormation stack.
    2. Combine all resources into one template for version control and automation.
    3. Use CloudFormation custom resources to handle dependencies between stacks
    4. Create separate templates based on functionality, create nested stacks with CloudFormation.
  3. You are working as an AWS DevOps admins for your company. You are in-charge of building the infrastructure for the company’s development teams using CloudFormation. The template will include building the VPC and networking components, installing a LAMP stack and securing the created resources. As per the AWS best practices what is the best way to design this template?
    1. Create a single CloudFormation template to create all the resources since it would be easier from the maintenance perspective.
    2. Create multiple CloudFormation templates based on the number of VPC’s in the environment.
    3. Create multiple CloudFormation templates based on the number of development groups in the environment.
    4. Create multiple CloudFormation templates for each set of logical resources, one for networking, and the other for LAMP stack creation.
  4. A team wants to ensure that all CloudFormation templates in their CI/CD pipeline comply with company security policies before deployment. Which approach should they use?
    1. Use CloudFormation drift detection before each deployment
    2. Manually review each template before deployment
    3. Integrate AWS CloudFormation Guard (cfn-guard) into the CI/CD pipeline to validate templates against policy rules
    4. Use CloudFormation Change Sets to validate compliance
  5. A DevOps engineer needs to detect and remediate configuration changes made to CloudFormation-managed resources outside of CloudFormation. What is the most effective approach introduced in 2025?
    1. Use standard Change Sets and manually compare with actual state
    2. Delete and recreate the stack
    3. Use Drift-Aware Change Sets that provide a three-way comparison between the new template, last-deployed template, and actual infrastructure state
    4. Use AWS Config rules to detect drift
  6. A company has manually created resources in their AWS account and wants to bring them under CloudFormation management. What is the recommended approach?
    1. Delete and recreate all resources using CloudFormation templates
    2. Use CloudFormation resource import only
    3. Use the IaC Generator to scan existing resources and generate CloudFormation templates, then import the resources
    4. Manually write CloudFormation templates for each resource
  7. An organization wants to enforce security controls on CloudFormation resource configurations before they are provisioned, without writing custom code. Which feature should they use?
    1. CloudFormation Stack Policies
    2. AWS Config rules in proactive mode
    3. CloudFormation Change Sets
    4. CloudFormation Hooks with managed proactive controls from the AWS Control Tower Controls Catalog

References

Google Cloud – Associate Cloud Engineer Certification learning path

Google Cloud - Associate Cloud Engineer

Google Cloud – Associate Cloud Engineer Certification learning path

Google Cloud – Associate Cloud Engineer certification exam is basically for one who works day-in day-out with the Google Cloud Services. It targets an Cloud Engineer who deploys applications, monitors operations, and manages enterprise solutions. The exam makes sure it covers gamut of services and concepts. Although, the exam is not that tough and time available of 2 hours a quite plenty, if you well prepared.

Google Cloud – Associate Cloud Engineer Certification Summary

  • Has 50 questions to be answered in 2 hours.
  • Covers wide range of Google Cloud services and what they actually do. It focuses heavily on IAM, Compute, Storage with a little bit of Network but hardly any data services.
  • Hands-on is a must. Covers Cloud SDK, CLI commands and Console operations that you would use for day-to-day work. If you have not worked on GCP before make sure you do lot of labs else you would be absolute clueless for some of the questions and commands
  • Once again be sure that NO Online Course or Practice tests is going to cover all. I did ACloud Guru – LA course which covered maybe 60-70%, but hands-on or practical knowledge is MUST

Google Cloud – Associate Cloud Engineer Certification Topics

General Services

  • Cloud Billing
    • understand how Cloud Billing works. Monthly vs Threshold and which has priority
    • Budgets can be set to alert for projects
    • how to change a billing account for a project and what roles you need. Hint – Project Owner and Billing Administrator for the billing account
    • Cloud Billing can be exported to BigQuery and Cloud Storage
  • Resource Manager
    • Understand Resource Manager the hierarchy Organization -> Folders -> Projects -> Resources
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
  • Cloud SDK
    • understand gcloud commands esp. when dealing with
      • configurations i.e. gcloud config
        • activate profiles – gcloud config configurations activate
        • GKE setting default cluster i.e. gcloud config set container/cluster CLUSTER_NAME
        • set project gcloud config set project mygcp-demo
        • set region gcloud config set compute/region us-west1
        • set zone gcloud config set compute/zone us-west1-a
      • Get project list and ids gcloud projects list
      • Auth i.e gcloud auth
        • Auth login using user gcloud auth login
        • Auth login using service accountgcloud auth activate-service-account --key-file=sa_key.json
      • deployment manager i.e. gcloud deployment-manager
      • VPC firewalls i.e. gcloud compute firewall-rules

Network Services

  • Virtual Private Cloud
    • Understand Virtual Private Cloud (VPC), subnets and host applications within them Hint VPC spans across region
    • Understand how Firewall rules works and how they are configured. Hint – Focus on Network Tags. Also, there are 2 implicit firewall rules – default ingress deny and default egress allow
    • Understand VPC Peering and Shared VPC
    • Understand the concept internal and external IPs and difference between static and ephemeral IPs
    • Primary IP range of an existing subnet can be expanded by modifying its subnet mask, setting the prefix length to a smaller number.
  • Cloud Load Balancing

Identity Services

  • Identity and Access Management – IAM 
    • Identify and Access Management – IAM provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
    • Understand the difference between Primitive, Pre-defined and Custom roles and their use cases
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
    • Basically  Permissions -> Roles -> (IAM Policy) -> Members
    • Need to know and understand the roles for the following services atleast
      • Cloud Storage – Admin vs Creator vs Viewer
      • Compute Engine – Admin vs Instance Admin
      • Spanner – Viewer vs Database User
      • BigQuery – User vs JobUser
    • Know how to copy roles to different projects or organization. Hint – gcloud iam roles copy
    • Know how to use service accounts with applications
  • Cloud Identity
    • Cloud Identity provides IDaaS (Identify as a Service) and provides single sign-on functionality and federation with external identity provides like Active Directory.

Compute Services

  • Make sure you know all the compute services Google Compute Engine, Google App Engine and Google Kubernetes Engine, they are heavily covered in the exam.
  • Google Compute Engine
    • Google Compute Engine is the best IaaS option for compute and provides fine grained control
    • Know how to create a Compute Engine instance, connect to it using Cloud shell or ssh keys
    • Difference between backups and images and how to create instances from the same.
    • Instance templates with managed instance groups. Instance template cannot be edited, create a new one and attach.
    • Difference between managed vs unmanaged instance groups and auto-healing feature
    • Preemptible VMs and their use cases. HINT – can be terminated any time and supports max 24 hours.
    • Upgrade an instance without downtime using Live Migration
    • Managing access using OS Login or project and instance metadata
    • Prevent accidental deletion using deletion protection flag
    • In case of any issues or errors, how to debug the same
  • Google App Engine
    • Google App Engine is mainly the best option for PaaS with platforms supported and features provided.
    • Deploy an application with App Engine and understand how versioning and rolling deployments can be done
    • Understand how to keep auto scaling and traffic splitting and migration.
    • Know App Engine is a regional resource and understand the steps to migrate or deploy application to different region and project.
    • Know the difference between App Engine Flexible vs Standard
  • Google Kubernetes Engine
    • Google Container Engine is now officially Google Kubernetes Engine and the questions refer to the same
    • Google Kubernetes Engine, powered by the open source container scheduler Kubernetes, enables you to run containers on Google Cloud Platform.
    • Kubernetes Engine takes care of provisioning and maintaining the underlying virtual machine cluster, scaling your application, and operational logistics such as logging, monitoring, and cluster health management.
    • Be sure to Create a Kubernetes Cluster and configure it to host an application
    • Understand how to make the cluster auto repairable and upgradable. Hint – Node auto-upgrades and auto-repairing feature
    • Very important to understand where to use gcloud commands (to create a cluster) and kubectl commands (manage the cluster components)
    • Very important to understand how to increase cluster size and enable autoscaling for the cluster
    • know how to manage secrets like database passwords

Storage Services

  • Understand each storage service options and their use cases.
  • Cloud Storage
    • Cloud Storage is cost-effective object storage for unstructured data.
    • very important to know the different storage classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access) and Coldline (yearly access)
    • Understand life cycle management. HINT – Changes are in accordance to object creation date
    • Understand Signed URL to give temporary access and the users do not need to be GCP users
    • Understand access control and permissions – IAM vs ACLs (fine grained control)
    • Understand best practices esp. uploading and downloading the data. HINT using parallel composite uploads
  • Relational Databases
    • Cloud SQL
      • Cloud SQL is a fully-managed service that provides MySQL, PostgreSQL and MS SQL Server
      • limited to 10TB 64TB and is a regional service.
      • Difference between Failover and Read replicas. Failover provides High Availability and almost zero downtime while Read replicas provide scalability. Cross region Read Replicas are supported
      • Perform Point-In-Time recovery. Hint – requires binary logging and backups
    • Cloud Spanner
      • is a fully managed, mission-critical relational database service.
      • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.
      • globally distributed and can scale and handle more than 10TB.
      • not a direct replacement and would need migration
    • There are no direct options for Microsoft SQL Server or Oracle yet.
  • Data Warehousing
    • BigQuery
      • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
      • Remember it is most suitable for historical analysis.
      • know how to perform a preview or dry run. Hint – price is determined by bytes read not bytes returned.
      • supports federated tables or external tables that can support Cloud Storage, BigTable, Google Drive and Cloud SQL.

Data Services

  • Although there were only a couple of reference of big data services in the exam, it is important to know (DO NOT DEEP DIVE) the Big Data stack (esp. IoT gateway, Pub/Sub, Bigtable vs BigQuery) to understand which service fits the different layers of ingest, store, process, analytics, use
    • Cloud Storage as the medium to store data as data lake
    • Cloud Pub/Sub as the messaging service to capture real time data esp. IoT
    • Cloud Pub/Sub is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real time IoT data capture
    • Cloud Dataflow to process, transform, transfer data and the key service to integrate store and analytics.
    • Cloud BigQuery for storage and analytics. Remember BigQuery provides the same cost-effective option for storage as Cloud Storage
    • Cloud Dataprep to clean and prepare data. Hint – It can be used anomaly detection.
    • Cloud Dataproc to handle existing Hadoop/Spark jobs. Hint – Use it to replace existing hadoop infra.
    • Cloud Datalab is an interactive tool for exploration, transformation, analysis and visualization of your data on Google Cloud Platform

Monitoring

  • Google Cloud Monitoring or Stackdriver
    • provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
    • remember audits are mainly checking Stackdriver

DevOps services

  • Deployment Manager 
  • Google Marketplace (Cloud Launcher)
    • provides a way to launch common software packages e.g. Jenkins or WordPress and stacks on Google Compute Engine with just a few clicks like a prepackaged solution.
    • It can help minimize deployment time and can be used without any knowledge about the product

Google Cloud – Associate Cloud Engineer Certification Resources

Google Cloud – Professional Data Engineer Certification learning path

Google Cloud – Professional Data Engineer Certification Learning Path

I just recertified on my Google Cloud Certified – Professional Data Engineer certification. The first attempt on the Data Engineer exam has already been 2 long years which lasted for 4 hours with 95 questions. Once again, similar to the other Google Cloud certification exams, the Data Engineer exam covers not only the gamut of services and concepts but also focuses on logical thinking and practical experience.

Google Cloud – Professional Cloud Data Engineer Certification Summary

  • Cloud Data Engineer exam had 50 questions to be answered in 2 hours
  • Covers a wide range of data services including machine learning, with other topics covering storage and security.
  • Exam does not cover any case studies
  • Although the exam covers the latest services, it has not been updated for Cloud Monitoring and Logging and still refers to Stackdriver.
  • Nothing much on Compute and Network is covered
  • Questions sometimes test your logical thinking rather than any concept regarding Google Cloud.
  • Hands-on is MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolutely clueless about some of the questions and commands
  • Be sure that NO Online Courses or Practice tests are going to cover all. I did Coursera, LinuxAcademy which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Data Engineer Certification Resources

Google Cloud – Professional Cloud Data Engineer Certification Topics

Data & Analytics Services

  • Obviously, there are lots and lots of data and related services
  • Google Cloud Data & Analytics Services Cheatsheet
  • Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics
  • Cloud BigQuery
    • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
    • ideal for storage and analytics.
    • provides the same cost-effective option for storage as Cloud Storage
    • understand BigQuery Security
      • use BigQuery IAM access roles to control data and querying access
      • use Authorized views to access control tables, columns within tables, and query results. HINT: Authorized views need to reside in a different dataset as compared to the source dataset.
      • support data encryption
    • understand BigQuery Best Practices including key strategy, cost optimization, partitioning, and clustering
      • use dry run to estimate costs
      • use partitioning and clustering to limit the amount of data scanned
      • using external data sources might result in query performance degradation and its better to import the data
    • Dataset location can be set ONLY at the time of its creation.
    • supports schema auto-detection for JSON and CSV files.
    • understand how BigQuery Streaming works
    • know BigQuery limitations esp. with updates and inserts
    • supports an external data source (federated data source)
      • which is a data source that can be queried directly even though the data is not stored in BigQuery.
      • offers support for querying data directly from:
        • Cloud Bigtable
        • Cloud Storage
        • Google Drive
        • Cloud SQL
      • Use Permanent table for querying an external data source multiple times
      • Use Temporary table for querying an external data source for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
  • Cloud Bigtable
    • provides column database suitable for both low-latency single-point lookups and precalculated analytics
    • understand Bigtable is not for long term storage as it is quite expensive
    • know the differences with HBase
    • Know how to measure performance and scale
    • supports Development and Production mode. Development mode can be upgraded to production and not vice versa.
    • supports HDD and SDD storage during cluster creation. HDD can be converted to SDD by exporting the data to the new instance.
    • understand Bigtable Replication. Can be used to separate real-time and batch workloads on the same instance using application profiles.
  • Cloud Pub/Sub
    • as the messaging service to capture real-time data esp. IoT
    • is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real-time IoT data capture
    • guarantees at-least-once (but not exactly once) message delivery and can result in data duplication if the message is not ack within a defined time period.
    • how it compares to Kafka (HINT: provides only 7 days of retention vs Kafka which depends on the storage)
  • Cloud Dataflow
    • to process, transform, transfer data and the key service to integrate store and analytics.
    • know how to improve a Dataflow performance
    • understand Apache Beam features as well
      • understand PCollections, Transforms, ParDo and what they do
      • understand windowing, watermarks, triggers Hint: windowing and watermarks can be used to handle delayed messages
    • supports drain feature to finish existing jobs but stop processing new ones, usually useful for deploying incompatible breaking changes
    • canceling a job will lead to an immediate stop and in-flight data loss.
  • Cloud Dataprep
    • to clean and prepare data. It can be used for anomaly detection.
    • does not need any programming language knowledge and can be done through the graphical interface
    • be sure to know or try hands-on on a dataset
  • Cloud Dataproc
    • to handle existing Hadoop/Spark jobs
    • supports connector for BigQuery, Bigtable, Cloud Storage
    • supports Ephermal clusters and with Cloud Storage connector support the data can be stored in GCS instead of HDFS
    • you need to know how to improve the performance of the Hadoop cluster as well :). Know how to configure the Hadoop cluster to use all the cores (hint- spark executor cores) and handle out of memory errors (hint – executor memory)
    • Secondary workers can be used to scale with the below limitations
      • Processing only with no data storage
      • No secondary-worker-only clusters
      • Persistent disk size is used for local caching of data and is not available through HDFS.
    • how to install other components (hint – initialization actions)
  • Cloud Datalab
    • is an interactive tool for exploration, transformation, analysis, and visualization of your data on Google Cloud Platform
    • based on Jupyter
  • Cloud Composer
    • fully managed workflow orchestration service, based on Apache Airflow, enabling workflow creation that spans across clouds and on-premises data centers.
    • pipelines are configured as directed acyclic graphs (DAGs)
    • workflow lives on-premises, in multiple clouds, or fully within GCP.
    • provides the ability to author, schedule, and monitor the workflows in a unified manner

Identity Services

  • Cloud IAM 
    • provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
    • Understand IAM Best practices

Storage Services

  • Understand each storage service option and its use cases.
  • Cloud Storage
    • cost-effective object storage for unstructured data.
    • very important to know the different classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access), and Coldline (yearly access)
    • Understand Signed URL to give temporary access and the users do not need to be GCP users
    • Understand permissions – IAM vs ACLs (fine-grained control)
  • Cloud SQL
    • is a fully-managed service that provides MySQL and PostgreSQL only.
    • Limited to 10TB and is a regional service.
    • No direct options for Oracle yet.
  • Cloud Spanner
    • is a fully managed, mission-critical relational database service.
    • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at a global scale.
    • globally distributed and can scale and handle more than 10TB.
    • not a direct replacement and would need migration
  • Cloud Datastore
    • provides document database for web and mobile applications. Datastore is not for analytics
    • Understand Datastore indexes and how to update indexes for Datastore

Machine Learning

  • Google expects the Data Engineer to surely know some of the Data scientists stuff
  • Understand the different algorithms
    • Supervised Learning (labeled data)
      • Classification (for e.g. Spam or Not)
      • Regression (for e.g. Stock or House prices)
    • Unsupervised Learning (Unlabelled data)
      • Clustering (for e.g. categories)
    • Reinforcement Learning
  • Know Cloud ML with Tensorflow
  • Know all the Cloud AI products which include
    • Cloud Vision
    • Cloud Natural Language
    • Cloud Speech-to-Text
    • Cloud Video Intelligence
    • Cloud Dialogflow
  • Cloud AutoML products, which can help you get started without much machine learning experience

Monitoring

  • Cloud Monitoring and Logging
    • provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
    • remember audits are mainly checking Cloud Logging entries
    • Aggregated sink can then route log entries from the organization or folder, plus (recursively) from any contained folders, billing accounts, or projects

Security Services

Other Services

  • Storage Transfer Service 
    • allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively. Online data is the key here as it supports AWS S3, HTTP/HTTPS, and other GCS buckets. If the data is on-premises you need to use the gsutil command
  • Transfer Appliance 
    • to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform. Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
  • BigQuery Data Transfer Service
    • to integrate with third-party services and load data into BigQuery

Google Cloud – Professional Cloud Architect Certification learning path

Google Cloud - Professional Cloud Architect certificate

Google Cloud – Professional Cloud Architect Certification Learning Path

Re-certified !!!! Google Cloud – Professional Cloud Architect certification exam is one of the toughest exam I have appeared for. Even though it was recertification, the preparation level was same as the first one. The gamut of services and concepts it tests your knowledge on is really vast.

Google Cloud – Professional Cloud Architect Certification Summary

  • Has 50 questions to be answered in 2 hours.
  • Covers wide range of Google Cloud services and what they actually do.
  • includes Compute, Storage, Network and even Data services
  • Questions sometimes tests your logical thinking rather than any concept regarding Google Cloud.
  • Hands-on is a MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolute clueless for some of the questions and commands
  • Make sure you cover the case studies before hand. I got  ~15 questions (almost 5 per case study) and it can really be a savior for you in the exams.
  • Be sure that NO Online Course or Practice tests is going to cover all. I did LinuxAcademy (a bit old now) which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Architect Certification Resources

Google Cloud – Professional Cloud Architect Certification Topics

General Services

  • Cloud Billing
    • understand how Cloud Billing works. Monthly vs Threshold and which has priority
    • Budgets can be set to alert for projects
    • how to change a billing account for a project and what roles you need. Hint – Project Owner and Billing Administrator for the billing account
    • Cloud Billing can be exported to BigQuery and Cloud Storage
  • Resource Manager
    • Understand Resource Manager the hierarchy Organization -> Folders -> Projects -> Resources
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.

Identity Services

  • Cloud Identity and Access Management
    • Identify and Access Management – IAM provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
    • Understand the difference between Primitive, Pre-defined and Custom roles and their use cases
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
    • Basically  Permissions -> Roles -> (IAM Policy) -> Members
    • Know how to use service accounts with applications
  • Cloud Identity
    • Cloud Identity provides IDaaS (Identity as a Service) and provides single sign-on functionality and federation with external identity provides like Active Directory.
    • Cloud Identity supports federating with Active Directory using GCDS to implement the synchronization

Compute Services

    • Make sure you know all the compute services Google Compute Engine, Google App Engine and Google Kubernetes Engine. You need to be sure to know the pros and cons and the use cases that you should use them.
    • Google Compute Engine
      • Google Compute Engine is the best IaaS option for compute and provides fine grained control
      • Know how to create a Compute Engine instance, connect to it using Cloud shell or ssh keys
      • Difference between backups and images and how to create instances from the same.
      • Understand Compute Engine Storage Options. Disk throughput and IOPS depends on type and size.
      • Understand Compute Engine Snapshots
      • Instance templates with managed instance groups provide scalability and high availability
      • Instance template cannot be edited, create a new one and attach.
      • Difference between managed vs unmanaged instance groups and auto-healing feature
      • Managed instance groups are covered heavily the exam, as they provide the key auto-scaling capability. Hint: you need to create an Instance template and associate it with Instance group
      • Understand how migration or traffic splitting with Managed instance groups works Hint – rolling updates & deployments
      • Preemptible VMs and their use cases. HINT – can be terminated any time and supports max 24 hours.
      • Upgrade an instance without downtime using Live Migration
      • Managing access using OS Login or project and instance metadata
      • Prevent accidental deletion using deletion protection flag
      •  Understand the pricing and discounts model Hint – Sustained (automatic upto 30%) vs Committed (1 to 3 yrs) discounts.
      • In case of any issues or errors, how to debug the same
    • Google App Engine
      • Google App Engine is mainly the best option for PaaS with platforms supported and features provided.
      • Deploy an application with App Engine and understand how versioning and rolling deployments can be done
      • Understand how to keep auto scaling and traffic splitting and migration.
      • Know App Engine is a regional resource and understand the steps to migrate or deploy application to different region and project.
      • Know the difference between App Engine Flexible vs Standard
    • Google Kubernetes Engine
      • Google Kubernetes Engine, powered by the open source container scheduler Kubernetes, enables you to run containers on Google Cloud Platform.
      • Kubernetes Engine takes care of provisioning and maintaining the underlying virtual machine cluster, scaling your application, and operational logistics such as logging, monitoring, and cluster health management.
      • A node pool is a subset of machines that all have the same configuration, including machine type (CPU and memory) authorization scopes. Node pools represent a subset of nodes within a cluster; a container cluster can contain one or more node pools. Hint : For adding new machine types, need to add a new node pool as existing one cannot be edited
      • Be sure to Create a Kubernetes Cluster and configure it to host an application
      • Understand how to make the cluster auto repairable and upgradable. Hint – Node auto-upgrades and auto-repairing feature
      • Very important to understand where to use gcloud commands (to create a cluster) and kubectl commands (manage the cluster components)
      • Very important to understand how to increase cluster size and enable autoscaling for the cluster
      • Know how to manage secrets like database passwords
    • Cloud Functions
      • is a lightweight, event-based, asynchronous compute solution that allows you to create small, single-purpose functions that respond to cloud events without the need to manage a server or a runtime environment.
      • Remember that Cloud Functions is serverless and scales from zero to scale and back to zero as the demand changes.

Network Services

  • Virtual Private Cloud
    • Understand Virtual Private Cloud (VPC), subnets and host applications within them Hint VPC spans across region
    • Understand how Firewall rules works and how they are configured. Hint – Focus on Network Tags. Also, there are 2 implicit firewall rules – default ingress deny and default egress allow
    • Understand VPC Peering and Shared VPC
    • Understand the concept internal and external IPs and difference between static and ephemeral IPs
    • Primary IP range of an existing subnet can be expanded by modifying its subnet mask, setting the prefix length to a smaller number.
    • Understand Private Google Access use cases
  • On-premises connectivity
    • Cloud VPN and Interconnect are 2 components which help you connect to on-premises data center.
    • Understand limitations of Cloud VPN esp. 3Gbps limit. How it can be improved with multiple tunnels.
    • Understand what are the requirements to setup Cloud VPN.
    • Cloud Router provides dynamic routing using BGP
    • Know Interconnect as the reliable high speed, low latency and dedicated bandwidth options.
  • Cloud Load Balancing (GCLB)
    • Google Cloud Load Balancing provides scaling, high availability, and traffic management for your internet-facing and private applications.
    • Understand Google Load Balancing options and their use cases esp. which is global and internal and what protocols they support.

Storage Services

  • Understand each Storage Options and use cases.
  • Persistent disks
    • attached to the Compute Engines, provide fast access however are limited in scalability, availability and scope.
    • Remember performance depends on the size of the disk
  • Cloud Storage
    • Cloud Storage is cost-effective object storage for unstructured data.
    • very important to know the different storage classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access) and Coldline (yearly access)
    • Understand life cycle management. HINT – Changes are in accordance to object creation date
    • Understand various data encryption techniques
    • Understand Signed URL to give temporary access and the users do not need to be GCP users
    • Understand access control and permissions – IAM vs ACLs (fine grained control)
    • Understand best practices esp. uploading and downloading the data. HINT using parallel composite uploads
  • Relational Databases
    • Know Cloud SQL and Cloud Spanner
    • Cloud SQL
      • Cloud SQL is a fully-managed service that provides MySQL, PostgreSQL and MS SQL Server
      • limited to 10TB and is a regional service.
      • Difference between Failover and Read replicas. Failover provides High Availability and almost zero downtime while Read replicas provide scalability. Cross region Read Replicas are supported
      • Perform Point-In-Time recovery. Hint – requires binary logging and backups
      • MS SQL server support was added anew. Previously for HA, it required setting up SQL Server on Compute Engine, using Always On Availability Groups using Windows Failover Clustering. Place nodes in different subnets.
    • Cloud Spanner
      • is a fully managed, mission-critical relational database service.
      • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.
      • globally distributed and can scale and handle more than 10TB.
      • not a direct replacement and would need migration
    • There are no direct options for Oracle yet.
  • NoSQL
    • Know Cloud Datastore and BigTable
    • Datastore
      • provides document database for web and mobile applications. Datastore is not for analytics
      • Understand Datastore indexes and how to update indexes for Datastore
      • Can be configured Multi-regional and regional
    • Bigtable
      • provides column database suitable for both low-latency single-point lookups and precalculated analytics
      • understand Bigtable is not for long term storage as it is quite expensive
  • Data Warehousing
    • BigQuery
      • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
      • Remember it is most suitable for historical analysis.
  • MemoryStore and Firebase did not feature in any of the questions

Data Services

  • Although there is a different certification for Data Engineer, the Cloud Architect does cover data services. Data services are also part of the use cases so be sure to know about them
  • Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics, use
  • Key Services which need to be mainly covered are –
    • Cloud Storage as the medium to store data as data lake
    • Cloud Pub/Sub
      • as the messaging service to capture real time data esp. IoT
      • is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real time IoT data capture
      • Cloud Storage can generate notifications Object change notification
    • Cloud Dataflow to process, transform, transfer data and the key service to integrate store and analytics.
    • Cloud BigQuery for storage and analytics. Remember BigQuery provides the same cost-effective option for storage as Cloud Storage
    • Cloud Dataprep to clean and prepare data. Hint – It can be used anomaly detection.
    • Cloud Dataproc to handle existing Hadoop/Spark jobs. Hint – Use it to replace existing hadoop infra.
    • Cloud Datalab is an interactive tool for exploration, transformation, analysis and visualization of your data on Google Cloud Platform
  • Know standard patterns Cloud Pub/Sub -> Dataflow -> BigQuery

Monitoring

  • Google Cloud Monitoring or Stackdriver
    • provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
    • remember audits are mainly checking Stackdriver
  • Google Cloud Logging or Stackdriver logging

DevOps services

  • Deployment Manager 
    • provides Infrastructure as Code
    • provides dynamic provisioning with templates
  • Cloud Source Repositories
    • provides source code repository with Git version control to support collaborative development
  • Container Registry
    • is a private Docker image storage system on Google Cloud Platform.
    • images stored are immutable.
  • Cloud Build
    • is a service that executes your builds on Google Cloud Platform infrastructure.
  • MarketPlace (Cloud Launcher)
    • provides a way to launch common software packages e.g. Jenkins or WordPress and stacks on Google Compute Engine with just a few clicks like a prepackaged solution.
    • can help minimize deployment time and can be used without any knowledge about the product

Security Services

  • Cloud Security Scanner 
    • is a web application security scanner that enables developers to easily check for a subset of common web application vulnerabilities in websites built on App Engine and Compute Engine.
  • Data Loss Prevention API
    • to handle sensitive data esp. redaction of PII data.
  • PCI-DSS compliant
    • GCP services are PCI-DSS complaint, however you need to make sure for the applications and hosting to be inline with PCI-DSS requirements
  • Same concept as PCI-DSS applies to GDPR as well

Other Services

  • Know various data transfer options
  • Storage Transfer Service
    • allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively.
    • Online data is the key here as it supports AWS S3, HTTP/HTTPS and other GCS buckets.
    • for on-premises data you need to use gsutil command
  • Transfer Appliance 
    • to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform.
    • Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
    • Transfer Appliance Rehydrator provides data rehydration, which is the process by to fully reconstitute the files, so that the transferred data can be accessed and used.
  • Spinnaker
    • is an open source, multi-cloud, continuous delivery platform and does appear in answer options. So be sure to know about it.
  • Jenkins
    • for Continuous Integration and Continuous Delivery.

Case Studies

Google Cloud – Dress4win Case Study

⚠️ Case Study No Longer on Current PCA Exam

The Dress4Win case study has been retired from the Google Cloud Professional Cloud Architect (PCA) exam.

The current PCA exam (as of 2025-2026) uses four different case studies: EHR Healthcare, Helicopter Racing League (HRL), Mountkirk Games, and TerramEarth.

This content is maintained for historical reference and as a learning exercise for GCP migration architecture concepts. The architectural patterns discussed remain relevant for real-world cloud migrations.

Dress4Win is a web-based company that helps their users organize and manage their personal wardrobe using a web app and mobile application. The company also cultivates an active social network that connects their users with designers and retailers. They monetize their services through advertising, e-commerce, referrals, and a freemium app model. The application has grown from a few servers in the founder’s garage to several hundred servers and appliances in a colocated data center. However, the capacity of their infrastructure is now insufficient for the application’s rapid growth. Because of this growth and the company’s desire to innovate faster, Dress4Win is committing to a full migration to a public cloud.

The key here is the company wants to migrate completely to public cloud for the current infrastructures inability to scale

Solution Concept

For the first phase of their migration to the cloud, Dress4Win is moving their development and test environments. They are also building a disaster recovery site, because their current infrastructure is at a single location. They are not sure which components of their architecture they can migrate as is and which components they need to change before migrating them.

Key here is Dress4Win wants to move the development and test environments first. And also, they want to build a DR site for their current production site which would continue to be hosted on-premises

Executive Statement

Our investors are concerned about our ability to scale and contain costs with our current infrastructure. They are also concerned that a competitor could use a public cloud platform to offset their up-front investment and free them to focus on developing better features. Our traffic patterns are highest in the mornings and weekend evenings; during other times, 80% of our capacity is sitting idle.

Our capital expenditure is now exceeding our quarterly projections. Migrating to the cloud will likely cause an initial increase in spending, but we expect to fully transition before our next hardware refresh cycle. Our total cost of ownership (TCO) analysis over the next 5 years for a public cloud strategy achieves a cost reduction between 30% and 50% over our current model.

The key here is that the company wants to improve on the application scalability, efficiency (hardware sitting idle most of the time), capex cost reduction, and improve TCO over a period of time

Existing Technical Environment

The Dress4Win application is served out of a single data center location. All servers run Ubuntu LTS v16.04.

Databases:

  • MySQL. 1 server for user data, inventory, static data,
    • MySQL 5.8
    • 8 core CPUs
    • 128 GB of RAM
    • 2x 5 TB HDD (RAID 1)
  • Redis 3 server cluster for metadata, social graph, caching. Each server is:
    • Redis 3.2
    • 4 core CPUs
    • 32GB of RAM
  • MySQL server can be migrated directly to Cloud SQL, which is GCP managed relational database and supports MySQL. For PostgreSQL-compatible workloads requiring higher performance, AlloyDB (GA since 2022) is also an option offering up to 4x throughput vs. standard PostgreSQL.
  • For Redis cluster, Memorystore for Redis can be used which is a fully-managed in-memory data store service for Redis. Memorystore now supports Redis versions up to 7.2 and offers a Memorystore for Redis Cluster option for high-throughput workloads.
  • There would be no changes required to support the same.

Compute:

  • 40 Web Application servers providing micro-services based APIs and static content.
    • Tomcat – Java
    • Nginx
    • 4 core CPUs
    • 32 GB of RAM
  • 20 Apache Hadoop/Spark servers:
    • Data analysis
    • Real-time trending calculations
    • 8 core CPUs
    • 128 GB of RAM
    • 4x 5 TB HDD (RAID 1)
  • 3 RabbitMQ servers for messaging, social notifications, and events:
    • 8 core CPUs
    • 32GB of RAM
  • Miscellaneous servers:
    • Jenkins, monitoring, bastion hosts, security scanners
    • 8 core CPUs
    • 32GB of RAM
  • Web Application servers with Java and Nginx can be supported using Compute Engine, Cloud Run (for containerized microservices with automatic scaling), or Google Kubernetes Engine (GKE) (formerly Container Engine) with autoscaling configured. GKE Autopilot mode simplifies cluster management further.
  • Although the core and RAM combination would need a custom machine type, the same be configured or tuned to use an existing machine type
  • Apache Hadoop/Spark servers can be easily migrated to Dataproc (now part of the Managed Service for Apache Spark brand), which provides managed Hadoop and Spark clusters with autoscaling that can reduce VM costs by up to 40%.
  • RabbitMQ messaging service is currently not directly supported by Google Cloud and can be supported either with
    • Cloud Pub/Sub messaging – however this would need changes to the code and would not be a seamless migration. Pub/Sub now also supports streaming ingestion from external sources and export subscriptions to BigQuery/Cloud Storage.
    • Use Compute Engine to host the RabbitMQ servers
  • Jenkins, Bastion hosts, Security scanners can be hosted using Google Compute Engine (GCE). For CI/CD, Cloud Build is also available as a managed alternative to self-hosted Jenkins.
  • Monitoring can be provided using Google Cloud Operations Suite (formerly Stackdriver), which includes Cloud Monitoring, Cloud Logging, Cloud Trace, and Cloud Profiler.

Storage appliances:

  • iSCSI for VM hosts
  • Fiber channel SAN – MySQL databases
    • 1 PB total storage; 400 TB available
  • NAS – image storage, logs, backups
    • 100 TB total storage; 35 TB available
  • iSCSI for VM hosts can be supported using Cloud Persistent Disks (or Hyperdisk for higher performance requirements) as it needs a block level storage
  • SAN for MySQL databases can be supported using Cloud Persistent Disks as it needs a block level storage. However, a single disk cannot scale to 1PB and multiple disks need to be combined to create the storage
  • NAS for image storage, logs and backups can be supported using Cloud Storage which provides unlimited storage capacity. For file-system access (NFS), Filestore provides a managed NFS file server.

Business Requirements

  • Build a reliable and reproducible environment with scaled parity of production.
    • can be handled by provisioning services or using GCP managed services with the same scale as on-premises resources and with Terraform or Infrastructure Manager for creating repeatable deployments
  • Improve security by defining and adhering to a set of security and Identity and Access Management (IAM) best practices for cloud.
    • can be handled using IAM by implemented best practices like least privileges, separating dev/test/production projects to control access
  • Improve business agility and speed of innovation through rapid provisioning of new resources.
    • can be handled using Terraform or Infrastructure Manager for repeatable and automated provisioning of resources
    • deployments of applications and new releases can be handled efficiently using rolling updates, A/B testing, and Cloud Deploy for managed continuous delivery
  • Analyze and optimize architecture for performance in the cloud.
    • can be handled using autoscaling Compute Engine instances based on the demand
    • can be handled using Google Cloud Operations Suite (Cloud Monitoring, Cloud Logging) for monitoring and fine tuning the specs, plus Active Assist recommendations for rightsizing

Technical Requirements

  • Easily create non-production environments in the cloud.
    • most of the services can be created using GCP managed services and the environment creation can be standardized and automated using templates and configurations
  • Implement an automation framework for provisioning resources in cloud.
    • can be handled using Terraform (recommended) or Infrastructure Manager, which provide Infrastructure as Code (IaC) for provisioning resources in cloud. Note: Cloud Deployment Manager reached end of support on March 31, 2026 and should not be used for new projects.
  • Implement a continuous deployment process for deploying applications to the on-premises datacenter or cloud.
    • continuous deployments can be handled using tools like Jenkins available on both the environments, or Cloud Build with Cloud Deploy for GCP-native CI/CD pipelines
  • Support failover of the production environment to cloud during an emergency.
    • can be handled by replicating all the data to the cloud environment and ability to provision the servers quickly.
    • can be handled by using Cloud DNS to repoint from on-premises environment to cloud environment
  • Encrypt data on the wire and at rest.
    • All the GCP services, by default, provide encryption on wire and at rest. Encryption can be performed using Google-managed keys, Customer-Managed Encryption Keys (CMEK) via Cloud KMS, or Customer-Supplied Encryption Keys (CSEK)
  • Support multiple private connections between the production data center and cloud environment.
    • can be handled using Cloud VPN (multiple VPN tunnels with HA VPN for 99.99% SLA) or Dedicated Interconnect/Partner Interconnect connection between the production data center and the cloud environment. For multi-cloud connectivity, Cross-Cloud Interconnect is also available.

Updated GCP Service Mapping (2025-2026)

The following table summarizes the recommended GCP services for the Dress4Win migration, reflecting current service names and availability:

  • Relational Database: Cloud SQL (MySQL/PostgreSQL) or AlloyDB (for PostgreSQL-compatible high-performance workloads)
  • In-Memory Cache: Memorystore for Redis (supports up to Redis 7.2) or Memorystore for Redis Cluster
  • Web Application Hosting: GKE (Google Kubernetes Engine), Cloud Run, or Compute Engine
  • Big Data/Analytics: Dataproc (Managed Service for Apache Spark) with autoscaling, or BigQuery for analytics
  • Messaging: Cloud Pub/Sub (managed) or self-hosted RabbitMQ on Compute Engine
  • Monitoring: Google Cloud Operations Suite (Cloud Monitoring, Cloud Logging, Cloud Trace)
  • IaC/Provisioning: Terraform or Infrastructure Manager (NOT Cloud Deployment Manager — deprecated)
  • CI/CD: Cloud Build + Cloud Deploy, or Jenkins on Compute Engine
  • Block Storage: Persistent Disk or Hyperdisk
  • Object/File Storage: Cloud Storage (objects), Filestore (NFS)
  • Networking: HA Cloud VPN, Dedicated/Partner Interconnect, Cross-Cloud Interconnect
  • Encryption: Google-managed keys, CMEK (Cloud KMS), or CSEK

References

AWS Certified SysOps Administrator – Associate (SOA-C01) Exam Learning Path

AWS Certified SysOps Administrator – Associate (SOA-C01) Exam Learning Path

AWS Certified SysOps Administrator – Associate (SOA-C01) exam is the latest AWS exam and has already replaced the old SysOps Administrator – Associate exam from 24th Sept 2018. It basically validates

  • Deploy, manage, and operate scalable, highly available, and fault tolerant systems on AWS
  • Implement and control the flow of data to and from AWS
  • Select the appropriate AWS service based on compute, data, or security requirements
  • Identify appropriate use of AWS operational best practices
  • Estimate AWS usage costs and identify operational cost control mechanisms
  • Migrate on-premises workloads to AWS

Refer AWS Certified SysOps – Associate Exam Guide Sep 18

AWS Certified SysOps Administrator - Associate Content Outline

AWS Certified SysOps Administrator – Associate (SOA-C01) Exam Summary

  • AWS Certified SysOps Administrator – Associate exam is quite different from the previous one with more focus on the error handling, deployment, monitoring.
  • AWS Certified SysOps Administrator – Associate exam covers a lot of latest AWS services like ALB, Lambda, AWS Config, AWS Inspector, AWS Shield while focusing majorly on other services like CloudWatch, Metrics from various services, CloudTrail.
  • Be sure to cover the following topics
    •  Monitoring & Management Tools
      • Understand CloudWatch monitoring to provide operational transparency
        • Know which EC2 metrics it can track (disk, network, CPU, status checks) and which would need custom metrics (memory, disk swap, disk storage etc.)
        • Know ELB monitoring
          • Classic Load Balancer metrics SurgeQueueLength and SpilloverCount
          • Reasons for 4XX and 5XX errors
      • Understand CloudTrail for audit and governance
      • Understand AWS Config and its use cases
      • Understand AWS Systems Manager and its various services like parameter store, patch manager
      • Understand AWS Trusted Advisor and what it provides
      • Very important to understand AWS CloudWatch vs AWS CloudTrail vs AWS Config
      • Very important to understand Trust Advisor vs Systems manager vs Inspector
      • Know Personal Health Dashboard & Service Health Dashboard
      • Deployment tools
        • Know AWS OpsWorks and its ability to support chef & puppet
        • Know Elastic Beanstalk and its advantages
        • Understand AWS CloudFormation
          • Know stacks, templates, nested stacks
          • Know how to wait for resources setup to be completed before proceeding esp. cfn-signal
          • Know how to retain resources (RDS, S3), prevent rollback in case of a failure
    • Networking & Content Delivery
      • Understand VPC in depth
        • Understand the difference between
          • Bastion host – allow access to instances in private subnet
          • NAT – route traffic from private subnets to internet
          • NAT instance vs NAT Gateway
          • Internet Gateway – Access to internet
          • Virtual Private Gateway – Connectivity between on-premises and VPC
          • Egress-Only Internet Gateway – relevant to IPv6 only to allow egress traffic from private subnet to internet, without allowing ingress traffic
        • Understand
        • Understand how VPC Peering works and limitations
        • Understand VPC Endpoints and supported services
        • Ability to debug networking issues like EC2 not accessible, EC2 instances not reachable, Instances in subnets not able to communicate with others or Internet.
      • Understand Route 53 and Routing Policies and their use cases
        • Focus on Weighted, Latency routing policies
      • Understand VPN and Direct Connect and their use cases
      • Understand CloudFront and use cases
      • Understand ELB, ALB and NLB and what features they provide like
        • ALB provides content and path routing
        • NLB provides ability to give static IPs to load balancer.
    • Compute
      • Understand EC2 in depth
        • Understand EC2 instance types
        • Understand EC2 purchase options esp. spot instances and improved reserved instances options.
        • Understand how IO Credits work and T2 burstable performance and T2 unlimited
        • Understand EC2 Metadata & Userdata. Whats the use of each? How to look up instance data after it is launched.
        • Understand EC2 Security. 
          • How IAM Role work with EC2 instances
          • IAM Role can now be attached to stopped and runnings instances
        • Understand AMIs and remember they are regional and how can they be shared with others.
        • Troubleshoot issues with launching EC2 esp. RequestLimitExceeded, InstanceLimitExceeded etc.
        • Troubleshoot connectivity, lost ssh keys issues
      • Understand Auto Scaling
      • Understand Lambda and its use cases
      • Understand Lambda with API Gateway
    • Storage
    • Databases
    • Security
      • Understand IAM as a whole
      • Understand KMS for key management and envelope encryption
      • Understand CloudHSM and KMS vs CloudHSM esp. support for symmetric and asymmetric keys
      • Know AWS Inspector and its use cases
      • Know AWS GuardDuty as managed threat detection service. Will help eliminate as the option
      • Know AWS Shield esp. the Shield Advanced option and the features it provides
      • Know WAF as Web Traffic Firewall
      • Know AWS Artifact as on-demand access to compliance reports
    • Integration Tools
      • Understand SQS as message queuing service and SNS as pub/sub notification service
        • Focus on SQS as a decoupling service
        • Understand SQS FIFO, make sure you know the differences between standard and FIFO
      • Understand CloudWatch integration with SNS for notification
    • Cost management

AWS Certified SysOps Administrator – Associate (SOA-C01) Exam Resources

AWS Cloud Computing Whitepapers

AWS Certified SysOps Administrator – Associate (SOA-C01) Exam Contents

Domain 1: Monitoring and Reporting

  1. Create and maintain metrics and alarms utilizing AWS monitoring services
  1. Recognize and differentiate performance and availability metrics
  2. Perform the steps necessary to remediate based on performance and availability metrics

Domain 2: High Availability

  1. Implement scalability and elasticity based on use case
  2. Recognize and differentiate highly available and resilient environments on AWS

Domain 3: Deployment and Provisioning

  1. Identify and execute steps required to provision cloud resources
  2. Identify and remediate deployment issues

Domain 4: Storage and Data Management

  1. Create and manage data retention
  2. Identify and implement data protection, encryption, and capacity planning needs

Domain 5: Security and Compliance

  1. Implement and manage security policies on AWS
  1. Implement access controls when using AWS
  2. Differentiate between the roles and responsibility within the shared responsibility model

Domain 6: Networking

  1. Apply AWS networking features
  1. Implement connectivity services of AWS
  2. Gather and interpret relevant information for network troubleshooting

Domain 7: Automation and Optimization

  1. Use AWS services and features to manage and assess resource utilization
  2. Employ cost-optimization strategies for efficient resource utilization
  3. Automate manual or repeatable process to minimize management overhead

AWS Certified Developer – Associate DVA-C01 Exam Learning Path

AWS Certified Developer – Associate DVA-C01 Exam Learning Path

AWS Certified Developer – Associate DVA-C01 exam is the latest AWS exam and would replace the old Developer – Associate exam. It basically validates

  • Demonstrate an understanding of core AWS services, uses, and basic AWS architecture best practices.
  • Demonstrate proficiency in developing, deploying, and debugging cloud-based applications using AWS.

Refer AWS Certified Developer – Associate (Released June 2018) Exam Blue Print

AWS Certified Developer - Associate June 2018 Domains

AWS Certified Developer – Associate DVA-C01 Summary

  • AWS Certified Developer – Associate DVA-C01 exam is quite different from the previous one with more focus on the hands-on development and deployment concepts rather then just the architectural concepts
  • AWS Certified Developer – Associate DVA-C01 exam covers a lot of latest AWS services like Lambda, X-Ray while focusing majorly on other services like DynamoDB, Elastic Beanstalk, S3, EC2

AWS Developer – Associate DVA-C01 Exam Resources

AWS Developer – Associate DVA-C01 Exam Topics

  • Be sure to cover the following topics
    • Compute
      • Understand what AWS services you can use to build a serverless architecture?
      • Make sure you know and understand Lambda and serverless architecture, its features and use cases.
      • Know Lambda limits for e.g. execution time, deployable zipped and unzipped package limit
      • Be sure to know how to deploy, package using Lambda.
      • Understand tracing of Lambda functions using X-Ray
      • Understand integration of Lambda with CloudWatch.
      • Understand how to handle multiple releases using Alias
      • Know AWS Step Functions to manage Lambda functions flow
      • Understand Lambda with API Gateway
      • Understand API Gateway stages, ability to cater to different environments for e.g. dev, test, prod
      • Understand EC2 as a whole
      • Understand EC2 Metadata & Userdata. Whats the use of each? How to look up instance data after it is launched.
      • Understand EC2 Security. How IAM Role work with EC2 instances.
      • Understand how does EC2 evaluates the order of credentials, when multiple are provided. Remember the order – Environment variables -> Java system properties -> Default credential profiles file -> ECS container credentials -> Instance Profile credentials
      • Know Elastic Beanstalk at a high level, what it provides and its ability to get an application running quickly
      • Understand Elastic Beanstalk configurations and deployment types with their advantages and disadvantages
    • Databases
      • Understand relational and NoSQLs data storage options which include RDS, DynamoDB and their use cases
      • Understand DynamoDB Secondary Indexes
      • Make sure you understand DynamoDB provisioned throughput for Read/Writes and its calculations
      • Make sure you understand DynamoDB Consistency Model – difference between Strongly Consistent and Eventual Consistency
      • Understand DynamoDB with its low latency performance, DAX
      • Know how to configure fine grained security for DynamoDB table, items, attributes
      • Understand DynamoDB Best Practices regarding
        • table design
        • provisioned throughput
        • Query vs Scan operations
        • improving Scan operation performance
      • Understand RDS features – Read Replicas for scalability, Multi-AZ for High Availability
      • Know ElastiCache use cases, mainly for caching performance
      • Understand ElastiCache Redis vs Memcached
    • Storage
      • Understand S3 storage option
      • Understand S3 Best Practices to improve performance for GET/PUT requests
      • Understand S3 features like different storage classes with lifecycle policies, static website hosting, versioning, Pre-Signed URLs for both upload and download, CORS
    • Security
      • Understand IAM as a whole
      • Focus on IAM role and its use case especially with EC2 instance
      • Know how to test and validate IAM policies
      • Understand IAM identity providers and federation and use cases
      • Understand how AWS Cognito works and what features it provides
      • Understand MFA and How would implement two factor authentication for your application
      • Understand KMS for key management and envelope encryption
      • Know what services support KMS
        • Remember SQS, Kinesis now provides SSE support
      • Focus on S3 with SSE, SSE-C, SSE-KMS. How they work and differ?
      • Know how can you enforce only buckets to only accept encrypted objects
      • Know various KMS encryption options encrypt, reencrypt, generateEncryptedDataKey etc
      • Know how KMS impacts the performance of the services
    • Management Tools
      • Understand CloudWatch monitoring to provide operational transparency
      • Know which EC2 metrics it can track.
      • Understand CloudWatch is extendable with custom metrics
      • Understand CloudTrail for Audit
    • Integration Tools
      • Understand SQS as message queuing service and SNS as pub/sub notification service
      • Understand SQS features like visibility, long poll vs short poll
      • Focus on SQS as a decoupling service
      • AWS has released SQS FIFO, make sure you know the differences between standard and FIFO
      • Know the different development and deployment tools like CodeCommit, CodeBuild, CodeDeploy, CodePipeline
    • Networking
      • Does not cover much on networking or designing of networks, but be sure you understand VPC, Subnets, Routes, Security Groups etc.

AWS Cloud Computing Whitepapers

AWS Certified Developer – Associate DVA-C01 Exam Contents

Domain 1: Deployment

  1. Deploy written code in AWS using existing CI/CD pipelines, processes, and patterns.
  1. Deploy applications using Elastic Beanstalk.
  1. Prepare the application deployment package to be deployed to AWS.
  2. Deploy serverless applications.

Domain 2: Security

  1. Make authenticated calls to AWS services.
  1. Implement encryption using AWS services.
  2. Implement application authentication and authorization.

Domain 3: Development with AWS Services

  1. Write code for serverless applications.
  1. Translate functional requirements into application design.
  1. Implement application design into application code.
  2. Write code that interacts with AWS services by using APIs, SDKs, and AWS CLI.

Domain 4: Refactoring

  1. Optimize application to best use AWS services and features.
  2. Migrate existing application code to run on AWS.

Domain 5: Monitoring and Troubleshooting

  1. Write code that can be monitored.
  2. Perform root cause analysis on faults found in testing or production.

AWS Certified Solutions Architect – Associate SAA-C01 Exam Learning Path (Obsolete)

AWS Certified Solutions Architect – Associate SAA-C01 Exam Learning Path (Obsolete)

SAA-C01 is Obsolete now, Please refer SAA-C03 Learning Path

AWS Solutions Architect – Associate SAA-C01 exam is the latest AWS exam and would replace the old CSA-Associate exam. It basically validates the ability to effectively demonstrate knowledge of how to architect and deploy secure and robust applications on AWS technologies

  • Define a solution using architectural design principles based on customer requirements.
  • Provide implementation guidance based on best practices to the organization throughout the life cycle of the project.

Refer AWS_Solution_Architect_-_Associate_SAA-C01_Exam_Blue_Print

AWS Certified Solutions Architect - Associate February 2018

AWS Solutions Architect – Associate SAA-C01 Exam Summary

  • AWS has updated the exam concepts from the focus being on individual services to more building of scalable, highly available, cost-effective, performant, resilient and operational effective architecture
  • Although, most of the services covered by the the old exam are the same. There are few new additions like API Gateway, Lambda, ECS, Aurora
  • Exam surely covers the architecture aspects in deep, so you must be able to visualize the architecture, even draw them out in the exam just to understand how it would work and how different services relate.
  • Be sure to cover the following topics
    • Networking
      • Be sure to create VPC from scratch. This is mandatory.
        • Create VPC and understand whats an CIDR.
        • Create public and private subnets, configure proper routes, security groups, NACLs.
        • Create Bastion for communication with instances
        • Create NAT Gateway or Instances for instances in private subnets to interact with internet
        • Create two tier architecture with application in public and database in private subnets
        • Create three tier architecture with web servers in public, application and database servers in private.
        • Make sure to understand how the communication happens between Internet, Public subnets, Private subnets, NAT, Bastion etc.
      • Understand VPC endpoints and what services it can help interact
      • Understand difference between NAT Gateway and NAT Instance
      • Understand how NAT high availability can be achieved
      • Understand CloudFront as CDN and the static and dynamic caching it provides, what can be its origin (it can point to on-premises sources)
      • Understand Route 53 for routing, health checks and various routing policies it provides and their use cases mainly for high availability
      • Be sure to cover ELB in deep. AWS has introduced ALB and NLB and there are lot of questions on ALB
      • Understand ALB features with its ability for content based and URL based routing with support for dynamic port mapping with ECS
    • Storage
      • Understand various storage options S3, EBS, Instance store, EFS, Glacier and what are the use cases and anti patterns for each
      • Would recommend referring Storage Options whitepaper, although a bit dated 90% still holds right
      • Understand various EBS volume types and their use cases in terms of IOPS and throughput. SSD for IOPS and HDD for throughput
      • Understand Burst performance and I/O credits to handle occasional peaks
      • Understand S3 features like different storage classes with lifecycle policies, static website hosting, versioning, Pre-Signed URLs for both upload and download, CORS
      • Understand Glacier as an archival storage with various retrieval patterns
      • Glacier Expedited retrieval now allows object retrieval within mins
      • Understand Storage gateway and its different types
    • Compute
      • Understand EC2 as a whole
      • Understand Auto Scaling and ELB, how they work together to provide High Available and Scalable solution
      • Understand EC2 various purchase types – Reserved, On-demand and Spot and their use cases
      • Understand Reserved purchase types with the introduction of Scheduled and Convertible types
      • Understand Lambda and serverless architecture, its features and use cases. How do you benefit from Lambda?
      • Understand ECS with its ability to deploy containers and micro services architecture
      • Know Elastic Beanstalk at a high level, what it provides and its ability to get an application running quickly
    • Databases
      • Understand relational and NoSQLs data storage options which include RDS, DynamoDB, Aurora and their use cases
      • Aurora has been added to the exam and most of time the questions refer to Aurora given its abilities for multiple read replicas and replication of data across AZs
      • Understand S3 is not a storage option for database
      • Understand RDS features – Read Replicas for scalability, Multi-AZ for High Availability, Automated Backups, underlying volume types
      • Understand DynamoDB with its low latency performance, DAX
      • Understand DynamoDB provisioned throughput for Read/Writes
      • Know ElastiCache use cases, mainly for caching performance
    • Analytics
      • Not much in deep, but understand what the services are and what they can do
      • Understand Redshift as a business intelligence tool
      • Know Kinesis for real time data capture and analytics
      • Atleast know what AWS Glue does, so you can eliminate the answer
    • Security
      • Understand IAM as a whole
      • Focus on IAM role and its use case especially with EC2 instance
      • Understand IAM identity providers and federation and use cases
      • Understand MFA and How would implement two factor authentication for your application
      • Understand encryption services
      • Refer Disaster Recovery whitepaper, be sure you know the different recovery types with impact on RTO/RPO.
    • Management Tools
      • Understand CloudWatch monitoring to provide operational transparency
      • Know which EC2 metrics it can track. Remember, it cannot track memory and disk space/swap utilization
      • Understand CloudWatch is extendable with custom metrics
      • Understand CloudTrail for Audit
      • Have a basic understanding of CloudFormation, OpsWorks
    • Integration Tools
      • Understand SQS as message queuing service and SNS as pub/sub notification service
      • Understand SQS features like visibility, long poll vs short poll
      • Focus on SQS as a decoupling service
      • AWS has released SQS FIFO, make sure you know the differences between standard and FIFO

NOTE: I have just marked the topics inline with the AWS Exam Blue Print. So be sure to check the same, as it is updated regularly and go through Whitepapers, FAQs and Re-Invent videos.

AWS Solutions Architect – Associate SAA-C01 Exam Resources

AWS Cloud Computing Whitepapers

AWS Solutions Architect – Associate Exam Contents

Domain 1: Design Resilient Architectures

  1. Choose reliable/resilient storage.
  2. Determine how to design decoupling mechanisms using AWS services.
  3. Determine how to design a multi-tier architecture solution.
  4. Determine how to design high availability and/or fault tolerant architectures.

Domain 2: Define Performant Architectures

  1. Choose performant storage and databases.
  2. Apply caching to improve performance.
  3. Design solutions for elasticity and scalability.

Domain 3: Specify Secure Applications and Architectures

  1. Determine how to secure application tiers.
  2. Determine how to secure data.
  3. Define the networking infrastructure for a single VPC application.

Domain 4: Design Cost-Optimized Architectures

  1. Determine how to design cost-optimized storage.
  2. Determine how to design cost-optimized compute.

Domain 5: Define Operationally-Excellent Architectures

  1. Choose design features in solutions that enable operational excellence.

Amazon EMR Best Practices

Best Practices for Using Amazon EMR

Amazon has made working with big data a lot easier. You can launch an EMR cluster in minutes for big data processing, machine learning, and real-time stream processing with the Apache Hadoop ecosystem. You can use the Management Console, the command line, or infrastructure-as-code tools like CloudFormation and Terraform to start several nodes with ease.

EMR pricing uses pay-per-second billing, which results in lower costs and you no longer have to worry about the hourly boundary.

EMR makes a whole bunch of the latest versions of open source software available to you. Currently, EMR supports over 20 open source projects including Apache Spark, Hive, HBase, Flink, Presto, Trino, Hudi, Iceberg, and Delta Lake, with new releases made every 4 to 6 weeks. This is very useful, especially for rapidly evolving open source projects such as Apache Spark where each release contains critical bug fixes and features. However, you are not forced to upgrade; a new release is made available if you choose to use it. Each EMR release now gets 24 months of standard support. With EMR, you can spin up a bunch of instances and process massive volumes of data residing on S3 at a reasonable cost.

EMR Deployment Options

Amazon EMR provides multiple deployment options to match your operational requirements:

  • EMR on EC2 – Traditional cluster-based deployment with full control over cluster configuration, including instance types and custom AMIs. Best for workloads requiring advanced configurations and persistent clusters.
  • EMR Serverless – Run applications without managing clusters. Resources automatically scale up and down based on your workload. Best for data analysts and engineers who want to focus on application logic without cluster management overhead.
  • EMR on EKS – Run EMR workloads on Amazon Elastic Kubernetes Service. Compute resources can be shared between Spark applications and other Kubernetes applications. Resources are allocated and removed on demand. Best for organizations already using Kubernetes for container orchestration.

Supported Applications and Frameworks

A variety of cluster management options are supported, including YARN. You can run the following:

  • HBase
  • Trino (formerly PrestoSQL – high-performance distributed SQL engine)
  • Presto (legacy; AWS recommends Trino going forward)
  • Spark (including Spark 4.0 with VARIANT data type, Spark Connect, and ANSI SQL compliance)
  • Apache Flink (stream processing)
  • Apache Hive
  • Tez
  • Apache Hudi, Apache Iceberg, Delta Lake (open table formats with ACID transactions)
  • Zeppelin
  • JupyterHub
  • Notebooks (EMR Studio Workspaces)
  • SQL editors

Apache Spark 4.0 on EMR (GA 2026)

Apache Spark 4.0 is now generally available on Amazon EMR across all deployment options (EMR Serverless, EMR on EC2, and EMR on EKS). Key features include:

  • VARIANT data type – Native semi-structured data handling for JSON without the need for complex parsing logic
  • Spark Connect – Bridges interactive development and production-scale execution, enabling thin client connectivity
  • ANSI SQL compliance – Build data pipelines using standard SQL without learning Spark-specific syntax
  • Enhanced streaming (transformWithState API) – Build stateful streaming applications with improved state management
  • Apache Iceberg v3 support – Default column values, deletion vectors, multi-argument transforms, and row lineage tracking
  • EMR Optimized Runtime – Runs Spark workloads up to 4.5× faster than open-source Apache Spark

AWS Connectors

Additionally, connectors to different AWS services are available; for example, you can use Spark to load Redshift (using the Redshift connector, which uses Redshift commands under the hood to get good throughput). You can access DynamoDB for analytics applications, use connectors for relational data, and integrate with Amazon S3 Tables and SageMaker Lakehouse.

AWS Glue

One particularly important integration is AWS Glue. AWS Glue (now at version 5.1) comprises several main components:

AWS Glue

  • AWS Glue ETL: Serverless ETL service supporting Apache Spark, with native fine-grained access control via AWS Lake Formation (table, column, row, and cell-level permissions). Glue 5.0+ adds support for SageMaker Lakehouse to unify data across S3 data lakes and Amazon Redshift data warehouses.
  • AWS Glue Data Catalog: A fully managed Hive metastore-compliant service. You have an intelligent metastore—you don’t have to write DDL to create a table; you can just make Glue crawl your data, infer what the schema is, and create those tables for you. It supports Apache Iceberg, Apache Hudi, and Delta Lake table formats, and automatically handles partition management. It also supports a variety of complex data types.
  • Crawlers: The crawlers crawl your data to infer the schema and automatically add partitions.
  • AWS Glue Data Quality: Provides rule-based data quality validation with rule labeling for organizing and analyzing data quality results.

AWS Glue is a managed service, so you spend less time monitoring. As a fully managed service, it is also responsible for infrastructure management and auto-scaling. Enabling security options in AWS Glue is straightforward, supporting IAM policies and AWS Lake Formation for fine-grained access control.

Open Table Formats

EMR now has first-class support for open table formats that enable ACID transactions, schema evolution, and time travel on data lakes:

  • Apache Iceberg – Supports v3 format with deletion vectors, materialized views, and merge-on-read. Recommended for most new data lake implementations.
  • Apache Hudi – Supports incremental data processing with upserts and deletes.
  • Delta Lake – Supports UniForm for cross-format compatibility with Iceberg clients.

All three formats are supported with AWS Lake Formation fine-grained access control (table, row, column, and cell-level filtering) across EMR on EC2, EMR Serverless, and EMR on EKS.

Common EMR Use Cases

EMR

HBase at massive scale: Using HBase with S3 for HFiles storage can save 50% or higher on storage costs compared to HDFS. Instead of sizing the cluster for HDFS, you size it for the processing power required for the HBase Region Servers. The S3 option also supports Read Replica HBase clusters in another AZ for load balancing and disaster recovery.

Real-time and batch processing: Use Amazon Kinesis Data Streams for pushing data to Spark. Use Spark Streaming or Apache Flink for real-time analytics or processing data on-the-fly and then dump that data into S3. If you don’t have real-time processing use cases, then Amazon Data Firehose (formerly Kinesis Data Firehose) is a great alternative. The data can be cataloged in the Glue Data Catalog and then accessed via a variety of analytical engines. EMR supports several analytical engines including Hive, Tez, Spark, and Trino. Once the data is in the Data Catalog on S3, you can use Athena (serverless SQL queries), Glue ETL (serverless ETL), and Redshift Spectrum.

Data exploration: Use Spark with EMR Studio (Jupyter-based notebooks), Zeppelin, or JupyterHub to arm data scientists with a way to explore large amounts of data. EMR Studio supports real-time collaboration and Git-based version control. Amazon SageMaker Unified Studio Notebooks also support EMR Serverless with Spark Connect for interactive analytics.

Ad hoc SQL queries: There is a big rise in the use of Trino (formerly PrestoSQL) for ad hoc SQL queries (in combination with Athena). Trino gives you advanced configurations and a way to build exactly what you need for your use case but requires cluster management, versus Athena where you just write SQL without infrastructure. Many BI tools support Trino for low latency dashboards. EMR 6.15+ runs Trino queries 2.7× faster than previous versions.

Deep learning with GPU instances: You can launch GPU hardware for EMR. TensorFlow is fully supported. Note that Apache MXNet has reached end-of-life and the project is archived—use PyTorch or TensorFlow instead for deep learning workloads on EMR.

Machine learning pipelines: Typical ML projects implement a multi-step process including ETL, feature engineering, model training, model evaluation, model deployment, and model scoring. Using Apache Spark for implementing ML pipelines is popular as it supports each step, scales for small and large jobs, has good ML libraries (MLlib), and has an active user base. Amazon SageMaker integrates with EMR for data preparation at scale.

EMR Deployment Best Practices

There are several options for deploying Spark on AWS:

  • EMR Serverless – No infrastructure to manage. Applications function as cluster templates that instantiate when jobs are submitted and can process multiple jobs. Automatic capacity management. Best for intermittent or variable workloads.
  • EMR on EC2 – Full control over instance types, cluster configurations, and custom AMIs. Supports batch and streaming, integrates with tooling. Best for persistent clusters with specific configuration requirements.
  • EMR on EKS – Run Spark on existing Kubernetes clusters. Share compute resources between Spark and other applications. Best for organizations standardized on Kubernetes.
  • EC2 (self-managed) – Maximum flexibility but places a huge management burden. Not recommended unless you have very specific requirements not met by managed options.

Lowering EMR Costs

If you are paying for Hadoop nodes that are not doing anything, then you are just burning money. Key cost optimization strategies:

  • Use EMR Serverless – Pay only for resources consumed during job execution. No idle cluster costs.
  • Use Graviton instances – AWS Graviton-based instances provide up to 40% better price-performance compared to equivalent x86 instances. EMR fully supports Graviton2 and Graviton3 instance types.
  • Use Spot Instances for task nodes – Achieve up to 60-90% cost savings on compute. Use Instance Fleets with diversified instance types across multiple Availability Zones for better Spot capacity.
  • Use Instance Fleets with allocation strategies – EMR supports prioritized and capacity-optimized-prioritized allocation strategies for better instance provisioning reliability and cost optimization.
  • Enable Managed Scaling – EMR evaluates cluster metrics every 5-10 seconds and makes optimized scaling decisions. Supports Advanced Scaling with configurable optimization index (cost vs. performance).
  • Batch workloads and shut down clusters – Take an inventory of jobs, batch them, and shut down clusters in-between. Use transient clusters for batch processing.
  • Separate clusters by workload – Instead of one large always-on cluster, use auto-scaling separate clusters optimized for each workload type.
  • Use Amazon Linux AMI with preinstalled customizations – Faster cluster creation with custom AMIs.
  • Use On-Demand Capacity Reservations (ODCRs) – For predictable, steady-state workloads to ensure capacity availability.
  • Right-size instances – Use AWS Compute Optimizer recommendations to match instance types to actual resource demands.

EMR Security Best Practices

  • Use v2 managed IAM policies – AWS is deprecating v1 managed policies (AmazonElasticMapReduceFullAccess). Migrate to the new v2 policies with least-privilege access.
  • Enable encryption – Use at-rest and in-transit encryption for data processed by EMR clusters.
  • Use Lake Formation – Apply fine-grained access control (table, row, column, cell-level) on open table formats across all EMR deployment options.
  • Use security groups – Restrict network access to EMR clusters and manage port access.

EMR Studio

Amazon EMR Studio is a managed IDE environment for developing, visualizing, and debugging applications written in R, Python, Scala, and PySpark using Jupyter notebooks. Key features:

  • Built-in real-time collaboration with peers
  • Git repository integration for version control
  • Workspaces that connect to EMR on EC2 clusters or EMR Serverless applications
  • Integration with SageMaker Unified Studio for AI/ML workflows
  • Support for Spark Connect (Spark 4.0) for interactive workloads

Amazon SageMaker Lakehouse Integration

Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes, Amazon S3 Tables, and Amazon Redshift data warehouses. With this integration:

  • Access and query unified data from EMR using Spark
  • Apply consistent governance through Lake Formation permissions
  • Capture data lineage of EMR Spark jobs into SageMaker Unified Studio
  • Use SageMaker Unified Studio for end-to-end analytics and AI/ML on a single platform

AWS Auto Scaling Lifecycle

Auto Scaling Lifecycle

  • Instances launched through the Auto Scaling group have a different lifecycle than that of other EC2 instances
  • Auto Scaling lifecycle starts when the Auto Scaling group launches an instance and puts it into service.
  • Auto Scaling lifecycle ends when the instance is terminated either by the user, or the Auto Scaling group takes it out of service and terminates it
  • AWS charges for the instances as soon as they are launched, including the time it is not in InService

Auto Scaling Lifecycle Transition

Auto Scaling Group Lifecycle

Auto Scaling Lifecycle Hooks

  • Auto Scaling Lifecycle hooks enable performing custom actions by pausing instances as an Auto Scaling group launches or terminates them
  • Each Auto Scaling group can have multiple lifecycle hooks. However, there is a limit on the number of hooks per Auto Scaling group
  • Auto Scaling scale out event flow
    • Instances start in the Pending state
    • If an autoscaling:EC2_INSTANCE_LAUNCHING  lifecycle hook is added, the state is moved to Pending:Wait
    • After the lifecycle action is completed, instances enter to Pending:Proceed
    • When the instances are fully configured, they are attached to the Auto Scaling group and moved to the InService state
  • Auto Scaling scale in event flow
    • Instances are detached from the Auto Scaling group and enter the Terminating state.
    • If an autoscaling:EC2_INSTANCE_TERMINATING lifecycle hook is added, the state is moved to Terminating:Wait
    • After the lifecycle action is completed, the instances enter the Terminating:Proceed state.
    • When the instances are fully terminated, they enter the Terminated state.
  • During the scale out and scale in events, instances are put into a wait state (Pending:Wait or Terminating:Wait) and are paused until either a continue action happens or the timeout period ends.
  • By default, the instance remains in a wait state for one hour, which can be extended by restarting the timeout period by recording a heartbeat.
  • If the task finishes before the timeout period ends, the lifecycle action can be marked completed and it continues the launch or termination process.
  • After the wait period, the Auto Scaling group continues the launch or terminate process (Pending:Proceed or Terminating:Proceed)
    • CloudWatch Events target to invoke a Lambda function when a lifecycle action occurs. Event contains information about the instance that is launching or terminating and a token that can be used to control the lifecycle action.
    • Notification target (CloudWatch events, SNS, SQS) for the lifecycle hook which receives the message from EC2 Auto Scaling.The message contains information about the instance that is launching or terminating and a token that can be used to control the lifecycle action
    • Create a script that runs on the instance as the instance starts. The script can control the lifecycle action using the ID of the instance on which it runs.Custom action can be implemented using

Auto Scaling Lifecycle Hooks Considerations

  • Keeping Instances in a Wait State
    • Instances remain in a wait state for a finite period of time.
    • Default is 1 hour (3600 seconds) with the max being 48 hours or 100 times the heartbeat timeout, whichever is smaller.
    • Time can be adjusted using
      • complete-lifecycle-action (CompleteLifecycleAction) command to continue to the next state if finishes before the timeout period end
      • put-lifecycle-hook command, the –heartbeat-timeout parameter to set the heartbeat timeout for the lifecycle hook during its creation
      • Restart the timeout period by recording a heartbeat, using the record-lifecycle-action-heartbeat (RecordLifecycleActionHeartbeat) command
  • Cooldowns and Custom Actions
    • Cooldown period helps ensure that the Auto Scaling group does not launch or terminate more instances than needed
    • Cooldown period starts when the instance enters the InService state. Any suspended scaling actions resume after cooldown period expires
  • Health Check Grace Period
    • Health check grace period does not start until the lifecycle hook completes and the instance enters the InService state
  • Lifecycle Action Result
    • Result of the lifecycle hook is either ABANDON or CONTINUE
    • If the instance is launching,
      • CONTINUE indicates a successful action, and the instance can be put into service.
      • ABANDON indicates the custom actions were unsuccessful, and that the instance can be terminated.
    • If the instance is terminating,
      • ABANDON and CONTINUE allow the instance to terminate.
      • However, ABANDON stops any remaining actions from other lifecycle hooks, while CONTINUE allows them to complete
  • Spot Instances
    • Lifecycle hooks can be used with Spot Instances. However, a lifecycle hook does not prevent an instance from terminating due to a change in the Spot Price, which can happen at any time

Enter and Exit Standby

  • Instance in an InService state can be moved toStandby state.
  • Standby state enables you to remove the instance from service, troubleshoot or make changes to it, and then put it back into service.
  • Instances in a Standby state continue to be managed by the Auto Scaling group. However, they are not an active part of the application until they are put back into service.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your application is running on EC2 in an Auto Scaling group. Bootstrapping is taking 20 minutes to complete. You find out that instances are shown as InService although the bootstrapping has not completed. How can you make sure that new instances are not added until the bootstrapping has finished. Choose the correct answer:
    1. Create a CloudWatch alarm with an SNS topic to send alarms to your DevOps engineer.
    2. Create a lifecycle hook to keep the instance in pending:wait state until the bootstrapping has finished and then put the instance in pending:proceed state.
    3. Increase the number of instances in your Auto Scaling group.
    4. Create a lifecycle hook to keep the instance in standby state until the bootstrapping has finished and then put the instance in pending:proceed state.
  2. When a scale out event occurs, the Auto Scaling group launches the required number of EC2 instances using its assigned launch configuration. What instance state do these instances start in? Choose the correct answer:
    1. pending:wait
    2. InService
    3. Pending
    4. Terminating
  3. With AWS Auto Scaling, once we apply a hook and the action is complete or the default wait state timeout runs out, the state changes to what, depending on which hook we have applied and what the instance is doing? Select two. Choose the 2 correct answers:
    1. pending:proceed
    2. pending:wait
    3. terminating:wait
    4. terminating:proceed
  4. For AWS Auto Scaling, what is the first transition state an existing instance enters after leaving steady state in Standby mode?
    1. Detaching
    2. Terminating:Wait
    3. Pending (You can put any instance that is in an InService state into a Standby state. This enables you to remove the instance from service, troubleshoot or make changes to it, and then put it back into service. Instances in a Standby state continue to be managed by the Auto Scaling group. However, they are not an active part of your application until you put them back into service. Refer link)
    4. EnteringStandby
  5. For AWS Auto Scaling, what is the first transition state an instance enters after leaving steady state when scaling in due to health check failure or decreased load?
    1. Terminating (When Auto Scaling responds to a scale in event, it terminates one or more instances. These instances are detached from the Auto Scaling group and enter the Terminating state. Refer link)
    2. Detaching
    3. Terminating:Wait
    4. EnteringStandby

References

AutoScalingGroupLifecycle