Google Cloud – Professional Data Engineer Certification learning path

Google Cloud – Professional Data Engineer Certification Learning Path

I just recertified on my Google Cloud Certified – Professional Data Engineer certification. The first attempt on the Data Engineer exam has already been 2 long years which lasted for 4 hours with 95 questions. Once again, similar to the other Google Cloud certification exams, the Data Engineer exam covers not only the gamut of services and concepts but also focuses on logical thinking and practical experience.

Google Cloud – Professional Cloud Data Engineer Certification Summary

  • Cloud Data Engineer exam had 50 questions to be answered in 2 hours
  • Covers a wide range of data services including machine learning, with other topics covering storage and security.
  • Exam does not cover any case studies
  • Although the exam covers the latest services, it has not been updated for Cloud Monitoring and Logging and still refers to Stackdriver.
  • Nothing much on Compute and Network is covered
  • Questions sometimes test your logical thinking rather than any concept regarding Google Cloud.
  • Hands-on is MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolutely clueless about some of the questions and commands
  • Be sure that NO Online Courses or Practice tests are going to cover all. I did Coursera, LinuxAcademy which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Data Engineer Certification Resources

Google Cloud – Professional Cloud Data Engineer Certification Topics

Data & Analytics Services

  • Obviously, there are lots and lots of data and related services
  • Google Cloud Data & Analytics Services Cheatsheet
  • Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics
  • Cloud BigQuery
    • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
    • ideal for storage and analytics.
    • provides the same cost-effective option for storage as Cloud Storage
    • understand BigQuery Security
      • use BigQuery IAM access roles to control data and querying access
      • use Authorized views to access control tables, columns within tables, and query results. HINT: Authorized views need to reside in a different dataset as compared to the source dataset.
      • support data encryption
    • understand BigQuery Best Practices including key strategy, cost optimization, partitioning, and clustering
      • use dry run to estimate costs
      • use partitioning and clustering to limit the amount of data scanned
      • using external data sources might result in query performance degradation and its better to import the data
    • Dataset location can be set ONLY at the time of its creation.
    • supports schema auto-detection for JSON and CSV files.
    • understand how BigQuery Streaming works
    • know BigQuery limitations esp. with updates and inserts
    • supports an external data source (federated data source)
      • which is a data source that can be queried directly even though the data is not stored in BigQuery.
      • offers support for querying data directly from:
        • Cloud Bigtable
        • Cloud Storage
        • Google Drive
        • Cloud SQL
      • Use Permanent table for querying an external data source multiple times
      • Use Temporary table for querying an external data source for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
  • Cloud Bigtable
    • provides column database suitable for both low-latency single-point lookups and precalculated analytics
    • understand Bigtable is not for long term storage as it is quite expensive
    • know the differences with HBase
    • Know how to measure performance and scale
    • supports Development and Production mode. Development mode can be upgraded to production and not vice versa.
    • supports HDD and SDD storage during cluster creation. HDD can be converted to SDD by exporting the data to the new instance.
    • understand Bigtable Replication. Can be used to separate real-time and batch workloads on the same instance using application profiles.
  • Cloud Pub/Sub
    • as the messaging service to capture real-time data esp. IoT
    • is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real-time IoT data capture
    • guarantees at-least-once (but not exactly once) message delivery and can result in data duplication if the message is not ack within a defined time period.
    • how it compares to Kafka (HINT: provides only 7 days of retention vs Kafka which depends on the storage)
  • Cloud Dataflow
    • to process, transform, transfer data and the key service to integrate store and analytics.
    • know how to improve a Dataflow performance
    • understand Apache Beam features as well
      • understand PCollections, Transforms, ParDo and what they do
      • understand windowing, watermarks, triggers Hint: windowing and watermarks can be used to handle delayed messages
    • supports drain feature to finish existing jobs but stop processing new ones, usually useful for deploying incompatible breaking changes
    • canceling a job will lead to an immediate stop and in-flight data loss.
  • Cloud Dataprep
    • to clean and prepare data. It can be used for anomaly detection.
    • does not need any programming language knowledge and can be done through the graphical interface
    • be sure to know or try hands-on on a dataset
  • Cloud Dataproc
    • to handle existing Hadoop/Spark jobs
    • supports connector for BigQuery, Bigtable, Cloud Storage
    • supports Ephermal clusters and with Cloud Storage connector support the data can be stored in GCS instead of HDFS
    • you need to know how to improve the performance of the Hadoop cluster as well :). Know how to configure the Hadoop cluster to use all the cores (hint- spark executor cores) and handle out of memory errors (hint – executor memory)
    • Secondary workers can be used to scale with the below limitations
      • Processing only with no data storage
      • No secondary-worker-only clusters
      • Persistent disk size is used for local caching of data and is not available through HDFS.
    • how to install other components (hint – initialization actions)
  • Cloud Datalab
    • is an interactive tool for exploration, transformation, analysis, and visualization of your data on Google Cloud Platform
    • based on Jupyter
  • Cloud Composer
    • fully managed workflow orchestration service, based on Apache Airflow, enabling workflow creation that spans across clouds and on-premises data centers.
    • pipelines are configured as directed acyclic graphs (DAGs)
    • workflow lives on-premises, in multiple clouds, or fully within GCP.
    • provides the ability to author, schedule, and monitor the workflows in a unified manner

Identity Services

  • Cloud IAM 
    • provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
    • Understand IAM Best practices

Storage Services

  • Understand each storage service option and its use cases.
  • Cloud Storage
    • cost-effective object storage for unstructured data.
    • very important to know the different classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access), and Coldline (yearly access)
    • Understand Signed URL to give temporary access and the users do not need to be GCP users
    • Understand permissions – IAM vs ACLs (fine-grained control)
  • Cloud SQL
    • is a fully-managed service that provides MySQL and PostgreSQL only.
    • Limited to 10TB and is a regional service.
    • No direct options for Oracle yet.
  • Cloud Spanner
    • is a fully managed, mission-critical relational database service.
    • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at a global scale.
    • globally distributed and can scale and handle more than 10TB.
    • not a direct replacement and would need migration
  • Cloud Datastore
    • provides document database for web and mobile applications. Datastore is not for analytics
    • Understand Datastore indexes and how to update indexes for Datastore

Machine Learning

  • Google expects the Data Engineer to surely know some of the Data scientists stuff
  • Understand the different algorithms
    • Supervised Learning (labeled data)
      • Classification (for e.g. Spam or Not)
      • Regression (for e.g. Stock or House prices)
    • Unsupervised Learning (Unlabelled data)
      • Clustering (for e.g. categories)
    • Reinforcement Learning
  • Know Cloud ML with Tensorflow
  • Know all the Cloud AI products which include
    • Cloud Vision
    • Cloud Natural Language
    • Cloud Speech-to-Text
    • Cloud Video Intelligence
    • Cloud Dialogflow
  • Cloud AutoML products, which can help you get started without much machine learning experience

Monitoring

  • Cloud Monitoring and Logging
    • provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
    • remember audits are mainly checking Cloud Logging entries
    • Aggregated sink can then route log entries from the organization or folder, plus (recursively) from any contained folders, billing accounts, or projects

Security Services

Other Services

  • Storage Transfer Service 
    • allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively. Online data is the key here as it supports AWS S3, HTTP/HTTPS, and other GCS buckets. If the data is on-premises you need to use the gsutil command
  • Transfer Appliance 
    • to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform. Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
  • BigQuery Data Transfer Service
    • to integrate with third-party services and load data into BigQuery

Google Cloud – Professional Cloud Architect Certification learning path

Google Cloud - Professional Cloud Architect certificate

Google Cloud – Professional Cloud Architect Certification Learning Path

Re-certified !!!! Google Cloud – Professional Cloud Architect certification exam is one of the toughest exam I have appeared for. Even though it was recertification, the preparation level was same as the first one. The gamut of services and concepts it tests your knowledge on is really vast.

Google Cloud – Professional Cloud Architect Certification Summary

  • Has 50 questions to be answered in 2 hours.
  • Covers wide range of Google Cloud services and what they actually do.
  • includes Compute, Storage, Network and even Data services
  • Questions sometimes tests your logical thinking rather than any concept regarding Google Cloud.
  • Hands-on is a MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolute clueless for some of the questions and commands
  • Make sure you cover the case studies before hand. I got  ~15 questions (almost 5 per case study) and it can really be a savior for you in the exams.
  • Be sure that NO Online Course or Practice tests is going to cover all. I did LinuxAcademy (a bit old now) which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Architect Certification Resources

Google Cloud – Professional Cloud Architect Certification Topics

General Services

  • Cloud Billing
    • understand how Cloud Billing works. Monthly vs Threshold and which has priority
    • Budgets can be set to alert for projects
    • how to change a billing account for a project and what roles you need. Hint – Project Owner and Billing Administrator for the billing account
    • Cloud Billing can be exported to BigQuery and Cloud Storage
  • Resource Manager
    • Understand Resource Manager the hierarchy Organization -> Folders -> Projects -> Resources
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.

Identity Services

  • Cloud Identity and Access Management
    • Identify and Access Management – IAM provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
    • Understand the difference between Primitive, Pre-defined and Custom roles and their use cases
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
    • Basically  Permissions -> Roles -> (IAM Policy) -> Members
    • Know how to use service accounts with applications
  • Cloud Identity
    • Cloud Identity provides IDaaS (Identity as a Service) and provides single sign-on functionality and federation with external identity provides like Active Directory.
    • Cloud Identity supports federating with Active Directory using GCDS to implement the synchronization

Compute Services

    • Make sure you know all the compute services Google Compute Engine, Google App Engine and Google Kubernetes Engine. You need to be sure to know the pros and cons and the use cases that you should use them.
    • Google Compute Engine
      • Google Compute Engine is the best IaaS option for compute and provides fine grained control
      • Know how to create a Compute Engine instance, connect to it using Cloud shell or ssh keys
      • Difference between backups and images and how to create instances from the same.
      • Understand Compute Engine Storage Options. Disk throughput and IOPS depends on type and size.
      • Understand Compute Engine Snapshots
      • Instance templates with managed instance groups provide scalability and high availability
      • Instance template cannot be edited, create a new one and attach.
      • Difference between managed vs unmanaged instance groups and auto-healing feature
      • Managed instance groups are covered heavily the exam, as they provide the key auto-scaling capability. Hint: you need to create an Instance template and associate it with Instance group
      • Understand how migration or traffic splitting with Managed instance groups works Hint – rolling updates & deployments
      • Preemptible VMs and their use cases. HINT – can be terminated any time and supports max 24 hours.
      • Upgrade an instance without downtime using Live Migration
      • Managing access using OS Login or project and instance metadata
      • Prevent accidental deletion using deletion protection flag
      •  Understand the pricing and discounts model Hint – Sustained (automatic upto 30%) vs Committed (1 to 3 yrs) discounts.
      • In case of any issues or errors, how to debug the same
    • Google App Engine
      • Google App Engine is mainly the best option for PaaS with platforms supported and features provided.
      • Deploy an application with App Engine and understand how versioning and rolling deployments can be done
      • Understand how to keep auto scaling and traffic splitting and migration.
      • Know App Engine is a regional resource and understand the steps to migrate or deploy application to different region and project.
      • Know the difference between App Engine Flexible vs Standard
    • Google Kubernetes Engine
      • Google Kubernetes Engine, powered by the open source container scheduler Kubernetes, enables you to run containers on Google Cloud Platform.
      • Kubernetes Engine takes care of provisioning and maintaining the underlying virtual machine cluster, scaling your application, and operational logistics such as logging, monitoring, and cluster health management.
      • A node pool is a subset of machines that all have the same configuration, including machine type (CPU and memory) authorization scopes. Node pools represent a subset of nodes within a cluster; a container cluster can contain one or more node pools. Hint : For adding new machine types, need to add a new node pool as existing one cannot be edited
      • Be sure to Create a Kubernetes Cluster and configure it to host an application
      • Understand how to make the cluster auto repairable and upgradable. Hint – Node auto-upgrades and auto-repairing feature
      • Very important to understand where to use gcloud commands (to create a cluster) and kubectl commands (manage the cluster components)
      • Very important to understand how to increase cluster size and enable autoscaling for the cluster
      • Know how to manage secrets like database passwords
    • Cloud Functions
      • is a lightweight, event-based, asynchronous compute solution that allows you to create small, single-purpose functions that respond to cloud events without the need to manage a server or a runtime environment.
      • Remember that Cloud Functions is serverless and scales from zero to scale and back to zero as the demand changes.

Network Services

  • Virtual Private Cloud
    • Understand Virtual Private Cloud (VPC), subnets and host applications within them Hint VPC spans across region
    • Understand how Firewall rules works and how they are configured. Hint – Focus on Network Tags. Also, there are 2 implicit firewall rules – default ingress deny and default egress allow
    • Understand VPC Peering and Shared VPC
    • Understand the concept internal and external IPs and difference between static and ephemeral IPs
    • Primary IP range of an existing subnet can be expanded by modifying its subnet mask, setting the prefix length to a smaller number.
    • Understand Private Google Access use cases
  • On-premises connectivity
    • Cloud VPN and Interconnect are 2 components which help you connect to on-premises data center.
    • Understand limitations of Cloud VPN esp. 3Gbps limit. How it can be improved with multiple tunnels.
    • Understand what are the requirements to setup Cloud VPN.
    • Cloud Router provides dynamic routing using BGP
    • Know Interconnect as the reliable high speed, low latency and dedicated bandwidth options.
  • Cloud Load Balancing (GCLB)
    • Google Cloud Load Balancing provides scaling, high availability, and traffic management for your internet-facing and private applications.
    • Understand Google Load Balancing options and their use cases esp. which is global and internal and what protocols they support.

Storage Services

  • Understand each Storage Options and use cases.
  • Persistent disks
    • attached to the Compute Engines, provide fast access however are limited in scalability, availability and scope.
    • Remember performance depends on the size of the disk
  • Cloud Storage
    • Cloud Storage is cost-effective object storage for unstructured data.
    • very important to know the different storage classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access) and Coldline (yearly access)
    • Understand life cycle management. HINT – Changes are in accordance to object creation date
    • Understand various data encryption techniques
    • Understand Signed URL to give temporary access and the users do not need to be GCP users
    • Understand access control and permissions – IAM vs ACLs (fine grained control)
    • Understand best practices esp. uploading and downloading the data. HINT using parallel composite uploads
  • Relational Databases
    • Know Cloud SQL and Cloud Spanner
    • Cloud SQL
      • Cloud SQL is a fully-managed service that provides MySQL, PostgreSQL and MS SQL Server
      • limited to 10TB and is a regional service.
      • Difference between Failover and Read replicas. Failover provides High Availability and almost zero downtime while Read replicas provide scalability. Cross region Read Replicas are supported
      • Perform Point-In-Time recovery. Hint – requires binary logging and backups
      • MS SQL server support was added anew. Previously for HA, it required setting up SQL Server on Compute Engine, using Always On Availability Groups using Windows Failover Clustering. Place nodes in different subnets.
    • Cloud Spanner
      • is a fully managed, mission-critical relational database service.
      • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.
      • globally distributed and can scale and handle more than 10TB.
      • not a direct replacement and would need migration
    • There are no direct options for Oracle yet.
  • NoSQL
    • Know Cloud Datastore and BigTable
    • Datastore
      • provides document database for web and mobile applications. Datastore is not for analytics
      • Understand Datastore indexes and how to update indexes for Datastore
      • Can be configured Multi-regional and regional
    • Bigtable
      • provides column database suitable for both low-latency single-point lookups and precalculated analytics
      • understand Bigtable is not for long term storage as it is quite expensive
  • Data Warehousing
    • BigQuery
      • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
      • Remember it is most suitable for historical analysis.
  • MemoryStore and Firebase did not feature in any of the questions

Data Services

  • Although there is a different certification for Data Engineer, the Cloud Architect does cover data services. Data services are also part of the use cases so be sure to know about them
  • Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics, use
  • Key Services which need to be mainly covered are –
    • Cloud Storage as the medium to store data as data lake
    • Cloud Pub/Sub
      • as the messaging service to capture real time data esp. IoT
      • is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real time IoT data capture
      • Cloud Storage can generate notifications Object change notification
    • Cloud Dataflow to process, transform, transfer data and the key service to integrate store and analytics.
    • Cloud BigQuery for storage and analytics. Remember BigQuery provides the same cost-effective option for storage as Cloud Storage
    • Cloud Dataprep to clean and prepare data. Hint – It can be used anomaly detection.
    • Cloud Dataproc to handle existing Hadoop/Spark jobs. Hint – Use it to replace existing hadoop infra.
    • Cloud Datalab is an interactive tool for exploration, transformation, analysis and visualization of your data on Google Cloud Platform
  • Know standard patterns Cloud Pub/Sub -> Dataflow -> BigQuery

Monitoring

  • Google Cloud Monitoring or Stackdriver
    • provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
    • remember audits are mainly checking Stackdriver
  • Google Cloud Logging or Stackdriver logging

DevOps services

  • Deployment Manager 
    • provides Infrastructure as Code
    • provides dynamic provisioning with templates
  • Cloud Source Repositories
    • provides source code repository with Git version control to support collaborative development
  • Container Registry
    • is a private Docker image storage system on Google Cloud Platform.
    • images stored are immutable.
  • Cloud Build
    • is a service that executes your builds on Google Cloud Platform infrastructure.
  • MarketPlace (Cloud Launcher)
    • provides a way to launch common software packages e.g. Jenkins or WordPress and stacks on Google Compute Engine with just a few clicks like a prepackaged solution.
    • can help minimize deployment time and can be used without any knowledge about the product

Security Services

  • Cloud Security Scanner 
    • is a web application security scanner that enables developers to easily check for a subset of common web application vulnerabilities in websites built on App Engine and Compute Engine.
  • Data Loss Prevention API
    • to handle sensitive data esp. redaction of PII data.
  • PCI-DSS compliant
    • GCP services are PCI-DSS complaint, however you need to make sure for the applications and hosting to be inline with PCI-DSS requirements
  • Same concept as PCI-DSS applies to GDPR as well

Other Services

  • Know various data transfer options
  • Storage Transfer Service
    • allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively.
    • Online data is the key here as it supports AWS S3, HTTP/HTTPS and other GCS buckets.
    • for on-premises data you need to use gsutil command
  • Transfer Appliance 
    • to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform.
    • Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
    • Transfer Appliance Rehydrator provides data rehydration, which is the process by to fully reconstitute the files, so that the transferred data can be accessed and used.
  • Spinnaker
    • is an open source, multi-cloud, continuous delivery platform and does appear in answer options. So be sure to know about it.
  • Jenkins
    • for Continuous Integration and Continuous Delivery.

Case Studies