Google Cloud – Professional Data Engineer Certification Learning Path

I just recertified on my Google Cloud Certified – Professional Data Engineer certification. The first attempt on the Data Engineer exam has already been 2 long years which lasted for 4 hours with 95 questions. Once again, similar to the other Google Cloud certification exams, the Data Engineer exam covers not only the gamut of services and concepts but also focuses on logical thinking and practical experience.

📋 2025-2026 Exam Update Notice

The Professional Data Engineer exam has been significantly updated. Key changes include:

Increased focus on data governance (Dataplex), data lakehouse architecture (BigLake), Looker/Looker Studio for visualization, and Vertex AI for ML.

Reduced focus on deep ML concepts (overfitting, hyperparameters), Compute Engine/GKE, and command-line syntax.
New services covered: Dataplex Universal Catalog, BigLake, Analytics Hub, Dataform, Vertex AI (replacing AI Platform/Cloud ML Engine).
Deprecated services removed: Cloud Datalab (replaced by Vertex AI Workbench), Pub/Sub Lite (EOL March 2026), Data Catalog (replaced by Dataplex Knowledge Catalog).

Rebranding: Cloud DLP is now Sensitive Data Protection; Stackdriver is fully replaced by Cloud Monitoring/Logging; Vertex AI is now Gemini Enterprise Agent Platform.

Google Cloud – Professional Cloud Data Engineer Certification Summary

Cloud Data Engineer exam has 50 to 60 questions to be answered in 2 hours
Covers a wide range of data services including machine learning, with other topics covering storage, security, and data governance.

Exam does not cover any case studies
The exam has been updated to reflect current service names — Cloud Monitoring and Cloud Logging (no longer Stackdriver).
Strong focus on BigQuery, Dataflow, Pub/Sub, Dataproc, Cloud Composer, Looker, and Vertex AI.
Nothing much on Compute and Network is covered
Questions sometimes test your logical thinking rather than any concept regarding Google Cloud.

Hands-on is MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolutely clueless about some of the questions and commands
Be sure that NO Online Courses or Practice tests are going to cover all. Hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Data Engineer Certification Resources

Official Google Cloud Certified Professional Data Engineer Study Guide
Online Courses
- Udemy – Google Cloud Professional Data Engineer Certification
- Coursera – Preparing for the Google Cloud Professional Data Engineer Exam which is a good overview course but not detailed.
- Coursera – Data Engineering on Google Cloud Platform
- Whizlabs – Training videos for Google Cloud Certified Professional Data Engineer
Practice Tests
- Braincert Google Cloud Certified – Professional Data Engineer Practice Exams
- Whizlabs – Practice Questions for Google Cloud Certified Professional Data Engineer
Use Google Cloud Skills Boost (formerly Qwiklabs) for hands-on labs as much as possible.

Google Cloud – Professional Cloud Data Engineer Certification Topics

Data & Analytics Services

Obviously, there are lots and lots of data and related services
Google Cloud Data & Analytics Services Cheatsheet
Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics
Cloud BigQuery
- provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
- ideal for storage and analytics.
- provides the same cost-effective option for storage as Cloud Storage
- understand BigQuery Security
  - use BigQuery IAM access roles to control data and querying access
  - use Authorized views to access control tables, columns within tables, and query results. HINT: Authorized views need to reside in a different dataset as compared to the source dataset.
  - support data encryption
- understand BigQuery Best Practices including key strategy, cost optimization, partitioning, and clustering
  - use dry run to estimate costs
  - use partitioning and clustering to limit the amount of data scanned
  - using external data sources might result in query performance degradation and its better to import the data
- Dataset location can be set ONLY at the time of its creation.
- supports schema auto-detection for JSON and CSV files.
- understand how BigQuery Streaming works
- know BigQuery limitations esp. with updates and inserts
- supports an external data source (federated data source)
  - which is a data source that can be queried directly even though the data is not stored in BigQuery.
  - offers support for querying data directly from:
    - Cloud Bigtable
    - Cloud Storage
    - Google Drive
    - Cloud SQL
  - Use Permanent table for querying an external data source multiple times
  - Use Temporary table for querying an external data source for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
- BigQuery Studio (launched 2023) provides a unified workspace with SQL and notebook (Colab Enterprise) interfaces for data engineers, analysts, and scientists to perform end-to-end data tasks.
- BigQuery editions (Standard, Enterprise, Enterprise Plus) provide flexible compute pricing with autoscaling slots, replacing the legacy flat-rate pricing model.
- BI Engine provides fast in-memory analysis for sub-second query performance on dashboards connected to BigQuery.
Cloud Bigtable
- provides column database suitable for both low-latency single-point lookups and precalculated analytics
- understand Bigtable is not for long term storage as it is quite expensive
- know the differences with HBase
- Know how to measure performance and scale
- supports Development and Production mode. Development mode can be upgraded to production and not vice versa.
- supports HDD and SDD storage during cluster creation. HDD can be converted to SDD by exporting the data to the new instance.
- understand Bigtable Replication. Can be used to separate real-time and batch workloads on the same instance using application profiles.
Cloud Pub/Sub
- as the messaging service to capture real-time data esp. IoT
- is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real-time IoT data capture
- now supports exactly-once delivery (when subscribers connect in the same region), in addition to the default at-least-once delivery.
- how it compares to Kafka (HINT: Pub/Sub provides only 7 days of retention vs Kafka which depends on the storage)
- Note: Pub/Sub Lite has been deprecated (no new customers after Sept 24, 2024; EOL March 18, 2026). Use standard Pub/Sub instead.
Cloud Dataflow
- to process, transform, transfer data and the key service to integrate store and analytics.
- know how to improve a Dataflow performance
- understand Apache Beam features as well
  - understand PCollections, Transforms, ParDo and what they do
  - understand windowing, watermarks, triggers Hint: windowing and watermarks can be used to handle delayed messages
- supports drain feature to finish existing jobs but stop processing new ones, usually useful for deploying incompatible breaking changes
- canceling a job will lead to an immediate stop and in-flight data loss.
- Note: Dataflow SQL has been deprecated (July 2024, shutdown Jan 2025). Use standard Dataflow with Apache Beam SDK instead.

Cloud Dataprep (by Trifacta/Alteryx)
- to clean and prepare data. It can be used for anomaly detection.
- does not need any programming language knowledge and can be done through the graphical interface
- be sure to know or try hands-on on a dataset
- Note: Now operated by Alteryx. For new projects, consider Dataform (integrated into BigQuery) for SQL-based data transformations.
Cloud Dataproc
- to handle existing Hadoop/Spark jobs
- supports connector for BigQuery, Bigtable, Cloud Storage
- supports Ephemeral clusters and with Cloud Storage connector support the data can be stored in GCS instead of HDFS
- you need to know how to improve the performance of the Hadoop cluster as well :). Know how to configure the Hadoop cluster to use all the cores (hint- spark executor cores) and handle out of memory errors (hint – executor memory)
- Secondary workers can be used to scale with the below limitations
  - Processing only with no data storage
  - No secondary-worker-only clusters
  - Persistent disk size is used for local caching of data and is not available through HDFS.
- how to install other components (hint – initialization actions)
- Dataproc Serverless allows running Spark batch workloads and interactive sessions without managing clusters.
Vertex AI Workbench
- is the interactive notebook-based environment for data exploration, transformation, analysis, and visualization on Google Cloud
- replaces the deprecated Cloud Datalab (deprecated Sept 2022)
- provides managed and user-managed notebook instances with JupyterLab
- integrates with BigQuery, Dataproc, and other GCP services
Cloud Composer
- fully managed workflow orchestration service, based on Apache Airflow, enabling workflow creation that spans across clouds and on-premises data centers.
- pipelines are configured as directed acyclic graphs (DAGs)
- workflow lives on-premises, in multiple clouds, or fully within GCP.
- provides the ability to author, schedule, and monitor the workflows in a unified manner
- Composer 2 (current) provides autoscaling, better resource management, and improved performance over Composer 1.

Data Governance & Catalog Services

Dataplex
- intelligent data fabric that enables organizations to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts.
- organizes data into Lakes, Zones, and Assets for logical data management.
- provides unified access management across BigQuery, Cloud Storage, and other services.
- supports data quality rules and automated data profiling.
- Dataplex Knowledge Catalog (formerly Dataplex Universal Catalog, replacing deprecated Data Catalog) provides metadata management, data discovery, and governance features.
- Understand data mesh architecture patterns with Dataplex — the exam tests when data mesh is the right answer.

BigLake
- unified storage engine that extends BigQuery’s fine-grained security and governance to multi-cloud and open-format data.
- creates a unified interface over data stored in Cloud Storage (and even AWS S3 or Azure ADLS).
- supports formats like Parquet, ORC, Avro, and Apache Iceberg.
- enables applying BigQuery column-level security and row-level access policies to data lake files.
Analytics Hub
- centralized platform for sharing BigQuery datasets securely, both within and across organizations.
- enables data providers to list datasets and data consumers to subscribe under governed access controls.
- supports private data exchanges for internal organizational sharing.
Dataform
- integrated into BigQuery for SQL-based data transformation and pipeline management.
- supports version control (Git), testing, and documentation for data pipelines.
- alternative to dbt for BigQuery-native SQL workflow orchestration.

Identity Services

Cloud IAM
- provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
- Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
- Understand IAM Best practices

Storage Services

Understand each storage service option and its use cases.
Cloud Storage
- cost-effective object storage for unstructured data.
- very important to know the different classes and their use cases:
  - Standard — frequent access (hot data)
  - Nearline — monthly access (30-day minimum storage)
  - Coldline — quarterly access (90-day minimum storage)
  - Archive — yearly access (365-day minimum storage, lowest cost)
- Autoclass automatically transitions objects between storage classes based on access patterns, eliminating retrieval and early-deletion charges.
- Understand Signed URL to give temporary access and the users do not need to be GCP users
- Understand permissions – IAM vs ACLs (fine-grained control). Note: Uniform bucket-level access is now the recommended default over ACLs.

Cloud SQL
- is a fully-managed service that provides MySQL, PostgreSQL, and SQL Server.
- supports Enterprise and Enterprise Plus editions with different performance tiers.
- Limited to 64TB storage and is a regional service.
- No direct options for Oracle yet.

AlloyDB for PostgreSQL
- fully managed PostgreSQL-compatible database with up to 4x faster performance than standard PostgreSQL for transactional workloads and up to 100x faster for analytical queries.
- integrates with Vertex AI for built-in vector search and AI capabilities.
- ideal for demanding enterprise workloads requiring PostgreSQL compatibility with enhanced performance.
Cloud Spanner
- is a fully managed, mission-critical relational database service.
- provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at a global scale.
- globally distributed and can scale and handle more than 10TB.
- now supports PostgreSQL interface for familiar tooling and migration.
- supports Spanner Graph, full-text search, and vector search (2024-2025) making it a multi-model database.
- not a direct replacement for Cloud SQL and would need migration

Cloud Firestore (Datastore mode)
- provides document database for web and mobile applications. Datastore mode is not for analytics.
- Firestore in Datastore mode is the recommended successor to the legacy Cloud Datastore.
- Understand Datastore indexes and how to update indexes for Datastore
- Firestore now offers Standard and Enterprise editions with enhanced features.

Machine Learning

Google expects the Data Engineer to know some of the Data scientists stuff, though the depth has been reduced in the current exam.

Understand the different algorithms
- Supervised Learning (labeled data)
  - Classification (for e.g. Spam or Not)
  - Regression (for e.g. Stock or House prices)
- Unsupervised Learning (Unlabelled data)
  - Clustering (for e.g. categories)
- Reinforcement Learning
Vertex AI (now rebranded as Gemini Enterprise Agent Platform)
- Unified ML platform replacing the legacy AI Platform and Cloud ML Engine.
- provides end-to-end ML workflow: data preparation, training, deployment, and monitoring.
- Vertex AI Workbench for notebook-based development (replaces Cloud Datalab).
- AutoML for building models without extensive ML expertise.
- Vertex AI Pipelines for orchestrating ML workflows.
- Model Registry for versioning and managing models.
- Access to Gemini foundation models for generative AI use cases.

Know the Cloud AI products which include
- Cloud Vision AI
- Cloud Natural Language AI
- Cloud Speech-to-Text
- Cloud Video Intelligence AI
- Dialogflow (conversational AI)

Monitoring

Cloud Monitoring and Cloud Logging (formerly Stackdriver)
- provides monitoring, alerting, error reporting, metrics, diagnostics, debugging, and trace capabilities.
- remember audits are mainly checking Cloud Logging entries (Audit Logs)
- Aggregated sink can route log entries from the organization or folder, plus (recursively) from any contained folders, billing accounts, or projects
- Cloud Logging supports log-based metrics for creating dashboards and alerts.

Security Services

Sensitive Data Protection (formerly Cloud Data Loss Prevention / Cloud DLP)
- to handle sensitive data esp. redaction of PII data.
- provides discovery, classification, and de-identification of sensitive data inside and outside Google Cloud.
- integrated with Security Command Center for risk assessment.
understand Encryption techniques
- Google-managed encryption keys (default)
- Customer-managed encryption keys (CMEK) via Cloud KMS
- Customer-supplied encryption keys (CSEK)

Data Transfer Services

Storage Transfer Service
- allows import of large amounts of data into Google Cloud Storage, quickly and cost-effectively.
- supports transfers from AWS S3, Azure Blob Storage, HTTP/HTTPS locations, other GCS buckets, and on-premises file systems (via agent-based transfers).
- recommended for transferring more than 1 TB from on-premises or cloud sources.
Transfer Appliance
- to transfer large amounts of data (hundreds of TB to PB) quickly and cost-effectively into Google Cloud Platform via physical appliance.
- Check for the data size — typically used when network transfer would take too long.
BigQuery Data Transfer Service
- to integrate with third-party services (e.g., Google Ads, YouTube, Amazon S3, Teradata) and load data into BigQuery on a scheduled basis.

Visualization & BI

Looker Studio (formerly Google Data Studio)
- free, self-service BI tool for creating interactive dashboards and reports.
- connects directly to BigQuery and other data sources.
- can use BigQuery BI Engine for sub-second query performance.
Looker
- enterprise BI platform with LookML modeling language for governed metrics.
- provides semantic layer, embedded analytics, and data applications.
- integrated with BigQuery for governed, reusable analytics.