Google Cloud – Professional Data Engineer Certification learning path

After completing my Google Cloud – Professional Cloud Architect certification exam, I was looking into the Google Cloud – Professional Data Engineer exam and luckily Google Cloud was doing a pilot for their latest updated Professional Data Engineer certification exam. I applied for the free pilot and had a chance to appear for the exam. The pilot exam was 4 hours – 95 questions (as compared to 2 hrs – 50 questions). The results would be out in March 2019, but I can assure the overall exam is quite exhaustive. Once again, the exam covers not only the gamut of services and concepts but also the focus on logical thinking and practical experience.

Quick summary of the exam

  • Wide range of Google Cloud data services and what they actually do. It includes Storage, and a LOTS of Data services
  • Nothing much on Compute and Network is covered
  • Questions sometimes tests your logical thinking rather than any concept regarding Google Cloud.
  • Hands-on, if you have not worked on GCP before make sure you do lots of labs else you would be absolute clueless for some of the questions and commands
  • Tests are updated for the latest enhancements.
  • Pilot exam does not cover the cases studies. But given my Professional Cloud Architect exam experience, make sure you cover the case studies before hand.
  • Be sure that NO Online Course or Practice tests is going to cover all. I did Coursera, LinuxAcademy which is really vast, but hands-on or practical knowledge is MUST.

The list of topics is quite long, but something that you need to be sure to cover are

  • Identity Services
    • Cloud IAM 
      • provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
      • Understand how IAM works and how rules apply esp. the hierarchy from Organization -> Folder -> Project -> Resources
      • Understand IAM Best practices
      • Make sure you know the BigQuery Access roles
  • Storage Services
    • Understand each storage service options and their use cases.
    • Cloud Storage
      • cost-effective object storage for an unstructured data.
      • very important to know the different classes and their use cases esp. Regional and Multi-Regional (frequent access), Nearline (monthly access) and Coldline (yearly access)
      • Understand Signed URL to give temporary access and the users do not need to be GCP users
      • Understand permissions – IAM vs ACLs (fine grained control)
    • Relational Databases
      • Know Cloud SQL and Cloud Spanner
      • Cloud SQL
        • is a fully-managed service that provides MySQL and PostgreSQL only.
        • Limited to 10TB and is a regional service.
      • Cloud Spanner
        • is a fully managed, mission-critical relational database service.
        • provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at global scale.
        • globally distributed and can scale and handle more than 10TB.
        • not a direct replacement and would need migration
      • There are no direct options for Microsoft SQL Server or Oracle yet.
    • NoSQL
      • Know Cloud Datastore and BigTable
      • Datastore
        • provides document database for web and mobile applications. Datastore is not for analytics
        • Understand Datastore indexes and how to update indexes for Datastore
      • Bigtable
        • provides column database suitable for both low-latency single-point lookups and precalculated analytics
        • understand Bigtable is not for long term storage as it is quite expensive
        • know the differences with HBase
        • Know how to measure performance and scale
        • supports Development and Production mode. Development mode can be upgraded to production and not vice versa.
        • supports HDD and SDD storage during cluster creation. HDD can be converted to SDD by exporting the data to the new instance.
    • Data Warehousing
      • BigQuery
        • provides scalable, fully managed enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
        • Remember it is most suitable for historical analysis.
        • use Authorized views to access control tables, columns within tables, and query results. HINT: Authorized views need to reside in a different dataset as compared to the source dataset.
        • Be sure to cover the BigQuery Best Practices including key strategy, cost optimization, partitioning, and clustering
        • Dataset location can be set ONLY at the time of its creation.
        • supports schema auto-detection for JSON and CSV files.
  • Data Services
    • Obviously, there is lots of Data and Just Data
    • Know the Big Data stack and understand which service fits the different layers of ingest, store, process, analytics
    • Cloud Storage
      • as the medium to store data as data lake
      • understand what class is the best suited and which one provides geo-redundancy.
    • Cloud Pub/Sub
      • as the messaging service to capture real-time data esp. IoT
    • Cloud Pub/Sub
      • is designed to provide reliable, many-to-many, asynchronous messaging between applications esp. real-time IoT data capture
      • how it compares to Kafka
    • Cloud Dataflow
      • to process, transform, transfer data and the key service to integrate store and analytics.
      • know how to improve a Dataflow performance
      • understand Apache Beam features as well
        • understand PCollections, Transforms, ParDo and what they do
        • understand windowing, watermarks, triggers Hint: windowing and watermarks can be used to handle delayed messages
    • Cloud BigQuery
      • for storage and analytics. Remember BigQuery provides the same cost-effective option for storage as Cloud Storage
      • understand how BigQuery Streaming works
      • know BigQuery limitations esp. with updates and inserts
    • Cloud Dataprep
      • to clean and prepare data. It can be used anomaly detection.
      • does not need any programming language knowledge and can be done through graphical interface
      • be sure to know or try hands-on on a dataset
    • Cloud Dataproc
      • to handle existing Hadoop/Spark jobs
      • you need to know how to improve the performance of the Hadoop cluster as well :). Know how to configure the hadoop cluster to use all the cores (hint- spark executor cores) and handle out of memory errors (hint – executor memory)
      • how to install other components (hint – initialization actions)
    • Cloud Datalab
      • is an interactive tool for exploration, transformation, analysis, and visualization of your data on Google Cloud Platform
      • based on Jupyter
    • Cloud Composer
      • fully managed workflow orchestration service based on Apache Airflow
      • pipelines are configured as directed acyclic graphs (DAGs)
      • workflow lives on-premises, in multiple clouds, or fully within GCP.
      • provides the ability to author, schedule, and monitor your workflows in a unified manner
  • Machine Learning
    • Google expects the Data Engineer to surely know some of the Data scientists stuff
    • Understand the different algorithms
      • Supervised Learning (labeled data)
        • Classification (for e.g. Spam or Not)
        • Regression (for e.g. Stock or House prices)
      • Unsupervised Learning (Unlabelled data)
        • Clustering (for e.g. categories)
      • Reinforcement Learning
    • Know Cloud ML with Tensorflow
    • Know all the Cloud AI products which include
      • Cloud Vision
      • Cloud Natural Language
      • Cloud Speech-to-Text
      • Cloud Video Intelligence
    • Cloud AutoML products, which can help you get started without much machine learning experience
  • Monitoring
    • Google Stackdriver provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
      remember audits are mainly checking Stackdriver
  • Security Services
    • Data Loss Prevention API to handle sensitive data esp. redaction of PII data.
    • understand Encryption techniques
  • Other Services
    • Storage Transfer Service allows import of large amounts of online data into Google Cloud Storage, quickly and cost-effectively. Online data is the key here as it supports AWS S3, HTTP/HTTPS and other GCS buckets. If the data is on-premises you need to use gsutil command
    • Transfer Appliance to transfer large amounts of data quickly and cost-effectively into Google Cloud Platform. Check for the data size and it would be always compared with Google Transfer Service or gsutil commands.
    • BigQuery Data Transfer Service to integrate with third-party services and load data into BigQuery


10 thoughts on “Google Cloud – Professional Data Engineer Certification learning path

  1. Hi Jayendra
    Thanks for your detailed explanation.

    Would it make a difference if I appear for Associate Cloud Engineer prior to Data Engineer one? I have no experience on GCP platform but I come from data background.

    Appreciate your feedback.

    1. It would not make much difference as the services covered by each exam are different and with a very little overlap maybe 20% at the concept level.
      Data Engineer is quite a tough exam to crack, so make sure you prepare well.

      1. Thank you, Jayendra for your feedback and directions. I will give-in my 100% and follow your guidance as per this post.

  2. Hi dude.. seems there is change in the pattern starting Mar 29. Do u have any info on that?.. Can you pls provide some info on that if you have.

    1. The pattern listed is based Pilot exam, which has replaced the old pattern. The new exam doesn’t have any case studies.

    1. Hi Saurabh, i had appeared for the pilot exam for Data Engg and it should be already inline with the latest exam.

  3. Hi Jayendra,

    Did you get any programming questions to solve using any of the services? I have little knowledge on programming. Wanted to know if I have to go through some basics of Python or other language to solve anything in the exam?


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.