The rebrand unifies the former “Dataproc on Compute Engine” (cluster deployment) and “Google Cloud Serverless for Apache Spark” (serverless deployment) under a single umbrella. The gcloud dataproc CLI commands and console URLs remain functional, but documentation now uses the new name.

This post uses the original “Dataproc” name for continuity, as the service functionality remains identical.

Google Cloud Dataproc (Managed Service for Apache Spark)

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
Dataproc automation helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
Dataproc helps reduce time and money spent on administration and lets you focus on your jobs and your data.

Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average.
Dataproc has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring.
Dataproc clusters support Spot VMs (previously called preemptible instances) that have lower compute prices to reduce costs further.

Dataproc supports connectors for BigQuery, Bigtable, Cloud Storage, and Cloud Spanner.
Dataproc supports Anaconda, HBase, Flink, Hive WebHCat, Druid, Jupyter, Presto, Trino, Solr, Zeppelin, Ranger, Zookeeper, Delta Lake, Iceberg, Hudi, and much more as optional components.
Dataproc offers two deployment modes:
- Cluster Deployment (Dataproc on Compute Engine) — Spark-clusters-as-a-service; you manage infrastructure configuration and pay for cluster uptime.
- Serverless Deployment (Serverless for Apache Spark) — Spark-jobs-as-a-service; fully managed Google Cloud infrastructure with pay-per-job-runtime billing.

Dataproc Serverless for Apache Spark

Dataproc Serverless lets you run Spark workloads without provisioning or managing a cluster.

Serverless supports two workload types:
- Batch Workloads — Submit PySpark, Spark SQL, SparkR, or Spark (Java/Scala) batch jobs. Resources are auto-scaled and charges apply only during execution.
- Interactive Sessions — Write and run code in Jupyter notebooks or BigQuery Studio notebooks via Spark Connect.

Serverless uses Dynamic Resource Allocation for autoscaling (not YARN-based).
Supports scheduling via Cloud Composer (Airflow) or Cloud Scheduler.
Offers Standard and Premium tiers:
- Standard Tier — Core batch execution with autoscaling.
- Premium Tier — Adds Lightning Engine, Native Query Execution, interactive sessions, and Gemini-powered autotuning.
Supports custom container images, GPUs, and VPC Service Controls.

Dataproc Lightning Engine

Lightning Engine is a next-generation performance layer that accelerates Spark workloads up to 4.9x faster than open-source Apache Spark with zero code changes.

Available for both cluster and serverless deployments.
Key components:
- Native Query Execution (NQE) — A C++ vectorized execution engine built on Velox and Apache Gluten that bypasses JVM bottlenecks.
- Intelligent Caching — Automatically caches frequently accessed data for faster reads.
- Optimized Columnar Shuffling — Reduces shuffle overhead for large joins and aggregations.
Enabled by specifying --engine=lightning during cluster creation or selecting the Premium tier for serverless workloads.
Does not require any application code changes to existing Spark jobs.

Dataproc Cluster High Availability

Dataproc cluster can be configured for High Availability by specifying the number of master instances in the cluster.
Dataproc supports the following cluster configurations:
- Single Node Cluster — 1 master, 0 Workers (default, non-HA)
  - Provides one node for both master and worker.
  - If the master fails, in-flight jobs will fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
- Standard Cluster — 1 master, N Workers (default for multi-node)
  - Standard configuration with separate master and worker nodes.
- High Availability Cluster — 3 masters, N Workers (Hadoop HA)
  - HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.
All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.

Dataproc Cluster Scaling

Dataproc cluster can be adjusted to scale by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling).

Dataproc cluster can be scaled at any time, even when jobs are running on the cluster.
Machine type of an existing cluster (vertical scaling) cannot be changed. To vertically scale, create a cluster using a supported machine type, then migrate jobs to the new cluster.
Dataproc cluster can help scale:
- to increase the number of workers to make a job run faster
- to decrease the number of workers to save money
- to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

Dataproc Cluster Autoscaling

Dataproc Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.

An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy should scale.
It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout cluster lifetime.
Recent enhancements to Dataproc autoscaling have shown to decrease cluster VM expenditures by up to 40% and reduce cumulative job runtime by 10%.

Autoscaling is recommended for:
- clusters that store data in external services, such as Cloud Storage
- clusters that process many jobs
- scaling up single-job clusters
Autoscaling is not recommended with/for:
- HDFS: Autoscaling is not intended for scaling on-cluster HDFS.
- YARN Node Labels: Autoscaling does not support YARN Node Labels. YARN incorrectly reports cluster metrics when node labels are used.
- Spark Structured Streaming: Autoscaling does not support Spark Structured Streaming.
- Idle Clusters: Autoscaling is not recommended for the purpose of scaling a cluster down to minimum size when the cluster is idle. Use Scheduled Stop or delete idle clusters instead.

Dataproc also supports Autotuning (Premium tier), which uses Gemini AI to automatically tune Spark properties, optimize memory allocation, and prevent OOM errors based on historical job patterns.

Dataproc Zero-Scale Clusters

Zero-scale clusters use only secondary workers (Spot VMs) that can be scaled down to zero when no processing is active.
Unlike standard clusters that require at least two primary workers, zero-scale clusters leave only the master node online to preserve metadata.

Ideal for development and testing environments where you want to eliminate idle compute costs.
Workers automatically scale up when jobs are submitted and scale back to zero when idle.

Dataproc Cluster Lifecycle Management

Scheduled Deletion — Automatically delete a cluster after a specified idle period, at a specified future time, or after a specified duration from creation.

Scheduled Stop — Automatically stop (not delete) a cluster after a specified idle period or at a future time. Preserves cluster configuration for easy restart.
Cluster Rotation — Recreate clusters at regular intervals for security patching and freshness.
Start/Stop — Manually stop and restart clusters to save costs without losing configuration.

Dataproc Workers

Primary workers are standard Compute Engine VMs.
Secondary workers can be used to scale compute with the below characteristics:
- Processing only
  - Secondary workers do not store data.
  - Can only function as processing nodes.
  - Useful to scale compute without scaling storage.
- No secondary-worker-only clusters (except zero-scale clusters)
  - Standard clusters must have primary workers.
  - Dataproc adds two primary workers by default if none are specified.
- VM Types for Secondary Workers
  - Spot VMs (recommended) — Latest version of preemptible VMs with no maximum runtime limit. Can be reclaimed at any time.
  - Preemptible VMs (legacy) — Limited to 24-hour runtime. Spot VMs are recommended instead.
  - Non-preemptible VMs — Standard pricing, not subject to reclamation.
- Persistent disk size
  - Created, by default, with the smaller of 100GB or the primary worker boot disk size.
  - This disk space is used for local caching of data and is not available through HDFS.
- Asynchronous Creation
  - Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned.

Flexible VMs (GA 2026) — Define up to ten ranked machine types for worker nodes. Dataproc dynamically scans the entire region to fulfill capacity requests, improving resilience against localized shortages.

Dataproc Driver Node Groups

Driver node groups provide dedicated nodes for running Spark drivers, separating them from executors running on worker nodes.
Recommended for shared clusters running many concurrent jobs to prevent driver resource contention.

Increase master node resources before using driver node groups to avoid limitations.

Dataproc Initialization Actions

Dataproc supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up.
Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

Dataproc Cloud Storage Connector

Dataproc Cloud Storage connector helps Dataproc use Google Cloud Storage as the persistent store instead of HDFS.
Cloud Storage connector helps separate the storage from the cluster lifecycle and allows the cluster to be shut down when not processing data.
Cloud Storage connector benefits:
- Direct data access — Store the data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
- HDFS compatibility — Can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://.
- Interoperability — Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility — Data is accessible even after shutting down the cluster, unlike HDFS.
- High data availability — Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead — Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.

Dataproc Open Table Format Support

Dataproc supports modern open table formats as optional cluster components:
- Apache Iceberg — Supports creating and querying Iceberg tables with metadata in Dataproc Metastore or BigLake Metastore.
- Delta Lake — Supports reading and writing Delta tables on Cloud Storage.
- Apache Hudi — Supports Hudi’s Copy-on-Write and Merge-on-Read table types.
Integration with Google Cloud Lakehouse enables read/write interoperability between Managed Service for Apache Spark and BigQuery using a unified metadata layer.

Dataproc on GKE

Dataproc on GKE allows running Spark and other data processing workloads on a Google Kubernetes Engine (GKE) cluster.

Provides Kubernetes-native resource management, scaling, and multi-tenancy for Spark workloads.
Useful for organizations that have standardized on Kubernetes and want unified infrastructure management.
Supports custom container images and executor pod scheduling on specific node pools.

Cloud Dataproc vs Dataflow

Refer blog post @ Cloud Dataproc vs Dataflow

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
1. Google Cloud Dataflow
2. Google Cloud Dataproc
3. Google Compute Engine
4. Google Kubernetes Engine
Your company is migrating to the Google cloud and looking for HBase alternative. Current solution uses a lot of custom code using the observer coprocessor. You are required to find the best alternative for migration while using managed services, if possible?
1. Dataflow
2. HBase on Dataproc
3. Bigtable
4. BigQuery
A data engineering team runs hundreds of short-lived Spark ETL jobs daily. They want to minimize infrastructure management and only pay for actual job execution time. Which deployment option is most appropriate?
1. Dataproc cluster with autoscaling
2. Dataproc Serverless (Managed Service for Apache Spark — Serverless)
3. Dataproc on GKE
4. Dataproc with scheduled deletion

Your organization wants to accelerate existing Spark SQL workloads on Dataproc by up to 4.9x without modifying application code. Which feature should you enable?
1. Dataproc Autoscaling
2. Dataproc Enhanced Flexibility Mode
3. Lightning Engine
4. Dataproc Premium Machine Types
A team wants a persistent Dataproc development environment that automatically stops incurring worker costs when no jobs are running, while preserving cluster metadata. Which feature should they use?
1. Scheduled Deletion
2. Standard Autoscaling
3. Single Node Cluster
4. Zero-Scale Cluster
Your Dataproc cluster frequently fails to scale due to temporary capacity constraints in the selected zone. What feature would improve resource obtainability? (Choose TWO)
1. Flexible VMs with ranked machine type preferences
2. Auto Zone Placement
3. Increasing the autoscaling cooldown period
4. Using only preemptible VMs

Which of the following are NOT recommended use cases for Dataproc Autoscaling? (Choose TWO)
1. Clusters that store data in Cloud Storage
2. Clusters running Spark Structured Streaming
3. Scaling on-cluster HDFS storage
4. Clusters processing many batch jobs

Jayendra's Cloud Certification Blog

Google Cloud Dataproc – Managed Spark & Hadoop

🔄 SERVICE REBRANDED — Now “Managed Service for Apache Spark”