Google Cloud Dataflow vs Dataproc

August 12, 2021 ~ jayendrapatil ~ 1 Comment

Google Cloud Dataflow vs Dataproc

Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
Cloud Dataproc provides a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if already familiar with Hadoop tools and have Hadoop jobs
Ideal for Lift and Shift migration of existing Hadoop environment
Requires manual provisioning of clusters
Consider Dataproc
- If you have a substantial investment in Apache Spark or Hadoop on-premise and considering moving to the cloud
- If you are looking at a Hybrid cloud and need portability across a private/multi-cloud environment
- If in the current environment Spark is the primary machine learning tool and platform
- In case the code depends on any custom packages along with distributed computing need

Cloud Dataflow

Google Cloud Dataflow is a fully managed, serverless service for unified stream and batch data processing requirements
When using it as a pre-processing pipeline for ML model that can be deployed in GCP AI Platform Training (earlier called Cloud ML Engine)
None of the above considerations made for Cloud Dataproc is relevant

Cloud Dataflow vs Dataproc Decision Tree

Dataflow vs Dataproc

Dataflow vs Dataproc Table

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
1. Google Cloud Dataflow
2. Google Cloud Dataproc
3. Google Compute Engine
4. Google Container Engine
A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
1. Dataproc
2. Dataprep
3. Dataflow
4. BigQuery

References

https://learning.oreilly.com

Google Cloud Dataproc

August 1, 2021 ~ Last updated on : August 12, 2021 ~ jayendrapatil

Google Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
Dataproc automation helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
Dataproc helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average
Dataproc has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
Dataproc clusters support preemptible instances that have lower compute prices to reduce costs further.
Dataproc supports connectors for BigQuery, Bigtable, Cloud Storage
Dataproc also supports Anaconda, HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.

Dataproc Cluster High Availability

Dataproc cluster can be configured for High Availability by specifying the number of master instances in the cluster
Dataproc supports two master configurations:
- Single Node Cluster – 1 master – 0 Workers (default, non HA)
  - provides one node for both master and worker
  - if the master fails, the in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
- High Availability Cluster – 3 masters – N Workers (Hadoop HA)
  - HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.
All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.

Dataproc Cluster Scaling

Dataproc cluster can be adjusted to scale by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
Dataproc cluster can be scaled at any time, even when jobs are running on the cluster.
Machine type of an existing cluster (vertical scaling) cannot be changed. To vertically scale, create a cluster using a supported machine type, then migrate jobs to the new cluster.
Dataproc cluster can help scale
- to increase the number of workers to make a job run faster
- to decrease the number of workers to save money
- to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

Dataproc Cluster Autoscaling

Dataproc Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy should scale.
It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout cluster lifetime.
Autoscaling is recommended for
- on clusters that store data in external services, such as Cloud Storage
- on clusters that process many jobs
- to scale up single-job clusters
Autoscaling is not recommended with/for:
- HDFS: Autoscaling is not intended for scaling on-cluster HDFS
- YARN Node Labels: Autoscaling does not support YARN Node Labels. YARN incorrectly reports cluster metrics when node labels are used.
- Spark Structured Streaming: Autoscaling does not support Spark Structured Streaming
- Idle Clusters: Autoscaling is not recommended for the purpose of scaling a cluster down to minimum size when the cluster is idle. It is better to delete an Idle cluster.

Dataproc Workers

Primary workers are standard Compute Engine VMs
Secondary workers can be used to scale with the below limitations
- Processing only
  - Secondary workers do not store data.
  - can only function as processing nodes
  - useful to scale compute without scaling storage.
- No secondary-worker-only clusters
  - Cluster must have primary workers
  - Dataproc adds two primary workers to the cluster, by default, if no primary workers are specified.
- Machine type
  - use the machine type of the cluster’s primary workers.
- Persistent disk size
  - are created, by default, with the smaller of 100GB or the primary worker boot disk size.
  - This disk space is used for local caching of data and is not available through HDFS.
- Asynchronous Creation
  - Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned

Dataproc Initialization Actions

Dataproc supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up
Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

Dataproc Cloud Storage Connector

Dataproc Cloud Storage connector helps Dataproc use Google Cloud Storage as the persistent store instead of HDFS.
Cloud Storage connector helps separate the storage from the cluster lifecycle and allows the cluster to be shut down when not processing data
Cloud Storage connector benefits
- Direct data access – Store the data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
- HDFS compatibility – can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://
- Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility – data is accessible even after shutting down the cluster, unlike HDFS.
- High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.

Cloud Dataproc vs Dataflow

Refer blog post @ Cloud Dataproc vs Dataflow

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
1. Google Cloud Dataflow
2. Google Cloud Dataproc
3. Google Compute Engine
4. Google Container Engine
Your company is migrating to the Google cloud and looking for HBase alternative. Current solution uses a lot of custom code using the observer coprocessor. You are required to find the best alternative for migration while using managed services, is possible?
1. Dataflow
2. HBase on Dataproc
3. Bigtable
4. BigQuery

References

Google_Cloud_Dataproc

Google Cloud Data Analytics Services Cheat Sheet

July 26, 2021 ~ Last updated on : August 11, 2021 ~ jayendrapatil

Cloud Pub/Sub

Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
Pub/Sub messages should be no greater than 10MB in size.
Messages can be received with pull or push delivery.
Messages published before a subscription is created will not be delivered to that subscription
Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
Pub/Sub support encryption at rest and encryption in transit.
Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
supports a standard SQL dialect
automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
Data model consists of Datasets, tables
BigQuery performance can be improved using Partitioned tables and Clustered tables.
BigQuery encrypts all data at rest and supports encryption in transit.
BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
Best Practices
- Control projection, avoid select *
- Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
- Use the maximum bytes billed setting to limit query costs.
- Use clustering and partitioning to reduce the amount of data scanned.
- Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
- Use streaming inserts only if the data must be immediately available as streaming data is charged.
- Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
- Denormalize data whenever possible using nested and repeated fields.
- Avoid external data sources, if query performance is a top priority
- Avoid using Javascript user-defined functions
- Optimize Join patterns. Start with the largest table.
- Use the expiration settings to remove unneeded tables and partitions
- Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
is a sparsely populated table that can scale to billions of rows and thousands of columns,
supports storage of terabytes or even petabytes of data
is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
handles upgrades and restarts transparently, and it automatically maintains high data durability.
scales linearly in direct proportion to the number of nodes in the cluster
stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
Single-cluster Bigtable instances provide strong consistency.
Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
supports drain feature to deploy incompatible updates

Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
support preemptible instances that have lower compute prices to reduce costs further.
also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
supports connectors for BigQuery, Bigtable, Cloud Storage
can be configured for High Availability by specifying the number of master instances in the cluster
All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
provides easy data preparation with clicks and no code.
automatically identifies data anomalies & helps take fast corrective action
automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
is built on Jupyter (formerly IPython)
enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).