Google Cloud Dataflow vs Dataproc

August 12, 2021 ~ jayendrapatil ~ 1 Comment

Google Cloud Dataflow vs Dataproc

Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
Cloud Dataproc provides a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if already familiar with Hadoop tools and have Hadoop jobs
Ideal for Lift and Shift migration of existing Hadoop environment
Requires manual provisioning of clusters
Consider Dataproc
- If you have a substantial investment in Apache Spark or Hadoop on-premise and considering moving to the cloud
- If you are looking at a Hybrid cloud and need portability across a private/multi-cloud environment
- If in the current environment Spark is the primary machine learning tool and platform
- In case the code depends on any custom packages along with distributed computing need

Cloud Dataflow

Google Cloud Dataflow is a fully managed, serverless service for unified stream and batch data processing requirements
When using it as a pre-processing pipeline for ML model that can be deployed in GCP AI Platform Training (earlier called Cloud ML Engine)
None of the above considerations made for Cloud Dataproc is relevant

Cloud Dataflow vs Dataproc Decision Tree

Dataflow vs Dataproc

Dataflow vs Dataproc Table

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
1. Google Cloud Dataflow
2. Google Cloud Dataproc
3. Google Compute Engine
4. Google Container Engine
A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
1. Dataproc
2. Dataprep
3. Dataflow
4. BigQuery

References

https://learning.oreilly.com

Google Cloud Dataflow

July 29, 2021 ~ Last updated on : August 12, 2021 ~ jayendrapatil

Google Cloud Dataflow

Google Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
Dataflow provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
Dataflow is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.

Dataflow (Apache Beam) Programming Model

Data Processing Model

Pipelines

A pipeline encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data.
The input source and output sink can be the same or of different types, allowing data conversion from one format to another.
Apache Beam programs start by constructing a Pipeline object and then using that object as the basis for creating the pipeline’s datasets.
Each pipeline represents a single, repeatable job.

PCollection

A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline’s data.
Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.

Transforms

A transform represents a processing operation that transforms data.
A transform takes one or more PCollections as input, performs a specified operation on each element in that collection, and produces one or more PCollections as output.
A transform can perform nearly any kind of processing operation like
- performing mathematical computations,
- data conversion from one format to another,
- grouping data together,
- reading and writing data,
- filtering data to output only the required elements, or
- combining data elements into single values.

ParDo

ParDo is the core parallel processing operation invoking a user-specified function on each of the elements of the input PCollection.
ParDo collects the zero or more output elements into an output PCollection.
ParDo processes elements independently and in parallel, if possible.

Pipeline I/O

Apache Beam I/O connectors help read data into the pipeline and write output data from the pipeline.
An I/O connector consists of a source and a sink.
All Apache Beam sources and sinks are transforms that let the pipeline work with data from several different data storage formats.

Aggregation

Aggregation is the process of computing some value from multiple input elements.
The primary computational pattern for aggregation is to
- group all elements with a common key and window.
- combine each group of elements using an associative and commutative operation.

User-defined functions (UDFs)

User-defined functions allow executing user-defined code as a way of configuring the transform.
For ParDo, user-defined code specifies the operation to apply to every element, and for Combine, it specifies how values should be combined.
A pipeline might contain UDFs written in a different language than the language of the runner.
A pipeline might also contain UDFs written in multiple languages.

Runner

Runners are the software that accepts a pipeline and executes it.

Event time

Time a data event occurs, determined by the timestamp on the data element itself.
This contrasts with the time the actual data element gets processed at any stage in the pipeline.

Windowing

Windowing enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
A windowing function tells the runner how to assign elements to an initial window, and how to merge windows of grouped elements.

Tumbling Windows (Fixed Windows)

A tumbling window represents a consistent, disjoint time interval, for e.g. every 1 min, in the data stream.

An image that shows tumbling windows, 30 seconds in duration

Hopping Windows (Sliding Windows)

A hopping window represents a consistent time interval in the data stream for e.g., a hopping window can start every 30 seconds and capture 1 min of data and the window. The frequency with which hopping windows begin is called the period.
Hopping windows can overlap, whereas tumbling windows are disjoint.
Hopping windows are ideal to take running averages of data

An image that shows hopping windows with 1 minute window duration and 30 second window period

Session windows

A session window contains elements within a gap duration of another element for e.g., session windows can divide a data stream representing user mouse activity. This data stream might have long periods of idle time interspersed with many clicks. A session window can contain the data generated by the clicks.
The gap duration is an interval between new data in a data stream.
If data arrives after the gap duration, the data is assigned to a new window
Session windowing assigns different windows to each data key.
Tumbling and hopping windows contain all elements in the specified time interval, regardless of data keys.

An image that shows session windows with a minimum gap duration

Watermarks

A Watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived.
Watermark is tracked and its a system’s notion of when all data in a certain window can be expected to have arrived in the pipeline
If new data arrives with a timestamp that’s in the window but older than the watermark, the data is considered late data.
Dataflow tracks watermarks because of the following:
- Data is not guaranteed to arrive in time order or at predictable intervals.
- Data events are not guaranteed to appear in pipelines in the same order that they were generated.

Trigger

Triggers determine when to emit aggregated results as data arrives.
For bounded data, results are emitted after all of the input has been processed.
For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.

Dataflow Pipeline Operations

Cancelling a job
- causes the Dataflow service to stop the job immediately.
- might lose in-flight data
Draining a job
- supports graceful stop
- prevents data loss
- is useful to deploy incompatible changes
- allows the job to clear the existing queue before stoping
- supports only streaming jobs and does not support batch pipelines

Dataflow Security

Dataflow provides data-in-transit encryption.
- All communication with Google Cloud sources and sinks is encrypted and is carried over HTTPS.
- All inter-worker communication occurs over a private network and is subject to the project’s permissions and firewall rules.

Cloud Dataflow vs Dataproc

Refer blog post @ Cloud Dataflow vs Dataproc

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
1. Dataproc
2. Dataprep
3. Dataflow
4. BigQuery

References

Google_Cloud_Dataflow

Google Cloud Data Analytics Services Cheat Sheet

July 26, 2021 ~ Last updated on : August 11, 2021 ~ jayendrapatil

Cloud Pub/Sub

Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
Pub/Sub messages should be no greater than 10MB in size.
Messages can be received with pull or push delivery.
Messages published before a subscription is created will not be delivered to that subscription
Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
Pub/Sub support encryption at rest and encryption in transit.
Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
supports a standard SQL dialect
automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
Data model consists of Datasets, tables
BigQuery performance can be improved using Partitioned tables and Clustered tables.
BigQuery encrypts all data at rest and supports encryption in transit.
BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
Best Practices
- Control projection, avoid select *
- Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
- Use the maximum bytes billed setting to limit query costs.
- Use clustering and partitioning to reduce the amount of data scanned.
- Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
- Use streaming inserts only if the data must be immediately available as streaming data is charged.
- Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
- Denormalize data whenever possible using nested and repeated fields.
- Avoid external data sources, if query performance is a top priority
- Avoid using Javascript user-defined functions
- Optimize Join patterns. Start with the largest table.
- Use the expiration settings to remove unneeded tables and partitions
- Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
is a sparsely populated table that can scale to billions of rows and thousands of columns,
supports storage of terabytes or even petabytes of data
is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
handles upgrades and restarts transparently, and it automatically maintains high data durability.
scales linearly in direct proportion to the number of nodes in the cluster
stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
Single-cluster Bigtable instances provide strong consistency.
Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
supports drain feature to deploy incompatible updates

Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
support preemptible instances that have lower compute prices to reduce costs further.
also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
supports connectors for BigQuery, Bigtable, Cloud Storage
can be configured for High Availability by specifying the number of master instances in the cluster
All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
provides easy data preparation with clicks and no code.
automatically identifies data anomalies & helps take fast corrective action
automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
is built on Jupyter (formerly IPython)
enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).