Google Cloud Data Analytics Services Cheat Sheet

Table of Contents hide

Cloud Pub/Sub

Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms

Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.

Pub/Sub messages should be no greater than 10MB in size.
Messages can be received with pull or push delivery.
Messages published before a subscription is created will not be delivered to that subscription

Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
Pub/Sub support encryption at rest and encryption in transit.

Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
supports a standard SQL dialect

automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
Data model consists of Datasets, tables

BigQuery performance can be improved using Partitioned tables and Clustered tables.
BigQuery encrypts all data at rest and supports encryption in transit.
BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis

Best Practices
- Control projection, avoid select *
- Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
- Use the maximum bytes billed setting to limit query costs.
- Use clustering and partitioning to reduce the amount of data scanned.
- Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
- Use streaming inserts only if the data must be immediately available as streaming data is charged.
- Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
- Denormalize data whenever possible using nested and repeated fields.
- Avoid external data sources, if query performance is a top priority
- Avoid using Javascript user-defined functions
- Optimize Join patterns. Start with the largest table.
- Use the expiration settings to remove unneeded tables and partitions
- Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.

ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
is a sparsely populated table that can scale to billions of rows and thousands of columns,

supports storage of terabytes or even petabytes of data
is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
handles upgrades and restarts transparently, and it automatically maintains high data durability.

scales linearly in direct proportion to the number of nodes in the cluster
stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.

Single-cluster Bigtable instances provide strong consistency.
Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements

provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.

Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.

helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
support preemptible instances that have lower compute prices to reduce costs further.

also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
supports connectors for BigQuery, Bigtable, Cloud Storage
can be configured for High Availability by specifying the number of master instances in the cluster

All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.

supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage

provides easy data preparation with clicks and no code.
automatically identifies data anomalies & helps take fast corrective action
automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates

uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.

is built on Jupyter (formerly IPython)
enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).