Table of Contents
hide
Cloud Pub/Sub
- Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
- Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
- Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
- Pub/Sub messages should be no greater than 10MB in size.
- Messages can be received with pull or push delivery.
- Messages published before a subscription is created will not be delivered to that subscription
- Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
- Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
- Pub/Sub support encryption at rest and encryption in transit.
- Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.
BigQuery
- BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
- supports a standard SQL dialect
- automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
- supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
- Data model consists of Datasets, tables
- BigQuery performance can be improved using Partitioned tables and Clustered tables.
- BigQuery encrypts all data at rest and supports encryption in transit.
- BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
- Best Practices
- Control projection, avoid
select *
- Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using
--dry-run
feature - Use the maximum bytes billed setting to limit query costs.
- Use clustering and partitioning to reduce the amount of data scanned.
- Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
- Use streaming inserts only if the data must be immediately available as streaming data is charged.
- Prune partitioned queries, use the
_PARTITIONTIME
pseudo column to filter the partitions. - Denormalize data whenever possible using nested and repeated fields.
- Avoid external data sources, if query performance is a top priority
- Avoid using Javascript user-defined functions
- Optimize Join patterns. Start with the largest table.
- Use the expiration settings to remove unneeded tables and partitions
- Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.
- Control projection, avoid
Bigtable
- Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
- ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
- supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
- is a sparsely populated table that can scale to billions of rows and thousands of columns,
- supports storage of terabytes or even petabytes of data
- is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
- handles upgrades and restarts transparently, and it automatically maintains high data durability.
- scales linearly in direct proportion to the number of nodes in the cluster
- stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
- Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
- Single-cluster Bigtable instances provide strong consistency.
- Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings
Cloud Dataflow
- Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
- provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
- is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
- supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
- supports drain feature to deploy incompatible updates
Cloud Dataproc
- Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
- helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
- helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
- has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
- support preemptible instances that have lower compute prices to reduce costs further.
- also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
- supports connectors for BigQuery, Bigtable, Cloud Storage
- can be configured for High Availability by specifying the number of master instances in the cluster
- All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
- supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
- supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
- supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up
Cloud Dataprep
- Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
- is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
- provides easy data preparation with clicks and no code.
- automatically identifies data anomalies & helps take fast corrective action
- automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
- uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code
Datalab
- Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
- runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
- is built on Jupyter (formerly IPython)
- enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).