Google Cloud BigQuery Security

Google Cloud BigQuery Security

BigQuery Encryption

  • BigQuery automatically encrypts all data before it is written to disk
  • By default, Google uses the Default Encryption at Rest and manages the key encryption keys used for data protection.
  • BigQuery also supports customer-managed encryption keys, to encrypt individual values within a table.
  • BigQuery uses TLS for data in transit encryption
  • Cloud Data Loss Prevention (Cloud DLP) can be used to scan the BigQuery tables and to protect sensitive data and meet compliance requirements.

BigQuery IAM Roles

  • BigQuery supports access control of datasets and tables using IAM
  • Primitive Roles
    • primitive roles act at the project level
    • By default, granting access to a project also grants access to datasets within it unless overridden
    • are not limited to BigQuery resources only
    • can separate data access permissions from job-running permissions
    • Viewer
      • View all datasets
      • Run Jobs/Queries
      • View and update all jobs that they started
    • Editor
      • All Viewer access
      • Modify or delete all tables
      • Create new datasets
    • Owner
      • All Editor access
      • list, modify, or delete all datasets
      • View all jobs
  • Predefined Roles
    • dataViewer, dataEditor, and dataOwner roles
      • are similar to the primitive roles except
        • can be assigned for individual datasets
        • don’t give users permission to run jobs or queries
    • user, jobUser roles
      • give users permission to run jobs or queries
      • A jobUser can only start jobs and cancel jobs, but cannot list datasets or tables
      • A user, on the other hand, can perform a variety of other tasks, such as listing or creating datasets
      • User or group granted the user role at the project level can create datasets and can run query jobs against tables in those datasets.
      • user role does not give permission to query data, view table data, or view table schema details for datasets the user did not create. Need to have the dataViewer role for the same.

Authorized Views

  • Authorized views help provide view access to a dataset
  • Use authorized views to restrict access at a lower resource level such as the table, column, row, or cell.
  • An authorized view allows sharing query results with particular users and groups without giving them access to the underlying tables.
  • Authorized View’s SQL query can be used to restrict the columns (fields) the users are able to query.
  • Authorized views HAVE to be created in a separate dataset from the source dataset. As access controls can be assigned only at the dataset level, if the view is created in the same dataset as the source data, the users would have access to both the view and the data.
  • Authorized View creation process
    • Create a separate dataset to store the view.
    • Create the view in the new dataset
    • Give the group read access to the dataset containing the view
    • Authorize the view to access the source dataset
    • Give the group bigquery.user role to run jobs, including query jobs within the project
  • Project-level bigquery.user role does not give the users the ability to view or query table data in the dataset containing the tables queried by the view. They need READER access to the dataset containing the view.

Fine-Grained Access Control

  • BigQuery supports access controls at the project, dataset, and table levels
  • BigQuery also supports fine-grained row and column level security
  • BigQuery provides fine-grained access to sensitive columns using policy tags, or type-based classification, of data.
  • Using BigQuery column-level security, you can create policies that check, at query time, whether a user has proper access.
  • Row-level security extends the principle of least privilege by enabling fine-grained access control to a subset of data in a BigQuery table, by means of row-level access policies.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You have multiple Data Analysts who work with the dataset hosted in BigQuery within the same project. As a BigQuery Administrator, you are required to grant the data analyst only the privilege to create jobs/queries and the ability to cancel self-submitted jobs. Which role should assign to the user?
    1. User
    2. Jobuser
    3. Owner
    4. Viewer
  2. Your analytics system executes queries against a BigQuery dataset. The SQL query is executed in batch and passes the contents of a SQL file to the BigQuery CLI. Then it redirects the BigQuery CLI output to another process. However, you are getting a permission error from the BigQuery CLI when the queries are executed. You want to resolve the issue. What should you do?
    1. Grant the service account BigQuery Data Viewer and BigQuery Job User roles.
    2. Grant the service account BigQuery Data Editor and BigQuery Data Viewer roles.
    3. Create a view in BigQuery from the SQL query and SELECT * from the view in the CLI.
    4. Create a new dataset in BigQuery, and copy the source table to the new dataset Query the new dataset and table from the CLI.
  3. You are responsible for the security and access control to a BigQuery dataset hosted within a project. Multiple users from multiple teams need to have access to the different tables within the dataset. How can access be control?
    1. Create Authorized views for tables in a separate project and grant access to the teams
    2. Create Authorized views for tables in the same project and grant access to the teams
    3. Create Materialized views for tables in a separate project and grant access to the teams
    4. Create Materialized views for tables in the same project and grant access to the teams

References

Google_Cloud_BigQuery_Data_Goverance

Google Cloud Data Analytics Services Cheat Sheet

Cloud Pub/Sub

  • Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
  • Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
  • Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
  • Pub/Sub messages should be no greater than 10MB in size.
  • Messages can be received with pull or push delivery.
  • Messages published before a subscription is created will not be delivered to that subscription
  • Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
  • Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
  • Pub/Sub support encryption at rest and encryption in transit.
  • Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

  • BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • supports a standard SQL dialect
  • automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
  • supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
  • Data model consists of Datasets, tables
  • BigQuery performance can be improved using Partitioned tables and Clustered tables.
  • BigQuery encrypts all data at rest and supports encryption in transit.
  • BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
  • Best Practices
    • Control projection, avoid select *
    • Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
    • Use the maximum bytes billed setting to limit query costs.
    • Use clustering and partitioning to reduce the amount of data scanned.
    • Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
    • Use streaming inserts only if the data must be immediately available as streaming data is charged.
    • Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
    • Denormalize data whenever possible using nested and repeated fields.
    • Avoid external data sources, if query performance is a top priority
    • Avoid using Javascript user-defined functions
    • Optimize Join patterns. Start with the largest table.
    • Use the expiration settings to remove unneeded tables and partitions
    • Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

  • Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
  • ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
  • supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
  • is a sparsely populated table that can scale to billions of rows and thousands of columns,
  • supports storage of terabytes or even petabytes of data
  • is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
  • handles upgrades and restarts transparently, and it automatically maintains high data durability.
  • scales linearly in direct proportion to the number of nodes in the cluster
  • stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
  • Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
  • Single-cluster Bigtable instances provide strong consistency.
  • Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

  • Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
  • provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
  • is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
  • supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
  • supports drain feature to deploy incompatible updates

Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
  • helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
  • has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
  • support preemptible instances that have lower compute prices to reduce costs further.
  • also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
  • supports connectors for BigQuery, Bigtable, Cloud Storage
  • can be configured for High Availability by specifying the number of master instances in the cluster
  • All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
  • supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
  • supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
  • supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

  • Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
  • is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
  • provides easy data preparation with clicks and no code.
  • automatically identifies data anomalies & helps take fast corrective action
  • automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
  • uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

  • Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
  • runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
  • is built on Jupyter (formerly IPython)
  • enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).

Google Cloud BigQuery

Google Cloud BigQuery

  • Google Cloud BigQuery is a fully managed, peta-byte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • BigQuery supports a standard SQL dialect that is ANSI:2011 compliant, which reduces the need for code rewrites.
  • BigQuery transparently and automatically provides highly durable, replicated storage in multiple locations and high availability with no extra charge and no additional setup.
  • BigQuery supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
  • BigQuery automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
  • BigQuery Data Transfer Service automatically transfers data from external data sources, like Google Marketing Platform, Google Ads, YouTube, external sources like S3 or Teradata, and partner SaaS applications to BigQuery on a scheduled and fully managed basis
  • BigQuery provides a REST API for easy programmatic access and application integration. Client libraries are available in Java, Python, Node.js, C#, Go, Ruby, and PHP.

BigQuery Resources

BigQuery Resources

Datasets

  • Datasets are the top-level containers used to organize and control access to the BigQuery tables and views.
  • Datasets frequently map to schemas in standard relational databases and data warehouses.
  • Datasets are scoped to the Cloud project
  • A dataset is bound to a location and can be defined as
    • Regional: A specific geographic place, such as London.
    • Multi-regional: A large geographic area, such as the United States, that contains two or more geographic places.
  • Dataset location can be set only at the time of its creation.
  • A query can contain tables or views from different datasets in the same location.
  • Dataset names must be unique for each project.

Tables

  • BigQuery tables are row-column structures that hold the data.
  • A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).
  • Every table is defined by a schema that describes the column names, data types, and other information.
  • BigQuery has the following types of tables:
    • Native tables: Tables backed by native BigQuery storage.
    • External tables: Tables backed by storage external to BigQuery.
    • Views: Virtual tables defined by a SQL query.
  • Schema of a table can either be defined during creation or specified in the query job or load job that first populates it with data.
  • Schema auto-detection is also supported when data is loaded from BigQuery or an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files
  • Columns datatype cannot be changed once defined.

Partitioned Tables

  • A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.
  • By dividing a large table into smaller partitions, query performance and costs can be controlled by reducing the number of bytes read by a query.
  • BigQuery tables can be partitioned by:
    • Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.
    • Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
    • Integer range: Tables are partitioned based on an integer column.
  • If a query filters on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. This process is called pruning.

Clustered Tables

  • With Clustered tables, the table data is automatically organized based on the contents of one or more columns in the table’s schema.
  • Columns specified are used to colocate the data.
  • Clustering can be performed on multiple columns, where the order of the columns is important as it determines the sort order of the data
  • Clustering can improve query performance for specific filter queries or ones that aggregate data as BigQuery uses the sorted blocks to eliminate scans of unnecessary data
  • Clustering does not provide cost guarantees before running the query.
  • Partitioning can be used with clustering where data is first partitioned and then data in each partition is clustered by the clustering columns. When the table is queried, partitioning sets an upper bound of the query cost based on partition pruning.

Views

  • A View is a virtual table defined by a SQL query.
  • View query results contain data only from the tables and fields specified in the query that defines the view.
  • Views are read-only and do not support DML queries
  • Dataset that contains the view and the dataset that contains the tables referenced by the view must be in the same location.
  • View does not support BigQuery job that exports data
  • View does not support JSON API to retrieve data from the view
  • Standard SQL and legacy SQL queries cannot be mixed
  • Legacy SQL view cannot be automatically updated to standard SQL syntax.
  • No user-defined functions allowed
  • No wildcard table references allowed

Materialized Views

  • Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency.
  • BigQuery leverages pre-computed results from materialized views and whenever possible reads only delta changes from the base table to compute up-to-date results.
  • Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.
  • Materialized views queries are generally faster and consume fewer resources than queries that retrieve the same data only from the base table
  • Materialized views can significantly improve the performance of workloads that have the characteristic of common and repeated queries.

Jobs

  • Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.
  • Jobs are not linked to the same project that the data is stored in. However, the location where the job can execute is linked to the dataset location.

External Data Sources

  • An external data source (federated data source) is a data source that can be queried directly even though the data is not stored in BigQuery.
  • Instead of loading or streaming the data, a table can be created that references the external data source.
  • BigQuery offers support for querying data directly from:
    • Cloud Bigtable
    • Cloud Storage
    • Google Drive
    • Cloud SQL
  • Supported formats are:
    • Avro
    • CSV
    • JSON (newline delimited only)
    • ORC
    • Parquet
  • External data sources use cases
    • Loading and cleaning the data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
    • Having a small amount of frequently changing data that needs to be joined with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.
  • Permanent vs Temporary external tables
    • The external data sources can be queried in BigQuery by using a permanent table or a temporary table.
    • Permanent Table
      • is a table that is created in a dataset and is linked to the external data source.
      • access controls can be used to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
    • Temporary Table
      • you submit a command that includes a query and creates a non-permanent table linked to the external data source.
      • no table is created in the BigQuery datasets.
      • cannot be shared with others.
      • Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
  • Limitations
    • does not guarantee data consistency for external data sources
    • query performance for external data sources may not be as high as querying data in a native BigQuery table
    • cannot reference an external data source in a wildcard table query.
    • support table partitioning or clustering in limited ways
    • results are not cached and would be charged for each query execution

BigQuery Security

Refer blog post @ BigQuery Security

BigQuery Best Practices

  • Cost Control
    • Query only the needed columns and avoid select * as BigQuery does a full scan of every column in the table.
    • Don’t run queries to explore or preview table data. Use preview option
    • Before running queries, preview them to estimate costs. Queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
    • Use the maximum bytes billed setting to limit query costs.
    • Use clustering and partitioning to reduce the amount of data scanned.
    • For non-clustered tables, do not use a LIMIT clause as a method of cost control. Applying a LIMIT clause to a query does not affect the amount of data read, but shows limited results only. With a clustered table, a LIMIT clause can reduce the number of bytes scanned
    • Partition the tables by date which helps query relevant subsets of data which improves performance and reduces costs.
    • Materialize the query results in stages. Break the query into stages where each stage materializes the query results by writing them to a destination table. Querying the smaller destination table reduces the amount of data that is read and lowers costs. The cost of storing the materialized results is much less than the cost of processing large amounts of data.
    • Use streaming inserts only if the data must be immediately available as streaming data is charged.
  • Query Performance
    • Control projection, Query only the needed columns. Avoid SELECT *
    • Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
    • Denormalize data whenever possible using nested and repeated fields.
    • Avoid external data sources, if query performance is a top priority
    • Avoid repeatedly transforming data via SQL queries, use materialized views instead
    • Avoid using Javascript user-defined functions
    • Optimize Join patterns. Start with the largest table.
  • Optimizing Storage
    • Use the expiration settings to remove unneeded tables and partitions
    • Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

BigQuery Data Transfer Service

Refer GCP blog post @ Google Cloud BigQuery Data Transfer Service

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A user wishes to generate reports on petabyte-scale data using Business Intelligence (BI) tools. Which storage option provides integration with BI tools and supports OLAP workloads up to petabyte-scale?
    1. Bigtable
    2. Cloud Datastore
    3. Cloud Storage
    4. BigQuery
  2. Your company uses Google Analytics for tracking. You need to export the session and hit data from a Google Analytics 360 reporting view on a scheduled basis into BigQuery for analysis. How can the data be exported?
    1. Configure a scheduler in Google Analytics to convert the Google Analytics data to JSON format, then import directly into BigQuery using bq command line.
    2. Use gsutil to export the Google Analytics data to Cloud Storage, then import into BigQuery and schedule it using Cron.
    3. Import data to BigQuery directly from Google Analytics using Cron
    4. Use BigQuery Data Transfer Service to import the data from Google Analytics

References

Google_Cloud_BigQuery_Architecture

Google Cloud Storage Options

GCP Storage Options

GCP provides various storage options and the selection can be based on

  • Structured vs Unstructured
  • Relational (SQL) vs Non-Relational (NoSQL)
  • Transactional (OLTP) vs Analytical (OLAP)
  • Fully Managed vs Requires Provisioning
  • Global vs Regional
  • Horizontal vs Vertical scaling

Cloud Firestore

  • Cloud Firestore is a fully managed, highly scalable, serverless, non-relational NoSQL document database
  • fully managed with no-ops and no planned downtime and no need to provision database instances (vs Bigtable)
  • uses a distributed architecture to automatically manage scaling.
  • queries scale with the size of the result set, not the size of the data set
  • supports ACID Atomic transactionsall or nothing (vs Bigtable)
  • provides High availability of reads and writesruns in Google data centers, which use redundancy to minimize impact from points of failure.
  • provides massive scalability with high performanceuses a distributed architecture to automatically manage scaling.
  • scales from zero to terabytes with flexible storage and querying of data
  • provides SQL-like query language
  • supports strong consistency
  • supports data encryption at rest and in transit
  • provides terabytes of capacity with a maximum unit size of 1 MB per entity (vs Bigtable)
  • Consider using Cloud Firestore if you need to store semi-structured objects, or if require support for transactions and SQL-like queries.

Cloud Bigtable

  • Bigtable provides a scalable, fully managed, non-relational NoSQL wide-column analytical big data database service suitable for both low-latency single-point lookups and precalculated analytics.
  • supports large quantities (>1 TB) of semi-structured or structured data (vs Datastore)
  • supports high throughput or rapidly changing data (vs BigQuery)
  • managed, but needs provisioning of nodes and can be expensive (vs Datastore and BigQuery)
  • does not support transactions or strong relational semantics (vs Datastore)
  • does not support SQL queries (vs BigQuery and Datastore)
  • Not Transactional and does not support ACID
  • provides eventual consistency
  • ideal for time-series or natural semantic ordering data
  • can run asynchronous batch or real-time processing on the data
  • can run machine learning algorithms on the data
  • provides petabytes of capacity with a maximum unit size of 10 MB per cell and 100 MB per row.
  • Usage Patterns
    • Low-latency read/write access
    • High-throughput data processing
    • Time series support
  • Anti Patterns
    • Not an ideal storage option for future analysis – Use BigQuery instead
    • Not an ideal storage option for transactional data – Use relational database or Datastore
  • Common Use cases
    • IoT, finance, adtech
    • Personalization, recommendations
    • Monitoring
    • Geospatial datasets
    • Graphs
  • Consider using Cloud Bigtable, if you need high-performance datastore to perform analytics on a large number of structured objects

Cloud Storage

  • Cloud Storage provides durable and highly available object storage.
  • fully managed, simple administration, cost-effective, and scalable service that does not require capacity management
  • supports unstructured data storage like binary or raw objects
  • provides high performance, internet-scale
  • supports data encryption at rest and in transit
  • Consider using Cloud Storage, if you need to store immutable blobs larger than 10 MB, such as large images or movies. This storage service provides petabytes of capacity with a maximum unit size of 5 TB per object.
  • Usage Patterns
    • Images, pictures, and videos
    • Objects and blobs
    • Unstructured data
    • Long term storage for archival or compliance
  • Anti Patterns
  • Common Use cases
    • Storing and streaming multimedia
    • Storage for custom data analytics pipelines
    • Archive, backup, and disaster recovery

Cloud SQL

  • provides fully managed, relational SQL databases
  • offers MySQL, PostgreSQL, MSSQL databases as a service
  • manages OS & Software installation, patches and updates, backups and configuring replications, failover however needs to select and provision machines (vs Cloud Spanner)
  • single region only – although it now supports cross-region read replicas (vs Cloud Spanner)
  • Scaling
    • provides vertical scalability (Max. storage of 10TB)
    • storage can be increased without incurring any downtime
    • provides an option to increase the storage automatically
    • storage CANNOT be decreased
    • supports Horizontal scaling for read-only using read replicas (vs Cloud Spanner)
    • performance is linked to the disk size
  • Security
    • data is encrypted when stored in database tables, temporary files, and backups.
    • external connections can be encrypted by using SSL, or by using the Cloud SQL Proxy.
  • High Availability
    • fault-tolerance across zones can be achieved by configuring the instance for high availability by adding a failover replica
    • failover is automatic
    • can be created from primary instance only
    • replication from the primary instance to failover replica is semi-synchronous.
    • failover replica must be in the same region as the primary instance, but in a different zone
    • only one instance for every primary instance allowed
    • supports managed backups and backups are created on primary instance only
    • supports automatic replication
  • Backups
    • Automated backups can be configured and are stored for 7 days
    • Manual backups (snapshots) can be created and are not deleted automatically
  • Point-in-time recovery
    • requires binary logging enabled.
    • every update to the database is written to an independent log, which involves a small reduction in write performance.
    • performance of the read operations is unaffected by binary logging, regardless of the size of the binary log files.
  • Usage Patterns
    • direct lift and shift for MySQL, PostgreSQL, MSSQL database only
    • relational database service with strong consistency
    • OLTP workloads
  • Anti Patterns
    • need data storage more than 10TB, use Cloud Spanner
    • need global availability with low latency, use Cloud Spanner
    • not a direct replacement for Oracle use installation on GCE
  • Common Use cases
    • Websites, blogs, and content management systems (CMS)
    • Business intelligence (BI) applications
    • ERP, CRM, and eCommerce applications
    • Geospatial applications
  • Consider using Cloud SQL for full relational SQL support for OTLP and lift and shift of MySQL, PostgreSQL databases

Cloud Spanner

  • Cloud Spanner provides fully managed, relational SQL databases with joins and secondary indexes
  • provides cross-region, global, horizontal scalability, and availability
  • supports strong consistency, including strongly consistent secondary indexes
  • provides high availability through synchronous and built-in data replication.
  • provides strong global consistency
  • supports database sizes exceeding ~2 TB (vs Cloud SQL)
  • does not provide direct lift and shift for relational databases (vs Cloud SQL)
  • expensive as compared to Cloud SQL
  • Consider using Cloud Spanner for full relational SQL support, with horizontal scalability spanning petabytes for OTLP

BigQuery

  • provides fully managed, no-ops,  OLAP, enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
  • provides high capacity, data warehousing analytics solution
  • ideal for big data exploration and processing
  • not ideal for operational or transactional databases
  • provides SQL interface
  • A scalable, fully managed
  • Usage Patterns
    • OLAP workloads up to petabyte-scale
    • Big data exploration and processing
    • Reporting via business intelligence (BI) tools
  • Anti Patterns
    • Not an ideal storage option for transactional data or OLTP – Use Cloud SQL or Cloud Spanner instead
    • Low-latency read/write access – Use Bigtable instead
  • Common Use cases
    • Analytical reporting on large data
    • Data science and advanced analyses
    • Big data processing using SQL

Memorystore

  • provides scalable, secure, and highly available in-memory service for Redis and Memcached.
  • fully managed as provisioning, replication, failover, and patching are all automated, which drastically reduces the time spent doing DevOps.
  • provides 100% compatibility with open source Redis and Memcached
  • is protected from the internet using VPC networks and private IP and comes with IAM integration
  • Usage Patterns
    • Lift and shift migration of applications
    • Low latency data caching and retrieval
  • Anti Patterns
    • Relational or NoSQL database
    • Analytics solution
  • Common Use cases
    • User session management

GCP Storage Options Decision Tree

GCP Storage Options Decision Tree

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your application is hosted across multiple regions and consists of both relational database data and static images. Your database has over 10 TB of data. You want to use a single storage repository for each data type across all regions. Which two products would you choose for this task? (Choose two)
    1. Cloud Bigtable
    2. Cloud Spanner
    3. Cloud SQL
    4. Cloud Storage
  2. You are building an application that stores relational data from users. Users across the globe will use this application. Your CTO is concerned about the scaling requirements because the size of the user base is unknown. You need to implement a database solution that can scale with your user growth with minimum configuration changes. Which storage solution should you use?
    1. Cloud SQL
    2. Cloud Spanner
    3. Cloud Firestore
    4. Cloud Datastore
  3. Your company processes high volumes of IoT data that are time-stamped. The total data volume can be several petabytes. The data needs to be written and changed at a high speed. You want to use the most performant storage option for your data. Which product should you use?
    1. Cloud Datastore
    2. Cloud Storage
    3. Cloud Bigtable
    4. BigQuery
  4. Your App Engine application needs to store stateful data in a proper storage service. Your data is non-relational database data. You do not expect the database size to grow beyond 10 GB and you need to have the ability to scale down to zero to avoid unnecessary costs. Which storage service should you use?
    1. Cloud Bigtable
    2. Cloud Dataproc
    3. Cloud SQL
    4. Cloud Datastore
  5. A financial organization wishes to develop a global application to store transactions happening from different part of the world. The storage system must provide low latency transaction support and horizontal scaling. Which GCP service is appropriate for this use case?
    1. Bigtable
    2. Datastore
    3. Cloud Storage
    4. Cloud Spanner
  6. You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?
    1. Cloud SQL
    2. Cloud Bigtable
    3. Cloud Spanner
    4. Cloud Datastore