Google Cloud BigQuery Security

August 2, 2021 ~ Last updated on : August 4, 2021 ~ jayendrapatil

Google Cloud BigQuery Security

BigQuery Encryption

BigQuery automatically encrypts all data before it is written to disk
By default, Google uses the Default Encryption at Rest and manages the key encryption keys used for data protection.
BigQuery also supports customer-managed encryption keys, to encrypt individual values within a table.
BigQuery uses TLS for data in transit encryption
Cloud Data Loss Prevention (Cloud DLP) can be used to scan the BigQuery tables and to protect sensitive data and meet compliance requirements.

BigQuery IAM Roles

BigQuery supports access control of datasets and tables using IAM
Primitive Roles
- primitive roles act at the project level
- By default, granting access to a project also grants access to datasets within it unless overridden
- are not limited to BigQuery resources only
- can separate data access permissions from job-running permissions
- Viewer
  - View all datasets
  - Run Jobs/Queries
  - View and update all jobs that they started
- Editor
  - All Viewer access
  - Modify or delete all tables
  - Create new datasets
- Owner
  - All Editor access
  - list, modify, or delete all datasets
  - View all jobs
Predefined Roles
- dataViewer, dataEditor, and dataOwner roles
  - are similar to the primitive roles except
    - can be assigned for individual datasets
    - don’t give users permission to run jobs or queries
- user, jobUser roles
  - give users permission to run jobs or queries
  - A jobUser can only start jobs and cancel jobs, but cannot list datasets or tables
  - A user, on the other hand, can perform a variety of other tasks, such as listing or creating datasets
  - User or group granted the user role at the project level can create datasets and can run query jobs against tables in those datasets.
  - user role does not give permission to query data, view table data, or view table schema details for datasets the user did not create. Need to have the dataViewer role for the same.

Authorized Views

Authorized views help provide view access to a dataset
Use authorized views to restrict access at a lower resource level such as the table, column, row, or cell.
An authorized view allows sharing query results with particular users and groups without giving them access to the underlying tables.
Authorized View’s SQL query can be used to restrict the columns (fields) the users are able to query.
Authorized views HAVE to be created in a separate dataset from the source dataset. As access controls can be assigned only at the dataset level, if the view is created in the same dataset as the source data, the users would have access to both the view and the data.
Authorized View creation process
- Create a separate dataset to store the view.
- Create the view in the new dataset
- Give the group read access to the dataset containing the view
- Authorize the view to access the source dataset
- Give the group bigquery.user role to run jobs, including query jobs within the project
Project-level bigquery.user role does not give the users the ability to view or query table data in the dataset containing the tables queried by the view. They need READER access to the dataset containing the view.

Fine-Grained Access Control

BigQuery supports access controls at the project, dataset, and table levels
BigQuery also supports fine-grained row and column level security
BigQuery provides fine-grained access to sensitive columns using policy tags, or type-based classification, of data.
Using BigQuery column-level security, you can create policies that check, at query time, whether a user has proper access.
Row-level security extends the principle of least privilege by enabling fine-grained access control to a subset of data in a BigQuery table, by means of row-level access policies.

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

You have multiple Data Analysts who work with the dataset hosted in BigQuery within the same project. As a BigQuery Administrator, you are required to grant the data analyst only the privilege to create jobs/queries and the ability to cancel self-submitted jobs. Which role should assign to the user?
1. User
2. Jobuser
3. Owner
4. Viewer
Your analytics system executes queries against a BigQuery dataset. The SQL query is executed in batch and passes the contents of a SQL file to the BigQuery CLI. Then it redirects the BigQuery CLI output to another process. However, you are getting a permission error from the BigQuery CLI when the queries are executed. You want to resolve the issue. What should you do?
1. Grant the service account BigQuery Data Viewer and BigQuery Job User roles.
2. Grant the service account BigQuery Data Editor and BigQuery Data Viewer roles.
3. Create a view in BigQuery from the SQL query and SELECT * from the view in the CLI.
4. Create a new dataset in BigQuery, and copy the source table to the new dataset Query the new dataset and table from the CLI.
You are responsible for the security and access control to a BigQuery dataset hosted within a project. Multiple users from multiple teams need to have access to the different tables within the dataset. How can access be control?
1. Create Authorized views for tables in a separate project and grant access to the teams
2. Create Authorized views for tables in the same project and grant access to the teams
3. Create Materialized views for tables in a separate project and grant access to the teams
4. Create Materialized views for tables in the same project and grant access to the teams

References

Google_Cloud_BigQuery_Data_Goverance

Google Cloud Dataproc

August 1, 2021 ~ Last updated on : August 12, 2021 ~ jayendrapatil

Google Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
Dataproc automation helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
Dataproc helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average
Dataproc has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
Dataproc clusters support preemptible instances that have lower compute prices to reduce costs further.
Dataproc supports connectors for BigQuery, Bigtable, Cloud Storage
Dataproc also supports Anaconda, HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.

Dataproc Cluster High Availability

Dataproc cluster can be configured for High Availability by specifying the number of master instances in the cluster
Dataproc supports two master configurations:
- Single Node Cluster – 1 master – 0 Workers (default, non HA)
  - provides one node for both master and worker
  - if the master fails, the in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
- High Availability Cluster – 3 masters – N Workers (Hadoop HA)
  - HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.
All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.

Dataproc Cluster Scaling

Dataproc cluster can be adjusted to scale by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
Dataproc cluster can be scaled at any time, even when jobs are running on the cluster.
Machine type of an existing cluster (vertical scaling) cannot be changed. To vertically scale, create a cluster using a supported machine type, then migrate jobs to the new cluster.
Dataproc cluster can help scale
- to increase the number of workers to make a job run faster
- to decrease the number of workers to save money
- to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

Dataproc Cluster Autoscaling

Dataproc Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy should scale.
It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout cluster lifetime.
Autoscaling is recommended for
- on clusters that store data in external services, such as Cloud Storage
- on clusters that process many jobs
- to scale up single-job clusters
Autoscaling is not recommended with/for:
- HDFS: Autoscaling is not intended for scaling on-cluster HDFS
- YARN Node Labels: Autoscaling does not support YARN Node Labels. YARN incorrectly reports cluster metrics when node labels are used.
- Spark Structured Streaming: Autoscaling does not support Spark Structured Streaming
- Idle Clusters: Autoscaling is not recommended for the purpose of scaling a cluster down to minimum size when the cluster is idle. It is better to delete an Idle cluster.

Dataproc Workers

Primary workers are standard Compute Engine VMs
Secondary workers can be used to scale with the below limitations
- Processing only
  - Secondary workers do not store data.
  - can only function as processing nodes
  - useful to scale compute without scaling storage.
- No secondary-worker-only clusters
  - Cluster must have primary workers
  - Dataproc adds two primary workers to the cluster, by default, if no primary workers are specified.
- Machine type
  - use the machine type of the cluster’s primary workers.
- Persistent disk size
  - are created, by default, with the smaller of 100GB or the primary worker boot disk size.
  - This disk space is used for local caching of data and is not available through HDFS.
- Asynchronous Creation
  - Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned

Dataproc Initialization Actions

Dataproc supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up
Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

Dataproc Cloud Storage Connector

Dataproc Cloud Storage connector helps Dataproc use Google Cloud Storage as the persistent store instead of HDFS.
Cloud Storage connector helps separate the storage from the cluster lifecycle and allows the cluster to be shut down when not processing data
Cloud Storage connector benefits
- Direct data access – Store the data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
- HDFS compatibility – can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://
- Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility – data is accessible even after shutting down the cluster, unlike HDFS.
- High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.

Cloud Dataproc vs Dataflow

Refer blog post @ Cloud Dataproc vs Dataflow

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
1. Google Cloud Dataflow
2. Google Cloud Dataproc
3. Google Compute Engine
4. Google Container Engine
Your company is migrating to the Google cloud and looking for HBase alternative. Current solution uses a lot of custom code using the observer coprocessor. You are required to find the best alternative for migration while using managed services, is possible?
1. Dataflow
2. HBase on Dataproc
3. Bigtable
4. BigQuery

References

Google_Cloud_Dataproc

Google Cloud Dataflow

July 29, 2021 ~ Last updated on : August 12, 2021 ~ jayendrapatil

Google Cloud Dataflow

Google Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
Dataflow provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
Dataflow is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.

Dataflow (Apache Beam) Programming Model

Data Processing Model

Pipelines

A pipeline encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data.
The input source and output sink can be the same or of different types, allowing data conversion from one format to another.
Apache Beam programs start by constructing a Pipeline object and then using that object as the basis for creating the pipeline’s datasets.
Each pipeline represents a single, repeatable job.

PCollection

A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline’s data.
Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.

Transforms

A transform represents a processing operation that transforms data.
A transform takes one or more PCollections as input, performs a specified operation on each element in that collection, and produces one or more PCollections as output.
A transform can perform nearly any kind of processing operation like
- performing mathematical computations,
- data conversion from one format to another,
- grouping data together,
- reading and writing data,
- filtering data to output only the required elements, or
- combining data elements into single values.

ParDo

ParDo is the core parallel processing operation invoking a user-specified function on each of the elements of the input PCollection.
ParDo collects the zero or more output elements into an output PCollection.
ParDo processes elements independently and in parallel, if possible.

Pipeline I/O

Apache Beam I/O connectors help read data into the pipeline and write output data from the pipeline.
An I/O connector consists of a source and a sink.
All Apache Beam sources and sinks are transforms that let the pipeline work with data from several different data storage formats.

Aggregation

Aggregation is the process of computing some value from multiple input elements.
The primary computational pattern for aggregation is to
- group all elements with a common key and window.
- combine each group of elements using an associative and commutative operation.

User-defined functions (UDFs)

User-defined functions allow executing user-defined code as a way of configuring the transform.
For ParDo, user-defined code specifies the operation to apply to every element, and for Combine, it specifies how values should be combined.
A pipeline might contain UDFs written in a different language than the language of the runner.
A pipeline might also contain UDFs written in multiple languages.

Runner

Runners are the software that accepts a pipeline and executes it.

Event time

Time a data event occurs, determined by the timestamp on the data element itself.
This contrasts with the time the actual data element gets processed at any stage in the pipeline.

Windowing

Windowing enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
A windowing function tells the runner how to assign elements to an initial window, and how to merge windows of grouped elements.

Tumbling Windows (Fixed Windows)

A tumbling window represents a consistent, disjoint time interval, for e.g. every 1 min, in the data stream.

An image that shows tumbling windows, 30 seconds in duration

Hopping Windows (Sliding Windows)

A hopping window represents a consistent time interval in the data stream for e.g., a hopping window can start every 30 seconds and capture 1 min of data and the window. The frequency with which hopping windows begin is called the period.
Hopping windows can overlap, whereas tumbling windows are disjoint.
Hopping windows are ideal to take running averages of data

An image that shows hopping windows with 1 minute window duration and 30 second window period

Session windows

A session window contains elements within a gap duration of another element for e.g., session windows can divide a data stream representing user mouse activity. This data stream might have long periods of idle time interspersed with many clicks. A session window can contain the data generated by the clicks.
The gap duration is an interval between new data in a data stream.
If data arrives after the gap duration, the data is assigned to a new window
Session windowing assigns different windows to each data key.
Tumbling and hopping windows contain all elements in the specified time interval, regardless of data keys.

An image that shows session windows with a minimum gap duration

Watermarks

A Watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived.
Watermark is tracked and its a system’s notion of when all data in a certain window can be expected to have arrived in the pipeline
If new data arrives with a timestamp that’s in the window but older than the watermark, the data is considered late data.
Dataflow tracks watermarks because of the following:
- Data is not guaranteed to arrive in time order or at predictable intervals.
- Data events are not guaranteed to appear in pipelines in the same order that they were generated.

Trigger

Triggers determine when to emit aggregated results as data arrives.
For bounded data, results are emitted after all of the input has been processed.
For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.

Dataflow Pipeline Operations

Cancelling a job
- causes the Dataflow service to stop the job immediately.
- might lose in-flight data
Draining a job
- supports graceful stop
- prevents data loss
- is useful to deploy incompatible changes
- allows the job to clear the existing queue before stoping
- supports only streaming jobs and does not support batch pipelines

Dataflow Security

Dataflow provides data-in-transit encryption.
- All communication with Google Cloud sources and sinks is encrypted and is carried over HTTPS.
- All inter-worker communication occurs over a private network and is subject to the project’s permissions and firewall rules.

Cloud Dataflow vs Dataproc

Refer blog post @ Cloud Dataflow vs Dataproc

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
1. Dataproc
2. Dataprep
3. Dataflow
4. BigQuery

References

Google_Cloud_Dataflow

Google Cloud Spanner

July 28, 2021 ~ Last updated on : July 30, 2021 ~ jayendrapatil

Google Cloud Spanner

Cloud Spanner is a fully managed, mission-critical relational database service
Cloud Spanner provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at a global scale.
Cloud Spanner provides traditional relational semantics like schemas, ACID transactions and SQL interface
Cloud Spanner provides Automatic, Synchronous replication within and across regions for high availability (99.999%)
Cloud Spanner benefits
- OLTP (Online Transactional Processing)
- Global scale
- Relational data model
- ACID/Strong or External consistency
- Low latency
- Fully managed and highly available
- Automatic replication

Cloud Spanner Architecture

Instance

Cloud Spanner Instance determines the location and the allocation of resources
Instance creation includes two important choices
- Instance configuration
  - determines the geographic placement i.e. location and replication of the databases
  - Location can be regional or multi-regional
  - cannot be changed once selected during the creation
- Node count
  - determines the amount of the instance’s serving and storage resources
  - can be updated
Cloud Spanner distributes an instance across zones of one or more regions to provide high performance and high availability
Cloud Spanner instances have:
- At least three read-write replicas of the database each in a different zone
- Each zone is a separate isolation fault domain
- Paxos distributed consensus protocol used for writes/transaction commits
- Synchronous replication of writes to all zones across all regions
- Database is available even if one zone fails (99.999% availability SLA for multi-region and 99.99% availability SLA for regional)

Regional vs Multi-Regional

Regional Configuration
- Cloud Spanner maintains 3 read-write replicas, each within a different Google Cloud zone in that region.
- Each read-write replica contains a full copy of the operational database that is able to serve read-write and read-only requests.
- Cloud Spanner uses replicas in different zones so that if a single-zone failure occurs, the database remains available.
- Every Cloud Spanner mutation requires a write quorum that’s composed of a majority of voting replicas. Write quorums are formed from two out of the three replicas in regional configurations.
- Provides 99.99% availability
Multi-Regional Configuration
- Multi-region configurations allow replicating the database’s data not just in multiple zones, but in multiple zones across multiple regions
- Additional replicas enable reading data with low latency from multiple locations close to or within the regions in the configuration.
- As the quorum (read-write) replicas are spread across more than one region, additional network latency is incurred when these replicas communicate with each other to vote on writes.
- Multi-region configurations enable the application to achieve faster reads in more places at the cost of a small increase in write latency.
- Provides 99.999% availability
- Multi-regional makes use of, paxos based replication, TrueTime and leader election, to provide global consistency and higher availability

Cloud Spanner - Regional vs Multi-Regional Configurations

Replication

Cloud Spanner automatically gets replication at the byte level from the underlying distributed filesystem.
Cloud Spanner also performs data replication to provide global availability and geographic locality, with fail-over between replicas being transparent to the client.
Cloud Spanner creates multiple copies, or “replicas,” of the rows, then stores these replicas in different geographic areas.
Cloud Spanner uses a synchronous, Paxos distributed consensus protocol, in which voting replicas take a vote on every write request to ensure transactions are available in sufficient replicas before being committed.
Globally synchronous replication gives the ability to read the most up-to-date data from any Cloud Spanner read-write or read-only replica.
Cloud Spanner creates replicas of each database split
A split holds a range of contiguous rows, where the rows are ordered by the primary key.
All of the data in a split is physically stored together in the replica, and Cloud Spanner serves each replica out of an independent failure zone.
A set of splits is stored and replicated using Paxos.
Within each Paxos replica set, one replica is elected to act as the leader.
Leader replicas are responsible for handling writes, while any read-write or read-only replica can serve a read request without communicating with the leader (though if a strong read is requested, the leader will typically be consulted to ensure that the read-only replica has received all recent mutations)
Cloud Spanner automatically reshards data into splits and automatically migrates data across machines (even across datacenters) to balance load, and in response to failures.
Spanner’s sharding considers the parent child relationships in interleaved tables and related data is migrated together to preserve query performance

Cloud Spanner Data Model

A Cloud Spanner Instance can contain one or more databases
A Cloud Spanner database can contain one or more tables
Tables look like relational database tables in that they are structured with rows, columns, and values, and they contain primary keys
Every table must have a primary key, and that primary key can be composed of zero or more columns of that table
Parent-child relationships in Cloud Spanner
- Table Interleaving
  - Table interleaving is a good choice for many parent-child relationships where the child table’s primary key includes the parent table’s primary key columns
  - Child rows are colocated with the parent rows significantly improving the performance
  - Primary key column(s) of the parent table must be the prefix of the primary key of the child table
- Foreign Keys
  - Foreign keys are similar to traditional databases.
  - They are not limited to primary key columns, and tables can have multiple foreign key relationships, both as a parent in some relationships and a child in others.
  - The foreign key relationship does not guarantee data co-location
Cloud Spanner automatically creates an index for each table’s primary key
Secondary indexes can be created for other columns

Cloud Spanner Scaling

Increase the compute capacity of the instance to scale up the server and storage resources in the instance.
Each node allows for an additional 2TB of data storage
Nodes provide additional compute resources to increase throughput
Increasing compute capacity does not increase the replica count but gives each replica more CPU and RAM, which increases the replica’s throughput (that is, more reads and writes per second can occur).

Cloud Spanner Backup & PITR

Cloud Spanner Backup and Restore helps create backups of Cloud Spanner databases on demand, and restore them to provide protection against operator and application errors that result in logical data corruption.
Backups are highly available, encrypted, and can be retained for up to a year from the time they are created.
Cloud Spanner point-in-time recovery (PITR) provides protection against accidental deletion or writes.
PITR works by letting you configure a database’s version_retention_period to retain all versions of data and schema, from a minimum of 1 hour up to a maximum of 7 days.

Cloud Spanner Best Practices

Design a schema that prevents hotspots and other performance issues.
For optimal write latency, place compute resources for write-heavy workloads within or close to the default leader region.
For optimal read performance outside of the default leader region, use staleness of at least 15 seconds.
To avoid single-region dependency for the workloads, place critical compute resources in at least two regions.
Provision enough compute capacity to keep high priority total CPU utilization under
- 65% in each region for regional configuration
- 45% in each region for multi-regional configuration

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your customer has implemented a solution that uses Cloud Spanner and notices some read latency-related performance issues on one table. This table is accessed only by their users using a primary key. The table schema is shown below. You want to resolve the issue. What should you do?
1. Remove the profile_picture field from the table.
2. Add a secondary index on the person_id column.
3. Change the primary key to not have monotonically increasing values.
4. Create a secondary index using the following Data Definition Language (DDL) CREATE INDEX person_id_ix ON Persons (
  person_id, firstname, lastname ) STORING ( profile_picture )
You are building an application that stores relational data from users. Users across the globe will use this application. Your CTO is concerned about the scaling requirements because the size of the user base is unknown. You need to implement a database solution that can scale with your user growth with minimum configuration changes. Which storage solution should you use?
1. Cloud SQL
2. Cloud Spanner
3. Cloud Firestore
4. Cloud Datastore
A financial organization wishes to develop a global application to store transactions happening from different part of the world. The storage system must provide low latency transaction support and horizontal scaling. Which GCP service is appropriate for this use case?
1. Bigtable
  B Datastore
  C Cloud Storage
  D Cloud Spanner

References

Google_Cloud_Spanner

Google Cloud Data Analytics Services Cheat Sheet

July 26, 2021 ~ Last updated on : August 11, 2021 ~ jayendrapatil

Cloud Pub/Sub

Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
Pub/Sub messages should be no greater than 10MB in size.
Messages can be received with pull or push delivery.
Messages published before a subscription is created will not be delivered to that subscription
Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
Pub/Sub support encryption at rest and encryption in transit.
Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
supports a standard SQL dialect
automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
Data model consists of Datasets, tables
BigQuery performance can be improved using Partitioned tables and Clustered tables.
BigQuery encrypts all data at rest and supports encryption in transit.
BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
Best Practices
- Control projection, avoid select *
- Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
- Use the maximum bytes billed setting to limit query costs.
- Use clustering and partitioning to reduce the amount of data scanned.
- Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
- Use streaming inserts only if the data must be immediately available as streaming data is charged.
- Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
- Denormalize data whenever possible using nested and repeated fields.
- Avoid external data sources, if query performance is a top priority
- Avoid using Javascript user-defined functions
- Optimize Join patterns. Start with the largest table.
- Use the expiration settings to remove unneeded tables and partitions
- Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
is a sparsely populated table that can scale to billions of rows and thousands of columns,
supports storage of terabytes or even petabytes of data
is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
handles upgrades and restarts transparently, and it automatically maintains high data durability.
scales linearly in direct proportion to the number of nodes in the cluster
stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
Single-cluster Bigtable instances provide strong consistency.
Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
supports drain feature to deploy incompatible updates

Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
support preemptible instances that have lower compute prices to reduce costs further.
also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
supports connectors for BigQuery, Bigtable, Cloud Storage
can be configured for High Availability by specifying the number of master instances in the cluster
All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
provides easy data preparation with clicks and no code.
automatically identifies data anomalies & helps take fast corrective action
automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
is built on Jupyter (formerly IPython)
enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).

Google Cloud Firestore

July 26, 2021 ~ jayendrapatil

Google Cloud Firestore

Google Cloud Firestore provides a fully managed, scalable, and serverless document database.
Firestore stores the data in the form of documents and collections
Firestore provides horizontal autoscaling, strong consistency with support for ACID transactions
Firestore database can be regional or multi-regional
Firestore multi-region instances provide five-nines (99.999%) availability SLA and regional instances with four-nines (99.99%) availability SLA

Data Model

Firestore is schemaless
Document & Collections
- Unit of storage is the document in Firestore
- Each document contains a set of key-value pairs
- stores the data in documents organized into collections.
- is optimized for storing large collections of small documents.
- supports a variety of data types for values: boolean, number, string, geo point, binary blob, and timestamp.
- Documents can contain subcollections, arrays, or nested objects, which can include primitive fields like strings or complex objects like lists.
- Documents within a collection are unique and can be identified using your own keys, such as user IDs, or Firestore generated random IDs.
Indexes
- Firestore guarantees high query performance by using indexes for all queries.
- supports two types of indexes
  - Single-field
    - automatically maintains single-field indexes for each field in a document and each subfield in a map.
    - Single-field index exemption can be used to exempt a field from automatic indexing settings
    - Single-field index exemption for a map field is inherited by the map’s subfields
  - Composite
    - A composite index stores a sorted mapping of all the documents in a collection, based on an ordered list of fields to index.
    - does not automatically create composite indexes but helps identify fields based on the query pattern

Data Contention

Data Contention occurs when two or more operations compete to control the same document.
Mobile/Web SDKs
- uses optimistic concurrency controls to resolve data contention
- resolves data contention by delaying or failing one of the operations
- client libraries automatically retry transactions that fail due to data contention. After a finite number of retries, the transaction operation fails and returns an error message
Server Client Libraries
- use pessimistic concurrency controls to resolve data contention.
- Pessimistic transactions use database locks to prevent other operations from modifying data.
- Transactions place locks on the documents they read. A transaction’s lock on a document blocks other transactions, batched writes, and non-transactional writes from changing that document.
- A transaction releases its document locks at commit time. It also releases its locks if it times out or fails for any reason.

Firestore Security

Firestore automatically encrypts all data before it is written to disk.
Server-side encryption can be used in combination with client-side encryption, where data is encrypted by the client as well as server i.e double encryption
Firestore uses Transport Layer Security (TLS) to protect the data as it travels over the Internet during read and write operations.

Firestore Native vs Datastore Mode

Firestore in Native mode

Strongly consistent storage layer
Collection and document data model
Real-time updates
Mobile and Web client libraries
Firestore is backward compatible with Datastore, but the new data model, real-time updates, and mobile and web client library features are not.
Native mode can automatically scale to millions of concurrent clients.
Native mode is recommended for Mobile and Web apps

Firestore in Datastore mode

Datastore mode uses Datastore system behavior but accesses Firestore’s storage layer, removing the following Datastore limitations:
- No more eventual consistency. Is a strongly consistent database
- No more entity group limits on writes per second. Writes to an entity group are no longer limited to 1 per second. Transactions are no longer limited to 25 entity groups.
- Transactions can be as complex as you want to design them.
- No more cross-entity group transaction limits. Transactions can span documents and be as complex as your app requires. Queries in transactions are no longer required to be ancestor queries.
Datastore mode disables Firestore features that are not compatible with Datastore:
- accepts only Datastore API requests and denies Firestore API requests.
- uses Datastore indexes instead of Firestore indexes.
- do not support Firestore client libraries, but only Datastore client libraries
- do not support Firestore real-time capabilities
Datastore mode can automatically scale to millions of writes per second.

Firestore Native Mode vs Datastore Mode

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your existing application keeps user state information in a single MySQL database. This state information is very user-specific and depends heavily on how long a user has been using an application. The MySQL database is causing challenges to maintain and enhance the schema for various users. Which storage option should you choose?
1. Cloud SQL
2. Cloud Storage
3. Cloud Spanner
4. Cloud Firestore

References

Google_Cloud_Firestore

Google Cloud – EHR Healthcare Case Study

July 20, 2021 ~ Last updated on : February 2, 2022 ~ jayendrapatil

Google Cloud – EHR Healthcare Case Study

EHR Healthcare is a leading provider of electronic health record software to the medical industry. EHR Healthcare provides its software as a service to multi-national medical offices, hospitals, and insurance providers.

Executive statement

Our on-premises strategy has worked for years but has required a major investment of time and money in training our team on distinctly different systems, managing similar but separate environments, and responding to outages. Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsistent monitoring practices. We want to use Google Cloud to leverage a scalable, resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.

EHR Healthcare wants to move to Google Cloud to expand, build scalable and highly available applications. It also wants to leverage automation and IaaC to provide consistency across environments and reduce provisioning errors.

Solution Concept

Due to rapid changes in the healthcare and insurance industry, EHR Healthcare’s business has been growing exponentially year over year. They need to be able to scale their environment, adapt their disaster recovery plan, and roll out new continuous deployment capabilities to update their software at a fast pace. Google Cloud has been chosen to replace its current colocation facilities.

EHR wants to build a scalable, Highly Available, Disaster Recovery setup and introduce Continous Integration and Deployment in their setup.

Existing Technical Environment

EHR’s software is currently hosted in multiple colocation facilities. The lease on one of the data centers is about to expire.
Customer-facing applications are web-based, and many have recently been containerized to run on a group of Kubernetes clusters. Data is stored in a mixture of relational and NoSQL databases (MySQL, MS SQL Server, Redis, and MongoDB).
EHR is hosting several legacy file- and API-based integrations with insurance providers on-premises. These systems are scheduled to be replaced over the next several years. There is no plan to upgrade or move these systems at the current time.
Users are managed via Microsoft Active Directory. Monitoring is currently being done via various open-source tools. Alerts are sent via email and are often ignored.

As the lease of one of the data centers is about to expire, time is critical
Some web applications are containerized and have SQL and NoSQL databases and can be moved
Some of the systems are legacy and would be replaced and need not be migrated
Team has multiple monitoring tools and might need consolidation

Business requirements

On-board new insurance providers as quickly as possible.

Provide a minimum 99.9% availability for all customer-facing systems.

Availability can be increased by hosting applications across multiple zones and using managed services which span multiple AZs

Provide centralized visibility and proactive action on system performance and usage.

Cloud Monitoring can be used to provide centralized visibility and alerting can provide proactive action
Cloud Logging can be also used for log monitoring and alerting

Increase ability to provide insights into healthcare trends.

Data can be pushed and analyzed using BigQuery and insights visualized using Data studio.

Reduce latency to all customers.

Performance can be improved using Global Load Balancer to expose the applications.
Applications can also be hosted across regions for low latency access.

Maintain regulatory compliance.

Regulatory compliance can be maintained using data localization, data retention policies as well as security measures.

Decrease infrastructure administration costs.

Infrastructure administration costs can be reduced using automation with either Terraform or Deployment Manager

Make predictions and generate reports on industry trends based on provider data.

Data can be pushed and analyzed using BigQuery.

Technical requirements

Maintain legacy interfaces to insurance providers with connectivity to both on-premises systems and cloud providers.

Provide a consistent way to manage customer-facing applications that are container-based.

Containers based applications can be deployed GKE or Cloud Run with consistent CI/CD experience

Provide a secure and high-performance connection between on-premises systems and Google Cloud.

Cloud VPN, Dedicated Interconnect, or Partner Interconnect connections can be established between on-premises and Google Cloud

Provide consistent logging, log retention, monitoring, and alerting capabilities.

Cloud Monitoring and Cloud Logging can be used to provide a single tool for monitoring, logging, and alerting.

Maintain and manage multiple container-based environments.

Use Deployment Manager or IaaC to provide consistent implementations across environments

Dynamically scale and provision new environments.

Applications deployed on GKE can be scaled using Cluster Autoscaler and HPA for deployments.

Create interfaces to ingest and process data from new providers.

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

For this question, refer to the EHR Healthcare case study. In the past, configuration errors put IP addresses on backend servers that should not have been accessible from the internet. You need to ensure that no one can put external IP addresses on backend Compute Engine instances and that external IP addresses can only be configured on the front end Compute Engine instances. What should you do?
1. Create an organizational policy with a constraint to allow external IP addresses on the front end Compute Engine instances
2. Revoke the compute.networkadmin role from all users in the project with front end instances
3. Create an Identity and Access Management (IAM) policy that maps the IT staff to the compute.networkadmin role for the organization
4. Create a custom Identity and Access Management (IAM) role named GCE_FRONTEND with the compute.addresses.create permission

EHR Healthcare References

EHR Healthcare Case Study

Google Cloud BigQuery

July 15, 2021 ~ Last updated on : August 13, 2021 ~ jayendrapatil

Google Cloud BigQuery

Google Cloud BigQuery is a fully managed, peta-byte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
BigQuery supports a standard SQL dialect that is ANSI:2011 compliant, which reduces the need for code rewrites.
BigQuery transparently and automatically provides highly durable, replicated storage in multiple locations and high availability with no extra charge and no additional setup.
BigQuery supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
BigQuery automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
BigQuery Data Transfer Service automatically transfers data from external data sources, like Google Marketing Platform, Google Ads, YouTube, external sources like S3 or Teradata, and partner SaaS applications to BigQuery on a scheduled and fully managed basis
BigQuery provides a REST API for easy programmatic access and application integration. Client libraries are available in Java, Python, Node.js, C#, Go, Ruby, and PHP.

BigQuery Resources

Datasets

Datasets are the top-level containers used to organize and control access to the BigQuery tables and views.
Datasets frequently map to schemas in standard relational databases and data warehouses.
Datasets are scoped to the Cloud project
A dataset is bound to a location and can be defined as
- Regional: A specific geographic place, such as London.
- Multi-regional: A large geographic area, such as the United States, that contains two or more geographic places.
Dataset location can be set only at the time of its creation.
A query can contain tables or views from different datasets in the same location.
Dataset names must be unique for each project.

Tables

BigQuery tables are row-column structures that hold the data.
A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).
Every table is defined by a schema that describes the column names, data types, and other information.
BigQuery has the following types of tables:
- Native tables: Tables backed by native BigQuery storage.
- External tables: Tables backed by storage external to BigQuery.
- Views: Virtual tables defined by a SQL query.
Schema of a table can either be defined during creation or specified in the query job or load job that first populates it with data.
Schema auto-detection is also supported when data is loaded from BigQuery or an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files
Columns datatype cannot be changed once defined.

Partitioned Tables

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.
By dividing a large table into smaller partitions, query performance and costs can be controlled by reducing the number of bytes read by a query.
BigQuery tables can be partitioned by:
- Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.
- Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
- Integer range: Tables are partitioned based on an integer column.
If a query filters on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. This process is called pruning.

Clustered Tables

With Clustered tables, the table data is automatically organized based on the contents of one or more columns in the table’s schema.
Columns specified are used to colocate the data.
Clustering can be performed on multiple columns, where the order of the columns is important as it determines the sort order of the data
Clustering can improve query performance for specific filter queries or ones that aggregate data as BigQuery uses the sorted blocks to eliminate scans of unnecessary data
Clustering does not provide cost guarantees before running the query.
Partitioning can be used with clustering where data is first partitioned and then data in each partition is clustered by the clustering columns. When the table is queried, partitioning sets an upper bound of the query cost based on partition pruning.

Views

A View is a virtual table defined by a SQL query.
View query results contain data only from the tables and fields specified in the query that defines the view.
Views are read-only and do not support DML queries
Dataset that contains the view and the dataset that contains the tables referenced by the view must be in the same location.
View does not support BigQuery job that exports data
View does not support JSON API to retrieve data from the view
Standard SQL and legacy SQL queries cannot be mixed
Legacy SQL view cannot be automatically updated to standard SQL syntax.
No user-defined functions allowed
No wildcard table references allowed

Materialized Views

Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency.
BigQuery leverages pre-computed results from materialized views and whenever possible reads only delta changes from the base table to compute up-to-date results.
Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.
Materialized views queries are generally faster and consume fewer resources than queries that retrieve the same data only from the base table
Materialized views can significantly improve the performance of workloads that have the characteristic of common and repeated queries.

Jobs

Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.
Jobs are not linked to the same project that the data is stored in. However, the location where the job can execute is linked to the dataset location.

External Data Sources

An external data source (federated data source) is a data source that can be queried directly even though the data is not stored in BigQuery.
Instead of loading or streaming the data, a table can be created that references the external data source.
BigQuery offers support for querying data directly from:
- Cloud Bigtable
- Cloud Storage
- Google Drive
- Cloud SQL
Supported formats are:
- Avro
- CSV
- JSON (newline delimited only)
- ORC
- Parquet
External data sources use cases
- Loading and cleaning the data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that needs to be joined with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.
Permanent vs Temporary external tables
- The external data sources can be queried in BigQuery by using a permanent table or a temporary table.
- Permanent Table
  - is a table that is created in a dataset and is linked to the external data source.
  - access controls can be used to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
- Temporary Table
  - you submit a command that includes a query and creates a non-permanent table linked to the external data source.
  - no table is created in the BigQuery datasets.
  - cannot be shared with others.
  - Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Limitations
- does not guarantee data consistency for external data sources
- query performance for external data sources may not be as high as querying data in a native BigQuery table
- cannot reference an external data source in a wildcard table query.
- support table partitioning or clustering in limited ways
- results are not cached and would be charged for each query execution

BigQuery Security

Refer blog post @ BigQuery Security

BigQuery Best Practices

Cost Control
- Query only the needed columns and avoid select * as BigQuery does a full scan of every column in the table.
- Don’t run queries to explore or preview table data. Use preview option
- Before running queries, preview them to estimate costs. Queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
- Use the maximum bytes billed setting to limit query costs.
- Use clustering and partitioning to reduce the amount of data scanned.
- For non-clustered tables, do not use a LIMIT clause as a method of cost control. Applying a LIMIT clause to a query does not affect the amount of data read, but shows limited results only. With a clustered table, a LIMIT clause can reduce the number of bytes scanned
- Partition the tables by date which helps query relevant subsets of data which improves performance and reduces costs.
- Materialize the query results in stages. Break the query into stages where each stage materializes the query results by writing them to a destination table. Querying the smaller destination table reduces the amount of data that is read and lowers costs. The cost of storing the materialized results is much less than the cost of processing large amounts of data.
- Use streaming inserts only if the data must be immediately available as streaming data is charged.
Query Performance
- Control projection, Query only the needed columns. Avoid SELECT *
- Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
- Denormalize data whenever possible using nested and repeated fields.
- Avoid external data sources, if query performance is a top priority
- Avoid repeatedly transforming data via SQL queries, use materialized views instead
- Avoid using Javascript user-defined functions
- Optimize Join patterns. Start with the largest table.
Optimizing Storage
- Use the expiration settings to remove unneeded tables and partitions
- Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

BigQuery Data Transfer Service

Refer GCP blog post @ Google Cloud BigQuery Data Transfer Service

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A user wishes to generate reports on petabyte-scale data using Business Intelligence (BI) tools. Which storage option provides integration with BI tools and supports OLAP workloads up to petabyte-scale?
1. Bigtable
2. Cloud Datastore
3. Cloud Storage
4. BigQuery
Your company uses Google Analytics for tracking. You need to export the session and hit data from a Google Analytics 360 reporting view on a scheduled basis into BigQuery for analysis. How can the data be exported?
1. Configure a scheduler in Google Analytics to convert the Google Analytics data to JSON format, then import directly into BigQuery using bq command line.
2. Use gsutil to export the Google Analytics data to Cloud Storage, then import into BigQuery and schedule it using Cron.
3. Import data to BigQuery directly from Google Analytics using Cron
4. Use BigQuery Data Transfer Service to import the data from Google Analytics

References

Google_Cloud_BigQuery_Architecture

Google Cloud BigTable

July 15, 2021 ~ Last updated on : August 18, 2021 ~ jayendrapatil

Google Cloud Bigtable

Cloud Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
Bigtable is ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
Bigtable supports high read and write throughput at low latency and provides consistent sub-10ms latency – handle millions of requests/second
Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns,
Bigtable supports storage of terabytes or even petabytes of data
Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
Fully Managed
- Bigtable handles upgrades and restarts transparently, and it automatically maintains high data durability.
- Data replication can be performed by simply adding a second cluster to the instance, and replication starts automatically.
Scalability
- Bigtable scales linearly in direct proportion to the number of machines in the cluster
- Bigtable throughput can be scaled dynamically by adding or removing cluster nodes without restarting
Bigtable integrates easily with big data tools like Hadoop, Dataflow, Dataproc and supports HBase APIs.

Bigtable Architecture

Bigtable Instance is a container for Cluster where Nodes are organized.
Bigtable stores data in Colossus, Google’s file system.
Instance
- A Bigtable instance is a container for data.
- Instances have one or more clusters, located in a different zone and different region (Different region adds to latency)
- Each cluster has at least 1 node
- A Table belongs to an instance and not to the cluster or node.
- An instance also consists of the following properties
  - Storage Type – SSD or HDD
  - Application Profiles – primarily for instances using replication
Instance Type
- Development – Single node cluster with no replication or SLA
- Production – 1+ clusters which 3+ nodes per cluster
Storage Type
- Storage Type dictates where the data is stored i.e. SSD or HDD
- Choice of SSD or HDD storage for the instance is permanent
- SSD storage is the most efficient and cost-effective choice for most use cases.
- HDD storage is sometimes appropriate for very large data sets (>10 TB) that are not latency-sensitive or are infrequently accessed.
Application Profile
- An application profile, or app profile, stores settings indicate Bigtable on how to handle incoming requests from an application
- Application profile helps define custom application-specific settings for handling incoming connections
Cluster
- Clusters handle the requests sent to a single Bigtable instance
- Each cluster belongs to a single Bigtable instance, and an instance can have up to 4 clusters
- Each cluster is located in a single-zone
- Bigtable instances with only 1 cluster do not use replication
- An Instances with multiple clusters replicate the data, which
  - improves data availability and durability
  - improves scalability by routing different types of traffic to different clusters
  - provides failover capability, if another cluster becomes unavailable
- If multiple clusters within an instance, Bigtable automatically starts replicating the data by keeping separate copies of the data in each of the clusters’ zones and synchronizing updates between the copies
Nodes
- Each cluster in an instance has 1 or more nodes, which are the compute resources that Bigtable uses to manage the data.
- Each node in the cluster handles a subset of the requests to the cluster
- All client requests go through a front-end server before they are sent to a Bigtable node.
- Bigtable separates the Compute from the Storage. Data is never stored in nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. This helps as
  - Rebalancing tablets from one node to another is very fast, as the actual data is not copied. Only pointers for each node are updated
  - Recovery from the failure of a Bigtable node is very fast as only the metadata needs to be migrated to the replacement node.
  - When a Bigtable node fails, no data is lost.
- A Bigtable cluster can be scaled by adding nodes which would increase
  - the number of simultaneous requests that the cluster can handle
  - the maximum throughput of the cluster.
- Each node is responsible for:
  - Keeping track of specific tablets on disk.
  - Handling incoming reads and writes for its tablets.
  - Performing maintenance tasks on its tablets, such as periodic compactions
- Bigtable nodes are also referred to as tablet servers
Tables
- Bigtable stores data in massively scalable tables, each of which is a sorted key/value map.
- A Table belongs to an instance and not to the cluster or node.
- A Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries.
- Bigtable splits all of the data in a table into separate tablets.
- Tablets are stored on the disk, separate from the nodes but in the same zone as the nodes.
- Each tablet is associated with a specific Bigtable node.
- Tablets are stored in SSTable format which provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
- In addition to the SSTable files, all writes are stored in Colossus’s shared log as soon as they are acknowledged by Bigtable, providing increased durability.

Bigtable Storage Model

Bigtable stores data in tables, each of which is a sorted key/value map.
A Table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
Each row is indexed by a single row key, and columns that are related to one another are typically grouped together into a column family.
Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.
Each row/column intersection can contain multiple cells.
Each cell contains a unique timestamped version of the data for that row and column.
Storing multiple cells in a column provides a record of how the stored data for that row and column has changed over time.
Bigtable tables are sparse; if a column is not used in a particular row, it does not take up any space.

Bigtable Schema Design

Bigtable schema is a blueprint or model of a table that includes Row Keys, Column Families, and Columns
Bigtable is a key/value store, not a relational store. It does not support joins, and transactions are supported only within a single row.
Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian byte order, the binary equivalent of alphabetical order.
Column families are not stored in any specific order.
Columns are grouped by column family and sorted in lexicographic order within the column family.
Intersection of a row and column can contain multiple timestamped cells. Each cell contains a unique, timestamped version of the data for that row and column.
All operations are atomic at the row level. This means that an operation affects either an entire row or none of the row.
Bigtable tables are sparse. A column doesn’t take up any space in a row that doesn’t use the column.

Bigtable Replication

Bigtable Replication helps increase the availability and durability of the data by copying it across multiple zones in a region or multiple regions.
Replication helps isolate workloads by routing different types of requests to different clusters using application profiles.
Bigtable replication can be implemented by
- creating a new instance with more than 1 cluster or
- adding clusters to an existing instance.
Bigtable synchronizes the data between the clusters, creating a separate, independent copy of the data in each zone with the instance cluster.
Replicated clusters in different regions typically have higher replication latency than replicated clusters in the same region.
Bigtable replicates any changes to the data automatically, including all of the following types of changes:
- Updates to the data in existing tables
- New and deleted tables
- Added and removed column families
- Changes to a column family’s garbage collection policy
Bigtable treats each cluster in the instance as a primary cluster, so reads and writes can be performed in each cluster.
Application profiles can be created so that the requests from different types of applications are routed to different clusters.
Consistency Model
- Eventual Consistency
  - Replication for Bigtable is eventually consistent, by default.
- Read-your-writes Consistency
  - Bigtable can also provide read-your-writes consistency when replication is enabled, which ensures that an application will never read data that is older than its most recent writes.
  - To gain read-your-writes consistency for a group of applications, each application in the group must use an app profile that is configured for single-cluster routing, and all of the app profiles must route requests to the same cluster.
  - You can use the instance’s additional clusters at the same time for other purposes.
- Strong Consistency
  - For some replication use cases, Bigtable can also provide strong consistency, which ensures that all of the applications see the data in the same state.
  - To gain strong consistency, you use the single-cluster routing app-profile configuration for read-your-writes consistency, but you must not use the instance’s additional clusters unless you need to failover to a different cluster.
Use cases
- Isolate real-time serving applications from batch reads
- Improve availability
- Provide near-real-time backup
- Ensure your data has a global presence

Bigtable Best Practices

Store datasets with similar schemas in the same table, rather than in separate tables as in SQL.
Bigtable has a limit of 1,000 tables per instance
Creating many small tables is a Bigtable anti-pattern.
Put related columns in the same column family
Create up to about 100 column families per table. A higher number would lead to performance degradation.
Choose short but meaningful names for your column families
Put columns that have different data retention needs in different column families to limit storage cost.
Create as many columns as you need in the table. Bigtable tables are sparse, and there is no space penalty for a column that is not used in a row
Don’t store more than 100 MB of data in a single row as a higher number would impact performance
- Don’t store more than 10 MB of data in a single cell.
Design the row key based on the queries used to retrieve the data
Following queries provide the most efficient performance
- Row key
- Row key prefix
- Range of rows defined by starting and ending row keys
Other types of queries trigger a full table scan, which is much less efficient.
Store multiple delimited values in each row key. Multiple identifiers can be included in the row key.
Use human-readable string values in your row keys whenever possible. Makes it easier to use the Key Visualizer tool.
Row keys anti-pattern
- Row keys that start with a timestamp, as it causes sequential writes to a single node
- Row keys that cause related data to not be grouped together, which would degrade the read performance
- Sequential numeric IDs
- Frequently updated identifiers
- Hashed values as hashing a row key removes the ability to take advantage of Bigtable’s natural sorting order, making it impossible to store rows in a way that are optimal for querying
- Values expressed as raw bytes rather than human-readable strings
- Domain names, instead use the reverse domain name as the row key as related data can be clubbed.

Bigtable Load Balancing

Each Bigtable zone is managed by a primary process, which balances workload and data volume within clusters.
This process redistributes the data between nodes as needed as it
- splits busier/larger tablets in half and
- merges less-accessed/smaller tablets together
Bigtable automatically manages all of the splitting, merging, and rebalancing, saving users the effort of manually administering the tablets
Bigtable write performance can be improved by distributed writes as evenly as possible across nodes with proper row key design.

Bigtable Consistency

Single-cluster instances provide strong consistency.
Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Bigtable Security

Access to the tables is controlled by your Google Cloud project and the Identity and Access Management (IAM) roles assigned to the users.
All data stored within Google Cloud, including the data in Bigtable tables, is encrypted at rest using Google’s default encryption.
Bigtable supports using customer-managed encryption keys (CMEK) for data encryption.

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

Your company processes high volumes of IoT data that are time-stamped. The total data volume can be several petabytes. The data needs to be written and changed at a high speed. You want to use the most performant storage option for your data. Which product should you use?
1. Cloud Datastore
2. Cloud Storage
3. Cloud Bigtable
4. BigQuery
You want to optimize the performance of an accurate, real-time, weather-charting application. The data comes from 50,000 sensors sending 10 readings a second, in the format of a timestamp and sensor reading. Where should you store the data?
1. Google BigQuery
2. Google Cloud SQL
3. Google Cloud Bigtable
4. Google Cloud Storage
Your team is working on designing an IoT solution. There are thousands of devices that need to send periodic time series data for
processing. Which services should be used to ingest and store the data?
1. Pub/Sub, Datastore
2. Pub/Sub, Dataproc
3. Dataproc, Bigtable
4. Pub/Sub, Bigtable

References

Google_Cloud_Bigtable

Google Cloud – Professional Cloud Developer Certification learning path

Google Cloud Profressional Cloud Developer Certificate

July 6, 2021 ~ Last updated on : March 30, 2022 ~ jayendrapatil

Google Cloud – Professional Cloud Developer Certification learning path

Continuing on the Google Cloud Journey, glad to have passed the sixth certification with the Professional Cloud Developer certification.

Google Cloud -Professional Cloud Developer Certification Summary

Had 60 questions to be answered in 2 hours. The number of questions was 50 with the other exams in the same 2 hours.
Covers a wide range of Google Cloud services mainly focusing on application and deployment services
Make sure you cover the case studies beforehand. I got ~5-6 questions and it can really be a savior for you in the exams.
As mentioned for all the exams, Hands-on is a MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolutely clueless about some of the questions and commands
I did Coursera and ACloud Guru which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Developer Certification Resources

Google Cloud – Professional Cloud Developer Exam Guide
Google Cloud – Professional Cloud Developer Learning Track
Courses
- Udemy Google Cloud Developer – GCP Professional Certification
- Coursera – Developing Applications with Google Cloud Platform
- Coursera – Getting Started with Google Kubernetes Engine
- Coursera – Architecting with Google Kubernetes Engine: Production
- A Cloud Guru – Google Cloud Certified – Professional Cloud Developer
Practice tests
- Braincert Google Cloud Certified – Professional Cloud Developer Practice Exams
- Whizlabs – Google Cloud Certified Professional Cloud Developer practice test questions – Buy Now
Use Google Free Tier and Qwiklabs as much as possible.

Google Cloud – Professional Cloud Developer Certification Topics

Case Studies

HipLocal Case Study

Compute Services

Compute services like Google Compute Engine and Google Kubernetes Engine are lightly covered more from the security aspects
Google Compute Engine
- Google Compute Engine is the best IaaS option for compute and provides fine-grained control
- Compute Engine is recommended to be used with Service Account with the least privilege to provide access to Google services and the information can be queried from instance metadata.
- Compute Engine Persistent disks can be attached to multiple VMs in read-only mode.
- Compute Engine launch issues reasons
  - Boot disk is full.
  - Boot disk is corrupted
  - Boot Disk has an invalid master boot record (MBR).
  - Quota Errors
  - Can be debugged using Serial console
- Preemptible VMs and their use cases. HINT – shutdown script to perform cleanup actions
Google Kubernetes Engine
- Google Kubernetes Engine, enables running containers on Google Cloud
- Understand GKE containers, Pods, Deployments, Service, DaemonSet, StatefulSets
  - Pods are the smallest, most basic deployable objects in Kubernetes. A Pod represents a single instance of a running process in the cluster and can contain single or multiple containers
  - Deployments represent a set of multiple, identical Pods with no unique identities. A Deployment runs multiple replicas of the application and automatically replaces any instances that fail or become unresponsive.
  - StatefulSets represent a set of Pods with unique, persistent identities and stable hostnames that GKE maintains regardless of where they are scheduled
  - DaemonSets manages groups of replicated Pods. However, DaemonSets attempt to adhere to a one-Pod-per-node model, either across the entire cluster or a subset of nodes
  - Service is to group a set of Pod endpoints into a single resource. GKE Services can be exposed as ClusterIP, NodePort, and Load Balancer
  - Ingress object defines rules for routing HTTP(S) traffic to applications running in a cluster. An Ingress object is associated with one or more Service objects, each of which is associated with a set of Pods
- GKE supports Horizontal Pod Autoscaler (HPA) to autoscale deployments based on CPU and Memory
- GKE supports health checks using liveness and readiness probe
  - Readiness probes are designed to let Kubernetes know when the app is ready to serve traffic.
  - Liveness probes let Kubernetes know if the app is alive or dead.
- Understand Workload Identity for security, which is a recommended way to provide Pods running on the cluster access to Google resources.
- GKE integrates with Istio to provide MTLS feature
Google App Engine
- Google App Engine is the PaaS option
- Understand how to keep auto-scaling and traffic splitting and migration.
- Know the difference between App Engine Flexible vs Standard
Cloud Tasks
- is a fully managed service that allows you to manage the execution, dispatch, and delivery of a large number of distributed tasks.

Security Services

Cloud Identity-Aware Proxy
- Identity-Aware Proxy IAP allows managing access to HTTP-based apps both on Google Cloud and outside of Google Cloud.
- IAP uses Google identities and IAM and can leverage external identity providers as well like OAuth with Facebook, Microsoft, SAML, etc.
- Signed headers using JWT provide secondary security in case someone bypasses IAP.
Cloud Data Loss Prevention – DLP
- Cloud Data Loss Prevention – DLP is a fully managed service designed to help discover, classify, and protect the most sensitive data.
- provides two key features
  - Classification is the process to inspect the data and know what data we have, how sensitive it is, and the likelihood.
  - De-identification is the process of removing, masking, redaction, replacing information from data.
Web Security Scanner
- Web Security Scanner identifies security vulnerabilities in the App Engine, GKE, and Compute Engine web applications.
- scans provide information about application vulnerability findings, like OWASP, XSS, Flash injection, outdated libraries, cross-site scripting, clear-text passwords, or use of mixed content

Networking Services

Virtual Private Cloud
- Understand Virtual Private Cloud (VPC), subnets, and host applications within them
- Private Access options for services allow instances with internal IP addresses can communicate with Google APIs and services.
- Private Google Access allows VMs to connect to the set of external IP addresses used by Google APIs and services by enabling Private Google Access on the subnet used by the VM’s network interface.
Cloud Load Balancing
- Google Cloud Load Balancing provides scaling, high availability, and traffic management for your internet-facing and private applications.

Identity Services

Resource Manager
- Understand Resource Manager the hierarchy Organization -> Folders -> Projects -> Resources
- IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
- Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
Identity and Access Management
- Identify and Access Management – IAM provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
- A service account is a special kind of account used by an application or a virtual machine (VM) instance, not a person.
- Understand IAM Best Practices
  - Use groups for users requiring the same responsibilities
  - Use service accounts for server-to-server interactions.
  - Use Organization Policy Service to get centralized and programmatic control over the organization’s cloud resources.
- Domain-wide delegation of authority to grant third-party and internal applications access to the users’ data for e.g. Google Drive etc.

Storage Services

Cloud Storage
- Cloud Storage is cost-effective object storage for unstructured data and provides an option for long term data retention
- Understand Signed URL to give temporary access and the users do not need to be GCP users HINT: Signed URL would work for direct upload to GCS without routing the traffic through App Engine or CE
- Understand Google Cloud Storage Classes and Object Lifecycle Management to transition objects
- Retention Policies help define the retention period for the bucket, before which the objects in the bucket cannot be deleted.
- Bucket Lock feature allows configuring a data retention policy for a bucket that governs how long objects in the bucket must be retained. The feature also allows locking the data retention policy, permanently preventing the policy from being reduced or removed
- Know Cloud Storage Best Practices esp. GCS auto-scaling performs well if requests ramp up gradually rather than having a sudden spike. Also, retry using exponential back-off strategy
- Cloud Storage can be used to host static websites
- Cloud CDN can be used with Cloud Storage to improve performance and enable caching
DataStore/FireStore
- Cloud Datastore/Firestore provides a managed NoSQL document database built for automatic scaling, high performance, and ease of application development.

Developer Tools

Google Cloud Build
- Cloud Build integrates with Cloud Source Repository, Github, and Gitlab and can be used for Continous Integration and Deployments.
- Cloud Build can import source code, execute build to the specifications, and produce artifacts such as Docker containers or Java archives
- Cloud Build build config file specifies the instructions to perform, with steps defined to each task like test, build and deploy.
- Cloud Build supports custom images as well for the steps
- Cloud Build uses a directory named /workspace as a working directory and the assets produced by one step can be passed to the next one via the persistence of the /workspace directory.
Google Cloud Code
- Cloud Code helps write, debug, and deploy the cloud-based applications for IntelliJ, VS Code, or in the browser.
Google Cloud Client Libraries
- Google Cloud Client Libraries provide client libraries and SDKs in various languages for calling Google Cloud APIs.
- If the language is not supported, Cloud Rest APIs can be used.
Deployment Techniques
- Recreate deployment – fully scale down the existing application version before you scale up the new application version.
- Rolling update – update a subset of running application instances instead of simultaneously updating every application instance
- Blue/Green deployment – (also known as a red/black deployment), you perform two identical deployments of your application
- GKE supports Rolling and Recreate deployments.
  - Rolling deployments support maxSurge (new pods would be created) and maxUnavailable (existing pods would be deleted)
- Managed Instance groups support Rolling deployments using the
- maxSurge (new pods would be created) and maxUnavailable (existing pods would be deleted) configurations
Testing Strategies
- Canary testing – partially roll out a change and then evaluate its performance against a baseline deployment
- A/B testing – test a hypothesis by using variant implementations. A/B testing is used to make business decisions (not only predictions) based on the results derived from data.

Data Services

Bigtable
- Bigtable is a sparsely populated NoSQL table that can scale to billions of rows and thousands of columns
- Know Bigtable best practices for schema design
Cloud Pub/Sub
- Understand Cloud Pub/Sub as an asynchronous messaging service
- Know patterns for One to Many, Many to One, and Many to Many
- roles/publisher and roles/pubsub.subscriber provides applications with the ability to publish and consume.
Cloud SQL
- Cloud SQL is a fully managed service that provides MySQL, PostgreSQL, and Microsoft SQL Server.
- HA configuration provides data redundancy and failover capability with minimal downtime when a zone or instance becomes unavailable due to a zonal outage, or an instance corruption
- Read replicas help scale horizontally the use of data in a database without degrading performance
Cloud Spanner
- is a fully managed relational database with unlimited scale, strong consistency, and up to 99.999% availability.
- can read and write up-to-date strongly consistent data globally
- Multi-region instances give higher availability guarantees (99.999% availability) and global scale.
- Cloud Spanner’s table interleaving is a good choice for many parent-child relationships where the child table’s primary key includes the parent table’s primary key columns.

Monitoring

Google Cloud Monitoring or Stackdriver
- provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
- Cloud Monitoring helps gain visibility into the performance, availability, and health of your applications and infrastructure.
Google Cloud Logging or Stackdriver logging
- Cloud Logging provides real-time log management and analysis
- Cloud Logging allows ingestion of custom log data from any source
- Logs can be exported by configuring log sinks to BigQuery, Cloud Storage, or Pub/Sub.
- Cloud Logging Agent can be installed for logging and capturing application logs.
Cloud Error Reporting
- counts, analyzes, and aggregates the crashes in the running cloud services
Cloud Trace
- is a distributed tracing system that collects latency data from the applications and displays it in the Google Cloud Console.
Cloud Debugger
- is a feature of Google Cloud that lets you inspect the state of a running application in real-time, without stopping or slowing it down
- Debug Logpoints allow logging injection into running services without restarting or interfering with the normal function of the service
- Debug Snapshots help capture local variables and the call stack at a specific line location in your app’s source code

All the Best !!