Google Cloud Data Analytics Services Cheat Sheet

Cloud Pub/Sub

  • Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
  • Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
  • Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
  • Pub/Sub messages should be no greater than 10MB in size.
  • Messages can be received with pull or push delivery.
  • Messages published before a subscription is created will not be delivered to that subscription
  • Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
  • Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
  • Pub/Sub support encryption at rest and encryption in transit.
  • Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

  • BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • supports a standard SQL dialect
  • automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
  • supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
  • Data model consists of Datasets, tables
  • BigQuery performance can be improved using Partitioned tables and Clustered tables.
  • BigQuery encrypts all data at rest and supports encryption in transit.
  • BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
  • Best Practices
    • Control projection, avoid select *
    • Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
    • Use the maximum bytes billed setting to limit query costs.
    • Use clustering and partitioning to reduce the amount of data scanned.
    • Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
    • Use streaming inserts only if the data must be immediately available as streaming data is charged.
    • Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
    • Denormalize data whenever possible using nested and repeated fields.
    • Avoid external data sources, if query performance is a top priority
    • Avoid using Javascript user-defined functions
    • Optimize Join patterns. Start with the largest table.
    • Use the expiration settings to remove unneeded tables and partitions
    • Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

  • Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
  • ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
  • supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
  • is a sparsely populated table that can scale to billions of rows and thousands of columns,
  • supports storage of terabytes or even petabytes of data
  • is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
  • handles upgrades and restarts transparently, and it automatically maintains high data durability.
  • scales linearly in direct proportion to the number of nodes in the cluster
  • stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
  • Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
  • Single-cluster Bigtable instances provide strong consistency.
  • Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

  • Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
  • provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
  • is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
  • supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
  • supports drain feature to deploy incompatible updates

Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
  • helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
  • has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
  • support preemptible instances that have lower compute prices to reduce costs further.
  • also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
  • supports connectors for BigQuery, Bigtable, Cloud Storage
  • can be configured for High Availability by specifying the number of master instances in the cluster
  • All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
  • supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
  • supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
  • supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

  • Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
  • is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
  • provides easy data preparation with clicks and no code.
  • automatically identifies data anomalies & helps take fast corrective action
  • automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
  • uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

  • Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
  • runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
  • is built on Jupyter (formerly IPython)
  • enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).

Google Cloud Firestore

Google Cloud Firestore

  • Google Cloud Firestore provides a fully managed, scalable, and serverless document database.
  • Firestore stores the data in the form of documents and collections
  • Firestore provides horizontal autoscaling, strong consistency with support for ACID transactions
  • Firestore database can be regional or multi-regional
  • Firestore multi-region instances provide five-nines (99.999%) availability SLA and regional instances with four-nines (99.99%) availability SLA

Data Model

  • Firestore is schemaless
  • Document & Collections
    • Unit of storage is the document in Firestore
    • Each document contains a set of key-value pairs
    • stores the data in documents organized into collections.
    • is optimized for storing large collections of small documents.
    • supports a variety of data types for values: boolean, number, string, geo point, binary blob, and timestamp.
    • Documents can contain subcollections, arrays, or nested objects, which can include primitive fields like strings or complex objects like lists.
    • Documents within a collection are unique and can be identified using your own keys, such as user IDs, or Firestore generated random IDs.
  • Indexes
    • Firestore guarantees high query performance by using indexes for all queries.
    • supports two types of indexes
      • Single-field
        • automatically maintains single-field indexes for each field in a document and each subfield in a map.
        • Single-field index exemption can be used to exempt a field from automatic indexing settings
        • Single-field index exemption for a map field is inherited by the map’s subfields
      • Composite
        • A composite index stores a sorted mapping of all the documents in a collection, based on an ordered list of fields to index.
        • does not automatically create composite indexes but helps identify fields based on the query pattern

Data Contention

  • Data Contention occurs when two or more operations compete to control the same document.
  • Mobile/Web SDKs
    • uses optimistic concurrency controls to resolve data contention
    • resolves data contention by delaying or failing one of the operations
    • client libraries automatically retry transactions that fail due to data contention. After a finite number of retries, the transaction operation fails and returns an error message
  • Server Client Libraries
    • use pessimistic concurrency controls to resolve data contention.
    • Pessimistic transactions use database locks to prevent other operations from modifying data.
    • Transactions place locks on the documents they read. A transaction’s lock on a document blocks other transactions, batched writes, and non-transactional writes from changing that document.
    • A transaction releases its document locks at commit time. It also releases its locks if it times out or fails for any reason.

Firestore Security

  • Firestore automatically encrypts all data before it is written to disk.
  • Server-side encryption can be used in combination with client-side encryption, where data is encrypted by the client as well as server i.e double encryption
  • Firestore uses Transport Layer Security (TLS) to protect the data as it travels over the Internet during read and write operations.

Firestore Native vs Datastore Mode

Firestore in Native mode

  • Strongly consistent storage layer
  • Collection and document data model
  • Real-time updates
  • Mobile and Web client libraries
  • Firestore is backward compatible with Datastore, but the new data model, real-time updates, and mobile and web client library features are not.
  • Native mode can automatically scale to millions of concurrent clients.
  • Native mode is recommended for Mobile and Web apps

Firestore in Datastore mode

  • Datastore mode uses Datastore system behavior but accesses Firestore’s storage layer, removing the following Datastore limitations:
    • No more eventual consistency. Is a strongly consistent database
    • No more entity group limits on writes per second. Writes to an entity group are no longer limited to 1 per second. Transactions are no longer limited to 25 entity groups.
    • Transactions can be as complex as you want to design them.
    • No more cross-entity group transaction limits. Transactions can span documents and be as complex as your app requires. Queries in transactions are no longer required to be ancestor queries.
  • Datastore mode disables Firestore features that are not compatible with Datastore:
    • accepts only Datastore API requests and denies Firestore API requests.
    • uses Datastore indexes instead of Firestore indexes.
    • do not support Firestore client libraries, but only Datastore client libraries
    • do not support Firestore real-time capabilities
  • Datastore mode can automatically scale to millions of writes per second.

Firestore Native Mode vs Datastore Mode

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your existing application keeps user state information in a single MySQL database. This state information is very user-specific and depends heavily on how long a user has been using an application. The MySQL database is causing challenges to maintain and enhance the schema for various users. Which storage option should you choose?
    1. Cloud SQL
    2. Cloud Storage
    3. Cloud Spanner
    4. Cloud Firestore

References

Google_Cloud_Firestore

Google Cloud – EHR Healthcare Case Study

Google Cloud – EHR Healthcare Case Study

EHR Healthcare is a leading provider of electronic health record software to the medical industry. EHR Healthcare provides its software as a service to multi-national medical offices, hospitals, and insurance providers.

Executive statement

Our on-premises strategy has worked for years but has required a major investment of time and money in training our team on distinctly different systems, managing similar but separate environments, and responding to outages. Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsistent monitoring practices. We want to use Google Cloud to leverage a scalable, resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.

EHR Healthcare wants to move to Google Cloud to expand, build scalable and highly available applications. It also wants to leverage automation and IaaC to provide consistency across environments and reduce provisioning errors.

Solution Concept

Due to rapid changes in the healthcare and insurance industry, EHR Healthcare’s business has been growing exponentially year over year. They need to be able to scale their environment, adapt their disaster recovery plan, and roll out new continuous deployment capabilities to update their software at a fast pace. Google Cloud has been chosen to replace its current colocation facilities.

EHR wants to scale, build HA and DR setup and introduce CI/CD in their setup.

Existing Technical Environment

EHR’s software is currently hosted in multiple colocation facilities. The lease on one of the data centers is about to expire.
Customer-facing applications are web-based, and many have recently been containerized to run on a group of Kubernetes clusters. Data is stored in a mixture of relational and NoSQL databases (MySQL, MS SQL Server, Redis, and MongoDB).
EHR is hosting several legacy file- and API-based integrations with insurance providers on-premises. These systems are scheduled to be replaced over the next several years. There is no plan to upgrade or move these systems at the current time.
Users are managed via Microsoft Active Directory. Monitoring is currently being done via various open-source tools. Alerts are sent via email and are often ignored.

  • As the lease of one of the data centers is about to expire, so time is critical
  • Some applications are containerized and have SQL and NoSQL databases and can be moved
  • Some of the systems would not be migrated
  • Team has multiple monitoring tools and might need consolidation

Business requirements

  • On-board new insurance providers as quickly as possible.
  • Provide a minimum 99.9% availability for all customer-facing systems.
    • Availability can be increased by hosting applications across multiple zones
  • Provide centralized visibility and proactive action on system performance and usage.
    • Cloud Monitoring can be used to provide centralized visibility and alerting can provide proactive action
  • Increase ability to provide insights into healthcare trends.
    • Data can be pushed and analyzed using BigQuery and insights visualized using Data studio.
  • Reduce latency to all customers.
    • Performance can be improved using Global Load Balancer to expose the applications
  • Maintain regulatory compliance.
    • Regulatory compliance can be maintained using data localization, data retention.
  • Decrease infrastructure administration costs.
    • Infrastructure administration costs can be reduced using automation with either Terraform or Deployment Manager
  • Make predictions and generate reports on industry trends based on provider data.
    • Data can be pushed and analysed using BigQuery.

Technical requirements

  • Maintain legacy interfaces to insurance providers with connectivity to both on-premises systems and cloud providers.
  • Provide a consistent way to manage customer-facing applications that are container-based.
    • Containers based applications can be deployed GKE or Cloud Run with consistent CI/CD experience
  • Provide a secure and high-performance connection between on-premises systems and Google Cloud.
    • Cloud VPN, Dedicated Interconnect, or Partner Interconnect connections can be established between on-premises and Google Cloud
  • Provide consistent logging, log retention, monitoring, and alerting capabilities.
    • Cloud Monitoring and Cloud Logging can be used to provide a single tool for monitoring, logging, and alerting.
  • Maintain and manage multiple container-based environments.
    • Use Deployment Manager or IaaC to provide consistent implementations across environments
  • Dynamically scale and provision new environments.
    • Applications deployed on GKE can be scaled using Cluster Autoscaler and HPA for deployments.
  • Create interfaces to ingest and process data from new providers.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. For this question, refer to the EHR Healthcare case study. In the past, configuration errors put IP addresses on backend servers that should not have been accessible from the internet. You need to ensure that no one can put external IP addresses on backend Compute Engine instances and that external IP addresses can only be configured on the front end Compute Engine instances. What should you do?
    1. Create an organizational policy with a constraint to allow external IP addresses on the front end Compute Engine instances
    2. Revoke the compute.networkadmin role from all users in the project with front end instances
    3. Create an Identity and Access Management (IAM) policy that maps the IT staff to the compute.networkadmin role for the organization
    4. Create a custom Identity and Access Management (IAM) role named GCE_FRONTEND with the compute.addresses.create permission

References

EHR Healthcare Case Study

Google Cloud BigQuery

Google Cloud BigQuery

  • Google Cloud BigQuery is a fully managed, peta-byte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • BigQuery supports a standard SQL dialect that is ANSI:2011 compliant, which reduces the need for code rewrites.
  • BigQuery transparently and automatically provides highly durable, replicated storage in multiple locations and high availability with no extra charge and no additional setup.
  • BigQuery supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
  • BigQuery automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
  • BigQuery Data Transfer Service automatically transfers data from external data sources, like Google Marketing Platform, Google Ads, YouTube, external sources like S3 or Teradata, and partner SaaS applications to BigQuery on a scheduled and fully managed basis
  • BigQuery provides a REST API for easy programmatic access and application integration. Client libraries are available in Java, Python, Node.js, C#, Go, Ruby, and PHP.

BigQuery Resources

BigQuery Resources

Datasets

  • Datasets are the top-level containers used to organize and control access to the BigQuery tables and views.
  • Datasets frequently map to schemas in standard relational databases and data warehouses.
  • Datasets are scoped to the Cloud project
  • A dataset is bound to a location and can be defined as
    • Regional: A specific geographic place, such as London.
    • Multi-regional: A large geographic area, such as the United States, that contains two or more geographic places.
  • Dataset location can be set only at the time of its creation.
  • A query can contain tables or views from different datasets in the same location.
  • Dataset names must be unique for each project.

Tables

  • BigQuery tables are row-column structures that hold the data.
  • A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).
  • Every table is defined by a schema that describes the column names, data types, and other information.
  • BigQuery has the following types of tables:
    • Native tables: Tables backed by native BigQuery storage.
    • External tables: Tables backed by storage external to BigQuery.
    • Views: Virtual tables defined by a SQL query.
  • Schema of a table can either be defined during creation or specified in the query job or load job that first populates it with data.
  • Schema auto-detection is also supported when data is loaded from BigQuery or an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files
  • Columns datatype cannot be changed once defined.

Partitioned Tables

  • A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.
  • By dividing a large table into smaller partitions, query performance and costs can be controlled by reducing the number of bytes read by a query.
  • BigQuery tables can be partitioned by:
    • Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.
    • Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.
    • Integer range: Tables are partitioned based on an integer column.
  • If a query filters on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. This process is called pruning.

Clustered Tables

  • With Clustered tables, the table data is automatically organized based on the contents of one or more columns in the table’s schema.
  • Columns specified are used to colocate the data.
  • Clustering can be performed on multiple columns, where the order of the columns is important as it determines the sort order of the data
  • Clustering can improve query performance for specific filter queries or ones that aggregate data as BigQuery uses the sorted blocks to eliminate scans of unnecessary data
  • Clustering does not provide cost guarantees before running the query.
  • Partitioning can be used with clustering where data is first partitioned and then data in each partition is clustered by the clustering columns. When the table is queried, partitioning sets an upper bound of the query cost based on partition pruning.

Views

  • A View is a virtual table defined by a SQL query.
  • View query results contain data only from the tables and fields specified in the query that defines the view.
  • Views are read-only and do not support DML queries
  • Dataset that contains the view and the dataset that contains the tables referenced by the view must be in the same location.
  • View does not support BigQuery job that exports data
  • View does not support JSON API to retrieve data from the view
  • Standard SQL and legacy SQL queries cannot be mixed
  • Legacy SQL view cannot be automatically updated to standard SQL syntax.
  • No user-defined functions allowed
  • No wildcard table references allowed

Materialized Views

  • Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency.
  • BigQuery leverages pre-computed results from materialized views and whenever possible reads only delta changes from the base table to compute up-to-date results.
  • Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.
  • Materialized views queries are generally faster and consume fewer resources than queries that retrieve the same data only from the base table
  • Materialized views can significantly improve the performance of workloads that have the characteristic of common and repeated queries.

Jobs

  • Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.
  • Jobs are not linked to the same project that the data is stored in. However, the location where the job can execute is linked to the dataset location.

External Data Sources

  • An external data source (federated data source) is a data source that can be queried directly even though the data is not stored in BigQuery.
  • Instead of loading or streaming the data, a table can be created that references the external data source.
  • BigQuery offers support for querying data directly from:
    • Cloud Bigtable
    • Cloud Storage
    • Google Drive
    • Cloud SQL
  • Supported formats are:
    • Avro
    • CSV
    • JSON (newline delimited only)
    • ORC
    • Parquet
  • External data sources use cases
    • Loading and cleaning the data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
    • Having a small amount of frequently changing data that needs to be joined with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.
  • Permanent vs Temporary external tables
    • The external data sources can be queried in BigQuery by using a permanent table or a temporary table.
    • Permanent Table
      • is a table that is created in a dataset and is linked to the external data source.
      • access controls can be used to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
    • Temporary Table
      • you submit a command that includes a query and creates a non-permanent table linked to the external data source.
      • no table is created in the BigQuery datasets.
      • cannot be shared with others.
      • Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
  • Limitations
    • does not guarantee data consistency for external data sources
    • query performance for external data sources may not be as high as querying data in a native BigQuery table
    • cannot reference an external data source in a wildcard table query.
    • support table partitioning or clustering in limited ways
    • results are not cached and would be charged for each query execution

BigQuery Security

Refer blog post @ BigQuery Security

BigQuery Best Practices

  • Cost Control
    • Query only the needed columns and avoid select * as BigQuery does a full scan of every column in the table.
    • Don’t run queries to explore or preview table data. Use preview option
    • Before running queries, preview them to estimate costs. Queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
    • Use the maximum bytes billed setting to limit query costs.
    • Use clustering and partitioning to reduce the amount of data scanned.
    • For non-clustered tables, do not use a LIMIT clause as a method of cost control. Applying a LIMIT clause to a query does not affect the amount of data read, but shows limited results only. With a clustered table, a LIMIT clause can reduce the number of bytes scanned
    • Partition the tables by date which helps query relevant subsets of data which improves performance and reduces costs.
    • Materialize the query results in stages. Break the query into stages where each stage materializes the query results by writing them to a destination table. Querying the smaller destination table reduces the amount of data that is read and lowers costs. The cost of storing the materialized results is much less than the cost of processing large amounts of data.
    • Use streaming inserts only if the data must be immediately available as streaming data is charged.
  • Query Performance
    • Control projection, Query only the needed columns. Avoid SELECT *
    • Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
    • Denormalize data whenever possible using nested and repeated fields.
    • Avoid external data sources, if query performance is a top priority
    • Avoid repeatedly transforming data via SQL queries, use materialized views instead
    • Avoid using Javascript user-defined functions
    • Optimize Join patterns. Start with the largest table.
  • Optimizing Storage
    • Use the expiration settings to remove unneeded tables and partitions
    • Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

BigQuery Data Transfer Service

Refer GCP blog post @ Google Cloud BigQuery Data Transfer Service

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A user wishes to generate reports on petabyte-scale data using Business Intelligence (BI) tools. Which storage option provides integration with BI tools and supports OLAP workloads up to petabyte-scale?
    1. Bigtable
    2. Cloud Datastore
    3. Cloud Storage
    4. BigQuery
  2. Your company uses Google Analytics for tracking. You need to export the session and hit data from a Google Analytics 360 reporting view on a scheduled basis into BigQuery for analysis. How can the data be exported?
    1. Configure a scheduler in Google Analytics to convert the Google Analytics data to JSON format, then import directly into BigQuery using bq command line.
    2. Use gsutil to export the Google Analytics data to Cloud Storage, then import into BigQuery and schedule it using Cron.
    3. Import data to BigQuery directly from Google Analytics using Cron
    4. Use BigQuery Data Transfer Service to import the data from Google Analytics

References

Google_Cloud_BigQuery_Architecture

Google Cloud BigTable

Google Cloud Bigtable

  • Cloud Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
  • Bigtable is ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
  • Bigtable supports high read and write throughput at low latency and provides consistent sub-10ms latency – handle millions of requests/second
  • Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns,
  • Bigtable supports storage of terabytes or even petabytes of data
  • Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
  • Fully Managed
    •  Bigtable handles upgrades and restarts transparently, and it automatically maintains high data durability.
    • Data replication can be performed by simply adding a second cluster to the instance, and replication starts automatically.
  • Scalability
    • Bigtable scales linearly in direct proportion to the number of machines in the cluster
    • Bigtable throughput can be scaled dynamically by adding or removing cluster nodes without restarting
  • Bigtable integrates easily with big data tools like Hadoop, Dataflow, Dataproc and supports HBase APIs.

Bigtable Architecture

Bigtable Architecture

  • Bigtable Instance is a container for Cluster where Nodes are organized.
  • Bigtable stores data in Colossus, Google’s file system.
  • Instance
    • A Bigtable instance is a container for data.
    • Instances have one or more clusters, located in a different zone and different region (Different region adds to latency)
    • Each cluster has at least 1 node
    • A Table belongs to an instance and not to the cluster or node.
    • An instance also consists of the following properties
      • Storage Type – SSD or HDD
      • Application Profiles – primarily for instances using replication
  • Instance Type
    • Development – Single node cluster with no replication or SLA
    • Production – 1+ clusters which 3+ nodes per cluster
  • Storage Type
    • Storage Type dictates where the data is stored i.e. SSD or HDD
    • Choice of SSD or HDD storage for the instance is permanent
    • SSD storage is the most efficient and cost-effective choice for most use cases.
    • HDD storage is sometimes appropriate for very large data sets (>10 TB) that are not latency-sensitive or are infrequently accessed.
  • Application Profile
    • An application profile, or app profile, stores settings indicate Bigtable on how to handle incoming requests from an application
    • Application profile helps define custom application-specific settings for handling incoming connections
  • Cluster
    • Clusters handle the requests sent to a single Bigtable instance
    • Each cluster belongs to a single Bigtable instance, and an instance can have up to 4 clusters
    • Each cluster is located in a single-zone
    • Bigtable instances with only 1 cluster do not use replication
    • An Instances with multiple clusters replicate the data, which
      • improves data availability and durability
      • improves scalability by routing different types of traffic to different clusters
      • provides failover capability, if another cluster becomes unavailable
    • If multiple clusters within an instance, Bigtable automatically starts replicating the data by keeping separate copies of the data in each of the clusters’ zones and synchronizing updates between the copies
  • Nodes
    • Each cluster in an instance has 1 or more nodes, which are the compute resources that Bigtable uses to manage the data.
    • Each node in the cluster handles a subset of the requests to the cluster
    • All client requests go through a front-end server before they are sent to a Bigtable node.
    • Bigtable separates the Compute from the Storage. Data is never stored in nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. This helps as
      • Rebalancing tablets from one node to another is very fast, as the actual data is not copied. Only pointers for each node are updated
      • Recovery from the failure of a Bigtable node is very fast as only the metadata needs to be migrated to the replacement node.
      • When a Bigtable node fails, no data is lost.
    • A Bigtable cluster can be scaled by adding nodes which would increase
      • the number of simultaneous requests that the cluster can handle
      • the maximum throughput of the cluster.
    • Each node is responsible for:
      • Keeping track of specific tablets on disk.
      • Handling incoming reads and writes for its tablets.
      • Performing maintenance tasks on its tablets, such as periodic compactions
    • Bigtable nodes are also referred to as tablet servers
  • Tables
    • Bigtable stores data in massively scalable tables, each of which is a sorted key/value map.
    • A Table belongs to an instance and not to the cluster or node.
    • A Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries.
    • Bigtable splits all of the data in a table into separate tablets.
    • Tablets are stored on the disk, separate from the nodes but in the same zone as the nodes.
    • Each tablet is associated with a specific Bigtable node.
    • Tablets are stored in SSTable format which provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
    • In addition to the SSTable files, all writes are stored in Colossus’s shared log as soon as they are acknowledged by Bigtable, providing increased durability.

Bigtable Storage Model

Bigtable Storage Model

  • Bigtable stores data in tables, each of which is a sorted key/value map.
  • A Table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
  • Each row is indexed by a single row key, and columns that are related to one another are typically grouped together into a column family.
  • Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.
  • Each row/column intersection can contain multiple cells.
  • Each cell contains a unique timestamped version of the data for that row and column.
  • Storing multiple cells in a column provides a record of how the stored data for that row and column has changed over time.
  • Bigtable tables are sparse; if a column is not used in a particular row, it does not take up any space.

Bigtable Schema Design

  • Bigtable schema is a blueprint or model of a table that includes Row Keys, Column Families, and Columns
  • Bigtable is a key/value store, not a relational store. It does not support joins, and transactions are supported only within a single row.
  • Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
  • Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian byte order, the binary equivalent of alphabetical order.
  • Column families are not stored in any specific order.
  • Columns are grouped by column family and sorted in lexicographic order within the column family.
  • Intersection of a row and column can contain multiple timestamped cells. Each cell contains a unique, timestamped version of the data for that row and column.
  • All operations are atomic at the row level. This means that an operation affects either an entire row or none of the row.
  • Bigtable tables are sparse. A column doesn’t take up any space in a row that doesn’t use the column.

Bigtable Replication

  • Bigtable Replication helps increase the availability and durability of the data by copying it across multiple zones in a region or multiple regions.
  • Replication helps isolate workloads by routing different types of requests to different clusters using application profiles.
  • Bigtable replication can be implemented by
    • creating a new instance with more than 1 cluster or
    • adding clusters to an existing instance.
  • Bigtable synchronizes the data between the clusters, creating a separate, independent copy of the data in each zone with the instance cluster.
  • Replicated clusters in different regions typically have higher replication latency than replicated clusters in the same region.
  • Bigtable replicates any changes to the data automatically, including all of the following types of changes:
    • Updates to the data in existing tables
    • New and deleted tables
    • Added and removed column families
    • Changes to a column family’s garbage collection policy
  • Bigtable treats each cluster in the instance as a primary cluster, so reads and writes can be performed in each cluster.
  • Application profiles can be created so that the requests from different types of applications are routed to different clusters.
  • Consistency Model
    • Eventual Consistency
      • Replication for Bigtable is eventually consistent, by default.
    • Read-your-writes Consistency
      • Bigtable can also provide read-your-writes consistency when replication is enabled, which ensures that an application will never read data that is older than its most recent writes.
      • To gain read-your-writes consistency for a group of applications, each application in the group must use an app profile that is configured for single-cluster routing, and all of the app profiles must route requests to the same cluster.
      • You can use the instance’s additional clusters at the same time for other purposes.
    • Strong Consistency
      • For some replication use cases, Bigtable can also provide strong consistency, which ensures that all of the applications see the data in the same state.
      • To gain strong consistency, you use the single-cluster routing app-profile configuration for read-your-writes consistency, but you must not use the instance’s additional clusters unless you need to failover to a different cluster.
  • Use cases
    • Isolate real-time serving applications from batch reads
    • Improve availability
    • Provide near-real-time backup
    • Ensure your data has a global presence

Bigtable Best Practices

  • Store datasets with similar schemas in the same table, rather than in separate tables as in SQL.
  • Bigtable has a limit of 1,000 tables per instance
  • Creating many small tables is a Bigtable anti-pattern.
  • Put related columns in the same column family
  • Create up to about 100 column families per table. A higher number would lead to performance degradation.
  • Choose short but meaningful names for your column families
  • Put columns that have different data retention needs in different column families to limit storage cost.
  • Create as many columns as you need in the table. Bigtable tables are sparse, and there is no space penalty for a column that is not used in a row
  • Don’t store more than 100 MB of data in a single row as a higher number would impact performance
    • Don’t store more than 10 MB of data in a single cell.
  • Design the row key based on the queries used to retrieve the data
  • Following queries provide the most efficient performance
    • Row key
    • Row key prefix
    • Range of rows defined by starting and ending row keys
  • Other types of queries trigger a full table scan, which is much less efficient.
  • Store multiple delimited values in each row key. Multiple identifiers can be included in the row key.
  • Use human-readable string values in your row keys whenever possible. Makes it easier to use the Key Visualizer tool.
  • Row keys anti-pattern
    • Row keys that start with a timestamp, as it causes sequential writes to a single node
    • Row keys that cause related data to not be grouped together, which would degrade the read performance
    • Sequential numeric IDs
    • Frequently updated identifiers
    • Hashed values as hashing a row key removes the ability to take advantage of Bigtable’s natural sorting order, making it impossible to store rows in a way that are optimal for querying
    • Values expressed as raw bytes rather than human-readable strings
    • Domain names, instead use the reverse domain name as the row key as related data can be clubbed.

Bigtable Load Balancing

  • Each Bigtable zone is managed by a primary process, which balances workload and data volume within clusters.
  • This process redistributes the data between nodes as needed as it
    • splits busier/larger tablets in half and
    • merges less-accessed/smaller tablets together
  • Bigtable automatically manages all of the splitting, merging, and rebalancing, saving users the effort of manually administering the tablets
  • Bigtable write performance can be improved by distributed writes as evenly as possible across nodes with proper row key design.

Bigtable Consistency

  • Single-cluster instances provide strong consistency.
  • Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Bigtable Security

  • Access to the tables is controlled by your Google Cloud project and the Identity and Access Management (IAM) roles assigned to the users.
  • All data stored within Google Cloud, including the data in Bigtable tables, is encrypted at rest using Google’s default encryption.
  • Bigtable supports using customer-managed encryption keys (CMEK) for data encryption.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company processes high volumes of IoT data that are time-stamped. The total data volume can be several petabytes. The data needs to be written and changed at a high speed. You want to use the most performant storage option for your data. Which product should you use?
    1. Cloud Datastore
    2. Cloud Storage
    3. Cloud Bigtable
    4. BigQuery
  2. You want to optimize the performance of an accurate, real-time, weather-charting application. The data comes from 50,000 sensors sending 10 readings a second, in the format of a timestamp and sensor reading. Where should you store the data?
    1. Google BigQuery
    2. Google Cloud SQL
    3. Google Cloud Bigtable
    4. Google Cloud Storage
  3. Your team is working on designing an IoT solution. There are thousands of devices that need to send periodic time series data for
    processing. Which services should be used to ingest and store the data?

    1. Pub/Sub, Datastore
    2. Pub/Sub, Dataproc
    3. Dataproc, Bigtable
    4. Pub/Sub, Bigtable

References

Google_Cloud_Bigtable

Google Cloud – Professional Cloud Developer Certification learning path

Google Cloud Profressional Cloud Developer Certificate

Google Cloud – Professional Cloud Developer Certification learning path

Continuing on the Google Cloud Journey, glad to have passed the sixth certification with the Professional Cloud Developer certification. Google Cloud – Professional Cloud Security Engineer certification exam focuses on almost all of the Google Cloud security services with storage, compute, networking services with their security aspects only.

Google Cloud -Professional Cloud Developer Certification Summary

  • Had 60 questions to be answered in 2 hours. The number of questions was 50 with the other exams in the same 2 hours.
  • Covers a wide range of Google Cloud services mainly focusing on application and deployment services
  • Make sure you cover the case studies beforehand. I got  ~5-6 questions and it can really be a savior for you in the exams.
  • As mentioned for all the exams, Hands-on is a MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolutely clueless about some of the questions and commands
  • I did Coursera and ACloud Guru which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud Developer Certification Resources

Google Cloud – Professional Cloud Developer Certification Topics

Case Studies

Compute Services

  • Compute services like Google Compute Engine and Google Kubernetes Engine are lightly covered more from the security aspects
  • Google Compute Engine
    • Google Compute Engine is the best IaaS option for compute and provides fine-grained control
    • Compute Engine is recommended to be used with Service Account with the least privilege to provide access to Google services and the information can be queried from instance metadata.
    • Compute Engine Persistent disks can be attached to multiple VMs in read-only mode.
    • Compute Engine launch issues reasons
      • Boot disk is full.
      • Boot disk is corrupted
      • Boot Disk has an invalid master boot record (MBR).
      • Quota Errors
      • Can be debugged using Serial console
    • Preemptible VMs and their use cases. HINT –  shutdown script to perform cleanup actions
  • Google Kubernetes Engine
    • Google Kubernetes Engine, enables running containers on Google Cloud
    • Understand GKE containers, Pods, Deployments, Service, DaemonSet, StatefulSets
      • Pods are the smallest, most basic deployable objects in Kubernetes. A Pod represents a single instance of a running process in the cluster and can contain single or multiple containers
      • Deployments represent a set of multiple, identical Pods with no unique identities. A Deployment runs multiple replicas of the application and automatically replaces any instances that fail or become unresponsive.
      • StatefulSets represent a set of Pods with unique, persistent identities and stable hostnames that GKE maintains regardless of where they are scheduled
      • DaemonSets manages groups of replicated Pods. However, DaemonSets attempt to adhere to a one-Pod-per-node model, either across the entire cluster or a subset of nodes
      • Service is to group a set of Pod endpoints into a single resource. GKE Services can be exposed as ClusterIP, NodePort, and Load Balancer
      • Ingress object defines rules for routing HTTP(S) traffic to applications running in a cluster. An Ingress object is associated with one or more Service objects, each of which is associated with a set of Pods
    • GKE supports Horizontal Pod Autoscaler (HPA) to autoscale deployments based on CPU and Memory
    • GKE supports health checks using liveness and readiness probe
      • Readiness probes are designed to let Kubernetes know when the app is ready to serve traffic.
      • Liveness probes let Kubernetes know if the app is alive or dead.
    • Understand Workload Identity for security, which is a recommended way to provide Pods running on the cluster access to Google resources.
    • GKE integrates with Istio to provide MTLS feature
  • Google App Engine
  • Cloud Tasks
    • is a fully managed service that allows you to manage the execution, dispatch, and delivery of a large number of distributed tasks.

Security Services

  • Cloud Identity-Aware Proxy
    • Identity-Aware Proxy IAP allows managing access to HTTP-based apps both on Google Cloud and outside of Google Cloud.
    • IAP uses Google identities and IAM and can leverage external identity providers as well like OAuth with Facebook, Microsoft, SAML, etc.
    • Signed headers using JWT provide secondary security in case someone bypasses IAP.
  • Cloud Data Loss Prevention – DLP
    • Cloud Data Loss Prevention – DLP is a fully managed service designed to help discover, classify, and protect the most sensitive data.
    • provides two key features
      • Classification is the process to inspect the data and know what data we have, how sensitive it is, and the likelihood.
      • De-identification is the process of removing, masking, redaction, replacing information from data.
  • Web Security Scanner
    • Web Security Scanner identifies security vulnerabilities in the App Engine, GKE, and Compute Engine web applications.
    • scans provide information about application vulnerability findings, like OWASP, XSS, Flash injection, outdated libraries, cross-site scripting, clear-text passwords, or use of mixed content

Networking Services

  • Virtual Private Cloud
    • Understand Virtual Private Cloud (VPC), subnets, and host applications within them
    • Private Access options for services allow instances with internal IP addresses can communicate with Google APIs and services.
    • Private Google Access allows VMs to connect to the set of external IP addresses used by Google APIs and services by enabling Private Google Access on the subnet used by the VM’s network interface.
  • Cloud Load Balancing
    • Google Cloud Load Balancing provides scaling, high availability, and traffic management for your internet-facing and private applications.

Identity Services

  • Resource Manager
    • Understand Resource Manager the hierarchy Organization -> Folders -> Projects -> Resources
    • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
    • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
  • Identity and Access Management
    • Identify and Access Management – IAM provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    • A service account is a special kind of account used by an application or a virtual machine (VM) instance, not a person.
    • Understand IAM Best Practices
      • Use groups for users requiring the same responsibilities
      • Use service accounts for server-to-server interactions.
      • Use Organization Policy Service to get centralized and programmatic control over the organization’s cloud resources.
    • Domain-wide delegation of authority to grant third-party and internal applications access to the users’ data for e.g. Google Drive etc.

Storage Services

  • Cloud Storage
    • Cloud Storage is cost-effective object storage for unstructured data and provides an option for long term data retention
    • Understand Signed URL to give temporary access and the users do not need to be GCP users HINT: Signed URL would work for direct upload to GCS without routing the traffic through App Engine or CE
    • Understand Google Cloud Storage Classes and Object Lifecycle Management to transition objects
    • Retention Policies help define the retention period for the bucket, before which the objects in the bucket cannot be deleted.
    • Bucket Lock feature allows configuring a data retention policy for a bucket that governs how long objects in the bucket must be retained. The feature also allows locking the data retention policy, permanently preventing the policy from being reduced or removed
    • Know Cloud Storage Best Practices esp. GCS auto-scaling performs well if requests ramp up gradually rather than having a sudden spike. Also, retry using exponential back-off strategy
    • Cloud Storage can be used to host static websites
    • Cloud CDN can be used with Cloud Storage to improve performance and enable caching
  • DataStore/FireStore
    • Cloud Datastore/Firestore provides a managed NoSQL document database built for automatic scaling, high performance, and ease of application development.

Developer Tools

  • Google Cloud Build
    • Cloud Build integrates with Cloud Source Repository, Github, and Gitlab and can be used for Continous Integration and Deployments.
    • Cloud Build can import source code, execute build to the specifications, and produce artifacts such as Docker containers or Java archives
    • Cloud Build build config file specifies the instructions to perform, with steps defined to each task like test, build and deploy.
    • Cloud Build supports custom images as well for the steps
    • Cloud Build uses a directory named /workspace as a working directory and the assets produced by one step can be passed to the next one via the persistence of the /workspace directory.
  • Google Cloud Code
    • Cloud Code helps write, debug, and deploy the cloud-based applications for IntelliJ, VS Code, or in the browser.
  • Google Cloud Client Libraries
    • Google Cloud Client Libraries provide client libraries and SDKs in various languages for calling Google Cloud APIs.
    • If the language is not supported, Cloud Rest APIs can be used.
  • Deployment Techniques
    • Recreate deployment – fully scale down the existing application version before you scale up the new application version.
    • Rolling update – update a subset of running application instances instead of simultaneously updating every application instance
    • Blue/Green deployment – (also known as a red/black deployment), you perform two identical deployments of your application
    • GKE supports Rolling and Recreate deployments.
      • Rolling deployments support maxSurge (new pods would be created) and maxUnavailable (existing pods would be deleted)
    • Managed Instance groups support Rolling deployments using the
    • maxSurge (new pods would be created) and maxUnavailable (existing pods would be deleted) configurations
  • Testing Strategies
    • Canary testing – partially roll out a change and then evaluate its performance against a baseline deployment
    • A/B testing – test a hypothesis by using variant implementations. A/B testing is used to make business decisions (not only predictions) based on the results derived from data.

Data Services

  • Bigtable
  • Cloud Pub/Sub
    • Understand Cloud Pub/Sub as an asynchronous messaging service
    • Know patterns for One to Many, Many to One, and Many to Many
    • roles/publisher and roles/pubsub.subscriber provides applications with the ability to publish and consume.
  • Cloud SQL
    • Cloud SQL is a fully managed service that provides MySQL, PostgreSQL, and Microsoft SQL Server.
    • HA configuration provides data redundancy and failover capability with minimal downtime when a zone or instance becomes unavailable due to a zonal outage, or an instance corruption
    • Read replicas help scale horizontally the use of data in a database without degrading performance
  • Cloud Spanner
    • is a fully managed relational database with unlimited scale, strong consistency, and up to 99.999% availability.
    • can read and write up-to-date strongly consistent data globally
    • Multi-region instances give higher availability guarantees (99.999% availability) and global scale.
    • Cloud Spanner’s table interleaving is a good choice for many parent-child relationships where the child table’s primary key includes the parent table’s primary key columns.

Monitoring

  • Google Cloud Monitoring or Stackdriver
    • provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
    • Cloud Monitoring helps gain visibility into the performance, availability, and health of your applications and infrastructure.
  • Google Cloud Logging or Stackdriver logging
    • Cloud Logging provides real-time log management and analysis
    • Cloud Logging allows ingestion of custom log data from any source
    • Logs can be exported by configuring log sinks to BigQuery, Cloud Storage, or Pub/Sub.
    • Cloud Logging Agent can be installed for logging and capturing application logs.
  • Cloud Error Reporting
    • counts, analyzes, and aggregates the crashes in the running cloud services
  • Cloud Trace
    • is a distributed tracing system that collects latency data from the applications and displays it in the Google Cloud Console.
  • Cloud Debugger
    • is a feature of Google Cloud that lets you inspect the state of a running application in real-time, without stopping or slowing it down
    • Debug Logpoints allow logging injection into running services without restarting or interfering with the normal function of the service
    • Debug Snapshots help capture local variables and the call stack at a specific line location in your app’s source code

All the Best !!

Google Cloud Pub/Sub – Asynchronous Messaging

Google Cloud Pub/Sub

  • Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable.
  • Pub/Sub service allows applications to exchange messages reliably, quickly, and asynchronously
  • Pub/Sub allows services to communicate asynchronously, with latencies on the order of 100 milliseconds.
  • Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
  • Publishers communicate with subscribers asynchronously by broadcasting events, rather than by synchronous remote procedure calls.
  • Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
  • Pub/Sub accepts a maximum of 1,000 messages in a batch, and the size of a batch can not exceed 10 megabytes.

Pub/Sub Core Concepts

  • Topic: A named resource to which messages are sent by publishers.
  • Publisher: An application that creates and sends messages to a topic(s).
  • Subscriber: An application with a subscription to a topic(s) to receive messages from it.
  • Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application.
  • Message: The combination of data and (optional) attributes that a publisher sends to a topic and is eventually delivered to subscribers.
  • Message attribute: A key-value pair that a publisher can define for a message.
  • Acknowledgment (or “ack”): A signal sent by a subscriber to Pub/Sub after it has received a message successfully. Acked messages are removed from the subscription’s message queue.
  • Schema: A schema is a format that messages must follow, creating a contract between publisher and subscriber that Pub/Sub will enforce
  • Push and pull: The two message delivery methods. A subscriber receives messages either by Pub/Sub pushing them to the subscriber’s chosen endpoint or by the subscriber pulling them from the service.

Message lifecycle

Pub/Sub Subscription Properties

  • Delivery method
    • Messages can be received with pull or push delivery.
    • In pull delivery, the subscriber application initiates requests to the Pub/Sub server to retrieve messages.
    • In push delivery, Pub/Sub initiates requests to the subscriber application to deliver messages. The push endpoint must be a publicly accessible HTTPS address.
    • If unspecified, Pub/Sub subscriptions use pull delivery.
    • Messages published before a subscription is created will not be delivered to that subscription
  • Acknowledgment deadline
    • Message not acknowledged before the deadline is sent again.
    • Default acknowledgment deadline is 10 secs. with a max of 10 mins.
  • Message retention duration
    • Message retention duration specifies how long Pub/Sub retains messages after publication.
    • Acknowledged messages are no longer available to subscribers and are deleted, by default
    • After the message retention duration, Pub/Sub might discard the message, regardless of its acknowledgment state.
    • Default message retention duration is 7 days with a min-max of 10 mins – 7 days
  • Dead-letter topics
    • If a subscriber can’t acknowledge a message, Pub/Sub can forward the message to a dead-letter topic.
    • With a dead-letter topic, message ordering can’t be enabled
    • With a dead-letter topic, the maximum number of delivery attempts can be specified.
    • Default is 5 delivery attempts; with a min-max of 5-100
  • Expiration period
    • Subscriptions expire without any subscriber activity such as open connections, active pulls, or successful pushes
    • Subscription deletion clock restarts, if subscriber activity is detected
    • Default expiration period is 31 days with a min-max of 1 day-never
  • Retry policy
    • If the acknowledgment deadline expires or a subscriber responds with a negative acknowledgment, Pub/Sub can send the message again using exponential backoff.
    • If the retry policy isn’t set, Pub/Sub resends the message as soon as the acknowledgment deadline expires or a subscriber responds with a negative acknowledgment.
  • Message ordering
    • If publishers send messages with an ordering key, are in the same region and message ordering is set, Pub/Sub delivers the messages in order.
    • If not set, Pub/Sub doesn’t deliver messages in order, including messages with ordering keys.
  • Filter
    • Filter is a string with a filtering expression where the subscription only delivers the messages that match the filter.
    • Pub/Sub service automatically acknowledges the messages that don’t match the filter.
    • Message can be filtered using their attributes.

Pub/Sub Seek Feature

  • Acknowledged messages are no longer available to subscribers and are deleted
  • Subscriber clients must process every message in a subscription even if only a subset is needed.
  • Seek feature extends subscriber functionality by allowing you to alter the acknowledgment state of messages in bulk
  • Timestamp Seeking
    • With Seek feature, you can replay previously acknowledged messages or purge messages in bulk
    • Seeking to a time marks every message received by Pub/Sub before the time as acknowledged, and all messages received after the time as unacknowledged.
    • Seeking to a time in the future allows you to purge messages.
    • Seeking to a time in the past allows replay and reprocess previously acknowledged messages
    • Timestamp seeking approach is imprecise as
      • Possible clock skew among Pub/Sub servers.
      • Pub/Sub has to work with the arrival time of the publish request rather than when an event occurred in the source system.
  • Snapshot Seeking
    • State of one subscription can be copied to another by using seek in combination with a Snapshot.
    • Once a snapshot is created, it retains:
      • All messages that were unacknowledged in the source subscription at the time of the snapshot’s creation.
      • Any messages published to the topic thereafter.

Pub/Sub Locations

  • Pub/Sub servers run in all GCP regions around the world, which helps offer fast, global data access while giving users control over where messages are stored
  • Cloud Pub/Sub offers global data access in that publisher and subscriber clients are not aware of the location of the servers to which they connect or how those services route the data.
  • Pub/Sub’s load balancing mechanisms direct publisher traffic to the nearest GCP data center where data storage is allowed, as defined in the Resource Location Restriction
  • Publishers in multiple regions may publish messages to a single topic with low latency. Any individual message is stored in a single region. However, a topic may have messages stored in many regions.
  • Subscriber client requesting messages published to this topic connects to the nearest server which aggregates data from all messages published to the topic for delivery to the client.
  • Message Storage Policy
    • Message Storage Policy helps ensure that messages published to a topic are never persisted outside a set of specified Google Cloud regions, regardless of where the publish requests originate.
    • Pub/Sub chooses the nearest allowed region, when multiple regions are allowed by the policy

Pub/Sub Security

  • Pub/Sub encrypts messages with Google-managed keys, by default.
  • Every message is encrypted at the following states and layers:
    • At rest
      • Hardware layer
      • Infrastructure layer
      • Application layer
        • Pub/Sub individually encrypts incoming messages as soon as the message is received
    • In transit
  • Pub/Sub does not encrypt message attributes at the application layer.
  • Message attributes are still encrypted at the hardware and infrastructure layers.

Common use cases

  • Ingestion user interaction and server events
  • Real-time event distribution
  • Replicating data among databases
  • Parallel processing and workflows
  • Data streaming from IoT devices
  • Refreshing distributed caches
  • Load balancing for reliability

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

References

Google_Cloud_Pub/Sub

Google Kubernetes Engine – GKE Security

GKE Security

Google Kubernetes Engine – GKE Security provides multiple layers of security to secure workloads including the contents of the container image, the container runtime, the cluster network, and access to the cluster API server.

Authentication and Authorization

  • Kubernetes supports two types of authentication:
    • User accounts are accounts that are known to Kubernetes but are not managed by Kubernetes
    • Service accounts are accounts that are created and managed by Kubernetes but can only be used by Kubernetes-created entities, such as pods.
  • In a GKE cluster, Kubernetes user accounts are managed by Google Cloud and can be of the following type
    • Google Account
    • Google Cloud service account
  • Once authenticated, these identities need to be authorized to create, read, update or delete Kubernetes resources.
  • Kubernetes service accounts and Google Cloud service accounts are different entities.
    • Kubernetes service accounts are part of the cluster in which they are defined and are typically used within that cluster.
    • Google Cloud service accounts are part of a Google Cloud project, and can easily be granted permissions both within clusters and to Google Cloud project clusters themselves, as well as to any Google Cloud resource using IAM.

Control Plane Security

  • In GKE, the Kubernetes control plane components are managed and maintained by Google.
  • Control plane components host the software that runs the Kubernetes control plane, including the API server, scheduler, controller manager, and the etcd database where the Kubernetes configuration is persisted.
  • By default, the control plane components use a public IP address.
  • Kubernetes API server can be protected by using authorized networks, and private clusters, which allow assigning a private IP address to the control plane and disable access on the public IP address.
  • Control plane can also be secured by doing credential rotation on a regular basis. When credential rotation is initiated, the SSL certificates and cluster certificate authority are rotated. This process is automated by GKE and also ensures that your control plane IP address rotates.

Node Security

Container-Optimized OS

  • GKE nodes, by default, use Google’s Container-Optimized OS as the operating system on which to run Kubernetes and its components.
  • Container-Optimized OS  features include
    • Locked-down firewall
    • Read-only filesystem where possible
    • Limited user accounts and disabled root login

Node upgrades

  • GKE recommends upgrading nodes on regular basis to patch the OS for security issues in the container runtime, Kubernetes itself, or the node operating system
  • GKE also allows automatic as well as manual upgrades

Protecting nodes from untrusted workloads

  • GKE Sandbox can be enabled on the cluster to isolate untrusted workloads in sandboxes on the node if the clusters run unknown or untrusted workloads.
  • GKE Sandbox is built using gVisor, an open-source project.

Securing instance metadata

  • GKE nodes run as CE instances and have access to instance metadata by default, which a Pod running on the node does not necessarily need.
  • Sensitive instance metadata paths can be locked down by disabling legacy APIs and by using metadata concealment.
  • Metadata concealment ensures that Pods running in a cluster are not able to access sensitive data by filtering requests to fields such as the kube-env

Network Security

  • Network Policies help cluster administrators and users to lock down the ingress and egress connections created to and from the Pods in a namespace
  • Network policies allow you to use tags to define the traffic flowing through the Pods.
  • MTLS for Pod to Pod communication can be enabled using the Istio service mesh

Giving Pods Access to Google Cloud resources

Workload Identity (recommended)

  • Simplest and most secure way to authorize Pods to access Google Cloud resources is with Workload Identity.
  • Workload identity allows a Kubernetes service account to run as a Google Cloud service account.
  • Pods that run as the Kubernetes service account have the permissions of the Google Cloud service account.

Node Service Account

  • Pods can authenticate to Google Cloud using the Kubernetes clusters’ service account credentials from metadata.
  • Node Service Account credentials can be reached by any Pod running in the cluster if Workload Identity is not enabled.
  • It is recommended to create and configure a custom service account that has the minimum IAM roles required by all the Pods running in the cluster.

Service Account JSON key

  • Applications can access Google Cloud resources by using the service account’s key.
  • This approach is NOT recommended because of the difficulty of securely managing account keys.
  • Each service account is assigned only the IAM roles that are needed for its paired application to operate successfully. Keeping the service account application-specific makes it easier to revoke its access in the case of a compromise without affecting other applications.
  • A JSON service account key can be created and then mounted into the Pod using a Kubernetes Secret.

Binary Authorization

  • Binary Authorization works with images deployed to GKE from Container Registry or another container image registry.
  • Binary Authorization helps ensure that internal processes that safeguard the quality and integrity of the software have successfully completed before an application is deployed to the production environment.
  • Binary Authorization provides:
    • A policy model that lets you describe the constraints under which images can be deployed
    • An attestation model that lets you define trusted authorities who can attest or verify that required processes in your environment have completed before deployment
    • A deploy-time enforcer that prevents images that violate the policy from being deployed

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are building a product on top of Google Kubernetes Engine (GKE). You have a single GKE cluster. For each of your customers, a Pod is running in that cluster, and your customers can run arbitrary code inside their Pod. You want to maximize the isolation between your customers’ Pods. What should you do?
    1. Use Binary Authorization and whitelist only the container images used by your customers’ Pods.
    2. Use the Container Analysis API to detect vulnerabilities in the containers used by your customers’ Pods.
    3. Create a GKE node pool with a sandbox type configured to gVisor. Add the parameter runtimeClassName: gvisor to the specification of your customers’ Pods.
    4. Use the cos_containerd image for your GKE nodes. Add a nodeSelector with the value cloud.google.com/gke-os-distribution: cos_containerd to the specification of your customers’ Pods.

References

Google Kubernetes Engine – Security Overview

Google Cloud Functions

Google Cloud Functions

  • Cloud Functions is a serverless execution environment for building and connecting cloud services
  • Cloud Functions provide scalable pay-as-you-go functions as a service (FaaS) to run code with zero server management.
  • Cloud Functions are attached to events emitted from the cloud services and infrastructure and are triggered when an event being watched is fired.
  • Cloud Functions supports multiple language runtimes including Node.js, Python, Go, Java, .Net, Ruby, PHP, etc.
  • Cloud Functions features include
    • Zero server management
      • No servers to provision, manage or upgrade
      • Google Cloud handles the operational infrastructure including managing servers, configuring software, updating frameworks, and patching operating systems
      • Provisioning of resources happens automatically in response to events
    • Automatically scale based on the load
      • Cloud Function can scale from a few invocations a day to many millions of invocations without any work from you.
    • Integrated monitoring, logging, and debugging capability
    • Built-in security at role and per function level based on the principle of least privilege
      • Cloud Functions uses Google Service Account credential to seamlessly authenticate with the majority of Google Cloud services
    • Key networking capabilities for hybrid and multi-cloud scenarios

Cloud Functions Execution Environment

  • Cloud Functions handles incoming requests by assigning them to instances of the function and based on the volume or existing functions, it can assign it to an existing one or spawn a new instance.
  • Each instance of a function handles only one concurrent request at a time and can use the full amount of resources i.e. CPU and Memory
  • Cloud Functions may start multiple new instances to handle requests, thus provide auto-scaling and parallelism.
  • Cloud Functions must be stateless i.e. one function invocation should not rely on an in-memory state set by a previous invocation, to allow Google to automatically manage and scale the functions
  • Every deployed function is isolated from all other functions – even those deployed from the same source file. In particular, they don’t share memory, global variables, file systems, or other state.
  • Cloud Functions allows you to set a limit on the total number of function instances that can co-exist at any given time
  • Cloud Function instance is created when its deployed or the function needs to be scaled
  • Cloud Functions can have a Cold Start, which is the time involved in loading the runtime and the code.
  • Function execution time is limited by the timeout duration specified at function deployment time. By default, a function times out after 1 minute but can be extended up to 9 minutes.
  • Cloud Function provides a writeable filesystem i.e. /tmp directory only, which can be used to store temporary files in a function instance.  The rest of the file system is read-only and accessible to the function
  • Cloud Functions has 2 scopes
    • Global Scope
      • contain the function definition,
      • is executed on every cold start, but not if the instance has already been initialized.
      • can be used for initialization like database connections etc.
    • Function Scope
      •  only the body of the function declared as the entry point
      • is executed for each request and should include the actual logic
  • Cloud Functions Execution Guarantees
    • Functions are typically invoked once for each incoming event. However, Cloud Functions does not guarantee a single invocation in all cases
    • HTTP functions are invoked at most once as they are synchronous and the execution is not retried in an event of a failure
    • Event-driven functions are invoked at least once as they are asynchronous and can be retried

Cloud Functions Events and Triggers

  • Events are things that happen within the cloud environment that you might want to take action on.
  • Trigger is creating a response to that event. Trigger type determines how and when the function executes.
  • Cloud Functions supports the following native trigger mechanisms:
    • HTTP Triggers
      • Cloud Functions can be invoked with an HTTP request using the POSTPUTGETDELETE, and OPTIONS HTTP methods
      • HTTP invocations are synchronous and the result of the function execution will be returned in the response to the HTTP request.
    • Cloud Endpoints Triggers
      • Cloud Functions can be invoked through Cloud Endpoints, which uses the Extensible Service Proxy V2 (ESPv2) as an API gateway
      • ESPv2 intercepts all requests to the functions and performs any necessary checks (such as authentication) before invoking the function. ESPv2 also gathers and reports telemetry
    • Cloud Pub/Sub Triggers
      • Cloud Functions can be triggered by messages published to Pub/Sub topics in the same Cloud project as the function.
      • Pub/Sub is a globally distributed message bus that automatically scales as needed and provides a foundation for building robust, global services.
    • Cloud Storage Triggers
      • Cloud Functions can respond to change notifications emerging from Google Cloud Storage.
      • Notifications can be configured to trigger in response to various events inside a bucket – object creation, deletion, archiving, and metadata updates.
      • Cloud Functions can only be triggered by Cloud Storage buckets in the same Google Cloud Platform project.
    • Direct Triggers
      • Cloud Functions provides a call command in the command-line interface and testing functionality in the Cloud Console UI to support quick iteration and debugging
      • Function can be directly invoked to ensure it is behaving as expected. This causes the function to execute immediately, even though it may have been deployed to respond to a specific event.
    • Cloud Firestore
      • Cloud Functions can handle events in Cloud Firestore in the same Cloud project as the function.
      • Cloud Firestore can be read or updated in response to these events using the Firestore APIs and client libraries.
    • Analytics for Firebase
    • Firebase Realtime Database
    • Firebase Authentication
      • Cloud Functions can be triggered by events from Firebase Authentication in the same Cloud project as the function.
  • Cloud Functions can also be integrated with any other Google service that supports Cloud Pub/Sub for e.g. Cloud Scheduler, or any service that provides HTTP callbacks (webhooks)
  • Google Cloud Logging events can be exported to a Cloud Pub/Sub topic from which they can then be consumed by Cloud Functions.

Cloud Functions Best Practices

  • Write Idempotent functions – produce same events when invoke multiple times with the same parameters
  • Do not start background activities i.e. activity after function has terminated. Any code run after graceful termination cannot access the CPU and will not make any progress.
  • Always delete temporary files – As files can persist between invocations, failing to delete files may lead to memory issues
  • Use dependencies wisely – Import only what is required as it would impact the cold starts due to invocation latency
  • Use global variables to reuse objects in future invocations for e.g. database connections
  • Do lazy initialization of global variables
  • Use retry to handle only transient and retryable errors, with the handling being idempotent

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

Reference

Google_Cloud_Functions

Google Cloud – HipLocal Case Study

Google Cloud – HipLocal Case Study

HipLocal is a community application designed to facilitate communication between people in close proximity. It is used for event planning and organizing sporting events, and for businesses to connect with their local communities. HipLocal launched recently in a few neighborhoods in Dallas and is rapidly growing into a global phenomenon. Its unique style of hyper-local community communication and business outreach is in demand around the world.

Key point here is HipLocal is expanding globally

HipLocal Solution Concept

HipLocal wants to expand their existing service with updated functionality in new locations to better serve their global customers. They want to hire and train a new team to support these locations in their time zones. They will need to ensure that the application scales smoothly and provides clear uptime data, and that they analyze and respond to any issues that occur.

Key points here are HipLocal wants to expand globally, with an ability to scale and provide clear observability, alerting and ability to react.

HipLocal Existing Technical Environment

HipLocal’s environment is a mixture of on-premises hardware and infrastructure running in Google Cloud. The HipLocal team understands their application well, but has limited experience in globally scaled applications. Their existing technical environment is as follows:

  • Existing APIs run on Compute Engine virtual machine instances hosted in Google Cloud.
  • Expand availability of the application to new locations.
  • Support 10x as many concurrent users.
  • State is stored in a single instance MySQL database in Google Cloud.
  • Release cycles include development freezes to allow for QA testing.
  • The application has no consistent logging.
  • Applications are manually deployed by infrastructure engineers during periods of slow traffic on weekday evenings.
  • There are basic indicators of uptime; alerts are frequently fired when the APIs are unresponsive.

Business requirements

HipLocal’s investors want to expand their footprint and support the increase in demand they are experiencing. Their requirements are:

  • Expand availability of the application to new locations.
    • Availability can be achieved using either
      • scaling the application and exposing it through Global Load Balancer OR
      • deploying the applications across multiple regions.
  • Support 10x as many concurrent users.
    • As the APIs run on Compute Engine, the scale can be implemented using Managed Instance Groups frontend by a Load Balancer OR App Engine OR Container-based application deployment
    • Scaling policies can be defined to scale as per the demand.
  • Ensure a consistent experience for users when they travel to different locations.
    • Consistent experience for the users can be provided using either
      • Google Cloud Global Load Balancer which uses GFE and routes traffic close to the users
      • multi-region setup targeting each region
  • Obtain user activity metrics to better understand how to monetize their product.
    • User activity data can also be exported to BigQuery for analytics and monetization
    • Cloud Monitoring and Logging can be configured for application logs and metrics to provide observability, alerting, and reporting.
    • Cloud Logging can be exported to BigQuery for analytics
  • Ensure compliance with regulations in the new regions (for example, GDPR).
    • Compliance is shared responsibility, while Google Cloud ensures compliance of its services, application hosted on Google Cloud would be customer responsibility
    • GDPR or other regulations for data residency can be met using setup per region, so that the data resides with the region
  • Reduce infrastructure management time and cost.
    • As the infrastructure is spread across on-premises and Google Cloud, it would make sense to consolidate the infrastructure into one place i.e. Google Cloud
    • Consolidation would help in automation, maintenance, as well as provide cost benefits.
  • Adopt the Google-recommended practices for cloud computing:
    • Develop standardized workflows and processes around application lifecycle management.
    • Define service level indicators (SLIs) and service level objectives (SLOs).

Technical requirements

  • Provide secure communications between the on-premises data center and cloud hosted applications and infrastructure
    • Secure communications can be enabled between the on-premise data centers and the Cloud using Cloud VPN and Interconnect.
  • The application must provide usage metrics and monitoring.
    • Cloud Monitoring and Logging can be configured for application logs and metrics to provide observability, alerting, and reporting.
  • APIs require authentication and authorization.
    • APIs can be configured for various Authentication mechanisms.
    • APIs can be exposed through a centralized Cloud Endpoints gateway
    • Internal Applications can be exposed using Cloud Identity-Aware Proxy
  • Implement faster and more accurate validation of new features.
    • QA Testing can be improved using automated testing
    • Production Release cycles can be improved using canary deployments to test the applications on a smaller base before rolling out to all.
    • Application can be deployed to App Engine which supports traffic spilling out of the box for canary releases
  • Logging and performance metrics must provide actionable information to be able to provide debugging information and alerts.
    • Cloud Monitoring and Logging can be configured for application logs and metrics to provide observability, alerting, and reporting.
    • Cloud Logging can be exported to BigQuery for analytics
  • Must scale to meet user demand.
    • As the APIs run on Compute Engine, the scale can be implemented using Managed Instance Groups frontend by a Load Balancer and using scaling policies as per the demand.
    • Single instance MySQL instance can be migrated to Cloud SQL. This would not need any application code changes and can be as-is migration. With read replicas to scale both horizontally and vertically seamlessly.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Which database should HipLocal use for storing state while minimizing application changes?
    1. Firestore
    2. BigQuery
    3. Cloud SQL
    4. Cloud Bigtable
  2. Which architecture should HipLocal use for log analysis?
    1. Use Cloud Spanner to store each event.
    2. Start storing key metrics in Memorystore.
    3. Use Cloud Logging with a BigQuery sink.
    4. Use Cloud Logging with a Cloud Storage sink.
  3. HipLocal wants to improve the resilience of their MySQL deployment, while also meeting their business and technical requirements. Which configuration should they choose?
    1. ​Use the current single instance MySQL on Compute Engine and several read-only MySQL servers on Compute Engine.
    2. ​Use the current single instance MySQL on Compute Engine, and replicate the data to Cloud SQL in an external master configuration.
    3. Replace the current single instance MySQL instance with Cloud SQL, and configure high availability.
    4. ​Replace the current single instance MySQL instance with Cloud SQL, and Google provides redundancy without further configuration.
  4. Which service should HipLocal use to enable access to internal apps?
    1. Cloud VPN
    2. Cloud Armor
    3. Virtual Private Cloud
    4. Cloud Identity-Aware Proxy
  5. Which database should HipLocal use for storing user activity?
    1. BigQuery
    2. Cloud SQL
    3. Cloud Spanner
    4. Cloud Datastore

Reference

Case_Study_HipLocal