Google Cloud Certified – Cloud Digital Leader Learning Path

Google Cloud Certified - Cloud Digital Leader Certificate

Google Cloud – Cloud Digital Leader Certification Learning Path

Continuing on the Google Cloud Journey, glad to have passed the seventh certification with the Professional Cloud Digital Leader certification. Google Cloud was missing the initial entry-level certification similar to AWS Cloud Practitioner certification, which was introduced as the Cloud Digital Leader certification. Cloud Digital Leader focuses on general Cloud knowledge,  Google Cloud knowledge with its products and services.

Google Cloud – Cloud Digital Leader Certification Summary

  • Had 59 questions (somewhat odd !!) to be answered in 90 minutes.
  • Covers a wide range of General Cloud and Google Cloud services and products knowledge.
  • This exam does not require much Hands-on and theoretical knowledge is good enough to clear the exam.

Google Cloud – Cloud Digital Leader Certification Resources

Google Cloud – Cloud Digital Leader Certification Topics

General cloud knowledge

  1. Define basic cloud technologies. Considerations include:
    1. Differentiate between traditional infrastructure, public cloud, and private cloud
      1. Traditional infrastructure includes on-premises data centers
      2. Public cloud include Google Cloud, AWS, and Azure
      3. Private Cloud includes services like AWS Outpost
    2. Define cloud infrastructure ownership
    3. Shared Responsibility Model
      1. Security of the Cloud is Google Cloud’s responsibility
      2. Security on the Cloud depends on the services used and is shared between Google Cloud and the Customer
    4. Essential characteristics of cloud computing
      1. On-demand computing
      2. Pay-as-you-use
      3. Scalability and Elasticity
      4. High Availability and Resiliency
      5. Security
  2. Differentiate cloud service models. Considerations include:
    1. Infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS)
      1. IaaS – everything is done by you – more flexibility more management
      2. PaaS – most of the things are done by Cloud with few things done by you – moderate flexibility and management
      3. SaaS – everything is taken care of by the Cloud, you would just it – no flexibility and management
    2. Describe the trade-offs between level of management versus flexibility when comparing cloud services
    3. Define the trade-offs between costs versus responsibility
    4. Appropriate implementation and alignment with given budget and resources
  3. Identify common cloud procurement financial concepts. Considerations include:
    1. Operating expenses (OpEx), capital expenditures (CapEx), and total cost of operations (TCO)
      1. On-premises has more of Capex and less OpEx
      2. Cloud has no to least Capex and more of OpEx
    2. Recognize the relationship between OpEx and CapEx related to networking and compute infrastructure
    3. Summarize the key cost differentiators between cloud and on-premises environments

General Google Cloud knowledge

  1. Recognize how Google Cloud meets common compliance requirements. Considerations include:
    1. Locating current Google Cloud compliance requirements
    2. Familiarity with Compliance Reports Manager
  2. Recognize the main elements of Google Cloud resource hierarchy. Considerations include:
    1. Describe the relationship between organization, folders, projects, and resources i.e. Organization -> Folder -> Folder or Projects -> Resources
  3. Describe controlling and optimizing Google Cloud costs. Considerations include:
    1. Google Cloud billing models and applicability to different service classes
    2. Define a consumption-based use model
    3. Application of discounts (e.g., flat-rate, committed-use discounts [CUD], sustained-use discounts [SUD])
      1. Sustained-use discounts [SUD] are automatic discounts for running specific resources for a significant portion of the billing month
      2. Committed use discounts [CUD] help with committed use contracts in return for deeply discounted prices for VM usage
  4. Describe Google Cloud’s geographical segmentation strategy. Considerations include:
    1. Regions are collections of zones. Zones have high-bandwidth, low-latency network connections to other zones in the same region. Regions help design fault-tolerant and highly available solutions.
    2. Zones are deployment areas within a region and provide the lowest latency usually less than 10ms
    3. Regional resources are accessible by any resources within the same region
    4. Zonal resources are hosted in a zone are called per-zone resources.
    5. Multiregional resources or Global resources are accessible by any resource in any zone within the same project.
  5. Define Google Cloud support options. Considerations include:
    1. Distinguish between billing support, technical support, role-based support, and enterprise support
      1. Role-Based Support provides more predictable rates and a flexible configuration. Although they are legacy, the exam does cover these.
      2. Enterprise Support provides the fastest case response times and a dedicated Technical Account Management (TAM) contact who helps you execute a Google Cloud strategy.
    2. Recognize a variety of Service Level Agreement (SLA) applications

Google Cloud products and services

  1. Describe the benefits of Google Cloud virtual machine (VM)-based compute options. Considerations include:
    1. Compute Engine provides virtual machines (VM) hosted on Google’s infrastructure.
    2. Google Cloud VMware Engine helps easy lift and shift VMware-based applications to Google Cloud without changes to the apps, tools, or processes
    3. Bare Metal lets businesses run specialized workloads such as Oracle databases close to Google Cloud while lowering overall costs and reducing risks associated with migration
    4. Custom versus standard sizing
    5. Free, premium, and custom service options
    6. Attached storage/disk options
    7. Preemptible VMs is an instance that can be created and run at a much lower price than normal instances.
  2. Identify and evaluate container-based compute options. Considerations include:
    1. Define the function of a container registry
      1. Container Registry is a single place to manage Docker images, perform vulnerability analysis, and decide who can access what with fine-grained access control.
    2. Distinguish between VMs, containers, and Google Kubernetes Engine
  3. Identify and evaluate serverless compute options. Considerations include:
    1. Define the function and use of App Engine, Cloud Functions, and Cloud Run
    2. Define rationale for versioning with serverless compute options
    3. Cost and performance tradeoffs of scale to zero
      1. Scale to zero helps provides cost efficiency by scaling down to zero when there is no load but comes with an issue with cold starts
      2. Serverless technologies like Cloud Functions, Cloud Run, App Standard Engine provides these capabilities
  4. Identify and evaluate multiple data management offerings. Considerations include:
    1. Describe the differences and benefits of Google Cloud’s relational and non-relational database offerings
      1. Cloud SQL provides fully managed, relational SQL databases and offers MySQL, PostgreSQL, MSSQL databases as a service
      2. Cloud Spanner provides fully managed, relational SQL databases with joins and secondary indexes
      3. Cloud Bigtable provides a scalable, fully managed, non-relational NoSQL wide-column analytical big data database service suitable for low-latency single-point lookups and precalculated analytics
      4. BigQuery provides fully managed, no-ops, OLAP, enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
    2. Describe Google Cloud’s database offerings and how they compare to commercial offerings
  5. Distinguish between ML/AI offerings. Considerations include:
    1. Describe the differences and benefits of Google Cloud’s hardware accelerators (e.g., Vision API, AI Platform, TPUs)
    2. Identify when to train your own model, use a Google Cloud pre-trained model, or build on an existing model
      1. Vision API provides out-of-the-box pre-trained models to extract data from images
      2. AutoML provides the ability to train models
      3. BigQuery Machine Learning provides support for limited models and SQL interface
  6. Differentiate between data movement and data pipelines. Considerations include:
    1. Describe Google Cloud’s data pipeline offerings
      1. Cloud Pub/Sub provides reliable, many-to-many, asynchronous messaging between applications. By decoupling senders and receivers, Google Cloud Pub/Sub allows developers to communicate between independently written applications.
      2. Cloud Dataflow is a fully managed service for strongly consistent, parallel data-processing pipelines
      3. Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building & managing data pipelines
      4. BigQuery Service is a fully managed, highly scalable data analysis service that enables businesses to analyze Big Data.
      5. Looker provides an enterprise platform for business intelligence, data applications, and embedded analytics.
    2. Define data ingestion options
  7. Apply use cases to a high-level Google Cloud architecture. Considerations include:
    1. Define Google Cloud’s offerings around the Software Development Life Cycle (SDLC)
    2. Describe Google Cloud’s platform visibility and alerting offerings covers Cloud Monitoring and Cloud Logging
  8. Describe solutions for migrating workloads to Google Cloud. Considerations include:
    1. Identify data migration options
    2. Differentiate when to use Migrate for Compute Engine versus Migrate for Anthos
      1. Migrate for Compute Engine provides fast, flexible, and safe migration to Google Cloud
      2. Migrate for Anthos and GKE makes it fast and easy to modernize traditional applications away from virtual machines and into native containers. This significantly reduces the cost and labor that would be required for a manual application modernization project.
    3. Distinguish between lift and shift versus application modernization
      1. involves lift and shift migration with zero to minimal changes and is usually performed with time constraints
      2. Application modernization requires a redesign of infra and applications and takes time. It can include moving legacy monolithic architecture to microservices architecture, building CI/CD pipelines for automated builds and deployments, frequent releases with zero downtime, etc.
  9. Describe networking to on-premises locations. Considerations include:
    1. Define Software-Defined WAN (SD-WAN) – did not have any questions regarding the same.
    2. Determine the best connectivity option based on networking and security requirements – covers Cloud VPN, Interconnect, and Peering.
    3. Private Google Access provides access from VM instances to Google provides services like Cloud Storage or third-party provided services
  10. Define identity and access features. Considerations include:
    1. Cloud Identity & Access Management (Cloud IAM) provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    2. Google Cloud Directory Sync enables administrators to synchronize users, groups, and other data from an Active Directory/LDAP service to their Google Cloud domain directory.

Google Cloud Compute Options

Google Cloud Compute Options

Compute Engine

  • provides Infrastructure as a Service (IaaS) in the Google Cloud
  • provides full control/flexibility on the choice of OS, resources like CPU and memory
  • Usage patterns
    • lift and shift migrations of existing systems
    • existing VM images to move to the cloud
    • need low-level access to or fine-grained control of the operating system, network, and other operational characteristics.
    • require custom kernel or arbitrary OS
    • software that can’t be easily containerized
    • using a third party licensed software
  • Usage anti-patterns
    • containerized applications – Choose App Engine, GKE, or Cloud Run
    • stateless event-driven applications – Choose Cloud Functions

App Engine

  • helps build highly scalable web and mobile backend applications on a fully managed serverless platform
  • Usage patterns
    • Rapidly developing CRUD-heavy applications
    • HTTP/S based applications
    • Deploying complex APIs
  • Usage anti-patterns
    • Stateful applications requiring lots of in-memory states to meet the performance or functional requirements
    • Systems that require protocols other than HTTP

Google Kubernetes Engine – GKE

  • provides a managed environment for deploying, managing, and scaling containerized applications using Google infrastructure.
  • Usage patterns
    • containerized applications or those that can be easily containerized
    • Hybrid or multi-cloud environments
    • Systems leveraging stateful and stateless services
    • Strong CI/CD Pipelines
  • Usage anti-patterns
    • non-containerized applications – Choose CE or App engine
    • applications requiring very low-level access to the underlying hardware like custom kernel, networking, etc. – Choose CE
    • stateless event-driven applications – Choose Cloud Functions

Cloud Run

  • provides a serverless managed compute platform to run stateless, isolated containers without orchestration that can be invoked via web requests or Pub/Sub events.
  • abstracts away all infrastructure management allowing users to focus on building great applications.
  • is built from Knative.
  • Usage patterns
    • Stateless services that are easily containerized
    • Event-driven applications and systems
    • Applications that require custom system and language dependencies
  • Usage anti-patterns
    • Highly stateful systems
    • Systems that require protocols other than HTTP
    • Compliance requirements that demand strict controls over the low-level environment and infrastructure (might be okay with the Knative GKE mode)

Cloud Functions

  • provides serverless compute for event-driven apps
  • Usage patterns
    • ephemeral and event-driven applications and functions
    • fully managed environment
    • pay only for what you use
    • quick data transformations (ETL)
  • Usage anti-patterns
    • continuous stateful application – Choose CE, App Engine or GKE
Credit @ https://thecloudgirl.dev/

Google Cloud Compute Options Decision Tree

Google Cloud Compute Options Decision Tree

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your organization is developing a new application. This application responds to events created by already running applications. The business goal for the new application is to scale to handle spikes in the flow of incoming events while minimizing administrative work for the team. Which Google Cloud product or feature should you choose?
    1. Cloud Run
    2. Cloud Run for Anthos
    3. App Engine standard environment
    4. Compute Engine
  2. A company wants to build an application that stores images in a Cloud Storage bucket and wants to generate thumbnails as well as resize the images. They want to use managed service which will help them scale automatically from zero to scale and back to zero. Which GCP service satisfies the requirement?
    1. Google Compute Engine
    2. Google Kubernetes Engine
    3. Google App Engine
    4. Cloud Functions

Google Cloud Composer

Cloud Composer

  • Cloud Composer is a fully managed workflow orchestration service, built on Apache Airflow, enabling workflow creation that spans across clouds and on-premises data centers.
  • Cloud Composer requires no installation or has no management overhead.
  • Cloud Composer integrates with Cloud Logging and Cloud Monitoring to provide a central place to view all Airflow service and workflow logs.

Cloud Composer Components

  • Cloud Composer helps define a series of tasks as Workflow executed within an Environment
  • Workflows are created using DAGs or Direct Acyclic Graphs
  • DAG is a collection of tasks that are scheduled and executed, organized in a way that reflects their relationships and dependencies.
  • DAGs are stored in Cloud Storage
  • Each Task can represent anything from ingestion, transform, filtering, monitoring, preparing, etc.
  • Environments are self-contained Airflow deployments based on Google Kubernetes Engine, and they work with other Google Cloud services using connectors built into Airflow.
  • Cloud Composer environment is a wrapper around Apache Airflow with components like GKE Cluster, Web Server, Database, Cloud Storage.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
    1. Cloud Dataflow
    2. Cloud Composer
    3. Cloud Dataprep
    4. Cloud Dataproc
  2. Your company is working on a multi-cloud initiative. The data processing pipelines require creating workflows that connect data, transfer data, processing, and using services across clouds. What cloud-native tool should be used for orchestration?
    1. Cloud Scheduler
    2. Cloud Dataflow
    3. Cloud Composer
    4. Cloud Dataproc

Google Cloud Dataflow vs Dataproc

Google Cloud Dataflow vs Dataproc

Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • Cloud Dataproc provides a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if already familiar with Hadoop tools and have Hadoop jobs
  • Ideal for Lift and Shift migration of existing Hadoop environment
  • Requires manual provisioning of clusters
  • Consider Dataproc
    • If you have a substantial investment in Apache Spark or Hadoop on-premise and considering moving to the cloud
    • If you are looking at a Hybrid cloud and need portability across a private/multi-cloud environment
    • If in the current environment Spark is the primary machine learning tool and platform
    • In case the code depends on any custom packages along with distributed computing need

Cloud Dataflow

  • Google Cloud Dataflow is a fully managed, serverless service for unified stream and batch data processing requirements
  • When using it as a pre-processing pipeline for ML model that can be deployed in GCP AI Platform Training (earlier called Cloud ML Engine)
  • None of the above considerations made for Cloud Dataproc is relevant

Cloud Dataflow vs Dataproc Decision Tree

Dataflow vs Dataproc

Dataflow vs Dataproc Table

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
    1. Google Cloud Dataflow
    2. Google Cloud Dataproc
    3. Google Compute Engine
    4. Google Container Engine
  2. A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
    1. Dataproc
    2. Dataprep
    3. Dataflow
    4. BigQuery

References

Google Cloud BigQuery Data Transfer Service

Cloud BigQuery Data Transfer Service

  • BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
  • After a data transfer is configured, the BigQuery Data Transfer Service automatically loads data into BigQuery on a regular basis.
  • BigQuery Data Transfer Service can also initiate data backfills to recover from any outages or gaps.
  • BigQuery Data Transfer Service can only sink data to BigQuery and cannot be used to transfer data out of BigQuery.

BigQuery Data Transfer Service Sources

  • BigQuery Data Transfer Service supports loading data from the following data sources:
    • Google Software as a Service (SaaS) apps
    • Campaign Manager
    • Cloud Storage
    • Google Ad Manager
    • Google Ads
    • Google Merchant Center (beta)
    • Google Play
    • Search Ads 360 (beta)
    • YouTube Channel reports
    • YouTube Content Owner reports
    • External cloud storage providers
      • Amazon S3
    • Data warehouses
      • Teradata
      • Amazon Redshift

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company uses Google Analytics for tracking. You need to export the session and hit data from a Google Analytics 360 reporting view on a scheduled basis into BigQuery for analysis. How can the data be exported?
    1. Configure a scheduler in Google Analytics to convert the Google Analytics data to JSON format, then import directly into BigQuery using bq command line.
    2. Use gsutil to export the Google Analytics data to Cloud Storage, then import into BigQuery and schedule it using Cron.
    3. Import data to BigQuery directly from Google Analytics using Cron
    4. Use BigQuery Data Transfer Service to import the data from Google Analytics

Reference

Google_Cloud_BigQuery_Transfer_Service

Google Cloud BigQuery Security

Google Cloud BigQuery Security

BigQuery Encryption

  • BigQuery automatically encrypts all data before it is written to disk
  • By default, Google uses the Default Encryption at Rest and manages the key encryption keys used for data protection.
  • BigQuery also supports customer-managed encryption keys, to encrypt individual values within a table.
  • BigQuery uses TLS for data in transit encryption
  • Cloud Data Loss Prevention (Cloud DLP) can be used to scan the BigQuery tables and to protect sensitive data and meet compliance requirements.

BigQuery IAM Roles

  • BigQuery supports access control of datasets and tables using IAM
  • Primitive Roles
    • primitive roles act at the project level
    • By default, granting access to a project also grants access to datasets within it unless overridden
    • are not limited to BigQuery resources only
    • can separate data access permissions from job-running permissions
    • Viewer
      • View all datasets
      • Run Jobs/Queries
      • View and update all jobs that they started
    • Editor
      • All Viewer access
      • Modify or delete all tables
      • Create new datasets
    • Owner
      • All Editor access
      • list, modify, or delete all datasets
      • View all jobs
  • Predefined Roles
    • dataViewer, dataEditor, and dataOwner roles
      • are similar to the primitive roles except
        • can be assigned for individual datasets
        • don’t give users permission to run jobs or queries
    • user, jobUser roles
      • give users permission to run jobs or queries
      • A jobUser can only start jobs and cancel jobs, but cannot list datasets or tables
      • A user, on the other hand, can perform a variety of other tasks, such as listing or creating datasets
      • User or group granted the user role at the project level can create datasets and can run query jobs against tables in those datasets.
      • user role does not give permission to query data, view table data, or view table schema details for datasets the user did not create. Need to have the dataViewer role for the same.

Authorized Views

  • Authorized views help provide view access to a dataset
  • Use authorized views to restrict access at a lower resource level such as the table, column, row, or cell.
  • An authorized view allows sharing query results with particular users and groups without giving them access to the underlying tables.
  • Authorized View’s SQL query can be used to restrict the columns (fields) the users are able to query.
  • Authorized views HAVE to be created in a separate dataset from the source dataset. As access controls can be assigned only at the dataset level, if the view is created in the same dataset as the source data, the users would have access to both the view and the data.
  • Authorized View creation process
    • Create a separate dataset to store the view.
    • Create the view in the new dataset
    • Give the group read access to the dataset containing the view
    • Authorize the view to access the source dataset
    • Give the group bigquery.user role to run jobs, including query jobs within the project
  • Project-level bigquery.user role does not give the users the ability to view or query table data in the dataset containing the tables queried by the view. They need READER access to the dataset containing the view.

Fine-Grained Access Control

  • BigQuery supports access controls at the project, dataset, and table levels
  • BigQuery also supports fine-grained row and column level security
  • BigQuery provides fine-grained access to sensitive columns using policy tags, or type-based classification, of data.
  • Using BigQuery column-level security, you can create policies that check, at query time, whether a user has proper access.
  • Row-level security extends the principle of least privilege by enabling fine-grained access control to a subset of data in a BigQuery table, by means of row-level access policies.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You have multiple Data Analysts who work with the dataset hosted in BigQuery within the same project. As a BigQuery Administrator, you are required to grant the data analyst only the privilege to create jobs/queries and the ability to cancel self-submitted jobs. Which role should assign to the user?
    1. User
    2. Jobuser
    3. Owner
    4. Viewer
  2. Your analytics system executes queries against a BigQuery dataset. The SQL query is executed in batch and passes the contents of a SQL file to the BigQuery CLI. Then it redirects the BigQuery CLI output to another process. However, you are getting a permission error from the BigQuery CLI when the queries are executed. You want to resolve the issue. What should you do?
    1. Grant the service account BigQuery Data Viewer and BigQuery Job User roles.
    2. Grant the service account BigQuery Data Editor and BigQuery Data Viewer roles.
    3. Create a view in BigQuery from the SQL query and SELECT * from the view in the CLI.
    4. Create a new dataset in BigQuery, and copy the source table to the new dataset Query the new dataset and table from the CLI.
  3. You are responsible for the security and access control to a BigQuery dataset hosted within a project. Multiple users from multiple teams need to have access to the different tables within the dataset. How can access be control?
    1. Create Authorized views for tables in a separate project and grant access to the teams
    2. Create Authorized views for tables in the same project and grant access to the teams
    3. Create Materialized views for tables in a separate project and grant access to the teams
    4. Create Materialized views for tables in the same project and grant access to the teams

References

Google_Cloud_BigQuery_Data_Goverance

Google Cloud Dataproc

Google Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • Dataproc automation helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
  • Dataproc helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
  • Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average
  • Dataproc has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
  • Dataproc clusters support preemptible instances that have lower compute prices to reduce costs further.
  • Dataproc supports connectors for BigQuery, Bigtable, Cloud Storage
  • Dataproc also supports Anaconda, HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.

Dataproc Cluster High Availability

  • Dataproc cluster can be configured for High Availability by specifying the number of master instances in the cluster
  • Dataproc supports two master configurations:
    • Single Node Cluster – 1 master – 0 Workers (default, non HA)
      • provides one node for both master and worker
      • if the master fails, the in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
    • High Availability Cluster – 3 masters – N Workers (Hadoop HA)
      • HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.
  • All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.

Dataproc Cluster Scaling

  • Dataproc cluster can be adjusted to scale by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
  • Dataproc cluster can be scaled at any time, even when jobs are running on the cluster.
  • Machine type of an existing cluster (vertical scaling) cannot be changed. To vertically scale, create a cluster using a supported machine type, then migrate jobs to the new cluster.
  • Dataproc cluster can help scale
    • to increase the number of workers to make a job run faster
    • to decrease the number of workers to save money
    • to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

Dataproc Cluster Autoscaling

  • Dataproc Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
  • An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy should scale.
  • It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout cluster lifetime.
  • Autoscaling is recommended for
    • on clusters that store data in external services, such as Cloud Storage
    • on clusters that process many jobs
    • to scale up single-job clusters
  • Autoscaling is not recommended with/for:
    • HDFS: Autoscaling is not intended for scaling on-cluster HDFS
    • YARN Node Labels: Autoscaling does not support YARN Node Labels. YARN incorrectly reports cluster metrics when node labels are used.
    • Spark Structured Streaming: Autoscaling does not support Spark Structured Streaming
    • Idle Clusters: Autoscaling is not recommended for the purpose of scaling a cluster down to minimum size when the cluster is idle. It is better to delete an Idle cluster.

Dataproc Workers

  • Primary workers are standard Compute Engine VMs
  • Secondary workers can be used to scale with the below limitations
    • Processing only
      • Secondary workers do not store data.
      • can only function as processing nodes
      • useful to scale compute without scaling storage.
    • No secondary-worker-only clusters
      • Cluster must have primary workers
      • Dataproc adds two primary workers to the cluster, by default, if no primary workers are specified.
    • Machine type
      • use the machine type of the cluster’s primary workers.
    • Persistent disk size
      • are created, by default, with the smaller of 100GB or the primary worker boot disk size.
      • This disk space is used for local caching of data and is not available through HDFS.
    • Asynchronous Creation
      • Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned

Dataproc Initialization Actions

  • Dataproc supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up
  • Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

Dataproc Cloud Storage Connector

  • Dataproc Cloud Storage connector helps Dataproc use Google Cloud Storage as the persistent store instead of HDFS.
  • Cloud Storage connector helps separate the storage from the cluster lifecycle and allows the cluster to be shut down when not processing data
  • Cloud Storage connector benefits
    • Direct data access – Store the data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
    • HDFS compatibility – can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://
    • Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
    • Data accessibility – data is accessible even after shutting down the cluster, unlike HDFS.
    • High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
    • No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.

Cloud Dataproc vs Dataflow

Refer blog post @ Cloud Dataproc vs Dataflow

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
    1. Google Cloud Dataflow
    2. Google Cloud Dataproc
    3. Google Compute Engine
    4. Google Container Engine
  2. Your company is migrating to the Google cloud and looking for HBase alternative. Current solution uses a lot of custom code using the observer coprocessor. You are required to find the best alternative for migration while using managed services, is possible?
    1. Dataflow
    2. HBase on Dataproc
    3. Bigtable
    4. BigQuery

References

Google_Cloud_Dataproc

Google Cloud Dataflow

Google Cloud Dataflow

  • Google Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
  • Dataflow provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
  • Dataflow is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.

Dataflow (Apache Beam) Programming Model

Data Processing Model

Pipelines

  • A pipeline encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data.
  • The input source and output sink can be the same or of different types, allowing data conversion from one format to another.
  • Apache Beam programs start by constructing a Pipeline object and then using that object as the basis for creating the pipeline’s datasets.
  • Each pipeline represents a single, repeatable job.

PCollection

  • A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline’s data.
  • Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.

Transforms

  • A transform represents a processing operation that transforms data.
  • A transform takes one or more PCollections as input, performs a specified operation on each element in that collection, and produces one or more PCollections as output.
  • A transform can perform nearly any kind of processing operation like
    • performing mathematical computations,
    • data conversion from one format to another,
    • grouping data together,
    • reading and writing data,
    • filtering data to output only the required elements, or
    • combining data elements into single values.

ParDo

  • ParDo is the core parallel processing operation invoking a user-specified function on each of the elements of the input PCollection.
  • ParDo collects the zero or more output elements into an output PCollection.
  • ParDo processes elements independently and in parallel, if possible.

Pipeline I/O

  • Apache Beam I/O connectors help read data into the pipeline and write output data from the pipeline.
  • An I/O connector consists of a source and a sink.
  • All Apache Beam sources and sinks are transforms that let the pipeline work with data from several different data storage formats.

Aggregation

  • Aggregation is the process of computing some value from multiple input elements.
  • The primary computational pattern for aggregation is to
    • group all elements with a common key and window.
    • combine each group of elements using an associative and commutative operation.

User-defined functions (UDFs)

  • User-defined functions allow executing user-defined code as a way of configuring the transform.
  • For ParDo, user-defined code specifies the operation to apply to every element, and for Combine, it specifies how values should be combined.
  • A pipeline might contain UDFs written in a different language than the language of the runner.
  • A pipeline might also contain UDFs written in multiple languages.

Runner

  • Runners are the software that accepts a pipeline and executes it.

Event time

  • Time a data event occurs, determined by the timestamp on the data element itself.
  • This contrasts with the time the actual data element gets processed at any stage in the pipeline.

Windowing

  • Windowing enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
  • A windowing function tells the runner how to assign elements to an initial window, and how to merge windows of grouped elements.

Tumbling Windows (Fixed Windows)

  • A tumbling window represents a consistent, disjoint time interval, for e.g. every 1 min, in the data stream.

An image that shows tumbling windows, 30 seconds in duration

Hopping Windows (Sliding Windows)

  • A hopping window represents a consistent time interval in the data stream for e.g., a hopping window can start every 30 seconds and capture 1 min of data and the window. The frequency with which hopping windows begin is called the period.
  • Hopping windows can overlap, whereas tumbling windows are disjoint.
  • Hopping windows are ideal to take running averages of data

An image that shows hopping windows with 1 minute window duration and 30 second window period

Session windows

  • A session window contains elements within a gap duration of another element for e.g., session windows can divide a data stream representing user mouse activity. This data stream might have long periods of idle time interspersed with many clicks. A session window can contain the data generated by the clicks.
  • The gap duration is an interval between new data in a data stream.
  • If data arrives after the gap duration, the data is assigned to a new window
  • Session windowing assigns different windows to each data key.
  • Tumbling and hopping windows contain all elements in the specified time interval, regardless of data keys.

An image that shows session windows with a minimum gap duration

Watermarks

  • A Watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived.
  • Watermark is tracked and its a system’s notion of when all data in a certain window can be expected to have arrived in the pipeline
  • If new data arrives with a timestamp that’s in the window but older than the watermark, the data is considered late data.
  • Dataflow tracks watermarks because of the following:
    • Data is not guaranteed to arrive in time order or at predictable intervals.
    • Data events are not guaranteed to appear in pipelines in the same order that they were generated.

Trigger

  • Triggers determine when to emit aggregated results as data arrives.
  • For bounded data, results are emitted after all of the input has been processed.
  • For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed.

Dataflow Pipeline Operations

  • Cancelling a job
    • causes the Dataflow service to stop the job immediately.
    • might lose in-flight data
  • Draining a job
    • supports graceful stop
    • prevents data loss
    • is useful to deploy incompatible changes
    • allows the job to clear the existing queue before stoping
    • supports only streaming jobs and does not support batch pipelines

Dataflow Security

  • Dataflow provides data-in-transit encryption.
    • All communication with Google Cloud sources and sinks is encrypted and is carried over HTTPS.
    • All inter-worker communication occurs over a private network and is subject to the project’s permissions and firewall rules.

Cloud Dataflow vs Dataproc

Refer blog post @ Cloud Dataflow vs Dataproc

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
    1. Dataproc
    2. Dataprep
    3. Dataflow
    4. BigQuery

References

Google_Cloud_Dataflow

Google Cloud Spanner

Google Cloud Spanner

  • Cloud Spanner is a fully managed, mission-critical relational database service
  • Cloud Spanner provides a scalable online transaction processing (OLTP) database with high availability and strong consistency at a global scale.
  • Cloud Spanner provides traditional relational semantics like schemas, ACID transactions and SQL interface
  • Cloud Spanner provides Automatic, Synchronous replication within and across regions for high availability (99.999%)
  • Cloud Spanner benefits
    • OLTP (Online Transactional Processing)
    • Global scale
    • Relational data model
    • ACID/Strong or External consistency
    • Low latency
    • Fully managed and highly available
    • Automatic replication

Cloud Spanner Architecture

Cloud Spanner ArchitectureInstance

  • Cloud Spanner Instance determines the location and the allocation of resources
  • Instance creation includes two important choices
    • Instance configuration
      • determines the geographic placement i.e. location and replication of the databases
      • Location can be regional or multi-regional
      • cannot be changed once selected during the creation
    • Node count
      • determines the amount of the instance’s serving and storage resources
      • can be updated
  • Cloud Spanner distributes an instance across zones of one or more regions to provide high performance and high availability
  • Cloud Spanner instances have:
    • At least three read-write replicas of the database each in a different zone
    • Each zone is a separate isolation fault domain
    • Paxos distributed consensus protocol used for writes/transaction commits
    • Synchronous replication of writes to all zones across all regions
    • Database is available even if one zone fails (99.999% availability SLA for multi-region and 99.99% availability SLA for regional)

Regional vs Multi-Regional

  • Regional Configuration
    • Cloud Spanner maintains 3 read-write replicas, each within a different Google Cloud zone in that region.
    • Each read-write replica contains a full copy of the operational database that is able to serve read-write and read-only requests.
    • Cloud Spanner uses replicas in different zones so that if a single-zone failure occurs, the database remains available.
    • Every Cloud Spanner mutation requires a write quorum that’s composed of a majority of voting replicas. Write quorums are formed from two out of the three replicas in regional configurations.
    • Provides 99.99% availability
  • Multi-Regional Configuration
    • Multi-region configurations allow replicating the database’s data not just in multiple zones, but in multiple zones across multiple regions
    • Additional replicas enable reading data with low latency from multiple locations close to or within the regions in the configuration.
    • As the quorum (read-write) replicas are spread across more than one region, additional network latency is incurred when these replicas communicate with each other to vote on writes.
    • Multi-region configurations enable the application to achieve faster reads in more places at the cost of a small increase in write latency.
    • Provides 99.999% availability
    • Multi-regional makes use of, paxos based replication, TrueTime and leader election, to provide global consistency and higher availability

Cloud Spanner - Regional vs Multi-Regional Configurations

Replication

  • Cloud Spanner automatically gets replication at the byte level from the underlying distributed filesystem.
  • Cloud Spanner also performs data replication to provide global availability and geographic locality, with fail-over between replicas being transparent to the client.
  • Cloud Spanner creates multiple copies, or “replicas,” of the rows, then stores these replicas in different geographic areas.
  • Cloud Spanner uses a synchronous, Paxos distributed consensus protocol, in which voting replicas take a vote on every write request to ensure transactions are available in sufficient replicas before being committed.
  • Globally synchronous replication gives the ability to read the most up-to-date data from any Cloud Spanner read-write or read-only replica.
  • Cloud Spanner creates replicas of each database split
  • A split holds a range of contiguous rows, where the rows are ordered by the primary key.
  • All of the data in a split is physically stored together in the replica, and Cloud Spanner serves each replica out of an independent failure zone.
  • A set of splits is stored and replicated using Paxos.
  • Within each Paxos replica set, one replica is elected to act as the leader.
  • Leader replicas are responsible for handling writes, while any read-write or read-only replica can serve a read request without communicating with the leader (though if a strong read is requested, the leader will typically be consulted to ensure that the read-only replica has received all recent mutations)
  • Cloud Spanner automatically reshards data into splits and automatically migrates data across machines (even across datacenters) to balance load, and in response to failures.
  • Spanner’s sharding considers the parent child relationships in interleaved tables and related data is migrated together to preserve query performance

Cloud Spanner Data Model

  • A Cloud Spanner Instance can contain one or more databases
  • A Cloud Spanner database can contain one or more tables
  •  Tables look like relational database tables in that they are structured with rows, columns, and values, and they contain primary keys
  • Every table must have a primary key, and that primary key can be composed of zero or more columns of that table
  • Parent-child relationships in Cloud Spanner
    • Table Interleaving
      • Table interleaving is a good choice for many parent-child relationships where the child table’s primary key includes the parent table’s primary key columns
      • Child rows are colocated with the parent rows significantly improving the performance
      • Primary key column(s) of the parent table must be the prefix of the primary key of the child table
    • Foreign Keys
      • Foreign keys are similar to traditional databases.
      • They are not limited to primary key columns, and tables can have multiple foreign key relationships, both as a parent in some relationships and a child in others.
      • The foreign key relationship does not guarantee data co-location
  • Cloud Spanner automatically creates an index for each table’s primary key
  • Secondary indexes can be created for other columns

Cloud Spanner Scaling

  • Increase the compute capacity of the instance to scale up the server and storage resources in the instance.
  • Each node allows for an additional 2TB of data storage
  • Nodes provide additional compute resources to increase throughput
  • Increasing compute capacity does not increase the replica count but gives each replica more CPU and RAM, which increases the replica’s throughput (that is, more reads and writes per second can occur).

Cloud Spanner Backup & PITR

  • Cloud Spanner Backup and Restore helps create backups of Cloud Spanner databases on demand, and restore them to provide protection against operator and application errors that result in logical data corruption.
  • Backups are highly available, encrypted, and can be retained for up to a year from the time they are created.
  • Cloud Spanner point-in-time recovery (PITR) provides protection against accidental deletion or writes.
  • PITR works by letting you configure a database’s version_retention_period to retain all versions of data and schema, from a minimum of 1 hour up to a maximum of 7 days.

Cloud Spanner Best Practices

  • Design a schema that prevents hotspots and other performance issues.
  • For optimal write latency, place compute resources for write-heavy workloads within or close to the default leader region.
  • For optimal read performance outside of the default leader region, use staleness of at least 15 seconds.
  • To avoid single-region dependency for the workloads, place critical compute resources in at least two regions.
  • Provision enough compute capacity to keep high priority total CPU utilization under
    • 65% in each region for regional configuration
    • 45% in each region for multi-regional configuration

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your customer has implemented a solution that uses Cloud Spanner and notices some read latency-related performance issues on one table. This table is accessed only by their users using a primary key. The table schema is shown below. You want to resolve the issue. What should you do?
    1. Remove the profile_picture field from the table.
    2. Add a secondary index on the person_id column.
    3. Change the primary key to not have monotonically increasing values.
    4. Create a secondary index using the following Data Definition Language (DDL) CREATE INDEX person_id_ix ON Persons (
      person_id, firstname, lastname ) STORING ( profile_picture )
  2. You are building an application that stores relational data from users. Users across the globe will use this application. Your CTO is concerned about the scaling requirements because the size of the user base is unknown. You need to implement a database solution that can scale with your user growth with minimum configuration changes. Which storage solution should you use?
    1. Cloud SQL
    2. Cloud Spanner
    3. Cloud Firestore
    4. Cloud Datastore
  3. A financial organization wishes to develop a global application to store transactions happening from different part of the world. The storage system must provide low latency transaction support and horizontal scaling. Which GCP service is appropriate for this use case?
    1. Bigtable
      B Datastore
      C Cloud Storage
      D Cloud Spanner

References

Google_Cloud_Spanner

Google Cloud Data Analytics Services Cheat Sheet

Cloud Pub/Sub

  • Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
  • Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
  • Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
  • Pub/Sub messages should be no greater than 10MB in size.
  • Messages can be received with pull or push delivery.
  • Messages published before a subscription is created will not be delivered to that subscription
  • Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
  • Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
  • Pub/Sub support encryption at rest and encryption in transit.
  • Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

  • BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • supports a standard SQL dialect
  • automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
  • supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
  • Data model consists of Datasets, tables
  • BigQuery performance can be improved using Partitioned tables and Clustered tables.
  • BigQuery encrypts all data at rest and supports encryption in transit.
  • BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
  • Best Practices
    • Control projection, avoid select *
    • Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
    • Use the maximum bytes billed setting to limit query costs.
    • Use clustering and partitioning to reduce the amount of data scanned.
    • Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
    • Use streaming inserts only if the data must be immediately available as streaming data is charged.
    • Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
    • Denormalize data whenever possible using nested and repeated fields.
    • Avoid external data sources, if query performance is a top priority
    • Avoid using Javascript user-defined functions
    • Optimize Join patterns. Start with the largest table.
    • Use the expiration settings to remove unneeded tables and partitions
    • Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

  • Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
  • ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
  • supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
  • is a sparsely populated table that can scale to billions of rows and thousands of columns,
  • supports storage of terabytes or even petabytes of data
  • is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
  • handles upgrades and restarts transparently, and it automatically maintains high data durability.
  • scales linearly in direct proportion to the number of nodes in the cluster
  • stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
  • Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
  • Single-cluster Bigtable instances provide strong consistency.
  • Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

  • Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
  • provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
  • is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
  • supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
  • supports drain feature to deploy incompatible updates

Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
  • helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
  • has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
  • support preemptible instances that have lower compute prices to reduce costs further.
  • also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
  • supports connectors for BigQuery, Bigtable, Cloud Storage
  • can be configured for High Availability by specifying the number of master instances in the cluster
  • All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
  • supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
  • supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
  • supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

  • Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
  • is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
  • provides easy data preparation with clicks and no code.
  • automatically identifies data anomalies & helps take fast corrective action
  • automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
  • uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

  • Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
  • runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
  • is built on Jupyter (formerly IPython)
  • enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).