Using AWS in a Hybrid Environment

Using AWS in a Hybrid Environment

As the adoption of the public cloud grows, more and more people are finding themselves adopting a hybrid cloud model. This adoption is being driven out of necessity more than any architectural design decision. When you start to look closely at what makes a hybrid cloud, it’s increasingly understandable why more and more people are using AWS in this hybrid model.

Of course, when we think about why you would use a hybrid cloud, it’s the mixture of different computing models that is the most obvious reason. With a hybrid cloud model consisting of on-premises infrastructures, private cloud services, and public cloud offerings, and being able to orchestrate the deployment of workloads and components across these, it becomes increasingly easy to distribute an application in a resilient manner.

AWS services

AWS provides so many different services from IaaS virtual machines, PaaS relational database offerings, through to all different types of storage offerings at extremely low costs. When we start to look at how we can leverage AWS in a hybrid cloud world, it becomes extremely obvious that storage is at the forefront of these decisions, but the key driver behind everything in the cloud is agility. It’s a buzzword that has been around for quite some time, but the ability to spin up and destroy workloads on-demand without having to invest a large amount of capital to deliver a new service or capability to a business is what is driving this hybrid cloud adoption. Of course, when you start to introduce different platforms into your already existing environment, complexities can exist. Overcoming these complexities is what makes a successful hybrid cloud implementation. Considerations around security, authentication, networking, and connectivity must be looked at. In short, these considerations are the same as when a new data center is implemented, it is just simply a different platform.

Veeam, AWS, and the Hybrid Cloud

As more and more businesses adopt this hybrid cloud approach, protecting and migrating workloads across these different platforms becomes extremely complex. This is where Veeam can help businesses deliver on the true promise of a hybrid cloud. Veeam offers multiple products that can be used individually in a modular way, to provide data protection and management of individual resources and services or combined to provide a centralized data management solution.

Let’s look at a real-world scenario. In this example, we have some workloads running out in AWS that need to be migrated to our on-premises data center. Maybe we are facing latency issues, or we have a security requirement for this application to be closer to some services running on-premises. Using Veeam, we can easily protect those workloads and migrate those to another data center.

Using AWS in a Hybrid Environment - VeeamThe diagram above shows how simple this can be carried out. By protecting workloads in AWS and using Veeam, you can simply move workloads across different platforms. It doesn’t matter which direction you want to move workloads either, you can just as easily take a virtual machine running on VMware vSphere or Microsoft Hyper-V and migrate that to AWS EC2 as an instance. You can move workloads across multiple platforms or hypervisors extremely easily.

Summary

Introducing and implementing a hybrid cloud with AWS may be daunting, but it needn’t be complex. By taking a considered approach to aspects such as networking, connectivity, and migration, leveraging AWS in a hybrid cloud model with your existing on-premises implementation can provide anyone with an agile, simple, quick, and easy approach to delivering new services and capabilities. Combine that with products from companies like Veeam, and implementing a true hybrid cloud data management solution is extremely simple, providing you with the flexibility of moving workloads across multiple platforms, while implementing a reliable service to the end customers of the business.

For more information on Veeam, please visit the Veeam Backup for AWS page. Also, don’t forget to check “Choose Your Cloud Adventure” interactive e-book to learn how to manage your AWS data like a hero

AWS Elastic File Store – EFS

Elastic File Store – EFS

  • Elastic File Store – EFS provides a simple, fully managed, easy to set up, scalable, serverless, and cost-optimized file storage for use with AWS Cloud and on-premises resources.
  • can automatically scale from gigabytes to petabytes of data without needing to provision storage.
  • provides managed NFS (network file system) that can be mounted on and accessed by multiple EC2 in multiple AZs simultaneously.
  • offers highly durable, highly scalable, and highly available.
    • stores data redundantly across multiple AZs in the same region
    • grows and shrinks automatically as files are added and removed, so there is no need to manage storage procurement or provisioning.
  • supports the Network File System version 4 (NFSv4.1 and NFSv4.0) protocol
  • provides file system access semantics, such as strong data consistency and file locking
  • is compatible with all Linux-based AMIs for EC2,  POSIX file system (~Linux) that has a standard file API
  • is a shared POSIX system for Linux systems and does not work for Windows
  • offers the ability to encrypt data at rest using KMS and in transit.
  • can be accessed from on-premises using an AWS Direct Connect or AWS VPN connection between the on-premises datacenter and VPC.
  • can be accessed concurrently from servers in the on-premises data center as well as EC2 instances in the VPC

EFS Storage Classes

EFS Storage Classes

Standard storage classes

  • EFS Standard and Standard-Infrequent Access (Standard-IA), offer multi-AZ resilience and the highest levels of durability and availability.
  • For file systems using Standard storage classes, a mount target can be created in each availability Zone in the AWS Region.
  • Standard
    • regional storage class for frequently accessed data.
    • offers the highest levels of availability and durability by storing file system data redundantly across multiple AZs in an AWS Region.
    • ideal for active file system workloads and you pay only for the file system storage you use per month
  • Standard-Infrequent Access (Standard-IA)
    • regional, low-cost storage class that’s cost-optimized for files infrequently accessed i.e. not accessed every day
    • offers the highest levels of availability and durability by storing file system data redundantly across multiple AZs in an AWS Region
    • cost to retrieve files, lower price to store

EFS Regional

One Zone storage classes

  • EFS One Zone and One Zone-Infrequent Access (One Zone-IA) offer additional savings by saving the data in a single AZ.
  • For file systems using One Zone storage classes, only a single mount target that is in the same Availability Zone as the file system needs to be created.
  • EFS One Zone
    • For frequently accessed files stored redundantly within a single AZ in an AWS Region.
  • EFS One Zone-IA (One Zone-IA)
    • A lower-cost storage class for infrequently accessed files stored redundantly within a single AZ in an AWS Region.

EFS Zonal

EFS Lifecycle Management

  • EFS lifecycle management automatically manages cost-effective file storage for the file systems.
  • When enabled, lifecycle management migrates files that haven’t been accessed for a set period of time to an infrequent access storage class, Standard-IA or One Zone-IA
  • Lifecycle Management automatically moves the data to the EFS IA storage class according to the lifecycle policy. for e.g., you can move files automatically into EFS IA fourteen days after not being accessed.
  • Lifecycle management uses an internal timer to track when a file was last accessed and not the POSIX file system attribute that is publicly viewable.
  • Whenever a file in Standard or One Zone storage is accessed, the lifecycle management timer is reset.
  • After lifecycle management moves a file into one of the IA storage classes, the file remains there indefinitely if EFS Intelligent-Tiering is not enabled.

EFS Performance Modes

General Purpose (Default)

  • latency-sensitive use cases
  • ideal for web serving environments, content management systems, home directories, and general file serving, etc.

Max I/O

  • can scale to higher levels of aggregate throughput and operations per second.
  • with a tradeoff of slightly higher latencies for file metadata operations
  • ideal for highly parallelized applications and workloads, such as big data analysis, media processing, and genomic analysis
  • is not available for file systems using One Zone storage classes.

EFS Throughput Modes

Provisioned Throughput

  • throughput of the file system (in MiB/s) can be instantly provisioned independent of the amount of data stored.

Bursting Throughput

  • throughput on EFS scales as the size of the file system in the EFS Standard or One Zone storage class grows

EFS Security

  • EFS supports authentication, authorization, and encryption capabilities to help meet security and compliance requirements.
  • EFS supports two forms of encryption for file systems,
    • Encryption in transit
      • Encryption in Transit can be enabled when you mount the file system.
    • Encryption at rest.
      • encrypts all the data and metadata
      • can be enabled only when creating an EFS file system.
      • to encrypt an existing unencrypted EFS file system, create a new encrypted EFS file system, and migrate the data using AWS DataSync.
  • NFS client access to EFS is controlled by both AWS IAM policies and network security policies like security groups.

EFS Access Points

  • EFS access points are application-specific entry points into an EFS file system that make it easier to manage application access to shared datasets.
  • Access points can enforce a user identity, including the user’s POSIX groups, for all file system requests that are made through the access point.
  • Access points can enforce a different root directory for the file system so that clients can only access data in the specified directory or its subdirectories.
  • AWS IAM policies can be used to enforce that specific application use a specific access point.
  • IAM policies with access points provide secure access to specific datasets for the applications.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An administrator runs a highly available application in AWS. A file storage layer is needed that can share between instances and scale the platform more easily. The storage should also be POSIX compliant. Which AWS service can perform this action?
    1. Amazon EBS
    2. Amazon S3
    3. Amazon EFS
    4. Amazon EC2 Instance store

References

AWS_Elastic_File_Store_EFS

Google Cloud Certified – Cloud Digital Leader Learning Path

Google Cloud Certified - Cloud Digital Leader Certificate

Google Cloud – Cloud Digital Leader Certification Learning Path

Continuing on the Google Cloud Journey, glad to have passed the seventh certification with the Professional Cloud Digital Leader certification. Google Cloud was missing the initial entry-level certification similar to AWS Cloud Practitioner certification, which was introduced as the Cloud Digital Leader certification. Cloud Digital Leader focuses on general Cloud knowledge,  Google Cloud knowledge with its products and services.

Google Cloud – Cloud Digital Leader Certification Summary

  • Had 59 questions (somewhat odd !!) to be answered in 90 minutes.
  • Covers a wide range of General Cloud and Google Cloud services and products knowledge.
  • This exam does not require much Hands-on and theoretical knowledge is good enough to clear the exam.

Google Cloud – Cloud Digital Leader Certification Resources

Google Cloud – Cloud Digital Leader Certification Topics

General cloud knowledge

  1. Define basic cloud technologies. Considerations include:
    1. Differentiate between traditional infrastructure, public cloud, and private cloud
      1. Traditional infrastructure includes on-premises data centers
      2. Public cloud include Google Cloud, AWS, and Azure
      3. Private Cloud includes services like AWS Outpost
    2. Define cloud infrastructure ownership
    3. Shared Responsibility Model
      1. Security of the Cloud is Google Cloud’s responsibility
      2. Security on the Cloud depends on the services used and is shared between Google Cloud and the Customer
    4. Essential characteristics of cloud computing
      1. On-demand computing
      2. Pay-as-you-use
      3. Scalability and Elasticity
      4. High Availability and Resiliency
      5. Security
  2. Differentiate cloud service models. Considerations include:
    1. Infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS)
      1. IaaS – everything is done by you – more flexibility more management
      2. PaaS – most of the things are done by Cloud with few things done by you – moderate flexibility and management
      3. SaaS – everything is taken care of by the Cloud, you would just it – no flexibility and management
    2. Describe the trade-offs between level of management versus flexibility when comparing cloud services
    3. Define the trade-offs between costs versus responsibility
    4. Appropriate implementation and alignment with given budget and resources
  3. Identify common cloud procurement financial concepts. Considerations include:
    1. Operating expenses (OpEx), capital expenditures (CapEx), and total cost of operations (TCO)
      1. On-premises has more of Capex and less OpEx
      2. Cloud has no to least Capex and more of OpEx
    2. Recognize the relationship between OpEx and CapEx related to networking and compute infrastructure
    3. Summarize the key cost differentiators between cloud and on-premises environments

General Google Cloud knowledge

  1. Recognize how Google Cloud meets common compliance requirements. Considerations include:
    1. Locating current Google Cloud compliance requirements
    2. Familiarity with Compliance Reports Manager
  2. Recognize the main elements of Google Cloud resource hierarchy. Considerations include:
    1. Describe the relationship between organization, folders, projects, and resources i.e. Organization -> Folder -> Folder or Projects -> Resources
  3. Describe controlling and optimizing Google Cloud costs. Considerations include:
    1. Google Cloud billing models and applicability to different service classes
    2. Define a consumption-based use model
    3. Application of discounts (e.g., flat-rate, committed-use discounts [CUD], sustained-use discounts [SUD])
      1. Sustained-use discounts [SUD] are automatic discounts for running specific resources for a significant portion of the billing month
      2. Committed use discounts [CUD] help with committed use contracts in return for deeply discounted prices for VM usage
  4. Describe Google Cloud’s geographical segmentation strategy. Considerations include:
    1. Regions are collections of zones. Zones have high-bandwidth, low-latency network connections to other zones in the same region. Regions help design fault-tolerant and highly available solutions.
    2. Zones are deployment areas within a region and provide the lowest latency usually less than 10ms
    3. Regional resources are accessible by any resources within the same region
    4. Zonal resources are hosted in a zone are called per-zone resources.
    5. Multiregional resources or Global resources are accessible by any resource in any zone within the same project.
  5. Define Google Cloud support options. Considerations include:
    1. Distinguish between billing support, technical support, role-based support, and enterprise support
      1. Role-Based Support provides more predictable rates and a flexible configuration. Although they are legacy, the exam does cover these.
      2. Enterprise Support provides the fastest case response times and a dedicated Technical Account Management (TAM) contact who helps you execute a Google Cloud strategy.
    2. Recognize a variety of Service Level Agreement (SLA) applications

Google Cloud products and services

  1. Describe the benefits of Google Cloud virtual machine (VM)-based compute options. Considerations include:
    1. Compute Engine provides virtual machines (VM) hosted on Google’s infrastructure.
    2. Google Cloud VMware Engine helps easy lift and shift VMware-based applications to Google Cloud without changes to the apps, tools, or processes
    3. Bare Metal lets businesses run specialized workloads such as Oracle databases close to Google Cloud while lowering overall costs and reducing risks associated with migration
    4. Custom versus standard sizing
    5. Free, premium, and custom service options
    6. Attached storage/disk options
    7. Preemptible VMs is an instance that can be created and run at a much lower price than normal instances.
  2. Identify and evaluate container-based compute options. Considerations include:
    1. Define the function of a container registry
      1. Container Registry is a single place to manage Docker images, perform vulnerability analysis, and decide who can access what with fine-grained access control.
    2. Distinguish between VMs, containers, and Google Kubernetes Engine
  3. Identify and evaluate serverless compute options. Considerations include:
    1. Define the function and use of App Engine, Cloud Functions, and Cloud Run
    2. Define rationale for versioning with serverless compute options
    3. Cost and performance tradeoffs of scale to zero
      1. Scale to zero helps provides cost efficiency by scaling down to zero when there is no load but comes with an issue with cold starts
      2. Serverless technologies like Cloud Functions, Cloud Run, App Standard Engine provides these capabilities
  4. Identify and evaluate multiple data management offerings. Considerations include:
    1. Describe the differences and benefits of Google Cloud’s relational and non-relational database offerings
      1. Cloud SQL provides fully managed, relational SQL databases and offers MySQL, PostgreSQL, MSSQL databases as a service
      2. Cloud Spanner provides fully managed, relational SQL databases with joins and secondary indexes
      3. Cloud Bigtable provides a scalable, fully managed, non-relational NoSQL wide-column analytical big data database service suitable for low-latency single-point lookups and precalculated analytics
      4. BigQuery provides fully managed, no-ops, OLAP, enterprise data warehouse (EDW) with SQL and fast ad-hoc queries.
    2. Describe Google Cloud’s database offerings and how they compare to commercial offerings
  5. Distinguish between ML/AI offerings. Considerations include:
    1. Describe the differences and benefits of Google Cloud’s hardware accelerators (e.g., Vision API, AI Platform, TPUs)
    2. Identify when to train your own model, use a Google Cloud pre-trained model, or build on an existing model
      1. Vision API provides out-of-the-box pre-trained models to extract data from images
      2. AutoML provides the ability to train models
      3. BigQuery Machine Learning provides support for limited models and SQL interface
  6. Differentiate between data movement and data pipelines. Considerations include:
    1. Describe Google Cloud’s data pipeline offerings
      1. Cloud Pub/Sub provides reliable, many-to-many, asynchronous messaging between applications. By decoupling senders and receivers, Google Cloud Pub/Sub allows developers to communicate between independently written applications.
      2. Cloud Dataflow is a fully managed service for strongly consistent, parallel data-processing pipelines
      3. Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building & managing data pipelines
      4. BigQuery Service is a fully managed, highly scalable data analysis service that enables businesses to analyze Big Data.
      5. Looker provides an enterprise platform for business intelligence, data applications, and embedded analytics.
    2. Define data ingestion options
  7. Apply use cases to a high-level Google Cloud architecture. Considerations include:
    1. Define Google Cloud’s offerings around the Software Development Life Cycle (SDLC)
    2. Describe Google Cloud’s platform visibility and alerting offerings covers Cloud Monitoring and Cloud Logging
  8. Describe solutions for migrating workloads to Google Cloud. Considerations include:
    1. Identify data migration options
    2. Differentiate when to use Migrate for Compute Engine versus Migrate for Anthos
      1. Migrate for Compute Engine provides fast, flexible, and safe migration to Google Cloud
      2. Migrate for Anthos and GKE makes it fast and easy to modernize traditional applications away from virtual machines and into native containers. This significantly reduces the cost and labor that would be required for a manual application modernization project.
    3. Distinguish between lift and shift versus application modernization
      1. involves lift and shift migration with zero to minimal changes and is usually performed with time constraints
      2. Application modernization requires a redesign of infra and applications and takes time. It can include moving legacy monolithic architecture to microservices architecture, building CI/CD pipelines for automated builds and deployments, frequent releases with zero downtime, etc.
  9. Describe networking to on-premises locations. Considerations include:
    1. Define Software-Defined WAN (SD-WAN) – did not have any questions regarding the same.
    2. Determine the best connectivity option based on networking and security requirements – covers Cloud VPN, Interconnect, and Peering.
    3. Private Google Access provides access from VM instances to Google provides services like Cloud Storage or third-party provided services
  10. Define identity and access features. Considerations include:
    1. Cloud Identity & Access Management (Cloud IAM) provides administrators the ability to manage cloud resources centrally by controlling who can take what action on specific resources.
    2. Google Cloud Directory Sync enables administrators to synchronize users, groups, and other data from an Active Directory/LDAP service to their Google Cloud domain directory.

Google Cloud Compute Options

Google Cloud Compute Options

Compute Engine

  • provides Infrastructure as a Service (IaaS) in the Google Cloud
  • provides full control/flexibility on the choice of OS, resources like CPU and memory
  • Usage patterns
    • lift and shift migrations of existing systems
    • existing VM images to move to the cloud
    • need low-level access to or fine-grained control of the operating system, network, and other operational characteristics.
    • require custom kernel or arbitrary OS
    • software that can’t be easily containerized
    • using a third party licensed software
  • Usage anti-patterns
    • containerized applications – Choose App Engine, GKE, or Cloud Run
    • stateless event-driven applications – Choose Cloud Functions

App Engine

  • helps build highly scalable web and mobile backend applications on a fully managed serverless platform
  • Usage patterns
    • Rapidly developing CRUD-heavy applications
    • HTTP/S based applications
    • Deploying complex APIs
  • Usage anti-patterns
    • Stateful applications requiring lots of in-memory states to meet the performance or functional requirements
    • Systems that require protocols other than HTTP

Google Kubernetes Engine – GKE

  • provides a managed environment for deploying, managing, and scaling containerized applications using Google infrastructure.
  • Usage patterns
    • containerized applications or those that can be easily containerized
    • Hybrid or multi-cloud environments
    • Systems leveraging stateful and stateless services
    • Strong CI/CD Pipelines
  • Usage anti-patterns
    • non-containerized applications – Choose CE or App engine
    • applications requiring very low-level access to the underlying hardware like custom kernel, networking, etc. – Choose CE
    • stateless event-driven applications – Choose Cloud Functions

Cloud Run

  • provides a serverless managed compute platform to run stateless, isolated containers without orchestration that can be invoked via web requests or Pub/Sub events.
  • abstracts away all infrastructure management allowing users to focus on building great applications.
  • is built from Knative.
  • Usage patterns
    • Stateless services that are easily containerized
    • Event-driven applications and systems
    • Applications that require custom system and language dependencies
  • Usage anti-patterns
    • Highly stateful systems
    • Systems that require protocols other than HTTP
    • Compliance requirements that demand strict controls over the low-level environment and infrastructure (might be okay with the Knative GKE mode)

Cloud Functions

  • provides serverless compute for event-driven apps
  • Usage patterns
    • ephemeral and event-driven applications and functions
    • fully managed environment
    • pay only for what you use
    • quick data transformations (ETL)
  • Usage anti-patterns
    • continuous stateful application – Choose CE, App Engine or GKE
Credit @ https://thecloudgirl.dev/

Google Cloud Compute Options Decision Tree

Google Cloud Compute Options Decision Tree

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your organization is developing a new application. This application responds to events created by already running applications. The business goal for the new application is to scale to handle spikes in the flow of incoming events while minimizing administrative work for the team. Which Google Cloud product or feature should you choose?
    1. Cloud Run
    2. Cloud Run for Anthos
    3. App Engine standard environment
    4. Compute Engine
  2. A company wants to build an application that stores images in a Cloud Storage bucket and wants to generate thumbnails as well as resize the images. They want to use managed service which will help them scale automatically from zero to scale and back to zero. Which GCP service satisfies the requirement?
    1. Google Compute Engine
    2. Google Kubernetes Engine
    3. Google App Engine
    4. Cloud Functions

Google Cloud Composer

Cloud Composer

  • Cloud Composer is a fully managed workflow orchestration service, built on Apache Airflow, enabling workflow creation that spans across clouds and on-premises data centers.
  • Cloud Composer requires no installation or has no management overhead.
  • Cloud Composer integrates with Cloud Logging and Cloud Monitoring to provide a central place to view all Airflow service and workflow logs.

Cloud Composer Components

  • Cloud Composer helps define a series of tasks as Workflow executed within an Environment
  • Workflows are created using DAGs or Direct Acyclic Graphs
  • DAG is a collection of tasks that are scheduled and executed, organized in a way that reflects their relationships and dependencies.
  • DAGs are stored in Cloud Storage
  • Each Task can represent anything from ingestion, transform, filtering, monitoring, preparing, etc.
  • Environments are self-contained Airflow deployments based on Google Kubernetes Engine, and they work with other Google Cloud services using connectors built into Airflow.
  • Cloud Composer environment is a wrapper around Apache Airflow with components like GKE Cluster, Web Server, Database, Cloud Storage.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
    1. Cloud Dataflow
    2. Cloud Composer
    3. Cloud Dataprep
    4. Cloud Dataproc
  2. Your company is working on a multi-cloud initiative. The data processing pipelines require creating workflows that connect data, transfer data, processing, and using services across clouds. What cloud-native tool should be used for orchestration?
    1. Cloud Scheduler
    2. Cloud Dataflow
    3. Cloud Composer
    4. Cloud Dataproc

Google Cloud Dataflow vs Dataproc

Google Cloud Dataflow vs Dataproc

Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • Cloud Dataproc provides a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if already familiar with Hadoop tools and have Hadoop jobs
  • Ideal for Lift and Shift migration of existing Hadoop environment
  • Requires manual provisioning of clusters
  • Consider Dataproc
    • If you have a substantial investment in Apache Spark or Hadoop on-premise and considering moving to the cloud
    • If you are looking at a Hybrid cloud and need portability across a private/multi-cloud environment
    • If in the current environment Spark is the primary machine learning tool and platform
    • In case the code depends on any custom packages along with distributed computing need

Cloud Dataflow

  • Google Cloud Dataflow is a fully managed, serverless service for unified stream and batch data processing requirements
  • When using it as a pre-processing pipeline for ML model that can be deployed in GCP AI Platform Training (earlier called Cloud ML Engine)
  • None of the above considerations made for Cloud Dataproc is relevant

Cloud Dataflow vs Dataproc Decision Tree

Dataflow vs Dataproc

Dataflow vs Dataproc Table

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
    1. Google Cloud Dataflow
    2. Google Cloud Dataproc
    3. Google Compute Engine
    4. Google Container Engine
  2. A startup plans to use a data processing platform, which supports both batch and streaming applications. They would prefer to have a hands-off/serverless data processing platform to start with. Which GCP service is suited for them?
    1. Dataproc
    2. Dataprep
    3. Dataflow
    4. BigQuery

References

Google Cloud BigQuery Data Transfer Service

Cloud BigQuery Data Transfer Service

  • BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
  • After a data transfer is configured, the BigQuery Data Transfer Service automatically loads data into BigQuery on a regular basis.
  • BigQuery Data Transfer Service can also initiate data backfills to recover from any outages or gaps.
  • BigQuery Data Transfer Service can only sink data to BigQuery and cannot be used to transfer data out of BigQuery.

BigQuery Data Transfer Service Sources

  • BigQuery Data Transfer Service supports loading data from the following data sources:
    • Google Software as a Service (SaaS) apps
    • Campaign Manager
    • Cloud Storage
    • Google Ad Manager
    • Google Ads
    • Google Merchant Center (beta)
    • Google Play
    • Search Ads 360 (beta)
    • YouTube Channel reports
    • YouTube Content Owner reports
    • External cloud storage providers
      • Amazon S3
    • Data warehouses
      • Teradata
      • Amazon Redshift

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company uses Google Analytics for tracking. You need to export the session and hit data from a Google Analytics 360 reporting view on a scheduled basis into BigQuery for analysis. How can the data be exported?
    1. Configure a scheduler in Google Analytics to convert the Google Analytics data to JSON format, then import directly into BigQuery using bq command line.
    2. Use gsutil to export the Google Analytics data to Cloud Storage, then import into BigQuery and schedule it using Cron.
    3. Import data to BigQuery directly from Google Analytics using Cron
    4. Use BigQuery Data Transfer Service to import the data from Google Analytics

Reference

Google_Cloud_BigQuery_Transfer_Service

Google Cloud BigQuery Security

Google Cloud BigQuery Security

BigQuery Encryption

  • BigQuery automatically encrypts all data before it is written to disk
  • By default, Google uses the Default Encryption at Rest and manages the key encryption keys used for data protection.
  • BigQuery also supports customer-managed encryption keys, to encrypt individual values within a table.
  • BigQuery uses TLS for data in transit encryption
  • Cloud Data Loss Prevention (Cloud DLP) can be used to scan the BigQuery tables and to protect sensitive data and meet compliance requirements.

BigQuery IAM Roles

  • BigQuery supports access control of datasets and tables using IAM
  • Primitive Roles
    • primitive roles act at the project level
    • By default, granting access to a project also grants access to datasets within it unless overridden
    • are not limited to BigQuery resources only
    • can separate data access permissions from job-running permissions
    • Viewer
      • View all datasets
      • Run Jobs/Queries
      • View and update all jobs that they started
    • Editor
      • All Viewer access
      • Modify or delete all tables
      • Create new datasets
    • Owner
      • All Editor access
      • list, modify, or delete all datasets
      • View all jobs
  • Predefined Roles
    • dataViewer, dataEditor, and dataOwner roles
      • are similar to the primitive roles except
        • can be assigned for individual datasets
        • don’t give users permission to run jobs or queries
    • user, jobUser roles
      • give users permission to run jobs or queries
      • A jobUser can only start jobs and cancel jobs, but cannot list datasets or tables
      • A user, on the other hand, can perform a variety of other tasks, such as listing or creating datasets
      • User or group granted the user role at the project level can create datasets and can run query jobs against tables in those datasets.
      • user role does not give permission to query data, view table data, or view table schema details for datasets the user did not create. Need to have the dataViewer role for the same.

Authorized Views

  • Authorized views help provide view access to a dataset
  • Use authorized views to restrict access at a lower resource level such as the table, column, row, or cell.
  • An authorized view allows sharing query results with particular users and groups without giving them access to the underlying tables.
  • Authorized View’s SQL query can be used to restrict the columns (fields) the users are able to query.
  • Authorized views HAVE to be created in a separate dataset from the source dataset. As access controls can be assigned only at the dataset level, if the view is created in the same dataset as the source data, the users would have access to both the view and the data.
  • Authorized View creation process
    • Create a separate dataset to store the view.
    • Create the view in the new dataset
    • Give the group read access to the dataset containing the view
    • Authorize the view to access the source dataset
    • Give the group bigquery.user role to run jobs, including query jobs within the project
  • Project-level bigquery.user role does not give the users the ability to view or query table data in the dataset containing the tables queried by the view. They need READER access to the dataset containing the view.

Fine-Grained Access Control

  • BigQuery supports access controls at the project, dataset, and table levels
  • BigQuery also supports fine-grained row and column level security
  • BigQuery provides fine-grained access to sensitive columns using policy tags, or type-based classification, of data.
  • Using BigQuery column-level security, you can create policies that check, at query time, whether a user has proper access.
  • Row-level security extends the principle of least privilege by enabling fine-grained access control to a subset of data in a BigQuery table, by means of row-level access policies.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You have multiple Data Analysts who work with the dataset hosted in BigQuery within the same project. As a BigQuery Administrator, you are required to grant the data analyst only the privilege to create jobs/queries and the ability to cancel self-submitted jobs. Which role should assign to the user?
    1. User
    2. Jobuser
    3. Owner
    4. Viewer
  2. Your analytics system executes queries against a BigQuery dataset. The SQL query is executed in batch and passes the contents of a SQL file to the BigQuery CLI. Then it redirects the BigQuery CLI output to another process. However, you are getting a permission error from the BigQuery CLI when the queries are executed. You want to resolve the issue. What should you do?
    1. Grant the service account BigQuery Data Viewer and BigQuery Job User roles.
    2. Grant the service account BigQuery Data Editor and BigQuery Data Viewer roles.
    3. Create a view in BigQuery from the SQL query and SELECT * from the view in the CLI.
    4. Create a new dataset in BigQuery, and copy the source table to the new dataset Query the new dataset and table from the CLI.
  3. You are responsible for the security and access control to a BigQuery dataset hosted within a project. Multiple users from multiple teams need to have access to the different tables within the dataset. How can access be control?
    1. Create Authorized views for tables in a separate project and grant access to the teams
    2. Create Authorized views for tables in the same project and grant access to the teams
    3. Create Materialized views for tables in a separate project and grant access to the teams
    4. Create Materialized views for tables in the same project and grant access to the teams

References

Google_Cloud_BigQuery_Data_Goverance

Google Cloud Dataproc

Google Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • Dataproc automation helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
  • Dataproc helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
  • Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average
  • Dataproc has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
  • Dataproc clusters support preemptible instances that have lower compute prices to reduce costs further.
  • Dataproc supports connectors for BigQuery, Bigtable, Cloud Storage
  • Dataproc also supports Anaconda, HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.

Dataproc Cluster High Availability

  • Dataproc cluster can be configured for High Availability by specifying the number of master instances in the cluster
  • Dataproc supports two master configurations:
    • Single Node Cluster – 1 master – 0 Workers (default, non HA)
      • provides one node for both master and worker
      • if the master fails, the in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
    • High Availability Cluster – 3 masters – N Workers (Hadoop HA)
      • HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.
  • All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.

Dataproc Cluster Scaling

  • Dataproc cluster can be adjusted to scale by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
  • Dataproc cluster can be scaled at any time, even when jobs are running on the cluster.
  • Machine type of an existing cluster (vertical scaling) cannot be changed. To vertically scale, create a cluster using a supported machine type, then migrate jobs to the new cluster.
  • Dataproc cluster can help scale
    • to increase the number of workers to make a job run faster
    • to decrease the number of workers to save money
    • to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

Dataproc Cluster Autoscaling

  • Dataproc Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
  • An Autoscaling Policy is a reusable configuration that describes how clusters using the autoscaling policy should scale.
  • It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout cluster lifetime.
  • Autoscaling is recommended for
    • on clusters that store data in external services, such as Cloud Storage
    • on clusters that process many jobs
    • to scale up single-job clusters
  • Autoscaling is not recommended with/for:
    • HDFS: Autoscaling is not intended for scaling on-cluster HDFS
    • YARN Node Labels: Autoscaling does not support YARN Node Labels. YARN incorrectly reports cluster metrics when node labels are used.
    • Spark Structured Streaming: Autoscaling does not support Spark Structured Streaming
    • Idle Clusters: Autoscaling is not recommended for the purpose of scaling a cluster down to minimum size when the cluster is idle. It is better to delete an Idle cluster.

Dataproc Workers

  • Primary workers are standard Compute Engine VMs
  • Secondary workers can be used to scale with the below limitations
    • Processing only
      • Secondary workers do not store data.
      • can only function as processing nodes
      • useful to scale compute without scaling storage.
    • No secondary-worker-only clusters
      • Cluster must have primary workers
      • Dataproc adds two primary workers to the cluster, by default, if no primary workers are specified.
    • Machine type
      • use the machine type of the cluster’s primary workers.
    • Persistent disk size
      • are created, by default, with the smaller of 100GB or the primary worker boot disk size.
      • This disk space is used for local caching of data and is not available through HDFS.
    • Asynchronous Creation
      • Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned

Dataproc Initialization Actions

  • Dataproc supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up
  • Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

Dataproc Cloud Storage Connector

  • Dataproc Cloud Storage connector helps Dataproc use Google Cloud Storage as the persistent store instead of HDFS.
  • Cloud Storage connector helps separate the storage from the cluster lifecycle and allows the cluster to be shut down when not processing data
  • Cloud Storage connector benefits
    • Direct data access – Store the data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
    • HDFS compatibility – can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://
    • Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
    • Data accessibility – data is accessible even after shutting down the cluster, unlike HDFS.
    • High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
    • No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.

Cloud Dataproc vs Dataflow

Refer blog post @ Cloud Dataproc vs Dataflow

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local data center. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change. Which product should you use?
    1. Google Cloud Dataflow
    2. Google Cloud Dataproc
    3. Google Compute Engine
    4. Google Container Engine
  2. Your company is migrating to the Google cloud and looking for HBase alternative. Current solution uses a lot of custom code using the observer coprocessor. You are required to find the best alternative for migration while using managed services, is possible?
    1. Dataflow
    2. HBase on Dataproc
    3. Bigtable
    4. BigQuery

References

Google_Cloud_Dataproc