Google Cloud Data Analytics Services Cheat Sheet

Cloud Pub/Sub

  • Pub/Sub is a fully managed, asynchronous messaging service designed to be highly reliable and scalable with latencies on the order of 100 ms
  • Pub/Sub offers at-least-once message delivery and best-effort ordering to existing subscribers
  • Pub/Sub enables the creation of event producers and consumers, called publishers and subscribers.
  • Pub/Sub messages should be no greater than 10MB in size.
  • Messages can be received with pull or push delivery.
  • Messages published before a subscription is created will not be delivered to that subscription
  • Acknowledged messages are no longer available to subscribers and are deleted, by default. However, can be retained setting retention period.
  • Publishers can send messages with an ordering key and message ordering is set, Pub/Sub delivers the messages in order.
  • Pub/Sub support encryption at rest and encryption in transit.
  • Seek feature allows subscribers to alter the acknowledgment state of messages in bulk to replay or purge messages in bulk.

BigQuery

  • BigQuery is a fully managed, durable, petabyte scale, serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • supports a standard SQL dialect
  • automatically replicates data and keeps a seven-day history of changes, allowing easy restoration and comparison of data from different times
  • supports federated data and can process external data sources in GCS for Parquet and ORC open-source file formats, transactional databases (Bigtable, Cloud SQL), or spreadsheets in Drive without moving the data.
  • Data model consists of Datasets, tables
  • BigQuery performance can be improved using Partitioned tables and Clustered tables.
  • BigQuery encrypts all data at rest and supports encryption in transit.
  • BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis
  • Best Practices
    • Control projection, avoid select *
    • Estimate costs as queries are billed according to the number of bytes read and the cost can be estimated using --dry-run feature
    • Use the maximum bytes billed setting to limit query costs.
    • Use clustering and partitioning to reduce the amount of data scanned.
    • Avoid repeatedly transforming data via SQL queries. Materialize the query results in stages.
    • Use streaming inserts only if the data must be immediately available as streaming data is charged.
    • Prune partitioned queries, use the _PARTITIONTIME pseudo column to filter the partitions.
    • Denormalize data whenever possible using nested and repeated fields.
    • Avoid external data sources, if query performance is a top priority
    • Avoid using Javascript user-defined functions
    • Optimize Join patterns. Start with the largest table.
    • Use the expiration settings to remove unneeded tables and partitions
    • Keep the data in BigQuery to take advantage of the long-term storage cost benefits rather than exporting to other storage options.

Bigtable

  • Bigtable is a fully managed, scalable, wide-column NoSQL database service with up to 99.999% availability.
  • ideal for applications that need very high throughput and scalability for key/value data, where each value is max. of 10 MB.
  • supports high read and write throughput at low latency and provides consistent sub-10ms latency – handles millions of requests/second
  • is a sparsely populated table that can scale to billions of rows and thousands of columns,
  • supports storage of terabytes or even petabytes of data
  • is not a relational database. It does not support SQL queries, joins, or multi-row transactions.
  • handles upgrades and restarts transparently, and it automatically maintains high data durability.
  • scales linearly in direct proportion to the number of nodes in the cluster
  • stores data in tables, which is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
  • Each table has only one index, the row key. There are no secondary indices. Each row key must be unique.
  • Single-cluster Bigtable instances provide strong consistency.
  • Multi-cluster instances, by default, provide eventual consistency but can be configured to provide read-over-write consistency or strong consistency, depending on the workload and app profile settings

Cloud Dataflow

  • Cloud Dataflow is a managed, serverless service for unified stream and batch data processing requirements
  • provides Horizontal autoscaling to automatically choose the appropriate number of worker instances required to run the job.
  • is based on Apache Beam, an open-source, unified model for defining both batch and streaming-data parallel-processing pipelines.
  • supports Windowing which enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements.
  • supports drain feature to deploy incompatible updates

Cloud Dataproc

  • Cloud Dataproc is a managed Spark and Hadoop service to take advantage of open-source data tools for batch processing, querying, streaming, and machine learning.
  • helps to create clusters quickly, manage them easily, and save money by turning clusters on and off as needed.
  • helps reduce time on time and money spent on administration and lets you focus on your jobs and your data.
  • has built-in integration with other GCP services, such as BigQuery, Cloud Storage, Bigtable, Cloud Logging, and Monitoring
  • support preemptible instances that have lower compute prices to reduce costs further.
  • also supports HBase, Flink, Hive WebHcat, Druid, Jupyter, Presto, Solr, Zepplin, Ranger, Zookeeper, and much more.
  • supports connectors for BigQuery, Bigtable, Cloud Storage
  • can be configured for High Availability by specifying the number of master instances in the cluster
  • All nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.
  • supports cluster scaling by increasing or decreasing the number of primary or secondary worker nodes (horizontal scaling)
  • supports Autoscaling provides a mechanism for automating cluster resource management and enables cluster autoscaling.
  • supports initialization actions in executables or scripts that will run on all nodes in the cluster immediately after the cluster is set up

Cloud Dataprep

  • Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
  • is fully managed, serverless, and scales on-demand with no infrastructure to deploy or manage
  • provides easy data preparation with clicks and no code.
  • automatically identifies data anomalies & helps take fast corrective action
  • automatically detects schemas, data types, possible joins, and anomalies such as missing values, outliers, and duplicates
  • uses Dataflow or BigQuery under the hood, enabling unstructured or structured datasets processing of any size with the ease of clicks, not code

Datalab

  • Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models using familiar languages, such as Python and SQL, interactively.
  • runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
  • is built on Jupyter (formerly IPython)
  • enables analysis of the data on Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).

Google Cloud Services Cheat Sheet

Google Certification Exam Cheat Sheet

Google Certification Exams cover a lot of topics and a wide range of services with minute details for features, patterns, anti patterns and their integration with other services. This blog post is just to have a quick summary of all the services and key points for a quick glance before you appear for the exam

Google Services

GCP Marketplace (Cloud Launcher)

  • GCP Marketplace offers ready-to-go development stacks, solutions, and services to accelerate development and spend less time installing and more time developing.
    • Deploy production-grade solutions in a few clicks
    • Single bill for all your GCP and 3rd party services
    • Manage solutions using Deployment Manager
    • Notifications when a security update is available
    • Direct access to partner support

References

Google_Cloud_Services

Google Cloud Networking Services Cheat Sheet

Virtual Private Cloud

  • Virtual Private Cloud (VPC) provides networking functionality for the cloud-based resources and services that is global, scalable, and flexible.
  • VPC networks are global resources, including the associated routes and firewall rules, and are not associated with any particular region or zone.
  • Subnets are regional resources and each subnet defines a range of IP addresses
  • Network firewall rules
    • control the Traffic to and from instances.
    • Rules are implemented on the VMs themselves, so traffic can only be controlled and logged as it leaves or arrives at a VM.
    • Firewall rules are defined to allow or deny traffic and are executed within order with a defined priority
    • Highest priority (lower integer) rule applicable to a target for a given type of traffic takes precedence
  • Resources within a VPC network can communicate with one another by using internal IPv4 addresses, subject to applicable network firewall rules.
  • Private access options for services allow instances with internal IP addresses can communicate with Google APIs and services.
  • Shared VPC to keep a VPC network in a common host project shared with service projects. Authorized IAM members from other projects in the same organization can create resources that use subnets of the Shared VPC network
  • VPC Network Peering allow VPC networks to be connected with other VPC networks in different projects or organizations.
  • VPC networks can be securely connected in hybrid environments by using Cloud VPN or Cloud Interconnect.
  • Primary and Secondary IP address cannot overlap with the on-premises CIDR
  • VPC networks only support IPv4 unicast traffic. They do not support broadcast, multicast, or IPv6 traffic within the network; VMs in the VPC network can only send to IPv4 destinations and only receive traffic from IPv4 sources.
  • VPC Flow Logs records a sample of network flows sent from and received by VM instances, including instances used as GKE nodes.

Cloud Load Balancing

  • Cloud Load Balancing is a fully distributed, software-defined managed load balancing service
  • distributes user traffic across multiple instances of the applications and reduces the risk that the of performance issues for the applications experience by spreading the load
  • provides health checking mechanisms that determine if backends, such as instance groups and zonal network endpoint groups (NEGs), are healthy and properly respond to traffic.
  • supports IPv6 clients with HTTP(S) Load Balancing, SSL Proxy Load Balancing, and TCP Proxy Load Balancing.
  • supports multiple Cloud Load Balancing types
    • Internal HTTP(S) Load Balancing
      • is a proxy-based, regional Layer 7 load balancer that enables running and scaling services behind an internal IP address.
      • supports a regional backend service, which distributes HTTP and HTTPS requests to healthy backends (either instance groups containing CE VMs or NEGs containing GKE containers).
      • supports path based routing
      • preserves the Host header of the original client request and also appends two IP addresses (Client and LB )to the X-Forwarded-For header
      • supports a regional health check that periodically monitors the readiness of the backends.
      • has native support for the WebSocket protocol when using HTTP or HTTPS as the protocol to the backend
    • External HTTP(S) Load Balancing
      • is a global, proxy-based Layer 7 load balancer that enables running and scaling the services worldwide behind a single external IP address
      • distributes HTTP and HTTPS traffic to backends hosted on Compute Engine and GKE
      • offers global (cross-regional) and regional load balancing
      • supports content-based load balancing using URL maps
      • preserves the Host header of the original client request and also appends two IP addresses (Client and LB) to the X-Forwarded-For header
      • supports connection draining on backend services
      • has native support for the WebSocket protocol when using HTTP or HTTPS as the protocol to the backend
      • does not support client certificate-based authentication, also known as mutual TLS authentication.
    • Internal TCP/UDP Load Balancing
      • is a managed, internal, pass-through, regional Layer 4 load balancer that enables running and scaling services behind an internal IP address
      • distributes traffic among VM instances in the same region in a Virtual Private Cloud (VPC) network by using an internal IP address.
      • provides high-performance, pass-through Layer 4 load balancer for TCP or UDP traffic.
      • routes original connections directly from clients to the healthy backends, without any interruption.
      • does not terminate SSL traffic and SSL traffic can be terminated by the backends instead of by the load balancer
      • provides access through VPC Network Peering, Cloud VPN or Cloud Interconnect
      • supports health check that periodically monitors the readiness of the backends.
    • External TCP/UDP Network Load Balancing
      • is a managed, external, pass-through, regional Layer 4 load balancer that distributes TCP or UDP traffic originating from the internet to among VM instances in the same region
      • Load-balanced packets are received by backend VMs with their source IP unchanged.
      • Load-balanced connections are terminated by the backend VMs. Responses from the backend VMs go directly to the clients, not back through the load balancer.
      • scope of a network load balancer is regional, not global. A network load balancer cannot span multiple regions. Within a single region, the load balancer services all zones.
      • supports connection tracking table and a configurable consistent hashing algorithm to determine how traffic is distributed to backend VMs.
      • does not support Network endpoint groups (NEGs) as backends
    • External SSL Proxy Load Balancing
      • is a reverse proxy load balancer that distributes SSL traffic coming from the internet to VM instances in the VPC network.
      • with SSL traffic, user SSL (TLS) connections are terminated at the load balancing layer, and then proxied to the closest available backend instances by using either SSL (recommended) or TCP.
      • supports global load balancing service with the Premium Tier
        supports regional load balancing service with the Standard Tier
      • is intended for non-HTTP(S) traffic. For HTTP(S) traffic, GCP recommends using HTTP(S) Load Balancing.
      • supports proxy protocol header to preserve the original source IP addresses of incoming connections to the load balancer
      • does not support client certificate-based authentication, also known as mutual TLS authentication.
    • External TCP Proxy Load Balancing
      • is a reverse proxy load balancer that distributes TCP traffic coming from the internet to VM instances in the VPC network
      • terminates traffic coming over a TCP connection at the load balancing layer, and then forwards to the closest available backend using TCP or SSL
      • use a single IP address for all users worldwide and automatically routes traffic to the backends that are closest to the user
      • supports global load balancing service with the Premium Tier
        supports regional load balancing service with the Standard Tier
      • supports proxy protocol header to preserve the original source IP addresses of incoming connections to the load balancer

Cloud CDN

  • caches website and application content closer to the user
  • uses Google’s global edge network to serve content closer to users, which accelerates the websites and applications.
  • works with external HTTP(S) Load Balancing to deliver content to the users
  • Cloud CDN content can be sourced from various types of backends
    • Instance groups
    • Zonal network endpoint groups (NEGs)
    • Serverless NEGs: One or more App Engine, Cloud Run, or Cloud Functions services
    • Internet NEGs, for endpoints that are outside of Google Cloud (also known as custom origins)
    • Buckets in Cloud Storage
  • Cloud CDN with Google Cloud Armor enforces security policies only for requests for dynamic content, cache misses, or other requests that are destined for the origin server. Cache hits are served even if the downstream Google Cloud Armor security policy would prevent that request from reaching the origin server.
  • recommends
    • using versioning instead of cache invalidation
    • using custom keys to improve cache hit ration
    • cache static content

Cloud VPN

  • securely connects the peer network to the VPC network or two VPCs in GCP through an IPsec VPN connection.
  • encrypts the data as it travels over the internet.
  • only supports site-to-site IPsec VPN connectivity and not client-to-gateway scenarios
  • allows users to access private RFC1918 addresses on resources in the VPC from on-prem computers also using private RFC1918 addresses.
  • can be used with Private Google Access for on-premises hosts
  • Cloud VPN HA
    • provides a high-available and secure connection between the on-premises and the VPC network through an IPsec VPN connection in a single region
    • provides an SLA of 99.99% service availability, when configured with two interfaces and two external IP addresses.
  • supports up to 3Gbps per tunnel with a maximum of 8 tunnels
  • supports static as well as dynamic routing using Cloud Router
  • supports IKEv1 or IKEv2 using a shared secret

Cloud Interconnect

  • Cloud Interconnect provides two options for extending the on-premises network to the VPC networks in Google Cloud.
  • Dedicated Interconnect (Dedicated connection)
    • provides a direct physical connection between the on-premises network and Google’s network
    • requires your network to physically meet Google’s network in a colocation facility with your own routing equipment
    • supports only dynamic routing
    • supports bandwidth to 10 Gbps minimum to 200 Gbps maximum.
  • Partner Interconnect (Use a service provider)
    • provides connectivity between the on-premises and VPC networks through a supported service provider.
    • supports bandwidth to 50 Mbps minimum to 10 Gbps maximum.
    • provides Layer 2 and Layer 3 connectivity
      • For Layer 2 connections, you must configure and establish a BGP session between the Cloud Routers and on-premises routers for each created VLAN attachment
      • For Layer 3 connections, the service provider establishes a BGP session between the Cloud Routers and their edge routers for each VLAN attachment.
  • Single Interconnect connection does not offer redundancy or high availability and its recommended to
    • use 2 in the same metropolitan area (city) as the existing one, but in a different edge availability domain (metro availability zone).
    • use 4 with 2 connections in two different metropolitan areas (city), and each connection in a different edge availability domain (metro availability zone)
    • Cloud Routers are required one in each Google Cloud region
  • Cloud Interconnect does not encrypt the connection between your network and Google’s network. For additional security, use application-level encryption or your own VPN.
  • Currently, Cloud VPN can’t be used with Dedicated Interconnect.

Cloud Router

  • is a fully distributed, managed service that provides dynamic routing and scales with the network traffic.
  • works with both legacy networks and VPC networks.
  • isn’t supported for Direct Peering or Carrier Peering connections.
  • helps dynamically exchange routes between the Google Cloud networks and the on-premises network.
  • peers with the on-premises VPN gateway or router to provide dynamic routing and exchanges topology information through BGP.
  • Google Cloud recommends creating two Cloud Routers in each region for a Cloud Interconnect for 99.99% availability.
  • supports following dynamic routing mode
    • Regional routing mode – provides visibility to resources only in the defined region.
    • Global routing mode – provides has visibility to resources in all regions

Cloud DNS

  • is a high-performance, resilient, reliable, low-latency, global DNS service that publishes the domain names to the global DNS in a cost-effective way.
  • With Shared VPC, Cloud DNS managed private zone, Cloud DNS peering zone, or Cloud DNS forwarding zone must be created in the host project
  • provides Private Zone which supports DNS services for a GCP project. VPCs in the same project can use the same name servers
  • supports DNS Forwarding for Private Zones, which overrides normal DNS resolution for the specified zones. Queries for the specified zones are forwarded to the listed forwarding targets.
  • supports DNS Peering, which allows sending requests for records that come from one zone’s namespace to another VPC network with GCP
  • supports DNS Outbound Policy, which forwards all DNS requests for a VPC network to the specified server targets. It disables internal DNS for the selected networks.
  • Cloud DNS VPC Name Resolution Order
    • DNS Outbound Server Policy
    • DNS Forwarding Zone
    • DNS Peering
    • Compute Engine internal DNS
    • Public Zones
  • supports DNSSEC, a feature of DNS, that authenticates responses to domain name lookups and protects the domains from spoofing and cache poisoning attacks

Google Cloud Compute Services Cheat Sheet

Google Cloud Compute Services

Google Cloud - Compute Services Options

Compute Engine

  • is a virtual machine (VM) hosted on Google’s infrastructure.
  • can run the public images for Google provided Linux and Windows Server as well as custom images created or imported from existing systems
  • availability policy determines how it behaves when there is a maintenance event
    • VM instance’s maintenance behavior onHostMaintenance, which determines whether the instance is live migrated MIGRATE (default) or stopped TERMINATE
    • Instance’s restart behavior automaticRestart  which determines whether the instance automatically restarts (default) if it crashes or gets stopped
  • Live migration helps keep the VM instances running even when a host system event, such as a software or hardware update, occurs
  • Preemptible VM is an instance that can be created and run at a much lower price than normal instances, however can be stopped at any time
  • Shielded VM offers verifiable integrity of the Compute Engine VM instances, to confirm the instances haven’t been compromised by boot- or kernel-level malware or rootkits.
  • Instance template is a resource used to create VM instances and managed instance groups (MIGs) with identical configuration
  • Instance group is a collection of virtual machine (VM) instances that can be managed as a single entity.
    • Managed instance groups (MIGs)
      • allows app creation with multiple identical VMs.
      • workloads can be made scalable and highly available by taking advantage of automated MIG services, including: autoscaling, autohealing, regional (multiple zone) deployment, and automatic updating
      • supports rolling update feature
      • works with load balancing services to distribute traffic across all of the instances in the group.
    • Unmanaged instance groups
      • allows load balance across a fleet of VMs that you manage yourself which may not be identical
  • Instance template are global, while instance groups are regional.
  • Machine image stores all the configuration, data, metadata and permissions from one or more disks required to create a VM instance
  • Sole-tenancy provides dedicated hosting only for the project’s VM and provides added layer of hardware isolation
  • deletionProtection prevents accidental VM deletion esp. for VMs running critical workloads and need to be protected
  • provides Sustained Discounts, Committed discounts, free tier etc in pricing

App Engine

  • App Engine helps build highly scalable applications on a fully managed serverless platform
  • Each Cloud project can contain only a single App Engine application
  • App Engine is regional, which means the infrastructure that runs the apps is located in a specific region, and Google manages it so that it is available redundantly across all of the zones within that region
  • App Engine application location or region cannot be changed once created
  • App engine allows traffic management to an application version by migrating or splitting traffic.
    • Traffic Splitting (Canary) – distributes a percentage of traffic to versions of the application.
    • Traffic Migration – smoothly switches request routing
  • Support Standard and Flexible environments
    • Standard environment
      • Application instances that run in a sandbox, using the runtime environment of a supported language only.
      • Sandbox restricts what the application can do
        • only allows the app to use a limited set of binary libraries
        • app cannot write to disk
        • limits the CPU and memory options available to the application
      • Sandbox does not support
        • SSH debugging
        • Background processes
        • Background threads (limited capability)
        • Using Cloud VPN
    • Flexible environment
      • Application instances run within Docker containers on Compute Engine virtual machines (VM).
      • As Flexible environment supports docker it can support custom runtime or source code written in other programming languages.
      • Allows selection of any Compute Engine machine type for instances so that the application has access to more memory and CPU.
  • min_idle_instances indicates the number of additional instances to be kept running and ready to serve traffic for this version.

GKE

Node Pool

GKE
commands
–num-nodes scale cluster –size is deprecated

Google Cloud Storage Services Cheat Sheet

Google Cloud Storage Options

  • Relational (SQL) – Cloud SQL & Cloud Spanner
  • Non-Relational (NoSQL) – Datastore & Bigtable
  • Structured & Semi-structured – Cloud SQL, Cloud Spanner, Datastore & Bigtable
  • Unstructured – Cloud Storage
  • Block Storage – Persistent disk
  • Transactional (OLTP) – Cloud SQL & Cloud Spanner
  • Analytical (OLAP) – Bigtable & BigQuery
  • Fully Managed (Serverless) – Cloud Spanner, Datastore, BigQuery
  • Requires Provisioning – Cloud SQL, Bigtable
  • Global – Cloud Spanner
  • Regional – Cloud SQL, Bigtable, Datastore

Google Cloud - Storage Options Decision Tree

Google Cloud Storage – GCS

  • provides service for storing unstructured data i.e. objects
  • consists of bucket and objects where an object is an immutable piece of data consisting of a file of any format stored in containers called buckets.
  • support different location types
    • regional
      • A region is a specific geographic place, such as London.
      • helps optimize latency and network bandwidth for data consumers, such as analytics pipelines, that are grouped in the same region.
    • dual-region
      • is a specific pair of regions, such as Finland and the Netherlands.
      • provides higher availability that comes with being geo-redundant.
    • multi-region
      • is a large geographic area, such as the United States, that contains two or more geographic places.
      • allows serving content to data consumers that are outside of the Google network and distributed across large geographic areas
      • provides  higher availability that comes with being geo-redundant.
    • Objects stored in a multi-region or dual-region are geo-redundant i.e. data is stored redundantly in at least two separate geographic places separated by at least 100 miles.
  • Storage class affects the object’s availability and pricing model
    • Standard Storage is best for data that is frequently accessed (hot data) and/or stored for only brief periods of time.
    • Nearline Storage is a low-cost, highly durable storage service for storing infrequently accessed data (warm data)
    • Coldline Storage provides a very-low-cost, highly durable storage service for storing infrequently accessed data (cold data)
    • Archive Storage is the lowest-cost, highly durable storage service for data archiving, online backup, and disaster recovery. (coldest data)
  • Object Versioning prevents accidental overwrites and deletion. It retains a noncurrent object version when the live object version gets replaced, overwritten or deleted
  • Object Lifecycle Management sets Time To Live (TTL) on an object and helps configure transition or expiration of the objects based on specified rules for e.g.  SetStorageClass to change the storage class, delete to expire noncurrent or archived objects
  • Resumable uploads are the recommended method for uploading large files, because they don’t need to be restarted from the beginning if there is a network failure while the upload is underway.
  • Parallel composite uploads divides a file into up to 32 chunks, which are uploaded in parallel to temporary objects, the final object is recreated using the temporary objects, and the temporary objects are deleted
  • Requester Pays on the bucket that requires requester to include a billing project in their requests, thus billing the requester’s project.
  • supports upload and storage of any MIME type of data up to 5 TB in size.
  • Retention policy on a bucket ensures that all current and future objects in the bucket cannot be deleted or replaced until they reach the defined age
  • Retention policy locks will lock a retention policy on a bucket and prevents the policy from ever being removed or the retention period from ever being reduced (although it can be increased). Locking a retention policy is irreversible
  • Bucket Lock feature provides immutable storage on Cloud Storage
  • Object holds, when set on individual objects, prevents the object from being deleted or replaced, however allows metadata to be edited.
  • Signed URLs provide time-limited read or write access to an object through a generated URL.
  • Signed policy documents helps specify what can be uploaded to a bucket.
  • Cloud Storage supports encryption at rest and in transit as well
  • Cloud Storage supports both
    • Server-side encryption with support for Google managed, Customer managed and Customer supplied encryption keys
    • Client-side encryption: encryption that occurs before data is sent to Cloud Storage, encrypted at client side.
  • Cloud Storage operations are
    • strongly consistent for read after writes or deletes and listing
    • eventually consistent for granting access to or revoking access
  • Cloud Storage allows setting CORS configuration at the bucket level only

Cloud SQL

  • provides relational MySQL, PostgreSQL and MSSQL databases as a service
  • managed, however, needs to select and provision machines
  • supports automatic replication, managed backups, vertical scaling for read and write, Horizontal scaling (using read replicas)
  • provides High Availability configuration provides data redundancy and failover capability with minimal downtime, when a zone or instance becomes unavailable due to a zonal outage, or an instance corruption
  • HA standby instance does not increase scalability and cannot be used for read queries.
  • Read replicas help scale horizontally the use of data in a database without degrading performance
  • is regional – although it now supports cross region read replicas
  • supports data encryption at rest and in transit
  • supports Point-In-Time recovery with binary logging and backups

Cloud Spanner

Datastore

  • Ancestor Paths + Best Practices

BigQuery

  • user- or project- level custom query quota
  • dry-run
  • on-demand to flat rate
  • supports dry-run which helps in pricing queries based on the amount of bytes read i.e. --dry_run flag in the bq command-line tool or dryRun parameter when submitting a query job using the API

Google Cloud Datastore OR Filestore

MemoryStore

Google Persistent Disk

Google Local SSD

 

 

 

 

Google Cloud Identity Services Cheat Sheet

Identity & Access Management – IAM

  • administrators authorize who can take what action on which resources
  • IAM Member can be a Google Account (for end users), a service account (for apps and virtual machines), a Google group, or a Google Workspace or Cloud Identity domain that can access a resource.
  • IAM Role is a collection of permissions granted to authenticated members.
  • supports 3 kinds of roles
    • Primitive roles – board level of access
    • Predefined roles – finer-grained granular access control
    • Custom roles – tailored permissions when predefined roles don’t meet the needs.
  • Best practice is to use Predefined over primitive roles
  • IAM Policy binds one or more members to a role.
  • IAM policy can be set at any level in the resource hierarchy:  organization level,  folder level, the project level, or the resource level.
  • IAM Policy inheritance is transitive and resources inherit the policies of all of their parent resources.
  • Effective policy for a resource is the union of the policy set on that resource and the policies inherited from higher up in the hierarchy.
  • Service account is a special kind of account used by an application or a virtual machine (VM) instance, not a person.
  • Access Scopes are the legacy method of specifying permissions for the instance for default service accounts
  • Best practice is to set the full cloud-platform access scope on the instance, then securely limit the service account’s access using IAM roles.
  • Delegate responsibility with groups (instead of individual users) and service accounts (for server-to-server interactions)

Cloud Identity

  • Cloud Identity is an Identity as a Service (IDaaS) solution that helps centrally manage the users and groups.
  • configured to federate identities between Google and other identity providers, such as Active Directory and Azure Active Directory
  • Cloud Identity and Google Workspace support Security Assertion Markup Language (SAML) 2.0 for single sign-on  with authentication performed by an external identity provider (IdP)
  • With SAML,  Cloud Identity or Google Workspace acts as a service provider that trusts the SAML IdP to verify a user’s identity on its behalf.
  • Google Cloud Directory Sync – GCDS implements the synchronization process between external IdP

Cloud Billing

  • Google Cloud Billing defines billing accounts linked to Google Cloud Projects to determine who pays for a given set of Google Cloud resources.
  • To move the project to a different billing account, you must be a billing administrator and the project owner.
  • To link a project to a billing account, you must be a Billing Account Administrator or Billing Account User on the billing account OR Project Billing Manager on the project
  • Cloud Billing budgets can be created to monitor all of the Google Cloud charges in one place and configure alerts
  • supports BigQuery export with detailed Google Cloud billing data (such as usage, cost estimates, and pricing data) automatically throughout the day to a specified BigQuery dataset
  • Google Cloud billing data is not added retroactively to BigQuery, so the data before export is enabled will not be visible.

Terraform Cheat Sheet

  • An open source provisioning declarative tool that based on Infrastructure as a Code paradigm
  • designed on immutable infrastructure principles
  • Written in Golang and uses own syntax – HCL (Hashicorp Configuration Language), but also supports JSON
  • Helps to evolve the infrastructure, safely and predictably
  • Applies Graph Theory to IaaC and provides Automation, Versioning and Reusability
  • Terraform is a multipurpose composition tool:
    ○ Composes multiple tiers (SaaS/PaaS/IaaS)
    ○ A plugin-based architecture model
  • Terraform is not a cloud agnostic tool. It embraces all major Cloud Providers and provides common language to orchestrate the infrastructure resources
  • Terraform is not a configuration management tool and other tools like chef, ansible exists in the market.

Terraform Architecture

Terraform Architecture

Terraform Providers (Plugins)

  • provide abstraction above the upstream API and is responsible for understanding API interactions and exposing resources.
  • Invoke only upstream APIs for the basic CRUD operations
  • Providers are unaware of anything related to configuration loading, graph
    theory, etc.
  • supports multiple provider instances using alias for e.g. multiple aws provides with different region
  • can be integrated with any API using providers framework
  • Most providers configure a specific infrastructure platform (either cloud or self-hosted).
  • can also offer local utilities for tasks like generating random numbers for unique resource names.

Terraform Provisioners

  • run code locally or remotely on resource creation
    • local exec executes code on the machine running terraform
    • remote exec
      • runs on the provisioned resource
      • supports ssh and winrm
    • requires inline list of commands
  • should be used as a last resort
  • are defined within the resource block.
  • support types – Create and Destroy
    • if creation time fails, resource is tainted if provisioning failed, by default. (next apply it will be re-created)
    • behavior can be overridden by setting the on_failure to continue, which means ignore and continue
    • for destroy, if it fails – resources are not removed

Terraform Workspaces

  • helps manage multiple distinct sets of infrastructure resources or environments with the same code.
  • just need to create needed workspace and use them, instead of creating a directory for each environment to manage
  • state files for each workspace are stored in the directory terraform.tfstate.d
  • terraform workspace new dev creates a new workspace and switches to it as well
  • terraform workspace select dev helps select workspace
  • terraform workspace list lists the workspaces and shows the current active one with *
  • does not provide strong separation as it uses the same backend

Terraform Workflow

Terraform Workflow

init

  • initializes a working directory containing Terraform configuration files.
  • performs
    • backend initialization , storage for terraform state file.
    • modules installation, downloaded from terraform registry to local path
    • provider(s) plugins installation, the plugins are downloaded in the sub-directory of the present working directory at the path of .terraform/plugins
  • supports -upgrade to update all previously installed plugins to the newest version that complies with the configuration’s version constraints
  • is safe to run multiple times, to bring the working directory up to date with changes in the configuration
  • does not delete the existing configuration or state

validate

  • validates syntactically for format and correctness.
  • is used to validate/check the syntax of the Terraform files.
  • verifies whether a configuration is syntactically valid and internally consistent, regardless of any provided variables or existing state.
  • A syntax check is done on all the terraform files in the directory, and will display an error if any of the files doesn’t validate.

plan

  • create a execution plan
  • traverses each vertex and requests each provider using parallelism
  • calculates the difference between the last-known state and
    the current state and presents this difference as the output of the terraform plan operation to user in their terminal
  • does not modify the infrastructure or state.
  • allows a user to see which actions Terraform will perform prior to making any changes to reach the desired state
  • will scan all *.tf  files in the directory and create the plan
  • will perform refresh for each resource and might hit rate limiting issues as it calls provider APIs
  • all resources refresh can be disabled or avoided using
    • -refresh=false or
    • target=xxxx or
    • break resources into different directories.
  • supports -out to save the plan

apply

  • apply changes to reach the desired state.
  • scans the current directory for the configuration and applies the changes appropriately.
  • can be provided with a explicit plan, saved as out from terraform plan
  • If no explicit plan file is given on the command line, terraform apply will create a new plan automatically and prompt for approval to apply it
  • will modify the infrastructure and the state.
  • if a resource successfully creates but fails during provisioning,
    • Terraform will error and mark the resource as “tainted”.
    • A resource that is tainted has been physically created, but can’t be considered safe to use since provisioning failed.
    • Terraform also does not automatically roll back and destroy the resource during the apply when the failure happens, because that would go against the execution plan: the execution plan would’ve said a resource will be created, but does not say it will ever be deleted.
  • does not import any resource.
  • supports -auto-approve to apply the changes without asking for a confirmation
  • supports -target to apply a specific module

refresh

  • used to reconcile the state Terraform knows about (via its state file) with the real-world infrastructure
  • does not modify infrastructure, but does modify the state file

destroy

  • destroy the infrastructure and all resources
  • modifies both state and infrastructure
  • terraform destroy -target can be used to destroy targeted resources
  • terraform plan -destroy allows creation of destroy plan

import

  • helps import already-existing external resources, not managed by Terraform, into Terraform state and allow it to manage those resources
  • Terraform is not able to auto-generate configurations for those imported modules, for now, and requires you to first write the resource definition in Terraform and then import this resource

taint

  • marks a Terraform-managed resource as tainted, forcing it to be destroyed and recreated on the next apply.
  • will not modify infrastructure, but does modify the state file in order to mark a resource as tainted. Infrastructure and state are changed in next apply.
  • can be used to taint a resource within a module

fmt

  • format to lint the code into a standard format

console

  • command provides an interactive console for evaluating expressions.

Terraform Modules

  • enables code reuse
  • supports versioning to maintain compatibility
  • stores code remotely
  • enables easier testing
  • enables encapsulation with all the separate resources under one configuration block
  • modules can be nested inside other modules, allowing you to quickly spin up whole separate environments.
  • can be referred using source attribute
  • supports Local and Remote modules
    • Local modules are stored alongside the Terraform configuration (in a separate directory, outside of each environment but in the same repository) with source path ./ or ../
    • Remote modules are stored externally in a separate repository, and supports versioning
  • supports following backends
    • Local paths
    • Terraform Registry
    • GitHub
    • Bitbucket
    • Generic Git, Mercurial repositories
    • HTTP URLs
    • S3 buckets
    • GCS buckets
  • Module requirements
    • must be on GitHub and must be a public repo, if using public registry.
    • must be named terraform-<PROVIDER>-<NAME>, where <NAME> reflects the type of infrastructure the module manages and <PROVIDER> is the main provider where it creates that infrastructure. for e.g. terraform-google-vault or terraform-aws-ec2-instance.
    • must maintain x.y.z tags for releases to identify module versions. Release tag names must be a semantic version, which can optionally be prefixed with a v for example, v1.0.4 and 0.9.2. Tags that don’t look like version numbers are ignored.
    • must maintain a Standard module structure, which allows the registry to inspect the module and generate documentation, track resource usage, parse submodules and examples, and more.

Terraform Read and write configuration

terraform_sample

  • Resources
    • resource are the most important element in the Terraform language that describes one or more infrastructure objects, such as compute instances etc
    • resource type and local name together serve as an identifier for a given resource and must be unique within a module for e.g.  aws_instance.local_name
  • Data Sources
    • data allow data to be fetched or computed for use elsewhere in Terraform configuration
    • allows a Terraform configuration to make use of information defined outside of Terraform, or defined by another separate Terraform configuration
  • Variables
    • variable serve as parameters for a Terraform module and act like function arguments
    • allows aspects of the module to be customized without altering the module’s own source code, and allowing modules to be shared between different configurations
    • can be defined through multiple ways
      • command line for e.g.-var="image_id=ami-abc123"
      • variable definition files .tfvars or .tfvars.json. By default, terraform automatically loads
        • Files named exactly terraform.tfvars or terraform.tfvars.json.
        • Any files with names ending in .auto.tfvars or .auto.tfvars.json
        • file can also be passed with -var-file
      • environment variables can be used to set variables using the format TF_VAR_name
      • Environment variables
      • terraform.tfvars file, if present.
      • terraform.tfvars.json file, if present.
      • Any *.auto.tfvars or *.auto.tfvars.json files, processed in lexical order of their filenames.
      • Any -var and -var-file options on the command line, in the order they are provided.Terraform loads variables in the following order, with later sources taking precedence over earlier ones:
  • Local Values
    • locals assigns a name to an expression, allowing it to be used multiple times within a module without repeating it.
    • are like a function’s temporary local variables.
    • helps to avoid repeating the same values or expressions multiple times in a configuration.
  • Output
    • are like function return values.
    • output can be marked as containing sensitive material using the optional sensitive argument, which prevents Terraform from showing its value in the list of outputs. However, they are still stored in the state as plain text.
    • In a parent module, outputs of child modules are available in expressions as module.<MODULE NAME>.<OUTPUT NAME>.
  • Named Values
    • is an expression that references the associated value for e.g. aws_instance.local_name, data.aws_ami.centos, var.instance_type etc.
    • support Local named values for e.g count.index
  • Dependencies
    • identifies implicit dependencies as Terraform automatically infers when one resource depends on another by studying the resource attributes used in interpolation expressions for e.g aws_eip on resource `aws_instance`
    • explicit dependencies can be defined using depends_on where dependencies between resources that are not visible to Terraform
  • Data Types
    • supports primitive data types of
      • string, number and bool
      • Terraform language will automatically convert number and bool values to string values when needed
    • supports complex data types of
      • list – a sequence of values identified by consecutive whole numbers starting with zero.
      • map – a collection of values where each is identified by a string label.
      • set –  a collection of unique values that do not have any secondary identifiers or ordering.
    • supports structural data types of
      • object – a collection of named attributes that each have their own type
      • tuple – a sequence of elements identified by consecutive whole numbers starting with zero, where each element has its own type.
  • Built-in Functions
    • includes a number of built-in functions that can be called from within expressions to transform and combine values for e.g. min, max, file, concat, element, index, lookup etc.
    • does not support user-defined functions
  • Dynamic Blocks
    • acts much like a for expression, but produces nested blocks instead of a complex typed value. It iterates over a given complex value, and generates a nested block for each element of that complex value.
  • Terraform Comments
    • supports three different syntaxes for comments:
      • #
      • //
      • /* and */

Terraform Backends

  • determines how state is loaded and how an operation such as apply is executed
  • are responsible for storing state and providing an API for optional state locking
  • needs to be initialized
  • if switching the backed for the first time setup, Terraform provides a migration option
  • helps
    • collaboration and working as a team, with the state maintained remotely and state locking
    • can provide enhanced security for sensitive data
    • support remote operations
  • supports local vs remote backends
    • local (default) backend stores state in a local JSON file on disk
    • remote backend stores state remotely like S3, OSS, GCS, Consul and support features like remote operation, state locking, encryption, versioning etc.
  • supports partial configuration with remaining configuration arguments provided as part of the initialization process
  • Backend configuration doesn’t support interpolations.
  • GitHub is not the supported backend type in Terraform.

Terraform State Management

  • state helps keep track of the infrastructure Terraform manages
  • stored locally in the terraform.tfstate
  • recommended not to edit the state manually
  • Use terraform state command
    • mv – to move/rename modules
    • rm – to safely remove resource from the state. (destroy/retain like)
    • pull – to observe current remote state
    • list & show – to write/debug modules

State Locking

  • happens for all operations that could write state, if supported by backend
  • prevents others from acquiring the lock & potentially corrupting the state
  • backends which support state locking are
    • azurerm
    • Hashicorp consul
    • Tencent Cloud Object Storage (COS)
    • etcdv3
    • Google Cloud Storage GCS
    • HTTP endpoints
    • Kubernetes Secret with locking done using a Lease resource
    • AliCloud Object Storage OSS with locking via TableStore
    • PostgreSQL
    • AWS S3 with locking via DynamoDB
    • Terraform Enterprise
  • Backends which do not support state locking are
    • artifactory
    • etcd
  • can be disabled for most commands with the -lock flag
  • use force-unlock command to manually unlock the state if unlocking failed

State Security

  • can contain sensitive data, depending on the resources in use for e.g passwords and keys
  • using local state, data is stored in plain-text JSON files
  • using remote state, state is held in memory when used by Terraform. It may be encrypted at rest, if supported by backend for e.g. S3, OSS

Terraform Logging

  • debugging can be controlled using TF_LOG , which can be configured for different levels TRACE, DEBUG, INFO, WARN or ERROR, with TRACE being the more verbose.
  • logs path can be controlled TF_LOG_PATHTF_LOG needs to be specified.

Terraform Cloud and Terraform Enterprise

  • Terraform Cloud provides Cloud Infrastructure Automation as a Service. It is offered as a multi-tenant SaaS platform and is designed to suit the needs of smaller teams and organizations. Its smaller plans default to one run at a time, which prevents users from executing multiple runs concurrently.
  • Terraform Enterprise is a private install for organizations who prefer to self-manage. It is designed to suit the needs of organizations with specific requirements for security, compliance and custom operations.
  • Terraform Cloud provides features
    • Remote Terraform Execution – supports Remote Operations for Remote Terraform execution which helps provide consistency and visibility for critical provisioning operations.
    • Workspaces – organizes infrastructure with workspaces instead of directories. Each workspace contains everything necessary to manage a given collection of infrastructure, and Terraform uses that content whenever it executes in the context of that workspace.
    • Remote State Management – acts as a remote backend for the Terraform state. State storage is tied to workspaces, which helps keep state associated with the configuration that created it.
    • Version Control Integration – is designed to work directly with the version control system (VCS) provider.
    • Private Module Registry – provides a private and central library of versioned & validated modules to be used within the organization
    • Team based Permission System – can define groups of users that match the organization’s real-world teams and assign them only the permissions they need
    • Sentinel Policies – embeds the Sentinel policy-as-code framework, which lets you define and enforce granular policies for how the organization provisions infrastructure. Helps eliminate provisioned resources that don’t follow security, compliance, or operational policies.
    • Cost Estimation – can display an estimate of its total cost, as well as any change in cost caused by the proposed updates
    • Security – encrypts state at rest and protects it with TLS in transit.
  • Terraform Enterprise features
    • includes all the Terraform Cloud features with
    • Audit – supports detailed audit logging and tracks the identity of the user requesting state and maintains a history of state changes.
    • SSO/SAML – SAML for SSO provides the ability to govern user access to your applications.
  • Terraform Enterprise currently supports running under the following operating systems for a Clustered deployment:
    • Ubuntu 16.04.3 – 16.04.5 / 18.04
    • Red Hat Enterprise Linux 7.4 through 7.7
    • CentOS 7.4 – 7.7
    • Amazon Linux
    • Oracle Linux
    • Clusters currently don’t support other Linux variants.
  • Terraform Cloud currently supports following VCS Provider
    • GitHub.com
    • GitHub.com (OAuth)
    • GitHub Enterprise
    • GitLab.com
    • GitLab EE and CE
    • Bitbucket Cloud
    • Bitbucket Server
    • Azure DevOps Server
    • Azure DevOps Services
  • A Terraform Enterprise install that is provisioned on a network that does not have Internet access is generally known as an air-gapped install. These types of installs require you to pull updates, providers, etc. from external sources vs. being able to download them directly.

AWS Certification – Content Delivery – Cheat Sheet

CloudFront

  • provides low latency and high data transfer speeds for distribution of static, dynamic web or streaming content to web users
  • delivers the content through a worldwide network of data centers called Edge Locations
  • keeps persistent connections with the origin servers so that the files can be fetched from the origin servers as quickly as possible.
  • dramatically reduces the number of network hops that users’ requests must pass through
  • supports multiple origin server options, like AWS hosted service for e.g. S3, EC2, ELB or an on premise server, which stores the original, definitive version of the objects
  • single distribution can have multiple origins and Path pattern in a cache behavior determines which requests are routed to the origin
  • supports Web Download distribution and RTMP Streaming distribution
    • Web distribution supports static, dynamic web content, on demand using progressive download & HLS and live streaming video content
    • RTMP supports streaming of media files using Adobe Media Server and the Adobe Real-Time Messaging Protocol (RTMP) ONLY
  • supports HTTPS using either
    • dedicated IP address, which is expensive as dedicated IP address is assigned to each CloudFront edge location
    • Server Name Indication (SNI), which is free but supported by modern browsers only with the domain name available in the request header
  • For E2E HTTPS connection,
    • Viewers -> CloudFront needs either self signed certificate, or certificate issued by CA or ACM
    • CloudFront -> Origin needs certificate issued by ACM for ELB and by CA for other origins
  •  Security
    • Origin Access Identity (OAI) can be used to restrict the content from S3 origin to be accessible from CloudFront only
    • supports Geo restriction (Geo-Blocking) to whitelist or blacklist countries that can access the content
    • Signed URLs 
      • for RTMP distribution as signed cookies aren’t supported
      • to restrict access to individual files, for e.g., an installation download for your application.
      • users using a client, for e.g. a custom HTTP client, that doesn’t support cookies
    • Signed Cookies
      • provide access to multiple restricted files, for e.g., video part files in HLS format or all of the files in the subscribers’ area of a website.
      • don’t want to change the current URLs
    • integrates with AWS WAF, a web application firewall that helps protect web applications from attacks by allowing rules configured based on IP addresses, HTTP headers, and custom URI strings
  • supports GET, HEAD, OPTIONS, PUT, POST, PATCH, DELETE to get object & object headers, add, update, and delete objects
    • only caches responses to GET and HEAD requests and, optionally, OPTIONS requests
    • does not cache responses to PUT, POST, PATCH, DELETE request methods and these requests are proxied back to the origin
  • object removal from cache
    • would be removed upon expiry (TTL) from the cache, by default 24 hrs
    • can be invalidated explicitly, but has a cost associated, however might continue to see the old version until it expires from those caches
    • objects can be invalidated only for Web distribution
    • change object name, versioning, to serve different version
  • supports adding or modifying custom headers before the request is sent to origin which can be used to
    • validate if user is accessing the content from CDN
    • identifying CDN from which the request was forwarded from, in case of multiple CloudFront distribution
    • for viewers not supporting CORS to return the Access-Control-Allow-Origin header for every request
  • supports Partial GET requests using range header to download object in smaller units improving the efficiency of partial downloads and recovery from partially failed transfers
  • supports compression to compress and serve compressed files when viewer requests include Accept-Encoding: gzip in the request header
  • supports different price class to include all regions, to include only least expensive regions and other regions to exclude most expensive regions
  • supports access logs which contain detailed information about every user request for both web and RTMP distribution

AWS Certification – Analytics Services – Cheat Sheet

Kinesis Data Streams – KDS

  • enables real-time processing of streaming data at massive scale
  • provides ordering of records per shard
  • provides an ability to read and/or replay records in the same order
  • allows multiple applications to consume the same data
  • data is replicated across three data centers within a region
  • data is preserved for 24 hours, by default, and can be extended to 7 days
  • data inserted in Kinesis, it can’t be deleted (immutability) but only expires
  • streams can be scaled using multiple shards, based on the partition key
  • each shard provides the capacity of 1MB/sec data input and 2MB/sec data output with 1000 PUT requests per second
  • Kinesis vs SQS
    • real-time processing of streaming big data vs reliable, highly scalable hosted queue for storing messages
    • ordered records, as well as the ability to read and/or replay records in the same order vs no guarantee on data ordering (with the standard queues before the FIFO queue feature was released)
    • data storage up to 24 hours, extended to 7 days vs 1 minute to extended to 14 days but cleared if deleted by the consumer
    • supports multiple consumers vs single consumer at a time and requires multiple queues to deliver message to multiple consumers
  • Kinesis Producer
    • API
      • PutRecord and PutRecords are synchronous
      • PutRecords uses batching and increases throughput
      • might experience ProvisionedThroughputExceeded Exceptions, when sending more data. Use retries with backoff, resharding or change partition key.
    • KPL
      • producer supports synchronous or asynchronous use cases
      • supports inbuilt batching and retry mechanism
    • Kinesis Agent can help monitor log files and send them to KDS
    • supports third party libraries like Spark, Flume, Kafka connect etc.
  • Kinesis Consumers
    • Kinesis SDK
      • Records are polled by consumers from a shard
    • Kinesis Client Library (KCL)
      • Read records from Kinesis produced with the KPL (de-aggregation)
      • supports checkpointing feature to keep track of the application’s state and resume progress using DynamoDB table
      • if KDS application receives provisioned-throughput exceptions, increase the provisioned throughput for the DynamoDB table
    • Kinesis Connector Library – can be replaced using Firehose or Lambda
    • Third party libraries: Spark, Log4J Appenders, Flume, Kafka Connect…
    • Kinesis Firehose, AWS Lambda
    • Kinesis Consumer Enhanced Fan-Out
      • supports  Multiple Consumer applications for the same Stream
      • provides Low Latency ~70ms
      • Higher costs
      • Default limit of 5 consumers using enhanced fan-out per data stream
  • Kinesis Security
    • allows access / authorization control using IAM policies
    • supports Encryption in flight using HTTPS endpoints
    • supports Data encryption at rest either using client side encryption before pushing the data to data streams or server side encryption
    • supports VPC Endpoints to access within VPC

Kinesis Data  Firehose – KDF

  • data transfer solution for delivering real time streaming data to destinations such as S3,  Redshift,  Elasticsearch service, and Splunk.
  • is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration
  • is Near Real Time (min. 60 secs) as it buffers incoming streaming data to a certain size or for a certain period of time before delivering it
  • supports batching, compression, and encryption of the data before loading it, minimizing the amount of storage used at the destination and increasing security
  • supports data compression, minimizing the amount of storage used at the destination. It currently supports GZIP, ZIP, and SNAPPY compression formats. Only GZIP is supported if the data is further loaded to Redshift.
  • supports out of box data transformation as well as custom transformation using Lambda function to transform incoming source data and deliver the transformed data to destinations
  • uses at least once semantics for data delivery.
  • supports multiple producers as datasource, which include Kinesis data stream, KPL, Kinesis Agent, or the Kinesis Data Firehose API using the AWS SDK, CloudWatch Logs, CloudWatch Events, or AWS IoT
  • does NOT support consumers like Spark and KCL
  • supports interface VPC endpoint to keep traffic between the VPC and Kinesis Data Firehose from leaving the Amazon network.

Kinesis Data Streams vs Kinesis Data Firehose

Kinesis Data Analytics

  • helps analyze streaming data, gain actionable insights, and respond to the business and customer needs in real time.
  • reduces the complexity of building, managing, and integrating streaming applications with other AWS service

Redshift

  • Redshift is a fast, fully managed data warehouse
  • provides simple and cost-effective solution to analyze all the data using standard SQL and the existing Business Intelligence (BI) tools.
  • manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, and patching.
  • automatically monitors your nodes and drives to help you recover from failures.
  • only supports Single-AZ deployments.
  • replicates all the data within the data warehouse cluster when it is loaded and also continuously backs up your data to S3.
  • attempts to maintain at least three copies of your data (the original and replica on the compute nodes and a backup in S3).
  • supports cross-region snapshot replication to another region for disaster recovery
  • Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.
    • KEY distribution uses a single column as distribution key (DISTKEY) and helps place matching values on the same node slice
    • Even distribution distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column
    • ALL distribution replicates whole table in every compute node.
    • AUTO distribution lets Redshift assigns an optimal distribution style based on the size of the table data
  • Redshift supports Compound and Interleaved sort keys
    • Compound key
      • is made up of all of the columns listed in the sort key definition, in the order they are listed and is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
    • Interleaved sort key
      • gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
      • Not ideal for monotonically increasing attributes
  • Column encodings CANNOT be changed once created.
  • supports query queues for Workload Management, in order to manage concurrency and resource planning. It is a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU
  • Supports Enhanced VPC routing
  • Import/Export Data
    • UNLOAD helps copy data from Redshift table to S3
    • COPY command
      • helps copy data from S3 to Redshift
      • also supports EMR, DynamoDB, remote hosts using SSH
      • parallelized and efficient
      • can decrypt data as it is loaded from S3
      • DON’T use multiple concurrent COPY commands to load one table from multiple files as Redshift is forced to perform a serialized load, which is much slower.
      • supports data decryption when loading data, if data encrypted
      • supports decompressing data, if data is  compressed.
    • Split the Load Data into Multiple Files
    • Load the data in sort key order to avoid needing to vacuum.
    • Use a Manifest File
      • provides Data consistency, to avoid S3 eventual consistency issues
      • helps specify different S3 locations in a more efficient way that with the use of S3 prefixes.
  • Redshift Spectrum
    • helps query and retrieve structured and semistructured data from files in S3 without having to load the data into Redshift tables
    • Redshift Spectrum external tables are read-only. You can’t COPY or INSERT to an external table.

EMR

  • is a web service that utilizes a hosted Hadoop framework running on the web-scale infrastructure of EC2 and S3
  • launches all nodes for a given cluster in the same Availability Zone, which improves performance as it provides higher data access rate
  • seamlessly supports Reserved, On-Demand and Spot Instances
  • consists of Master Node for management and Slave nodes, which consists of Core nodes holding data and Task nodes for performing tasks only
  • is fault tolerant for slave node failures and continues job execution if a slave node goes down
  • does not automatically provision another node to take over failed slaves
  • supports Persistent and Transient cluster types
    • Persistent which continue to run
    • Transient which terminates once the job steps are completed
  • supports EMRFS which allows S3 to be used as a durable HA data storage

Detailed Reading

Glue

  •  fully-managed ETL service that automates the time-consuming steps of data preparation for analytics
  • is serverless and supports pay-as-you-go model.
  • recommends and generates ETL code to transform the source data into target schemas, and runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination.
  • helps setup, orchestrate, and monitor complex data flows.
  • natively supports RDS, Redshift, S3 and databases on EC2 instances.
  • supports server side encryption for data at rest and SSL for data in motion.
  • provides development endpoints to edit, debug, and test the code it generates.
  • AWS Glue Data Catalog
    • is a central repository to store structural and operational metadata for all the data assets.
    • automatically discovers and profiles the data
    • automatically discover both structured and semi-structured data stored in the data lake on S3, Redshift, and other databases
    • provides a unified view of the data that is available for ETL, querying and reporting using services like Athena, EMR, and Redshift Spectrum.
    • Each AWS account has one AWS Glue Data Catalog per region.
  • AWS Glue crawler
    • connects to a data store, progresses through a prioritized list of classifiers to extract the schema of the data and other statistics, and then populates the Glue Data Catalog with this metadata
    • can be scheduled to run periodically so that the metadata is always up-to-date and in-sync with the underlying data.

QuickSight

  • is a very fast, easy-to-use, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device.
  • delivers fast and responsive query performance by using a robust in-memory engine (SPICE).
    • “SPICE” stands for a Super-fast, Parallel, In-memory Calculation Engine
    • can also be  configured to keep the data in SPICE up-to-date as the data in the underlying sources change.
    • automatically replicates data for high availability and enables QuickSight to scale to support users to perform simultaneous fast interactive analysis across a wide variety of AWS data sources.
  • supports
    • Excel files and flat files like CSV, TSV, CLF, ELF
    • on-premises databases like PostgreSQL, SQL Server and MySQL
    • SaaS applications like Salesforce
    • and AWS data sources such as Redshift, RDS, Aurora, Athena, and S3
  • supports various functions to format and transform the data.
  • supports assorted visualizations that facilitate different analytical approaches:
    • Comparison and distribution – Bar charts (several assorted variants)
    • Changes over time – Line graphs, Area line charts
    • Correlation – Scatter plots, Heat maps
    • Aggregation – Pie graphs, Tree maps
    • Tabular – Pivot tables

Data Pipeline

  • orchestration service that helps define data-driven workflows to automate and schedule regular data movement and data processing activities
  • integrates with on-premises and cloud-based storage systems
  • allows scheduling, retry, and failure logic for the workflows

Elasticsearch

  • Elasticsearch Service is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud.
  • Elasticsearch provides
    • real-time, distributed search and analytics engine
    • ability to provision all the resources for Elasticsearch cluster and launches the cluster
    • easy to use cluster scaling options. Scaling Elasticsearch Service domain by adding or modifying instances, and storage volumes is an online operation that does not require any downtime.
    • provides self-healing clusters, which automatically detects and replaces failed Elasticsearch nodes, reducing the overhead associated with self-managed infrastructures
    • domain snapshots to back up and restore ES domains and replicate domains across AZs
    • enhanced security with IAM, Network, Domain access policies, and fine-grained access control
    • storage volumes for the data using EBS volumes
    • ability to span cluster nodes across multiple AZs in the same region, known as zone awareness,  for high availability and redundancy.  Elasticsearch Service automatically distributes the primary and replica shards across instances in different AZs.
    • dedicated master nodes to improve cluster stability
    • data visualization using the Kibana tool
    • integration with CloudWatch for monitoring ES domain metrics
    • integration with CloudTrail for auditing configuration API calls to ES domains
    • integration with S3, Kinesis, and DynamoDB for loading streaming data
    • ability to handle structured and Unstructured data
    • supports encryption at rest through KMS, node-to-node encryption over TLS, and the ability to require clients to communicate of HTTPS