AWS CloudFront with S3

CloudFront S3 Origin Access Identity - OAI

AWS CloudFront with S3

  • CloudFront can be used to distribute the content from an S3 bucket.
  • For an RTMP distribution, the S3 bucket is the only supported origin, and custom origins cannot be used
  • Using CloudFront over S3 has the following benefits
    • can be more cost-effective if the objects are frequently accessed as at higher usage, and the price for CloudFront data transfer is much lower than the price for S3 data transfer.
    • downloads are faster with CloudFront than with S3 alone because the objects are stored closer to the users
  • CloudFront provides two ways to send authenticated requests to an S3 origin: Origin Access Control (OAC) and Origin Access Identity (OAI).
  • When using S3 as the origin for distribution and the bucket is moved to a different region, CloudFront can take up to an hour to update its records to include the change of region when both of the following are true:
    • Origin Access Control (OAC) and Origin Access Identity (OAI) are used to restrict access to the bucket.
    • Bucket is moved to an S3 region that requires Signature Version 4 for authentication

Origin Access Identity – OAI

CloudFront S3 Origin Access Identity - OAI

  • S3 origin objects must be granted public read permissions and hence the objects are accessible from both S3 as well as CloudFront.
  • Even though CloudFront does not expose the underlying S3 URL, it can be known to the user if shared directly or used by applications.
  • For using CloudFront signed URLs or signed cookies to provide access to the objects, it would be necessary to prevent users from having direct access to the S3 objects.
  • Users accessing S3 objects directly would
    • bypass the controls provided by CloudFront signed URLs or signed cookies, for e.g., control over the date-time that a user can no longer access the content and the IP addresses can be used to access content
    • CloudFront access logs are less useful because they’re incomplete.
  • Origin Access Identity (OAI) can be used to prevent users from directly accessing objects from S3.
  • Origin access identity, which is a special CloudFront user, can be created and associated with the distribution.
  • S3 bucket/object permissions need to be configured to only provide access to the Origin Access Identity.
  • When users access the object from CloudFront, it uses the OAI to fetch the content on the user’s behalf, while the S3 object’s direct access is restricted

Origin Access Control – OAC

CloudFront Origin Access Control - OAC

  • Origin Access Control – OAC is recommended over Origin Access Identity – OAI and supports
    • Enhanced security practices like short-term credentials, frequent credential rotations, and resource-based policies
    • All S3 buckets in all AWS Regions
    • S3 server-side encryption with AWS KMS (SSE-KMS)
    • Comprehensive HTTP methods support including dynamic requests – OAC supports GET, PUT, POST, PATCH, DELETE, OPTIONS, and HEAD.
  • CloudFront OAC needs to be setup with permissions to access the S3 bucket origin, which can be done after creating a CloudFront distribution, but before adding the OAC to the S3 origin in the distribution configuration.
  • For buckets with objects encrypted using server-side encryption with AWS Key Management Service (SSE-KMS), the OAC must be provided with permission to use the AWS KMS key.

CloudFront with S3 Objects

  • CloudFront can be configured to include custom headers or modify existing headers whenever it forwards a request to the origin, to
    • validate the user is not accessing the origin directly, bypassing CDN
    • identify the CDN from which the request was forwarded, if more than one CloudFront distribution is configured to use the same origin
    • if users use viewers that don’t support CORS, configure CloudFront to forward the Origin header to the origin. That will cause the origin to return the Access-Control-Allow-Origin header for every request

Adding & Updating Objects

  • Objects just need to be added to the Origin and CloudFront would start distributing them when accessed.
  • For objects served by CloudFront, the Origin can be updated either by
    • Overwriting the original object
    • Create a different version and update the links exposed to the user.
  • For updating objects, it is recommended to use versioning e.g. have files or the entire folders with versions, so links can be changed when the objects are updated forcing a refresh.
  • With versioning,
    • there is no wait time for an object to expire before CloudFront begins to serve a new version of it.
    • there is no difference in consistency in the object served from the edge
    • no cost is involved to pay for object invalidation.

Removing/Invalidating Objects

  • Objects, by default, would be removed upon expiry (TTL) and the latest object would be fetched from the Origin
  • Objects can also be removed from the edge cache before it expires
    • File or Object Versioning to serve a different version of the object that has a different name.
    • Invalidate the object from edge caches. For the next request, CloudFront returns to the Origin to fetch the object
  • Object or File Versioning is recommended over Invalidating objects
    • if the objects need to be updated frequently.
    • enables to control which object a request returns even when the user has a version cached either locally or behind a corporate caching proxy.
    • makes it easier to analyze the results of object changes as CloudFront access logs include the names of the objects
    • provides a way to serve different versions to different users.
    • simplifies rolling forward & back between object revisions.
    • is less expensive, as no charges for invalidating objects.
    • for e.g. change header-v1.jpg to header-v2.jpg
  • Invalidating objects from the cache
    • objects in the cache can be invalidated explicitly before they expire to force a refresh
    • allows to invalidate selected objects
    • allows to invalidate multiple objects for e.g. objects in a directory or all of the objects whose names begin with the same characters, you can include the * wildcard at the end of the invalidation path.
    • the user might continue to see the old version until it expires from those caches.
    • A specified number of invalidation paths can be submitted each month for free. Any invalidation requests more than the allotted no. per month, a fee is charged for each submitted invalidation path
    • The First 1,000 invalidation paths requests submitted per month are free; charges apply for each invalidation path over 1,000 in a month.
    • Invalidation path can be for a single object for e.g. /js/ab.js or for multiple objects for e.g. /js/* and is counted as a single request even if the * wildcard request may invalidate thousands of objects.
  • For RTMP distribution, objects served cannot be invalidated

Partial Requests (Range GETs)

  • Partial requests using Range headers in a GET request help to download the object in smaller units, improving the efficiency of partial downloads and the recovery from partially failed transfers.
  • For a partial GET range request, CloudFront
    • checks the cache in the edge location for the requested range or the entire object and if exists, serves it immediately
    • if the requested range does not exist, it forwards the request to the origin and may request a larger range than the client requested to optimize performance
    • if the origin supports range header, it returns the requested object range and CloudFront returns the same to the viewer
    • if the origin does not support range header, it returns the complete object and CloudFront serves the entire object and caches it for future.
    • CloudFront uses the cached entire object to serve any future range GET header requests

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are building a system to distribute confidential training videos to employees. Using CloudFront, what method could be used to serve content that is stored in S3, but not publically accessible from S3 directly?
    1. Create an Origin Access Identity (OAI) for CloudFront and grant access to the objects in your S3 bucket to that OAI.
    2. Add the CloudFront account security group “amazon-cf/amazon-cf-sg” to the appropriate S3 bucket policy.
    3. Create an Identity and Access Management (IAM) User for CloudFront and grant access to the objects in your S3 bucket to that IAM User.
    4. Create an S3 bucket policy that lists the CloudFront distribution ID as the Principal and the target bucket as the Amazon Resource Name (ARN).

References

CloudFront Restrict Access to S3

AWS Client VPN

AWS Client VPN

  • AWS Client VPN is a managed client-based VPN service that enables secure access to AWS resources and resources in the on-premises network
  • Client VPN allows accessing the resources from any location using an OpenVPN-based VPN client.
  • Client VPN establishes a secure TLS connection from any location using the OpenVPN client.
  • Client VPN automatically scales to the number of users connecting to the AWS resources and on-premises resources.
  • Client VPN supports client authentication using Active Directory, federated authentication, and certificate-based authentication.
  • Client VPN provides manageability with the ability to manage active client connections, with the ability to terminate active client connections and to view connection logs, which provide details on client connection attempts

AWS Client VPN

Client VPN Components

  • Client VPN endpoint
    • is the resource that is created and configured to enable and manage client VPN sessions.
    • is the resource where all client VPN sessions are terminated.
  • Target network
    • is the network associated with a Client VPN endpoint.
    • is a subnet from a VPC that enables establishing VPN sessions.
    • Multiple subnets can be associated with the Client VPN endpoint, however, each subnet must belong to a different Availability Zone.
  • Route
    • describes the available destination network routes.
    • Each route in the route table specifies the path for traffic to specific resources or networks.
  • Authorization rules
    • restrict the users who can access a network.
    • helps configure the AD or IdP group that is allowed access. Only users belonging to this group can access the specified network.
  • Client
    • end-user connecting to the Client VPN endpoint to establish a VPN session.
    • need to download an OpenVPN client and use the Client VPN configuration file to establish a VPN session.

Client VPN Authentication & Authorization

  • Client VPN provides authentication and authorization capabilities.
  • Authentication determines whether clients are allowed to connect to the Client VPN endpoint
  • Client VPN offers the following types of client authentication:
    • Active Directory authentication (user-based)
    • Mutual authentication (certificate-based)
    • Single sign-on (SAML-based federated authentication) (user-based)
  • Client VPN supports two types of authorization:
    • Security groups and
    • Network-based authorization (using authorization rules)
      • allows mapping of the Active Directory group or the SAML-based IdP group to the network they can have access to.

Client VPN Split Tunnel

  • Client VPN endpoint, by default, routes all traffic over the VPN tunnel.
  • Split-tunnel Client VPN endpoint helps when you do not want all user traffic to route through the Client VPN endpoint.
  • Split tunnel ensures only traffic with a destination to the network matching a route from the Client VPN endpoint route table is routed over the Client VPN tunnel.
  • Split-tunnel offers the following benefits:
    • Optimized routing of traffic from clients by having only the AWS destined traffic traverse the VPN tunnel.
    • Reduced volume of outgoing traffic from AWS, therefore reducing the data transfer cost.

Client VPN Limitations

  • Client CIDR ranges cannot overlap with the local CIDR of the VPC in which the associated subnet is located, or any routes manually added to the Client VPN endpoint’s route table.
  • Client CIDR ranges must have a block size between /22 and /12.
  • Client CIDR range cannot be changed after Client VPN endpoint creation.
  • Subnets associated with a Client VPN endpoint must be in the same VPC.
  • Multiple subnets from the same AZ cannot be associated with a Client VPN endpoint.
  • A Client VPN endpoint does not support subnet associations in a dedicated tenancy VPC.
  • Client VPN supports IPv4 traffic only.
  • Client VPN is not Federal Information Processing Standards (FIPS) compliant.
  • As Client VPN is a managed service and the IP address to which the DNS name resolves might change. Hence, it is not recommended to connect to the Client VPN endpoint by using IP addresses. Use DNS instead.
  • IP forwarding is currently disabled when using the AWS Client VPN Desktop Application.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company is developing an application on AWS. For analysis, the application transmits log files to an Amazon Elasticsearch Service (Amazon ES) cluster. Each piece of data must be contained inside a VPC. A number of the company’s developers work remotely. Other developers are based at three distinct business locations. The developers must connect to Amazon ES directly from their local development computers in order to study and display logs. Which solution will satisfy these criteria?
    1. Configure and set up an AWS Client VPN endpoint. Associate the Client VPN endpoint with a subnet in the VPC. Configure a Client VPN self-service portal. Instruct the developers to connect by using the client for Client VPN.
    2. Create a transit gateway, and connect it to the VPC. Create an AWS Site-to-Site VPN. Create an attachment to the transit gateway. Instruct the developers to connect by using an OpenVPN client.
    3. Create a transit gateway, and connect it to the VPC. Order an AWS Direct Connect connection. Set up a public VIF on the Direct Connect connection. Associate the public VIF with the transit gateway. Instruct the developers to connect to the Direct Connect connection.
    4. Create and configure a bastion host in a public subnet of the VPC. Configure the bastion host security group to allow SSH access from the company CIDR ranges. Instruct the developers to connect by using SSH.

References

AWS_Client_VPN

AWS Transit VPC

AWS Transit VPC

  • Transit Gateway can be used instead of Transit VPC. AWS Transit Gateway offers the same advantages as transit VPC, but it is a managed service that scales elastically in a highly available product.
  • Transit VPC helps connect multiple, geographically disperse VPCs and remote networks in order to create a global network transit center.
  • Transit VPC can solve some of the shortcomings of VPC peering by introducing a hub and spoke design for inter-VPC connectivity.
  • A transit VPC simplifies network management and minimizes the number of connections required to connect multiple VPCs and remote networks.
  • Transit VPC allows an easy way to implement shared services or packet inspection/replication in a VPC.
  • Transit VPC can be used to support important use cases
    • Private Networking – build a private network that spans two or more AWS Regions.
    • Shared Connectivity – Multiple VPCs can share connections to data centers, partner networks, and other clouds.
    • Cross-Account AWS Usage – The VPCs and the AWS resources within them can reside in multiple AWS accounts.
  • Transit VPC design helps implement more complex routing rules, such as network address translation between overlapping network ranges, or to add additional network-level packet filtering or inspection

Transit VPC Configuration

  • Transit VPC network consists of a central VPC (the hub VPC) connecting with every other VPC (spoke VPC) through a VPN connection typically leveraging BGP over IPsec.
  • Central VPC contains EC2 instances running software appliances that route incoming traffic to their destinations using the VPN overlay.

Transit VPC Advantages & Disadvantages

  • supports Transitive routing using the overlay VPN network — allowing for a simpler hub and spoke design. Can be used to provide shared services for VPC Endpoints, Direct Connect connection, etc.
  • supports network address translation between overlapping network ranges.
  • supports vendor functionality around advanced security (layer 7 firewall/Intrusion Prevention System (IPS)/Intrusion Detection System (IDS) ) using third-party software on EC2
  • leverages instance-based routing that increases costs while lowering availability and limiting the bandwidth.
  • Customers are responsible for managing the HA and redundancy of EC2 instances running the third-party vendor virtual appliances

Transit VPC High Availability

Transit VPC High Availability

Transit VPC vs VPC Peering vs Transit Gateway

VPC Peering vs Transit VPC vs Transit Gateway

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. Under increased cyber security concerns, a company is deploying a near real-time intrusion detection system (IDS) solution. A system must be put in place as soon as possible. The architecture consists of many AWS accounts, and all results must be delivered to a central location. Which solution will meet this requirement, while minimizing downtime and costs?
    1. Deploy a third-party vendor solution to perform deep packet inspection in a transit VPC.
    2. Enable VPC Flow Logs on each VPC. Set up a stream of the flow logs to a central Amazon Elasticsearch cluster.
    3. Enable Amazon Macie on each AWS account and configure central reporting.
    4. Enable Amazon GuardDuty on each account as members of a central account.
  2. Your company has set up a VPN connection between their on-premises infrastructure and AWS. They have multiple VPCs defined. They also need to ensure that all traffic flows through a security VPC from their on-premise infrastructure. How would you architect the solution? (Select TWO)
    1. Create a VPN connection between the On-premise environment and the Security VPC (Transit VPC pattern)
    2. Create a VPN connection between the On-premise environment to all other VPC’s
    3. Create a VPN connection between the Security VPC to all other VPC’s (Transit VPC pattern)
    4. Create a VPC peering connection between the Security VPC and all other VPC’s

References

AWS_Transit_VPC

Let’s Talk About…

Let’s Talk About Cloud Security

Guest post by Dustin Albertson – Manager of Cloud & Applications, Product Management -Veeam.

I want to discuss something that’s important to me, security. Far too often I have discussions with customers and other engineers where they’re discussing an architecture or problem they are running into, and I spot issues with the design or holes in the thought process. One of the best things about the cloud model is also one of its worst traits: it’s “easy.” What I mean by this is that it’s easy to log into AWS and set up an EC2 instance, connect it to the internet and configure basic settings. This usually leads to issues down the road because the basic security or architectural best practices were not followed. Therefore, I want to talk about a few things that everyone should be aware of.

The Well-Architected Framework

AWS Well-Architected Framework

AWS has done a great job at creating a framework for its customer to adhere to when planning and deploying workloads in AWS.    This framework is called the AWS Well-Architected Framework.   The framework has 6 pillars that helps you learn architectural best practices for designing and operating secure, reliable, efficient, cost-effective, and sustainable workloads in the AWS Cloud.   The pillars are :

  • Operational Excellence: The ability to support the development and run workloads effectively, gain insight into their operations, and continuously improve supporting processes and procedures to deliver business value.
  • Security: The security pillar describes how to take advantage of cloud technologies to protect data, systems, and assets in a way that can improve your security posture.
  • Reliability: The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle. This paper provides in-depth, best practice guidance for implementing reliable workloads on AWS.
  • Performance Efficiency: The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.
  • Cost Optimization: The ability to run systems to deliver business value at the lowest price point.
  • Sustainability: The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload by maximizing the benefits from the provisioned resources and minimizing the total resources required.

This framework is important to read and understand for not only a customer but a software vendor or a services provider as well. As a company that provides software in the AWS marketplace, Veeam must go through a few processes prior to listing in the marketplace. Those processes are what’s called a W.A.R (Well-Architected Review) and a T.F.R (Technical Foundation Review).   A W.A.R. is a deep dive into the product and APIs to make sure that the best practices are being used in the way the products not only interact with the APIs in AWS but also how the software is deployed and the architecture it uses.    The T.F.R. is a review to validate that all the appropriate documentation and help guides are in place so that a customer can easily find out how to deploy, protect, secure, and obtain support when using a product deployed via the AWS Marketplace. This can give customers peace of mind when deploying software from the marketplace because they’ll know that it has been rigorously tested and validated.

I have mostly been talking at a high level here and want to break this down into a real-world example. Veeam has a product in the AWS Marketplace called Veeam Backup for AWS. One of the best practices for this product is to deploy it into a separate AWS account than your production account.

Veeam Data Protection

The reason for this is that the software will reach into the production account and back up the instances you wish to protect into an isolated protection account where you can limit the number of people who have access.    It’s also a best practice to have your backup data stored away from production data.    Now here is where the story gets interesting, a lot of people like to use encryption on their EBS volumes.   But since it’s so easy to enable encryption, most people just turn it on and move on.    The root of the issue is that AWS has made it easy to encrypt a volume since they have a default key that you choose when creating an instance.

They have also made it easy to set a policy that every new volume is encrypted and the default choice is the default key.

This is where the problem begins. Now, this may be fine for now or for a lot of users, but what this does is create issues later down the road.    Default encryption keys cannot be shared outside of the account that the key resides in. This means that you would not be able to back that instance up to another account, you can’t rotate the keys, you can’t delete the keys, you can’t audit the keys, and more. Customer managed keys (CMK) give you the ability to create, rotate, disable, enable and audit the encryption key used to protect the data.   I don’t want to go too deep here but this is an example that I run into a lot and people don’t realize the impact of this setting until it’s too late. To change from a default key to a CMK requires downtime of the instance and is a very manual process, although it can be scripted out, it still can be a very cumbersome task if we are talking about hundreds to thousands of instances.

Don’t just take my word for it, Trend Micro also lists this as a Medium Risk.

Aqua Vulnerability Database also lists this as a threat.

Conclusion

I’ am not trying to scare people or shame people for not knowing this information. A lot of the time in the field, we are so busy and just get things working and move on.   My goal here is to try to get you to stop for a second and think about if the choices you are making are the best ones for your security.   Take advantage of the resources and help that companies like AWS and Veeam are offering and learn about data protection and security best practices.   Take a step back from time to time and evaluate the architecture or design that you are implementing.   Get a second set of eyes on the project.   It may sound complicated or confusing, but I promise it’s not that hard and the best bet is to just ask others. Also, don’t forget to check the “Choose Your Cloud Adventure” interactive e-book to learn how to manage your AWS data like a hero.

Thank you for reading.

Google Cloud – Professional Cloud DevOps Engineer Certification learning path

Google Cloud Professional Cloud DevOps Engineer Certification

Google Cloud – Professional Cloud DevOps Engineer Certification learning path

Continuing on the Google Cloud Journey, glad to have passed the 8th certification with the Professional Cloud DevOps Engineer certification. Google Cloud – Professional Cloud DevOps Engineer certification exam focuses on almost all of the Google Cloud DevOps services with Cloud Developer tools, Operations Suite, and SRE concepts.

Google Cloud -Professional Cloud DevOps Engineer Certification Summary

  • Had 50 questions to be answered in 2 hours.
  • Covers a wide range of Google Cloud services mainly focusing on DevOps toolset including Cloud Developer tools, Operations Suite with a focus on monitoring and logging, and SRE concepts.
  • The exam has been updated to use
    • Cloud Operations, Cloud Monitoring & Logging and does not refer to Stackdriver in any of the questions.
    • Artifact Registry instead of Container Registry.
  • There are no case studies for the exam.
  • As mentioned for all the exams, Hands-on is a MUST, if you have not worked on GCP before make sure you do lots of labs else you would be absolutely clueless about some of the questions and commands
  • I did Coursera and ACloud Guru which is really vast, but hands-on or practical knowledge is MUST.

Google Cloud – Professional Cloud DevOps Engineer Certification Resources

Google Cloud – Professional Cloud DevOps Engineer Certification Topics

Developer Tools

  • Google Cloud Build
    • Cloud Build integrates with Cloud Source Repository, Github, and Gitlab and can be used for Continous Integration and Deployments.
    • Cloud Build can import source code, execute build to the specifications, and produce artifacts such as Docker containers or Java archives
    • Cloud Build can trigger builds on source commits in Cloud Source Repositories or other git repositories.
    • Cloud Build build config file specifies the instructions to perform, with steps defined to each task like the test, build and deploy.
    • Cloud Build step specifies an action to be performed and is run in a Docker container.
    • Cloud Build supports custom images as well for the steps
    • Cloud Build integrates with Pub/Sub to publish messages on build’s state changes.
    • Cloud Build can trigger the Spinnaker pipeline through Cloud Pub/Sub notifications.
    • Cloud Build should use a Service Account with a Container Developer role to perform deployments on GKE
    • Cloud Build uses a directory named /workspace as a working directory and the assets produced by one step can be passed to the next one via the persistence of the /workspace directory.
  • Binary Authorization and Vulnerability Scanning
    • Binary Authorization provides software supply-chain security for container-based applications. It enables you to configure a policy that the service enforces when an attempt is made to deploy a container image on one of the supported container-based platforms.
    • Binary Authorization uses attestations to verify that an image was built by a specific build system or continuous integration (CI) pipeline.
    • Vulnerability scanning helps scan images for vulnerabilities by Container Analysis.
    • Hint: For Security and compliance reasons if the image deployed needs to be trusted, use Binary Authorization
  • Google Source Repositories
    • Cloud Source Repositories are fully-featured, private Git repositories hosted on Google Cloud.
    • Cloud Source Repositories can be used for collaborative, version-controlled development of any app or service
    • Hint: If the code needs to be versioned controlled and needs collaboration with multiple members, choose Git related options
  • Google Container Registry/Artifact Registry
    • Google Artifact Registry supports all types of artifacts as compared to Container Registry which was limited to container images
    • Container Registry is not referred to in the exam
    • Artifact Registry supports both regional and multi-regional repositories
  • Google Cloud Code
    • Cloud Code helps write, debug, and deploy the cloud-based applications for IntelliJ, VS Code, or in the browser.
  • Google Cloud Client Libraries
    • Google Cloud Client Libraries provide client libraries and SDKs in various languages for calling Google Cloud APIs.
    • If the language is not supported, Cloud Rest APIs can be used.
  • Deployment Techniques
    • Recreate deployment – fully scale down the existing application version before you scale up the new application version.
    • Rolling update – update a subset of running application instances instead of simultaneously updating every application instance
    • Blue/Green deployment – (also known as a red/black deployment), you perform two identical deployments of your application
    • GKE supports Rolling and Recreate deployments.
      • Rolling deployments support maxSurge (new pods would be created) and maxUnavailable (existing pods would be deleted)
    • Managed Instance groups support Rolling deployments using the
    • maxSurge (new pods would be created) and maxUnavailable (existing pods would be deleted) configurations
  • Testing Strategies
    • Canary testing – partially roll out a change and then evaluate its performance against a baseline deployment
    • A/B testing – test a hypothesis by using variant implementations. A/B testing is used to make business decisions (not only predictions) based on the results derived from data.
  • Spinnaker
    • Spinnaker supports Blue/Green rollouts by dynamically enabling and disabling traffic to a particular Kubernetes resource.
    • Spinnaker recommends comparing canary against an equivalent baseline, deployed at the same time instead of production deployment.

Cloud Operations Suite

  • Cloud Operations Suite provides everything from monitoring, alert, error reporting, metrics, diagnostics, debugging, trace.
  • Google Cloud Monitoring or Stackdriver Monitoring
    • Cloud Monitoring helps gain visibility into the performance, availability, and health of your applications and infrastructure.
    • Cloud Monitoring Agent/Ops Agent helps capture additional metrics like Memory utilization, Disk IOPS, etc.
    • Cloud Monitoring supports log exports where the logs can be sunk to Cloud Storage, Pub/Sub, BigQuery, or an external destination like Splunk.
    • Cloud Monitoring API supports push or export custom metrics
    • Uptime checks help check if the resource responds. It can check the availability of any public service on VM, App Engine, URL, GKE, or AWS Load Balancer.
    • Process health checks can be used to check if any process is healthy
  • Google Cloud Logging or Stackdriver logging
    • Cloud Logging provides real-time log management and analysis
    • Cloud Logging allows ingestion of custom log data from any source
    • Logs can be exported by configuring log sinks to BigQuery, Cloud Storage, or Pub/Sub.
    • Cloud Logging Agent can be installed for logging and capturing application logs.
    • Cloud Logging Agent uses fluentd and fluentd filter can be applied to filter, modify logs before being pushed to Cloud Logging.
    • VPC Flow Logs helps record network flows sent from and received by VM instances.
    • Cloud Logging Log-based metrics can be used to create alerts on logs.
    • Hint: If the logs from VM do not appear on Cloud Logging, check if the agent is installed and running and it has proper permissions to write the logs to Cloud Logging.
  • Cloud Error Reporting
    • counts, analyzes and aggregates the crashes in the running cloud services
  • Cloud Profiler
    • Cloud Profiler allows for monitoring of system resources like CPU and memory on both GCP and on-premises resources.
  • Cloud Trace
    • is a distributed tracing system that collects latency data from the applications and displays it in the Google Cloud Console.
  • Cloud Debugger
    • is a feature of Google Cloud that lets you inspect the state of a running application in real-time, without stopping or slowing it down
    • Debug Logpoints allow logging injection into running services without restarting or interfering with the normal function of the service
    • Debug Snapshots help capture local variables and the call stack at a specific line location in your app’s source code

Compute Services

  • Compute services like Google Compute Engine and Google Kubernetes Engine are lightly covered more from the security aspects
  • Google Compute Engine
    • Google Compute Engine is the best IaaS option for computing and provides fine-grained control
    • Preemptible VMs and their use cases. HINT – use for short term needs
    • Committed Usage Discounts – CUD help provide cost benefits for long-term stable and predictable usage.
    • Managed Instance Group can help scale VMs as per the demand. It also helps provide auto-healing and high availability with health checks, in case an application fails.
  • Google Kubernetes Engine
    • GKE can be scaled using
      • Cluster AutoScaler to scale the cluster
      • Vertical Pod Scaler to scale the pods with increasing resource needs
      • Horizontal Pod Autoscaler helps scale Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload’s CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.
    • Kubernetes Secrets can be used to store secrets (although they are just base64 encoded values)
    • Kubernetes supports rolling and recreate deployment strategies.

Security

  • Cloud Key Management Service – KMS
    • Cloud KMS can be used to store keys to encrypt data in Cloud Storage and other integrated storage
  • Cloud Secret Manager
    • Cloud Secret Manager can be used to store secrets as well

Site Reliability Engineering – SRE

  • SRE is a DevOps implementation and focuses on increasing reliability and observability, collaboration, and reducing toil using automation.
  • SLOs help specify a target level for the reliability of your service using SLIs which provide actual measurements.
  •  SLI Types
    • Availability
    • Freshness
    • Latency
    • Quality
  • SLOs – Choosing the measurement method
    • Synthetic clients to measure user experience
    • Client-side instrumentation
    • Application and Infrastructure metrics
    • Logs processing
  • SLOs help defines Error Budget and Error Budget Policy which need to be aligned with all the stakeholders and help plan releases to focus on features vs reliability.
  • SRE focuses on Reducing Toil – Identifying repetitive tasks and automating them.
  • Production Readiness Review – PRR
    • Applications should be performance tested for volumes before being deployed to production
    • SLOs should not be modified/adjusted to facilitate production deployments. Teams should work to make the applications SLO compliant before they are deployed to production.
  • SRE Practices include
    • Incident Management and Response
      • Priority should be to mitigate the issue, and then investigate and find the root cause. Mitigating would include
        • Rollbacking the release causes issues
        • Routing traffic to working site to restore user experience
      • Incident Live State Document helps track the events and decision making which can be useful for postmortem.
      • involves the following roles
        • Incident Commander/Manager
          • Setup a communication channel for all to collaborate
          • Assign and delegate roles. IC would assume any role, if not delegated.
          • Responsible for Incident Live State Document
        • Communications Lead
          • Provide periodic updates to all the stakeholders and customers
        • Operations Lead
          • Responds to the incident and should be the only group modifying the system during an incident.
    • Postmortem
      • should contain the root cause
      • should be Blameless
      • should be shared with all for collaboration and feedback
      • should be shared with all the shareholders
      • should have proper action items to prevent recurrence with an owner and collaborators, if required.

All the Best !!

SRE – Site Reliability Engineering Best Practices

Site Reliability Engineering Best Practices

SRE Implements DevOps. The goal of SRE is to accelerate product development teams and keep services running in reliable and continuous way.

SRE Concepts

  • Remove Silos and help increase sharing and collaboration between the Development and Operations team
  • Accidents are Normal. It is more profitable to focus on speeding recovery than preventing accidents.
  • Focus on small and gradual changes. This strategy, coupled with automatic testing of smaller changes and reliable rollback of bad changes, leads to approaches to change management like CI/CD.
  • Measurement is Crucial.

SRE Foundations

  • SLIs, SLOs, and SLAs
  • Monitoring
  • Alerting
  • Toil reduction
  • Simplicity

SLI, SLO, and SLAs

  • SRE does not attempt to give everything 100% availability
  • SLIs – Service Level Indicators
    • “A carefully defined quantitative measure of some aspect of the level of service that is provided”
    • SLIs define what to measure
    • SLIs are metrics over time – specific to a user journey such as request/response, data processing – which shows how well the service is performing.
    • SLIs is the ratio between two numbers: the good and the total:
      • Success Rate = No. of successful HTTP request / total HTTP requests
      • Throughput Rate = No. of consumed jobs in a queue / total number of jobs in a queue
    • SLI is divided into specification and implementation. for e.g.
      • Specification: ration of requests loaded in < 100 ms
      • Implementation is a way to measure for e.g. based on: a) server logs b) client code instrumentation
    • SLI ranges from 0% to 100%, where 0% means nothing works, and 100% means nothing is broken
    • Types of SLIs
      • Availability – The proportion of requests which result in a successful state
      • Latency – The proportion of requests below some time threshold
      • Freshness – The proportion of data transferred to some time threshold. Replication or Data pipeline
      • Correctness – The proportion of input that produces correct output
      • Durability – The proportion of records written that can be successfully read
  • SLO – Service Level Objective
    • “SLOs specify a target level for the reliability of your service.”
    • SLO is a goal that the service provider wants to reach.
    • SLOs are tools to help determine what engineering work to prioritize.
    • SLO is a target percentage based on SLIs and can be a single target value or range of values for e.g. SLI <= SLO or (lower bound <= SLI <= upper bound) = SLO
    • SLOs also define the concept of error budget.
    • The Product and SRE team should select an appropriate availability target for the service and its user base, and the service is managed to that SLO.
  • Error Budget
    • Error budgets are a tool for balancing reliability with other engineering work, and a great way to decide which projects will have the most impact.
    • An Error budget is 100% minus the SLO
    • If an Error budget is exhausted, a team can declare an emergency with high-level approval to deprioritize all external demands until the service meets SLOs and exit criteria.
  • SLOs & Error budget approach
    • SLOs are agreed and approved by all stakeholders
    • It is possible to meet SLOs needs under normal conditions
    • The organization is committed to using the error budget for decision making and prioritizing
    • Error budget policy should cover the policy if the error budget is exhausted.
  • SLO and SLI in practice
    • The strategy to implement SLO, SLI in the company is to start small.
    • Consider the following aspects when working on the first SLO.
      • Choose one application for which you want to define SLOs
      • Decide on a few key SLIs specs that matter to your service and users
      • Consider common ways and tasks through which your users interact with service
      • Draw a high-level architecture diagram of your system
      • Show key components. The requests flow. The data flow
    • The result is a narrow and focused proof of concept that would help to make the benefits of SLO, SLI concise and clear.

Monitoring

  • Monitoring allows you to gain visibility into a system, which is a core requirement for judging service health and diagnosing your service when things go wrong
  • from an SRE perspective,

    • Alert on conditions that requires attention
    • Investigate and diagnose issues
    • Display information about the system visually
    • Gain insight into system health and resource usage for long-term planning
    • Compare the behavior of the system before and after a change, or between two control groups
  • Monitoring features that might be relevant
    • Speed of data retrieval and freshness of data.
    • Data retention and calculations
    • Interfaces: graphs, tables, charts. High level or low level.
    • Alerts: multiple categories, notifications flow, suppress functionality.
  • Monitoring sources
    • Metrics are numerical measurements representing attributes and events, typically harvested via many data points at regular time intervals.
    • Logs are an append-only record of events.

Alerting

  • Alerting helps ensure alerts are triggered for a significant event, an event that consumes a large fraction of the error budget.
  • Alerting should be configured to notify an on-caller only when there are actionable, specific threats to the error budget.
  • Alerting considerations
    • Precision – The proportion of events detected that were significant.
    • Recall – The proportion of significant events detected.
    • Detection time – How long it takes to send notification in various conditions. Long detection time negatively impacts the error budget.
    • Reset time – How long alerts fire after an issue is resolved
  • Ways to alerts
    • The recommendation is to combine several strategies to enhance your alert quality from different directions.
    • Target error rate ≥ SLO threshold.
      • Choose a small time window (for example, 10 minutes) and alert if the error rate over that window exceeds the SLO.
      • Upsides: Short detection time, Fast recall time
      • Downsides: Precision is low
    • Increased Alert Windows.
      • By increasing the window size, you spend a higher budget amount before triggering an alert. for e.g. if an event consumes 5% of the 30-day error budget – a 36-hour window.
      • Upsides: good detection time, better precision
      • Downside: poor reset time
    • Increment Alert Duration.
      • For how long alert should be triggered to be significant.
      • Upsides: Higher precision.
      • Downside: poor recall and poor detection time
    • Alert on Burn Rate.
      • How fast, relative to SLO, the service consumes an error budget.
      • Example: 5% error budget over 1 hour period.
      • Upside: Good precision, short time window, good detection time.
      • Downside: low recall, long reset time
    • Multiple Burn Rate Alerts.
      • Burn rate is how fast, relative to the SLO, the service consumes the error budget
      • Depend on burn rate determine the severity of alert which lead to page notification or a ticket
      • Upsides: good recall, good precision
      • Downsides: More parameters to manage, long reset time
    • Multi-window, multi burn alerts.
      • Upsides: Flexible alert framework, good precision, good recall
      • Downside: even harder to manage, lots of parameters to specify

Toil Reduction

It’s better to fix root causes when possible. If I fixed the symptom, there would be no incentive to fix the root cause.

  • Toils is a repetitive, predictable, constant stream of tasks related to maintaining a service.
  • Any time spent on operational tasks means time not spent on project work and project work is how we make our services more reliable and scalable.
  • Toil can be defined using following characteristics
    • Manual. When the tmp directory on a web server reaches 95% utilization, you need to login and find a space to clean up
    • Repetitive. A full tmp directory is unlikely to be a one-time event
    • Automatable. If the instructions are well defined then it’s better to automate the problem detection and remediation
    • Reactive. When you receive too many alerts of “disks full”, they distract more than help. So, potentially high-severity alerts could be missed
    • Lacks enduring value. The satisfaction of completed tasks is short term because it is to prevent the issue in the future
    • Grow at least as fast as its source. The growing popularity of the service will require more infrastructure and more toil work
  • Potential benefits of toil automation
    • Engineering work might reduce toil in the future
    • Increased team morale and reduced burnout
    • Less context switching for interrupts, which raises team productivity
    • Increased process clarity and standardization
    • Enhanced technical skills and career growth for team members
    • Reduced training time
    • Fewer outages attributable to human errors
    • Improved security
    • Shorter response times for user requests
  • Toil Measurement

    • Identify it.
    • Measure the amount of human effort applied to this toil
    • Track these measurements before, during, and after toil reduction efforts
  • Toil categorization
    • Business processes. A most common source of toil.
    • Production interrupts. The key tasks to keep the system running.
    • Product releases. Depending on the tooling and release size they could generate toil (release requests, rollbacks, hotfixes, and repetitive manual configuration changes)
    • Migrations. Large-scale migration or even small database structure change is likely done manually as a one-time effort. Such thinking is a mistake because this work is repetitive.
    • Cost engineering and capacity planning. Ensure a cost-effective baseline. Prepare for critical high traffic events.
    • Troubleshooting
  • Toil management strategies in practices
    • Identify and measure
    • Engineer toil out of the system
    • Reject the toil
    • Use SLO to reduce toil
    • Organizational:
      • Start with human-backed interfaces. For complex businesses, problems start with a partially automated approach.
      • Get support from management and colleagues. Toil reduction is a worthwhile goal.
      • Promote toil reduction as a feature. Create strong business case for toil reduction.
      • Start small and then improve
    • Standardization and automation:
      • Increase uniformity. Lean-to standard tools, equipment and processes.
      • Access risk within automation. Automation with admin-level privileges should have safety mechanism which checks automation actions against the system. It will prevent outages caused by bugs in automation tools.
      • Automate toil response. Think how to approach toil automation. It shouldn’t eliminate human understanding of what’s going on.
      • Use open-source and third-party tools.
    • Use feedback to improve. Seek for feedback from users who interact with your tools, workflows and automation.

Simplicity

  • Simple software breaks less often and is easier and faster to fix when it does break.
  • Simple systems are easier to understand, easier to maintain, and easier to test
  • Measure complexity
    • Training time. How long does it take for a newcomer engineer to get on full speed.
    • Explanation time. The time it takes to provide a view on system internals.
    • Administrative diversity. How many ways are there to configure similar settings
    • Diversity of deployed configuration
    • Age. How old is the system
  • SRE work on simplicity
    • SRE understand the system’s as a whole to prevent and fix the source of complexity
    • SRE should be involved in the design, system architecture, configuration, deployment processes, or elsewhere.
    • SRE leadership empowers SRE teams to push for simplicity and to explicitly reward these efforts.

SRE Practices

SRE practices apply software engineering solutions to operational problems

  • SRE teams are responsible for the day-to-day functioning of the systems we support, our engineering work often focuses

SRE Practices

Incident Management & Response

  • Incident Management involves coordinating the efforts of responding teams in an efficient manner and ensuring that communication flows both between the responders and those interested in the incident’s progress.
  • Incident management is to respond to an incident in a structured way.
  • Incident Response involves mitigating the impact and/or restoring the service to its previous condition.
  • Basic principles of incident response include the following:
    • Maintain a clear line of command.
    • Designate clearly defined roles.
    • Keep a working record of debugging and mitigation as you go.
    • Declare incidents early and often.
  • Key roles in an Incident Response
    • Incident Commander (IC)
      • the person who declares the incident typically steps into the IC role and directs the high-level state of the incident
      • Commands and coordinates the incident response, delegating roles as needed.
      • By default, the IC assumes all roles that have not been delegated yet.
      • Communicates effectively.
      • Stays in control of the incident response.
      • Works with other responders to resolve the incident.
      • Remove roadblocks that prevent Ops from working most effectively.
    • Communications Lead (CL)
      • CL is the public face of the incident response team.
      • The CL’s main duties include providing periodic updates to the incident response team and stakeholders and managing inquiries about the incident.
    • Operations or Ops Lead (OL)
      • OL works to respond to the incident by applying operational tools to mitigate or resolve the incident.
      • The operations team should be the only group modifying the system during an incident.
  • Live Incident State Document
    • Live Incident State Document can live in a wiki, but should ideally be editable by several people concurrently.
    • This living doc can be messy, but must be functional and not usually shared with shareholders.
    • Live Incident State Document is Incident commander’s most important responsibility.
    • Using a template makes generating this documentation easier, and keeping the most important information at the top makes it more usable.
    • Retain this documentation for postmortem analysis and, if necessary, meta analysis.
  • Incident Management Best Practices
    • Prioritize – Stop the bleeding, restore service, and preserve the evidence for root-causing.
    • Prepare – Develop and document your incident management procedures in advance, in consultation with incident participants.
    • Trust – Give full autonomy within the assigned role to all incident participants.
    • Introspect – Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.
    • Consider alternatives – Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response.
      Practice.Use the process routinely so it becomes second nature.
    • Change it around – Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.

Postmortem

  • A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
  • Postmortems are expected after any significant undesirable event.
  • The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.
  • Writing a postmortem is not punishment – it is a learning opportunity for the entire company.
  • Postmortem Best Practices
    • Blameless
      • Postmortems should be Blameless.
      • It must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior.
    • Collaborate and Share Knowledge
      • Postmortems should be used to collaborate and share knowledge. It should e shared broadly, typically with the larger engineering team or on an internal mailing list.
      • The goal should be to share postmortems to the widest possible audience that would benefit from the knowledge or lessons imparted.
    • No Portmortem Left Unreviewed
      • An unreviewed postmortem might as well never have existed.
    • Ownership
      • Declaring official ownership results in accountability, which leads to action.
      • It’s better to have a single owner and multiple collaborators.

Conclusions

  • SRE practices require a significant amount of time and skilled SRE people to implement right
  • A lot of tools are involved in day to day SRE work
  • SRE processes are one of a key to the success of a tech company

References

AWS Key Management Service – KMS

AWS KMS - Owned vs Managed vs Customer Managed Keys

AWS Key Management Service – KMS

  • AWS Key Management Service – KMS is a managed encryption service that allows the creation and control of encryption keys to enable data encryption.
  • provides a highly available key storage, management, and auditing solution to encrypt the data across AWS services & within applications.
  • uses hardware security modules (HSMs) to protect and validate the keys by the FIPS 140-2 Cryptographic Module Validation Program.
  • seamlessly integrates with several AWS services to make encrypting data in those services easy.
  • is integrated with AWS CloudTrail to provide encryption key usage logs to help meet auditing, regulatory, and compliance needs.
  • is regional and keys are only stored and used in the region in which they are created. They cannot be transferred to another region.
  • enforces usage and management policies, to control which IAM user, role from the account, or other accounts can manage and use keys.
  • can create and manage keys by
    • Create, edit, and view symmetric and asymmetric keys, including HMAC keys.
    • Control access to the keys by using key policies, IAM policies, and grants. Policies can be further refined using condition keys.
    • Supports attribute-based access control (ABAC).
    • Create, delete, list, and update aliases for the keys.
    • Tag the keys for identification, automation, and cost tracking.
    • Enable and disable keys.
    • Enable and disable automatic rotation of the cryptographic material in keys.
    • Delete keys to complete the key lifecycle.
  • supports the following cryptographic operations
    • Encrypt, decrypt, and re-encrypt data with symmetric or asymmetric keys.
    • Sign and verify messages with asymmetric keys.
    • Generate exportable symmetric data keys and asymmetric data key pairs.
    • Generate and verify HMAC codes. 
    • Generate random numbers suitable for cryptographic applications
  • supports multi-region keys, which act like copies of the same KMS key in different AWS Regions that can be used interchangeably – as though you had the same key in multiple Regions.
  • supports VPC private endpoint to connect KMS privately from a VPC.
  • supports keys in a CloudHSM key store backed by the CloudHSM cluster.

Envelope encryption

  • AWS cloud services integrated with AWS KMS use a method called envelope encryption to protect the data.
  • Envelope encryption is an optimized method for encrypting data that uses two different keys (Master key and Data key)
  • With Envelop encryption
    • A data key is generated and used by the AWS service to encrypt each piece of data or resource.
    • Data key is encrypted under a defined master key.
    • Encrypted data key is then stored by the AWS service.
    • For data decryption by the AWS service, the encrypted data key is passed to KMS and decrypted under the master key that was originally encrypted so the service can then decrypt the data.
  • When the data is encrypted directly with KMS it must be transferred over the network.
  • Envelope encryption can offer significant performance benefits as KMS only supports sending data less than 4 KB to be encrypted.
  • Envelope encryption reduces the network load for the application or AWS cloud service as only the request and fulfillment of the data key must go over the network.

KMS Service Concepts

KMS Usage

KMS Keys OR Customer Master Keys (CMKs)

  • AWS KMS key is a logical representation of a cryptographic key.
  • KMS Keys can be used to create symmetric or asymmetric keys for encryption or signing OR HMAC keys to generate and verify HMAC tags.
  • Symmetric keys and the private keys of asymmetric keys never leave AWS KMS unencrypted.
  • A KMS key contains metadata, such as the key ID, key spec, key usage, creation date, description, key state, and a reference to the key material that is used to run cryptographic operations with the KMS key.
  • Symmetric keys are 256-bit AES keys that are not exportable.
  • KMS keys can be used to generate, encrypt, and decrypt the data keys, used outside of AWS KMS to encrypt the data [Envelope Encryption]
  • Key material for symmetric keys and the private keys of asymmetric key never leaves AWS KMS unencrypted.

Customer Keys and AWS Keys

AWS KMS - Owned vs Managed vs Customer Managed Keys

AWS Managed Keys

  • AWS Managed keys are created, managed, and used on your behalf by AWS services in your AWS account.
  • keys are automatically rotated every year (~365 days) and the rotation schedule cannot be changed.
  • have permission to view the AWS managed keys in your account, view their key policies, and audit their use in CloudTrail logs.
  • cannot manage or rotate these keys, change their key policies, or use them in cryptographic operations directly; the service that creates them uses them on your behalf.

Customer managed keys

  • Customer managed keys are created by you to encrypt your service resources in your account.
  • Automatic rotation is Optional and if enabled, keys are automatically rotated every year.
  • provides full control over these keys, including establishing and maintaining their key policies, IAM policies, and grants, enabling and disabling them, rotating their cryptographic material, adding tags, creating aliases refering the KMS keys, and scheduling the KMS keys for deletion.

AWS Owned Keys

  • AWS owned keys are a collection of KMS keys that an AWS service owns and manages for use in multiple AWS accounts.
  • AWS owned keys are not in your AWS account, however, an AWS service can use the associated AWS owned keys to protect the resources in your account.
  • cannot view, use, track, or audit them

Key Material

  • KMS keys contain a reference to the key material used to encrypt and decrypt data.
  • By default, AWS KMS generates the key material for a newly created key.
  • KMS key can be created without key material and then your own key material can be imported or created in the AWS CloudHSM cluster associated with an AWS KMS custom key store.
  • Key material cannot be extracted, exported, viewed, or managed.
  • Key material cannot be deleted; you must delete the KMS key.

Key Material Origin

  • Key material origin is a KMS key property that identifies the source of the key material in the KMS key.
  • Symmetric encryption KMS keys can have one of the following key material origin values.
    • AWS_KMS
      • AWS KMS creates and manages the key material for the KMS key in AWS KMS.
    • EXTERNAL
      • Key has imported key material. 
      • Management and security of the key are the customer’s responsibility.
      • Only symmetric keys are supported.
      • Automatic rotation is not supported and needs to be manually rotated.
    • AWS_CLOUDHSM
      • AWS KMS created the key material for the KMS key in the AWS CloudHSM cluster associated with the custom key store.
    • EXTERNAL_KEY_STORE
      • Key material is a cryptographic key in an external key manager outside of AWS.
      • This origin is supported only for KMS keys in an external key store.

Data Keys

  • Data keys are encryption keys that you can use to encrypt data, including large amounts of data and other data encryption keys.
  • KMS does not store, manage, or track your data keys.
  • Data keys must be used by services outside of KMS.

Encryption Context

  • Encryption context provides an optional set of key–value pairs that can contain additional contextual information about the data.
  • AWS KMS uses the encryption context as additional authenticated data (AAD) to support authenticated encryption.
  • Encryption context is not secret and not encrypted and appears in plaintext in CloudTrail Logs so you can use it to identify and categorize your cryptographic operations.
  • Encryption context should not include sensitive information.
  • Encryption context usage
    • When an encryption context is included in an encryption request, it is cryptographically bound to the ciphertext such that the same encryption context is required to decrypt the data.
    • If the encryption context provided in the decryption request is not an exact, case-sensitive match, the decrypt request fails.
  • Only the order of the key-value pairs in the encryption context can vary.

Key Policies

  • help determine who can use and manage those keys.
  • can add, remove, or change permissions at any time for a customer-managed key.
  • cannot edit the key policy for AWS owned or managed keys.

Grants

  • provides permissions, an alternative to the key policy and IAM policy, that allows AWS principals to use the KMS keys.
  • are often used for temporary permissions because you can create one, use its permissions, and delete it without changing the key policies or IAM policies.
  • permissions specified in the grant might not take effect immediately due to eventual consistency.

Grant Tokens

  • help mitigate the potential delay with grants.
  • use the grant token received in the response to CreateGrant API request to make the permissions in the grant take effect immediately.

Alias

  • Alias helps provide a friendly name for a KMS key.
  • can be used to refer to different KMS keys in each AWS Region.
  • can be used to point to different keys without changing the code.
  • can allow and deny access to KMS keys based on their aliases without editing policies or managing grants.
  • aliases are independent resources, not properties of a KMS key, and can be added, changed, and deleted without affecting the associated KMS key.

Encryption & Decryption Process

  • Use KMS to get encrypted and plaintext data keys using CMK.
  • Use the plaintext data key to encrypt the data and store the encrypted data key with the data.
  • Use KMS decrypt to get the plaintext data key and decrypt the data.
  • Remove the plaintext data key from memory, once the operation is completed.

KMS Working

  • KMS centrally manages and securely stores the keys.
  • Keys can be generated or imported from the key management infrastructure (KMI).
  • Keys can be used from within the applications and supported AWS services to protect the data, but the key never leaves KMS.
  • Data is submitted to KMS to be encrypted, or decrypted, under keys that you control.
  • Usage policies on these keys can be set that determines which users can use them to encrypt and decrypt data.

KMS Access Control

  • Primary way to manage access to AWS KMS keys is with policies.
  • KMS keys access can be controlled using
    • Key Policies
      • are resource-based policies
      • every KMS key has a key policy
      • is a primary mechanism for controlling access to a key.
      • can be used alone to control access to the keys.
    • IAM policies
      • use IAM policies in combination with the key policy to control access to keys.
      • helps manage all of the permissions for your IAM identities in IAM.
    • Grants
      • Use grants in combination with the key policy and IAM policies to allow access to keys.
      • helps allow access to the keys in the key policy, and to allow users to delegate their access to others.
  • To allow access to a KMS CMK, a key policy MUST be used, either alone or in combination with IAM policies or grants.
  • IAM policies by themselves are not sufficient to allow access to keys, though they can be used in combination with a key policy.
  • IAM user who creates a KMS key is not considered to be the key owner and they don’t automatically have permission to use or manage the KMS key that they created.

Rotating KMS or Customer Master Keys

  • Key rotation changes only the key material, which is the cryptographic secret that is used in encryption operations. 
  • KMS keys can be enabled for automatic key rotation, where KMS generates new cryptographic material for the key every year.
  • KMS saves all previous versions of the cryptographic material in perpetuity so it can decrypt any data encrypted with that key.
  • KMS does not delete any rotated key material until you delete the KMS key.
  • All new encryption requests against a key are encrypted under the newest version of the key.
  • Rotation can be tracked in CloudWatch and CloudTrail.

Automatic Key Rotation

  • Automatic key rotation has the following benefits
    • properties of the KMS key like ID, ARN, region, policies, and permissions do not change.
    • applications or aliases refering the key do not need to change
    • Rotating key material does not affect the use of the KMS key in any AWS service.
  • Automatic key rotation is supported only on symmetric encryption KMS keys with key material that KMS generates i.e. Origin = AWS_KMS.
  • Automatic key rotation is not supported for
    • asymmetric keys, 
    • HMAC keys,
    • keys in custom key stores, and
    • keys with imported key material.
  • AWS managed keys
    • automatically rotated every 1 year (updated from 3 years before)
    • rotation cannot be enabled or disabled
  • Customer Managed keys
    • automatic key rotation is supported but is optional.
    • automatic key rotation is disabled, by default, and needs to be enabled.
    • keys can be rotated every year.
  • CMKs with imported key material or keys generated in a CloudHSM cluster using the KMS custom key store feature
    • do not support automatic key rotation.
    • provide flexibility to manually rotate keys as required.

Manual Key Rotation

  • Manual key rotation can be performed by creating a KMS key and updating the applications or aliases to point to the new key.
  • does not retain the ID, ARN, and policies of the key.
  • can help control the rotation frequency esp. if the frequency required is less than a year.
  • is also a good solution for KMS keys that are not eligible for automatic key rotation, such as asymmetric keys, HMAC keys, keys in custom key stores, and keys with imported key material.
  • For manually rotated keys, data has to be re-encrypted depending on the application’s configuration.

KMS Deletion

  • KMS key deletion deletes the key material and all metadata associated with the key and is irreversible.
  • Data encrypted by the deleted key cannot be recovered, once the key is deleted.
  • AWS recommends disabling the key before deleting it.
  • AWS Managed and Owned keys cannot be deleted. Only Customer managed keys can be scheduled for deletion.
  • KMS never deletes the keys unless you explicitly schedule them for deletion and the mandatory waiting period expires.
  • KMS requires setting a waiting period of 7-30 days for key deletion. During the waiting period, the KMS key status and key state is Pending deletion.
    • Key pending deletion cannot be used in any cryptographic operations.
    • Key material of keys that are pending deletion is not rotated.

KMS Multi-Region Keys

  • AWS KMS supports multi-region keys, which are AWS KMS keys in different AWS Regions that can be used interchangeably – as though you had the same key in multiple Regions.
  • Multi-Region keys have the same key material and key ID, so data can be encrypted in one AWS Region and decrypted in a different AWS Region without re-encrypting or making a cross-Region call to AWS KMS.
  • Multi-Region keys never leave AWS KMS unencrypted.
  • Multi-Region keys are not global and each multi-region key needs to be replicated and managed independently.

KMS Features

  • Create keys with a unique alias and description
  • Import your own keys
  • Control which IAM users and roles can manage keys
  • Control which IAM users and roles can use keys to encrypt & decrypt data
  • Choose to have AWS KMS automatically rotate keys on an annual basis
  • Temporarily disable keys so they cannot be used by anyone
  • Re-enable disabled keys
  • Delete keys that you no longer use
  • Audit use of keys by inspecting logs in AWS CloudTrail

KMS with VPC Interface Endpoint

  • AWS KMS can be connected through a private interface endpoint in the Virtual Private Cloud (VPC).
  • Interface VPC endpoint ensures the communication between the VPC and AWS KMS is conducted entirely within the AWS network.
  • Interface VPC endpoint connects the VPC directly to KMS without an internet gateway, NAT device, VPN, or Direct Connect connection.
  • Instances in the VPC do not need public IP addresses to communicate with AWS KMS.

KMS vs CloudHSM

AWS KMS vs CloudHSM

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You are designing a personal document-archiving solution for your global enterprise with thousands of employee. Each employee has potentially gigabytes of data to be backed up in this archiving solution. The solution will be exposed to he employees as an application, where they can just drag and drop their files to the archiving system. Employees can retrieve their archives through a web interface. The corporate network has high bandwidth AWS DirectConnect connectivity to AWS. You have regulatory requirements that all data needs to be encrypted before being uploaded to the cloud. How do you implement this in a highly available and cost efficient way?
    1. Manage encryption keys on-premise in an encrypted relational database. Set up an on-premises server with sufficient storage to temporarily store files and then upload them to Amazon S3, providing a client-side master key. (Storing temporary increases cost and not a high availability option)
    2. Manage encryption keys in a Hardware Security Module (HSM) appliance on-premise server with sufficient storage to temporarily store, encrypt, and upload files directly into amazon Glacier. (Not cost effective)
    3. Manage encryption keys in amazon Key Management Service (KMS), upload to amazon simple storage service (s3) with client-side encryption using a KMS customer master key ID and configure Amazon S3 lifecycle policies to store each object using the amazon glacier storage tier. (With CSE-KMS the encryption happens at client side before the object is upload to S3 and KMS is cost effective as well)
    4. Manage encryption keys in an AWS CloudHSM appliance. Encrypt files prior to uploading on the employee desktop and then upload directly into amazon glacier (Not cost effective)
  2. An AWS customer is deploying an application that is composed of an Auto Scaling group of EC2 Instances. The customers security policy requires that every outbound connection from these instances to any other service within the customers Virtual Private Cloud must be authenticated using a unique x 509 certificate that contains the specific instance-id. In addition an x 509 certificates must be designed by the customer’s Key management service in order to be trusted for authentication.
    Which of the following configurations will support these requirements?
    1. Configure an IAM Role that grants access to an Amazon S3 object containing a signed certificate and configure the Auto Scaling group to launch instances with this role. Have the instances bootstrap get the certificate from Amazon S3 upon first boot.
    2. Embed a certificate into the Amazon Machine Image that is used by the Auto Scaling group Have the launched instances generate a certificate signature request with the instance’s assigned instance-id to the Key management service for signature.
    3. Configure the Auto Scaling group to send an SNS notification of the launch of a new instance to the trusted key management service. Have the Key management service generate a signed certificate and send it directly to the newly launched instance.
    4. Configure the launched instances to generate a new certificate upon first boot. Have the Key management service poll the AutoScaling group for associated instances and send new instances a certificate signature that contains the specific instance-id.
  3. A company has a customer master key (CMK) with imported key materials. Company policy requires that all encryption keys must be rotated every year. What can be done to implement the above policy?
    1. Enable automatic key rotation annually for the CMK.
    2. Use AWS Command Line interface to create an AWS Lambda function to rotate the existing CMK annually.
    3. Import new key material to the existing CMK and manually rotate the CMK.
    4. Create a new CMK, import new key material to it, and point the key alias to the new CMK.
  4. An organization policy states that all encryption keys must be automatically rotated every 12 months. Which AWS Key Management Service (KMS) key type should be used to meet this requirement? (Select TWO)
    1. AWS managed Customer Master Key (CMK) (Now supports every year. It was every 3 years before.)
    2. Customer managed CMK with AWS generated key material
    3. Customer managed CMK with imported key material
    4. AWS managed data key

References

AWS_Key_Management_Service

Google Cloud Operations

Google Cloud Operations

Google Cloud Operations provides integrated monitoring, logging, and trace managed services for applications and systems running on Google Cloud and beyond.

Google Cloud Operations Suite
Credit Priyanka Vergadia

Cloud Monitoring

  • Cloud Monitoring collects measurements of key aspects of the service and of the Google Cloud resources used.
  • Cloud Monitoring provides tools to visualize and monitor this data.
  • Cloud Monitoring helps gain visibility into the performance, availability, and health of the applications and infrastructure.
  • Cloud Monitoring collects metrics, events, and metadata from Google Cloud, AWS, hosted uptime probes, and application instrumentation.

Cloud Logging

  • Cloud Logging is a service for storing, viewing and interacting with logs.
  • Answers the questions “Who did what, where and when” within the GCP projects
  • Maintains non-tamperable audit logs for each project and organizations
  • Logs buckets are a regional resource, which means the infrastructure that stores, indexes, and searches the logs are located in a specific geographical location.

Error Reporting

  • Error Reporting aggregates and displays errors produced in the running cloud services.
  • Error Reporting provides a centralized error management interface, to help find the application’s top or new errors so that they can be fixed faster.

Cloud Profiler

  • Cloud Profiler helps with continuous CPU, heap, and other parameters profiling to improve performance and reduce costs.
  • Cloud Profiler is a continuous profiling tool that is designed for applications running on Google Cloud:
    • It’s a statistical, or sampling, profiler that has low overhead and is suitable for production environments.
    • It supports common languages and collects multiple profile types.
  • Cloud Profiler consists of the profiling agent, which collects the data, and a console interface on Google Cloud, which lets you view and analyze the data collected by the agent.
  • Cloud Profiler is supported for Compute Engine, App Engine, GKE, and applications running on on-premises as well.

Cloud Trace

  • Cloud Trace is a distributed tracing system that collects latency data from the applications and displays it in the Google Cloud Console.
  • Cloud Trace helps understand how long it takes the application to handle incoming requests from users or applications, and how long it takes to complete operations like RPC calls performed when handling the requests.
  • CloudTrace can track how requests propagate through the application and receive detailed near real-time performance insights.
  • Cloud Trace automatically analyzes all of the application’s traces to generate in-depth latency reports to surface performance degradations and can capture traces from all the VMs, containers, or App Engines.

Cloud Debugger

  • Cloud Debugger helps inspect the state of an application, at any code location, without stopping or slowing down the running app.
  • Cloud Debugger makes it easier to view the application state without adding logging statements.
  • Cloud Debugger adds less than 10ms to the request latency only when the application state is captured. In most cases, this is not noticeable by users.
  • Cloud Debugger can be used with or without access to your app’s source code.
  • Cloud Debugger supports Cloud Source Repositories, GitHub, Bitbucket, or GitLab as the source code repository. If the source code repository is not supported, the source files can be uploaded.
  • Cloud Debugger allows collaboration by sharing the debug session by sending the Console URL.
  • Cloud Debugger supports a range of IDE.

Debug Snapshots

  • Debug Snapshots capture local variables and the call stack at a specific line location in the app’s source code without stopping or slowing it down.
  • Certain conditions and locations can be specified to return a snapshot of the app’s data.
  • Debug Snapshots support canarying wherein the debugger agent tests the snapshot on a subset of the instances.

Debug Logpoints

  • Debug Logpoints allow you to inject logging into running services without restarting or interfering with the normal function of the service.
  • Debug Logpoints are useful for debugging production issues without having to add log statements and redeploy.
  • Debug Logpoints remain active for 24 hours after creation, or until they are deleted or the service is redeployed.
  • If a logpoint is placed on a line that receives lots of traffic, the Debugger throttles the logpoint to reduce its impact on the application.
  • Debug Logpoints support canarying wherein the debugger agent tests the logpoints on a subset of the instances.

References

Google_Cloud_Operations

Google Cloud CI/CD – Continuous Integration & Continuous Deployment

Google Cloud CI/CD

Google Cloud CI/CD provides various tools for continuous integration and deployment and also integrates seamlessly with third-party solutions.

Google Cloud CI/CD - Continuous Integration Continuous Deployment

Google Cloud Source Repositories – CSR

  • Cloud Source Repositories are fully-featured, private Git repositories hosted on Google Cloud.
  • Cloud Source Repositories can be used for collaborative, version-controlled development of any app or service, including those that run on App Engine and Compute Engine.
  • Cloud Source Repositories can connect to an existing GitHub or Bitbucket repository. Connected repositories are synchronized with Cloud Source Repositories automatically.
  • Cloud Source Repositories automatically send logs on repository activity to Cloud Logging to help track and troubleshoot data access.
  • Cloud Source Repositories offer security key detection to block git push transactions that contain sensitive information which helps improve the security of the source code.
  • Cloud Source Repositories provide built-in integrations with other GCP tools like Cloud Build, Cloud Debugger, Cloud Operations, Cloud Logging, Cloud Functions, and others that let you automatically build, test, deploy, and debug code within minutes.
  • Cloud Source Repositories publishes messages about the repository to Pub/Sub topic.
  • Cloud Source Repositories provide a search feature to search for specific files or code snippets.
  • Cloud Source Repositories allow permissions to be controlled at the project (all projects) or at the repo level.

Cloud Build

  • Cloud Build is a fully-managed, serverless service that executes builds on Google Cloud Platform’s infrastructure.
  • Cloud Build can pull/import source code from a variety of repositories or cloud storage spaces, execute a build to produce containers or artifacts, and push them to the artifact registry.
  • Cloud Build executes the build as a series of build steps, where each build step specifies an action to be performed and is run in a Docker container.
  • Build steps can be provided by Cloud Build and the Cloud Build community or can be custom as well.
  • Build config file contains instructions for Cloud Build to perform tasks based on your specifications for e.g., the build config file can contain instructions to build, package, and push Docker images.
  • Builds can be started either manually or using build triggers.
  • Cloud Build uses build triggers to enable CI/CD automation.
  • Build triggers can listen for incoming events, such as when a new commit is pushed to a repository or when a pull request is initiated, and then automatically execute a build when new events come in.
  • Cloud Build publishes messages on a Pub/Sub topic called cloud-builds when the build’s state changes, such as when the build is created, when the build transitions to a working state, and when the build completes.

Container Registry

  • Container Registry is a private container image registry that supports Docker Image Manifest V2 and OCI image formats.
  • Container Registry provides a subset of Artifact Registry features.
  • Container Registry stores its tags and layer files for container images in a Cloud Storage bucket in the same project as the registry.
  • Access to the bucket is configured using Cloud Storage’s identity and access management (IAM) settings.
  • Container Registry integrates seamlessly with Google Cloud services.
    Container Registry works with popular continuous integration and continuous delivery systems including Cloud Build and third-party tools such as Jenkins.

Artifact Registry

  • Artifact Registry is a fully-managed service with support for both container images and non-container artifacts, Artifact Registry extends the capabilities of Container Registry.
  • Artifact Registry is the recommended service for container image storage and management on Google Cloud.
  • Artifact Registry comes with fine-grained access control via Cloud IAM. This enables scoping permissions as granularly as possible, for example to specific regions or environments as necessary.
  • Artifact Registry supports the creation of regional repositories

Container Registry vs Artifact Registry

Google Cloud Container Registry Vs Artifact Registry

Google Cloud DevOps
Credit Priyanka Vergadia

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

 

Google Cloud Container Registry Vs Artifact Registry

Container Registry vs Artifact Registry

Google Cloud - Container Registry vs Artifact Registry

Container Registry

  • Container Registry is a private container image registry that supports Docker Image Manifest V2 and OCI image formats.
  • provides a subset of Artifact Registry features.
  • stores its tags and layer files for container images in a Cloud Storage bucket in the same project as the registry.
  • does not support fine-grained IAM access control. Access to the bucket is configured using Cloud Storage’s permissions.
  • integrates seamlessly with Google Cloud services and works with popular continuous integration and continuous delivery systems including Cloud Build and third-party tools such as Jenkins.
  • is used to store only docker images and does not support languages or os packages.
  • is only multi-regional and does not support regional repository.
  • supports a single repository within a project and automatically creates a repository in a multi-region.
  • uses gcr.io hosts.
  • uses gcloud container images commands.
  • supports CMEK(Customer-Managed encryption keys) to encrypt the storage buckets that contain the images.
  • supports several authentication methods for pushing and pulling images with a third-party client.
  • caches the most frequently requested Docker Hub images on mirror.gcr.io
  • supports VPC-Service Controls and can be added to a service perimeter.
  • hosts Google provided images on gcr.io
  • publishes changes to the gcr topic.
  • images can be viewed and managed from the Container registry section of Cloud Console.
  • pricing is based on Cloud Storage usage, including storage and network egress.

Artifact Registry

  • Artifact Registry is a fully-managed service with support for both container images and non-container artifacts, Artifact Registry extends the capabilities of Container Registry.
  • Artifact Registry is the recommended service for container image storage and management on Google Cloud. It is considered the successor of the Container Registry.
  • Artifact Registry comes with fine-grained access control via Cloud IAM using Artifact Registry permission. This enables scoping permissions as granularly as possible for e.g. to specific regions or environments as necessary
  • supports multi-regional or regional repositories.
  • uses pkg.dev hosts.
  • uses gcloud artifacts docker commands.
  • supports CMEK(Customer-Managed encryption keys) to encrypt individual repositories.
  • supports multiple repositories within the project and the repository should be manually created before pushing any images.
  • supports multiple artifact formats, including Container images, Java packages, and Node.js modules.
  • supports the same authentication method as Container Registry.
  • mirror.gcr.io continues to cache frequently requested images from Docker Hub.
  • supports VPC-Service Controls and can be added to a service perimeter.
  • hosts Google provided images on gcr.io
  • publishes changes to the gcr topic.
  • Artifact Registry and Container Registry repositories can be viewed from the Artifact Registry section of Cloud Console.
  • pricing is based on storage and network egress.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

 

References

Artifact Registry vs Container Registry Feature Comparison