Google Cloud Operations

Google Cloud Operations

Google Cloud Operations provides integrated monitoring, logging, and trace managed services for applications and systems running on Google Cloud and beyond.

Google Cloud Operations Suite
Credit Priyanka Vergadia

Cloud Monitoring

  • Cloud Monitoring collects measurements of key aspects of the service and of the Google Cloud resources used.
  • Cloud Monitoring provides tools to visualize and monitor this data.
  • Cloud Monitoring helps gain visibility into the performance, availability, and health of the applications and infrastructure.
  • Cloud Monitoring collects metrics, events, and metadata from Google Cloud, AWS, hosted uptime probes, and application instrumentation.

Cloud Logging

  • Cloud Logging is a service for storing, viewing and interacting with logs.
  • Answers the questions “Who did what, where and when” within the GCP projects
  • Maintains non-tamperable audit logs for each project and organizations
  • Logs buckets are a regional resource, which means the infrastructure that stores, indexes, and searches the logs are located in a specific geographical location.

Error Reporting

  • Error Reporting aggregates and displays errors produced in the running cloud services.
  • Error Reporting provides a centralized error management interface, to help find the application’s top or new errors so that they can be fixed faster.

Cloud Profiler

  • Cloud Profiler helps with continuous CPU, heap, and other parameters profiling to improve performance and reduce costs.
  • Cloud Profiler is a continuous profiling tool that is designed for applications running on Google Cloud:
    • It’s a statistical, or sampling, profiler that has low overhead and is suitable for production environments.
    • It supports common languages and collects multiple profile types.
  • Cloud Profiler consists of the profiling agent, which collects the data, and a console interface on Google Cloud, which lets you view and analyze the data collected by the agent.
  • Cloud Profiler is supported for Compute Engine, App Engine, GKE, and applications running on on-premises as well.

Cloud Trace

  • Cloud Trace is a distributed tracing system that collects latency data from the applications and displays it in the Google Cloud Console.
  • Cloud Trace helps understand how long it takes the application to handle incoming requests from users or applications, and how long it takes to complete operations like RPC calls performed when handling the requests.
  • CloudTrace can track how requests propagate through the application and receive detailed near real-time performance insights.
  • Cloud Trace automatically analyzes all of the application’s traces to generate in-depth latency reports to surface performance degradations and can capture traces from all the VMs, containers, or App Engines.

Cloud Debugger

  • Cloud Debugger helps inspect the state of an application, at any code location, without stopping or slowing down the running app.
  • Cloud Debugger makes it easier to view the application state without adding logging statements.
  • Cloud Debugger adds less than 10ms to the request latency only when the application state is captured. In most cases, this is not noticeable by users.
  • Cloud Debugger can be used with or without access to your app’s source code.
  • Cloud Debugger supports Cloud Source Repositories, GitHub, Bitbucket, or GitLab as the source code repository. If the source code repository is not supported, the source files can be uploaded.
  • Cloud Debugger allows collaboration by sharing the debug session by sending the Console URL.
  • Cloud Debugger supports a range of IDE.

Debug Snapshots

  • Debug Snapshots capture local variables and the call stack at a specific line location in the app’s source code without stopping or slowing it down.
  • Certain conditions and locations can be specified to return a snapshot of the app’s data.
  • Debug Snapshots support canarying wherein the debugger agent tests the snapshot on a subset of the instances.

Debug Logpoints

  • Debug Logpoints allow you to inject logging into running services without restarting or interfering with the normal function of the service.
  • Debug Logpoints are useful for debugging production issues without having to add log statements and redeploy.
  • Debug Logpoints remain active for 24 hours after creation, or until they are deleted or the service is redeployed.
  • If a logpoint is placed on a line that receives lots of traffic, the Debugger throttles the logpoint to reduce its impact on the application.
  • Debug Logpoints support canarying wherein the debugger agent tests the logpoints on a subset of the instances.

References

Google_Cloud_Operations

Google Cloud Monitoring – Stackdriver

Google Cloud Monitoring

  • Cloud Monitoring collects measurements of key aspects of the service and of the Google Cloud resources used.
  • Cloud Monitoring provides tools to visualize and monitor this data.
  • Cloud Monitoring helps gain visibility into the performance, availability, and health of the applications and infrastructure.
  • Cloud Monitoring collects metrics, events, and metadata from Google Cloud, AWS, hosted uptime probes, and application instrumentation.
  • Using the BindPlane service, data can be collected from over 150 common application components, on-premise systems, and hybrid cloud systems.

Cloud Monitoring Workspaces

  • Cloud Monitoring uses Workspaces to organize monitoring information
  • Workspace is a tool for monitoring resources across Google Cloud projects
  • A Workspace accesses metric data from its monitored projects, but the metric data remains in those projects.
  • Every Workspace has a host project. If you delete the host project, you also delete the Workspace.
  • A Workspace always monitors its Google Cloud host project
  • Host project is the project used to create the Workspace. The name of the Workspace is set to the name of the host project. This isn’t configurable.
  • Host project for Workspace stores all of the configuration content for dashboards, alerting policies, uptime checks, notification channels, and group definitions that you configure.
  • Workspace can monitor multiple projects but a Google Cloud project can be monitored by exactly 1 Workspace.
  • Projects can be moved from one workspace to another workspace
  • Two different workspaces can be merged into a single workspace

Cloud Monitoring Metrics

  • Metrics are a collection of measurements that help you understand how the applications and system services are performing.
  • Measurements might include the latency of requests to a service, the amount of disk space available on a machine, the number of tables in the SQL database, the number of widgets sold, and so forth.
  • Metric Value type includes
    • For measurements consisting of a single value at a time
      • BOOL, a boolean
      • INT64, a 64-bit integer
      • DOUBLE, a double-precision float
      • STRING, a string
    • For distribution measurements, the value isn’t a single value but a group of values.
      • The value type for distribution measurements is DISTRIBUTION.
      • Values in distribution include the mean, count, max, and other statistics, computed for a group of values.
      • Latency metrics typically capture data as distributions
  • Metric Kind includes
    • Gauge metric – Value is measured at a specific instant in time for e.g, CPU utilization, current temperature.
    • Delta metric – Value is measured as the change since it was last recorded for e.g., metrics measuring request counts are delta metrics; each value records how many requests were received since the last data point was recorded.
    • Cumulative metric – Value constantly increases over time for e.g., a metric for “sent bytes” might be cumulative; each value records the total number of bytes sent by a service at that time.

Cloud Monitoring Agent

  • Google Cloud’s operations suite provides the following agents for collecting metrics on Linux and Windows VM instances.
  • Ops Agent
    • The primary and preferred agent for collecting telemetry from the Compute Engine instances.
    • This agent combines logging and metrics into a single agent, providing YAML-based configurations for collecting the logs and metrics, and features high-throughput logging.
    • Ops Agent uses Fluent Bit for logs, which supports high-throughput logging, and the OpenTelemetry Collector for metrics.
  • Legacy Monitoring Agent
    • The agent gathers system and application metrics from virtual machine instances and sends them to Cloud Monitoring.
    • By default, the legacy monitoring agent collects disk, CPU, network, and process metrics.
    • The agent can be configured to monitor third-party applications to get the full list of agent metrics.
    • The agent is a collectd-based daemon that gathers system and application metrics from VM instances and sends them to Monitoring.

Cloud Monitoring – Uptime Checks

  • An uptime check is a request sent to a publicly accessible IP address on a resource to see whether it responds.
  • Uptime checks can determine the availability of the following:
    • URLs
    • Kubernetes LoadBalancer Services
    • VM instances
    • App Engine services
    • AWS load balancers
  • The availability of a resource can be monitored by creating an alerting policy that creates an incident when the uptime check fails.
  • The alerting policy can be configured to notify by email or through a different channel, and that notification can include details about the resource that failed to respond.
  • The results of uptime checks can also be observed in the Monitoring uptime-check dashboards.
  • For non-publicly available resources, the resource’s firewall must be configured o permit incoming traffic from the uptime-check servers
  • Uptime checks are unable to reach resources that don’t have an external IP address.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. You need to monitor resources that are distributed over different projects in Google Cloud Platform. You want to consolidate reporting under the same Stackdriver Monitoring dashboard. What should you do?
    1. Use Shared VPC to connect all projects, and link Stackdriver to one of the projects.
    2. For each project, create a Stackdriver account. In each project, create a service account for that project and grant it the role
      of Stackdriver Account Editor in all other projects.
    3. Configure a single Stackdriver account, and link all projects to the same account.
    4. Configure a single Stackdriver account for one of the projects. In Stackdriver, create a Group and add the other project
      names as criteria for that Group.
  2. You are asked to set up application performance monitoring on Google Cloud projects A, B, and C as a single pane of glass. You want to monitor CPU, memory, and disk. What should you do?
    1. Enable API and then share charts from projects A, B, and C.
    2. Enable API and then give the metrics.reader role to projects A, B, and C.
    3. Enable API and then use default dashboards to view all projects in sequence.
    4. Enable API, create a workspace under project A, and then add projects B and C.

References

Google_Cloud_Monitoring