Google Cloud – TerramEarth Case Study

February 27, 2021 ~ Last updated on : February 1, 2022 ~ jayendrapatil ~ 2 Comments

TerramEarth manufactures heavy equipment for the mining and agricultural industries. They currently have over 500 dealers and service centers in 100 countries. Their mission is to build products that make their customers more productive.

Key points here are 500 dealers and service centers are spread across the world and they want to make their customers more productive.

Solution Concept

There are 2 million TerramEarth vehicles in operation currently, and we see 20% yearly growth. Vehicles collect telemetry data from many sensors during operation. A small subset of critical data is transmitted from the vehicles in real time to facilitate fleet management. The rest of the sensor data is collected, compressed, and uploaded daily when the vehicles return to home base. Each vehicle usually generates 200 to 500 megabytes of data per day

Key points here are TerramEarth has 2 million vehicles. Only critical data is transferred in real-time while the rest of the data is uploaded in bulk daily.

Executive Statement

Our competitive advantage has always been our focus on the customer, with our ability to provide excellent customer service and minimize vehicle downtimes. After moving multiple systems into Google Cloud, we are seeking new ways to provide best-in-class online fleet management services to our customers and improve operations of our dealerships. Our 5-year strategic plan is to create a partner ecosystem of new products by enabling access to our data, increasing autonomous operation capabilities of our vehicles, and creating a path to move the remaining legacy systems to the cloud.

Key point here is the company wants to improve further in operations, customer experience, and partner ecosystem by allowing them to reuse the data.

Existing Technical Environment

TerramEarth’s vehicle data aggregation and analysis infrastructure resides in Google Cloud and serves clients from all around the world. A growing amount of sensor data is captured from their two main manufacturing plants and sent to private data centers that contain their legacy inventory and logistics management systems. The private data centers have multiple network interconnects configured to Google Cloud.
The web frontend for dealers and customers is running in Google Cloud and allows access to stock management and analytics.

Key point here is the company is hosting its infrastructure in Google Cloud and private data centers. GCP has web frontend and vehicle data aggregation & analysis. Data is sent to private data centers.

Business Requirements

Predict and detect vehicle malfunction and rapidly ship parts to dealerships for just-in-time repair where possible.

Cloud IoT core can provide a fully managed service to easily and securely connect, manage, and ingest data from globally dispersed devices.
Existing legacy inventory and logistics management systems running in the private data centers can be migrated to Google Cloud.
Existing data can be migrated one time using Transfer Appliance.

Decrease cloud operational costs and adapt to seasonality.

- Google Cloud provides configuring elasticity and scalability for resources based on the demand.

Increase speed and reliability of development workflow.

- Google Cloud CI/CD tools like Cloud Build and open-source tools like Spinnaker can be used to increase the speed and reliability of the deployments.

Allow remote developers to be productive without compromising code or data security.

Cloud Function to Function authentication

Create a flexible and scalable platform for developers to create custom API services for dealers and partners.

Google Cloud provides multiple fully managed serverless and scalable application hosting solutions like Cloud Run and Cloud Functions
Managed Instance group with Compute Engines and GKE cluster with scaling can also be used to provide scalable, highly available compute services.

Technical Requirements

Create a new abstraction layer for HTTP API access to their legacy systems to enable a gradual move into the cloud without disrupting operations.

- Google Cloud API Gateway & Cloud Endpoints can be used to provide an abstraction layer to expose the data externally over a variety of backends.

Modernize all CI/CD pipelines to allow developers to deploy container-based workloads in highly scalable environments.

Google Cloud CI/CD - Continuous Integration Continuous Deployment

- Google Cloud provides DevOps tools like Cloud Build and supports open-source tools like Spinnaker to provide CI/CD features.
- Cloud Source Repositories are fully-featured, private Git repositories hosted on Google Cloud.
- Cloud Build is a fully-managed, serverless service that executes builds on Google Cloud Platform’s infrastructure.
- Container Registry is a private container image registry that supports Docker Image Manifest V2 and OCI image formats.
- Artifact Registry is a fully-managed service with support for both container images and non-container artifacts, Artifact Registry extends the capabilities of Container Registry.

Allow developers to run experiments without compromising security and governance requirements

- Google Cloud deployments can be configured for Canary or A/B testing to allow experimentation.

Create a self-service portal for internal and partner developers to create new projects, request resources for data analytics jobs, and centrally manage access to the API endpoints.

Use cloud-native solutions for keys and secrets management and optimize for identity-based access

- Google Cloud supports Key Management Service – KMS and Secrets Manager for managing secrets and key management.

Improve and standardize tools necessary for application and network monitoring and troubleshooting.

- Google Cloud provides Cloud Operations Suite which includes Cloud Monitoring and Logging to cover both on-premises and Cloud resources.
- Cloud Monitoring collects measurements of key aspects of the service and of the Google Cloud resources used
- Cloud Monitoring Uptime check is a request sent to a publicly accessible IP address on a resource to see whether it responds.
- Cloud Logging is a service for storing, viewing, and interacting with logs.
- Error Reporting aggregates and displays errors produced in the running cloud services.
- Cloud Profiler helps with continuous CPU, heap, and other parameters profiling to improve performance and reduce costs.
- Cloud Trace is a distributed tracing system that collects latency data from the applications and displays it in the Google Cloud Console.
- Cloud Debugger helps inspect the state of an application, at any code location, without stopping or slowing down the running app.

Reference Cellular Upload Architecture

Batch Upload Replacement Architecture

Reference

Google Cloud – TerramEarth case study

Google Cloud – Mountkirk Games Case Study

February 27, 2021 ~ Last updated on : February 1, 2022 ~ jayendrapatil ~ 2 Comments

Google Cloud – Mountkirk Games Case Study

Mountkirk Games makes online, session-based, multiplayer games for mobile platforms. They have recently started expanding to other platforms after successfully migrating their on-premises environments to Google Cloud. Their most recent endeavor is to create a retro-style first-person shooter (FPS) game that allows hundreds of simultaneous players to join a geo-specific digital arena from multiple platforms and locations. A real-time digital banner will display a global leaderboard of all the top players across every active arena.

Solution Concept

Mountkirk Games is building a new multiplayer game that they expect to be very popular. They plan to deploy the game’s backend on Google Kubernetes Engine so they can scale rapidly and use Google’s global load balancer to route players to the closest regional game arenas. In order to keep the global leader board in sync, they plan to use a multi-region Spanner cluster.

So the key here is the company wants to deploy the new game to Google Kubernetes Engine exposed globally using a Global Load Balancer and configured to scale rapidly and bring it closer to the users. Backend DB would be managed using a multi-region Cloud Spanner cluster.

Executive Statement

Our last game was the first time we used Google Cloud, and it was a tremendous success. We were able to analyze player behavior and game telemetry in ways that we never could before. This success allowed us to bet on a full migration to the cloud and to start building all-new games using cloud-native design principles. Our new game is our most ambitious to date and will open up doors for us to support more gaming platforms beyond mobile. Latency is our top priority, although cost management is the next most important challenge. As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.

So the key points here are the company has moved to Google Cloud with great success and wants to build new games in the cloud. Key priorities are high performance, low latency, cost, advanced analytics, quick deployment, and time-to-market cycles.

Business Requirements

Support multiple gaming platforms.

Support multiple regions.

Can be handled using a Global HTTP load balancer with GKE in each region.
Can be handled using multi-region Cloud Spanner

Other multi-regional services like Cloud Storage, Cloud Datastore, Cloud Pub/Sub, BigQuery can be used.

Support rapid iteration of game features.

Can be handled using Deployment Manager and IaaC services like Terraform to automate infrastructure provisioning

Cloud Build + Cloud Deploy/Spinnaker can be used for rapid continuous integration and deployment

Minimize latency

can be reduced using a Global HTTP load balancer, which would route the user to the closest region

using multi-regional resources like Cloud Spanner would also help reduce latency

Optimize for dynamic scaling

can be done using GKE Cluster Autoscaler and Horizontal Pod Autoscaling to dynamically scale the nodes and applications as per the demand

Cloud Spanner can be scaled dynamically

Use managed services and pooled resources.

Using GKE, with Global Load Balancer for computing and Cloud Spanner would help cover the application stack using managed services

Minimize costs.

Using minimal resources and enabling auto-scaling as per the demand would help minimize costs

Existing Technical Environment

The existing environment was recently migrated to Google Cloud, and five games came across using lift-and-shift virtual machine migrations, with a few minor exceptions. Each new game exists in an isolated Google Cloud project nested below a folder that maintains most of the permissions and network policies. Legacy games with low traffic have been consolidated into a single project. There are also separate environments for development and testing.

Key points here are the resource hierarchy exists with a project for each new game under a folder to control access using Service Control Permissions. Also, some of the small games would be hosted in a single project. There are also different environments for development, testing, and production.

Technical Requirements

Dynamically scale based on game activity.

can be done using GKE Cluster Autoscaler and Horizontal Pod Autoscaling to dynamically scale the nodes and applications as per the demand

Publish scoring data on a near-real-time global leaderboard.

can be handled using Pub/Sub for capturing data and Cloud DataFlow for processing the data on the fly i.e real time

Store game activity logs in structured files for future analysis.

can be handled using Cloud Storage to store logs for future analysis
analysis can be handled using BigQuery either loading the data or using federated data source
data can also be stored directly using BigQuery as it would provide a low-cost data storage (as compared to Bigtable) for analytics

another advantage of BigQuery over Bigtable in this case its multi-regional, meeting the global footprint and latency requirements

Use GPU processing to render graphics server-side for multi-platform support.

Support eventual migration of legacy games to this new platform.

Reference Architecture

Refer to Mobile Gaming Analysis Telemetry solution

Mountkirk Games References

Google Cloud – Mountkrik Games

Google Cloud – Dress4win Case Study

January 14, 2019 ~ Last updated on : May 12, 2021 ~ jayendrapatil

Dress4Win is a web-based company that helps their users organize and manage their personal wardrobe using a web app and mobile application. The company also cultivates an active social network that connects their users with designers and retailers. They monetize their services through advertising, e-commerce, referrals, and a freemium app model. The application has grown from a few servers in the founder’s garage to several hundred servers and appliances in a colocated data center. However, the capacity of their infrastructure is now insufficient for the application’s rapid growth. Because of this growth and the company’s desire to innovate faster, Dress4Win is committing to a full migration to a public cloud.

The key here is the company wants to migrate completely to public cloud for the current infrastructures inability to scale

Solution Concept

For the first phase of their migration to the cloud, Dress4Win is moving their development and test environments. They are also building a disaster recovery site, because their current infrastructure is at a single location. They are not sure which components of their architecture they can migrate as is and which components they need to change before migrating them.

Key here is Dress4Win wants to move the development and test environments first. And also, they want to build a DR site for their current production site which would continue to be hosted on-premises

Executive Statement

Our investors are concerned about our ability to scale and contain costs with our current infrastructure. They are also concerned that a competitor could use a public cloud platform to offset their up-front investment and free them to focus on developing better features. Our traffic patterns are highest in the mornings and weekend evenings; during other times, 80% of our capacity is sitting idle.

Our capital expenditure is now exceeding our quarterly projections. Migrating to the cloud will likely cause an initial increase in spending, but we expect to fully transition before our next hardware refresh cycle. Our total cost of ownership (TCO) analysis over the next 5 years for a public cloud strategy achieves a cost reduction between 30% and 50% over our current model.

The key here is that the company wants to improve on the application scalability, efficiency (hardware sitting idle most of the time), capex cost reduction, and improve TCO over a period of time

Existing Technical Environment

The Dress4Win application is served out of a single data center location. All servers run Ubuntu LTS v16.04.

Databases:

MySQL. 1 server for user data, inventory, static data,
- MySQL 5.8
- 8 core CPUs
- 128 GB of RAM
- 2x 5 TB HDD (RAID 1)
Redis 3 server cluster for metadata, social graph, caching. Each server is:
- Redis 3.2
- 4 core CPUs
- 32GB of RAM

MySQL server can be migrated directly to Cloud SQL, which is GCP managed relational database and supports MySQL.
For Redis cluster, MemoryStore can be used which is a fully-managed in-memory data store service for Redis.
There would be no changes required to support the same.

Compute:

40 Web Application servers providing micro-services based APIs and static content.
- Tomcat – Java
- Nginx
- 4 core CPUs
- 32 GB of RAM

20 Apache Hadoop/Spark servers:
- Data analysis
- Real-time trending calculations
- 8 core CPUs
- 128 GB of RAM
- 4x 5 TB HDD (RAID 1)

3 RabbitMQ servers for messaging, social notifications, and events:
- 8 core CPUs
- 32GB of RAM

Miscellaneous servers:
- Jenkins, monitoring, bastion hosts, security scanners
- 8 core CPUs
- 32GB of RAM

Web Application servers with Java and Nginx can be supported using Compute engine, App Engine or even with Container Engine with auto scaling configured.
Although the core and RAM combination would need a custom machine type, the same be configured or tuned to use an existing machine type

Apache Hadoop/Spark servers can be easily migrated to Cloud Dataproc
RabbitMQ messaging service is currently not directly supported by Google Cloud and can be supported either with
- Cloud Pub/Sub messaging – however this would need changes to the code and would not be a seamless migration
- Use Compute engine to host the RabbitMQ servers
Jenkins, Bastion hosts, Security scanners can be hosted using Google Compute Engine (GCE)
Monitoring can be provided using Stackdriver

Storage appliances:

iSCSI for VM hosts
Fiber channel SAN – MySQL databases
- 1 PB total storage; 400 TB available
NAS – image storage, logs, backups
- 100 TB total storage; 35 TB available

iSCSI for VM hosts can be supported using Cloud persistent disks as it needs a block level storage
SAN for MySQL databases can be supported using Cloud persistent disks as it needs a block level storage. However, a single disk cannot scale to 1PB and multiple disks need to be combined to create the storage
NAS for image storage, logs and backups can be supported using Cloud Storage which provides unlimited storage capacity

Business Requirements

Build a reliable and reproducible environment with scaled parity of production.
- can be handled by provisioning services or using GCP managed services with the same scale as on-premises resources and with Cloud Deployment Manager for creating repeatable deployments
Improve security by defining and adhering to a set of security and Identity and Access Management (IAM) best practices for cloud.
- can be handled using IAM by implemented best practices like least privileges, separating dev/test/production projects to control access
Improve business agility and speed of innovation through rapid provisioning of new resources.
- can be handled using Cloud Deployment Manager for repeatable and automated provisioning of resources
- deployments of applications and new releases can be handled efficiently using rolling updates, A/B testing
Analyze and optimize architecture for performance in the cloud.
- can be handled using auto scaling compute engines based on the demand
- can be handled using Stackdriver for monitoring and fine tuning the specs

Technical Requirements

Easily create non-production environments in the cloud.
- most of the services can be created using GCP managed services and the environment creation can be standardized and automated using templates and configurations

Implement an automation framework for provisioning resources in cloud.
- can be handled using Cloud Deployment Manager, which provides Infrastructure as a Code service for provisioning resources in cloud.
Implement a continuous deployment process for deploying applications to the on-premises datacenter or cloud.
- continuous deployments can be handled using tools like Jenkins available on both the environments
Support failover of the production environment to cloud during an emergency.
- can be handled by replicating all the data to the cloud environment and ability to provision the servers quickly.
- can be handled by using DNS to repoint from on-premises environment to cloud environment
Encrypt data on the wire and at rest.
- All the GCP services, by default, provide encryption on wire and at rest. Encryption can be performed using Google provided or Custom keys

Support multiple private connections between the production data center and cloud environment.
- can be handled using VPN (multiple VPNs for better performance) or dedicated Interconnect connection between the production data center and the cloud environment