GCP Google Cloud Storage – GCS

GCP Google Cloud Storage – GCS

  • Google Cloud Storage is a service for storing objects in Google Cloud.
  • Google Cloud Storage provides a RESTful service for storing and accessing your data on Google’s infrastructure.
  • GCS combines the performance and scalability of Google’s cloud with advanced security and sharing capabilities.

Google Cloud Storage Components

Buckets

  • All buckets are associated with a project, and can projects be grouped under an organization.
  • Bucket name requirements
    • must contain only lowercase letters, numbers, dashes (-), underscores (_), and dots (.). Spaces are not allowed. Names containing dots require verification
    • must start and end with a number or letter.
    • must contain 3-63 characters. Names containing dots can contain up to 222 characters, but each dot-separated component can be no longer than 63 characters.
    • cannot be represented as an IP address for e.g., 192.168.5.4
    • cannot begin with the “goog” prefix.
    • cannot contain “google” or close misspellings, such as “g00gle”.
  • Bucket name considerations
    • reside in a single Cloud Storage namespace.
    • must be unique.
    • are publicly visible.
    • can only be assigned during creation and cannot be changed.
    • can be used in a DNS record as part of a CNAME or A redirect.

Objects

  • An object is a piece of data consisting of a file of any format.
  • Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime.
  • Objects are stored in containers called buckets.
  • Object names reside in a flat namespace within a bucket, which means
    • Different buckets can have objects with the same name.
    • Objects do not reside within subdirectories in a bucket.
  • Existing objects cannot be directly renamed and need to be copied

Object Metadata

  • Objects stored in Cloud Storage have metadata associated with them
  • Metadata exists as key:value pairs and identifies properties of the object
  • Mutability of metadata varies as some metadata is set at the time the object is created for e.g. Content-Type, Cache-Control while for others they  can be edited at any time

Composite Objects

  • Composite objects help making appends to an existing object, as well as for recreating objects uploaded as multiple components in parallel.
  • Compose operation works with objects
    • having the same storage class.
    • be stored in the same Cloud Storage bucket.
    • NOT use customer-managed encryption keys.

GCS Locations

  • GCS buckets need to be created in a location for storing the object data.
  • GCS support different location types
    • regional
      • A region is a specific geographic place, such as London.
      • helps optimize latency and network bandwidth for data consumers, such as analytics pipelines, that are grouped in the same region.
    • dual-region
      • is a specific pair of regions, such as Finland and the Netherlands.
      • provides higher availability that comes with being geo-redundant.
    • multi-region
      • is a large geographic area, such as the United States, that contains two or more geographic places.
      • allows you to serve content to data consumers that are outside of the Google network and distributed across large geographic areas, or
      • provides  higher availability that comes with being geo-redundant.
  • Objects stored in a multi-region or dual-region are geo-redundant i.e. data is stored redundantly in at least two separate geographic places separated by at least 100 miles.

GCS Storage Classes

Refer blog Google Cloud Storage – Storage Classes

GCS Requester Pays

  • Project owner of the resource is billed normally for the access which includes operation charges, network charges and data retrieval charges
  • However, if the requester provides a billing project with their request, the requester’s project is billed instead.
  • With Requester Pays enabled on the bucket, it requires requester to include a billing project in their requests, thus billing the requester’s project.
  • Enabling Requester Pays is useful, for e.g., if you have a lot of data you want to make available to users, but you don’t want to be charged for their access to that data.
  • Requester Pays does not cover the storage charges and early deletion charges

GCS Upload and Download

  • GCS supports upload and storage of any MIME type of data up to 5 TB in size.
  • Uploaded object consists of the data along with any associated metadata
  • GCS supports multiple upload types
    • Simple upload – ideal for small files that can be uploaded again in its entirety  if the connection fails, and if there is no object metadata to send as part of the request.
    • Multipart upload – ideal for small files that can be uploaded again in its entirety  if the connection fails, and there is a need to include object metadata as part of the request.
    • Resumable upload – ideal for large files with a need for more reliable transfer. Supports streaming transfers, which is a type of resumable upload that allows uploading an object of unknown size.

Resumable upload

  • Resumable uploads are the recommended method for uploading large files, because they don’t need to be restarted from the beginning if there is a network failure while the upload is underway.
  • Resumable upload allows resumption of data transfer operations to Cloud Storage after a communication failure has interrupted the flow of data
  • Resumable uploads work by sending multiple requests, each of which contains a portion of the object you’re uploading.
  • Resumable upload mechanism supports transfers where the file size is not known in advance or for streaming transfer.
  • Resumable upload must be completed within a week of being initiated.

Streaming transfers

  • Cloud Storage supports streaming transfers, which allows streaming data to and from the Cloud Storage account without requiring that the data first be saved to a file.
  • Streaming uploads are useful when uploading data whose final size is not known at the start of the upload, such as when generating the upload data from a process, or when compressing an object on-the-fly.
  • Streaming downloads are useful to download data from Cloud Storage into a process.

Parallel composite uploads

  • Parallel composite uploads divides a file into up to 32 chunks, which are uploaded in parallel to temporary objects, the final object is recreated using the temporary objects, and the temporary objects are deleted
  • Parallel composite uploads can be significantly faster if network and disk speed are not limiting factors; however, the final object stored in the bucket is a composite object, which only has a crc32c hash and not an MD5 hash.
  • As a result, crcmod needs to be used to perform integrity checks when downloading the object with gsutil or other Python applications.
    You should only perform parallel composite uploads if the following apply:
  • Parallel composite uploads does not support bucket with default customer-managed encryption keys, because the compose operation does not support source objects encrypted in this way.
  • Parallel composite uploads do not need the uploaded objects to have an MD5 hash.

Object Versioning

  • Object Versioning retains a noncurrent object version when the live object version gets replaced or deleted
  • Object Versioning increases storage costs as it maintains the current and noncurrent versions of the object, which can be partially mitigated by
  • Noncurrent versions retain the name of the object, but are uniquely identified by their generation number.
  • Noncurrent versions only appear in requests that explicitly call for object versions to be included.
  • Objects versions can be permanently deleted by including the generation number or configuring Object Lifecycle Management to delete older object versions

Retention policies

  • Retention policy on a bucket ensures that all current and future objects in the bucket cannot be deleted or replaced until they reach the defined age
  • Retention policy can be applied when creating a bucket or to an existing bucket
  • Retention policy retroactively applies to existing objects in the bucket as well as new objects added to the bucket.

Retention policy locks

  • Retention policy locks will lock a retention policy on a bucket, which prevents the policy from ever being removed or the retention period from ever being reduced (although it can be increased)
  • Once a retention policy is locked, the bucket cannot be deleted until every object in the bucket has met the retention period.
  • Locking a retention policy is irreversible

Bucket Lock

  • Bucket Lock feature provides immutable storage on Cloud Storage
  • Bucket Lock feature allows configuring a data retention policy for a bucket that governs how long objects in the bucket must be retained
  • Bucket Lock feature also locks the data retention policy, permanently preventing the policy from being reduced or removed.
  • Bucket Lock can help with regulatory and compliance requirements

Object Holds

  • Object holds, when set on individual objects, prevents the object from being deleted or replaced, however allows metadata to be edited.
  • Cloud Storage offers the following types of holds:
    • Event-based holds.
    • Temporary holds.
  • When an object is stored in a bucket without a retention policy, both hold types behave exactly the same.
  • When an object is stored in a bucket with a retention policy, the hold types have different effects on the object when the hold is released:
    • An event-based hold resets the object’s time in the bucket for the purposes of the retention period.
    • A temporary hold does not affect the object’s time in the bucket for the purposes of the retention period.

Object Lifecycle Management

  • Object Lifecycle Management helps configure transition or expiration of the objects based on specified rules for e.g.  SetStorageClass to downgrade the storage class, delete to expire noncurrent objects
  • Lifecycle management configuration can be be applied to a bucket, which contains a set of rules applied to current and future objects in the bucket
  • Lifecycle management rules precedence
    • Delete action takes precedence over any SetStorageClass action.
    • SetStorageClass action switches the object to the storage class with the lowest at-rest storage pricing takes precedence.
  • Cloud Storage does not validate correctness of the storage class transition
  • Lifecycle actions can be tracked using Cloud Storage usage logs or using Pub/Sub Notifications for Cloud Storage

GCS Object Lifecycle Management

  • Object Lifecycle Behavior
    • Cloud Storage performs the action asynchronously, so there can be a lag between when the conditions are satisfied and the action is taken
    • Updates to lifecycle configuration may take up to 24 hours to take effect
    • Delete action will not take effect on an object while the object either has an object hold placed on it or a unfulfilled retention policy
    • SetStorageClass action is not affected by the existence of object holds or retention policies.

GCS Access Control

  • Cloud Storage offers two systems for granting users permission to access the buckets and objects: IAM and Access Control Lists (ACLs)
  • IAM and ACLs can be used on the same resource, Cloud Storage grants the broader permission set on the resource
  • Cloud Storage access control can be performed using
    • Uniform (recommended)
      • Uniform bucket-level access allows using IAM alone to manage permissions. IAM applies permissions to all the objects contained inside the bucket or groups of objects with common name prefixes.
      • IAM also allows using features that are not available when working with ACLs, such as IAM Conditions and Cloud Audit Logs.
      • Enabling uniform bucket-level access disables ACLs, but it can be reversed before 90 days
    • Fine-grained
      • Fine-grained option enables using IAM and Access Control Lists (ACLs) together to manage permissions.
      • ACLs are a legacy access control system for Cloud Storage designed for interoperability with Amazon S3.
      • Access and apply permissions can be specified at both the bucket level and per individual object.
  • Objects in the bucket can be made public using ACLs AllUsers:R or IAM allUsers:objectViewer permissions

Signed URLs

  • Signed URLs provide time-limited read or write access to an object through a generated URL.
  • Anyone having access to the URL can access the object for the duration of time specified, regardless of whether or not they have a Google account.

Signed Policy Documents

  • Signed policy documents helps specify what can be uploaded to a bucket.
  • Policy documents allow greater control over size, content type, and other upload characteristics than signed URLs, and can be used by website owners to allow visitors to upload files to Cloud Storage.

CORS

  • Cloud Storage allows setting CORS configuration at the bucket level only

Data Encryption

  • Cloud Storage always encrypts the data on the server side, before it is written to disk, at no additional charge.
  • Cloud supports following encryption
    • Server-side encryption: encryption that occurs after Cloud Storage receives the data, but before the data is written to disk and stored.
      • Google-managed encryption keys
        • Cloud Storage always encrypts the data on the server side, before it is written to disk
        • Cloud Storage manages server-side encryption keys using the same hardened key management systems, including strict key access controls and auditing.
        • Cloud Storage encrypts user data at rest using AES-256.
        • Data is automatically decrypted when read by an authorized user
      • Customer-supplied encryption keys
        • customers create and manage their own encryption keys.
      • Customer-managed encryption keys
        • customers manage their own encryption keys generated by Cloud Key Management Service (KMS)
        • Cloud Storage does not permanently store the key on Google’s servers or otherwise manage your key.
        • Customer provides the key for each GCS operation, and the key is purged from Google’s servers after the operation is complete
        • Cloud Storage stores only a cryptographic hash of the key so that future requests can be validated against the hash.
        • The key cannot be recovered from this hash, and the hash cannot be used to decrypt the data.
    • Client-side encryption: encryption that occurs before data is sent to Cloud Storage, encrypted at client side. This data also undergoes server-side encryption.
  • Cloud Storage supports Transport Layer Security, commonly known as TLS or HTTPS for data encryption in transit

Cloud Storage Tracking Updates

    • Pub/Sub notifications
      • sends information about changes to objects in the buckets to Pub/Sub, where the information is added to a specified Pub/Sub topic in the form of messages.
      • Each notification contains information describing both the event that triggered it and the object that changed.
    • Audit Logs
      • Google Cloud services write audit logs to help you answer the questions, “Who did what, where, and when?”
      • Cloud projects contain only the audit logs for resources that are directly within the project.
      • Cloud Audit Logs generates the following audit logs for operations in Cloud Storage:
        • Admin Activity logs: Entries for operations that modify the configuration or metadata of a project, bucket, or object.
        • Data Access logs: Entries for operations that modify objects or read a project, bucket, or object.

Data Consistency

  • Cloud Storage operations are strongly consistent and which are eventually consistent
  • Cloud Storage provides strong global consistency for the following operations, including both data and metadata:
    • Read-after-write
    • Read-after-metadata-update
    • Read-after-delete
    • Bucket listing
    • Object listing
  • Cloud Storage provides eventual consistency for following operations
    • Granting access to or revoking access from resources.

References

Google Cloud Platform – Cloud Storage

GCP Google Cloud Storage – Storage Classes

Google Cloud Storage – Storage Classes

  • Google Cloud Storage – Storage class affects the object’s availability and pricing model
  • Storage class of an existing object can be changed either by rewriting the object or by using Object Lifecycle Management.
  • Bucket’s default storage class is set to Standard Storage, if  not specified
  • A default storage class for the bucket can be specified so when a bucket is created, all the objects added to the bucket will inherit this storage class unless explicitly set otherwise.
  • Changing the default storage class of a bucket does not affect any of the objects that already exist in the bucket.

Available storage classes

  • All storage classes provide the following
    • Unlimited storage with no minimum object size.
    • Worldwide accessibility and worldwide storage locations.
    • Low latency (time to first byte typically tens of milliseconds).
    • High durability (99.999999999% annual durability).
    • Geo-redundancy if the data is stored in a multi-region or dual-region.
    • A uniform experience with Cloud Storage features, security, tools, and APIs.

Standard Storage

  • Standard Storage is best for data that is frequently accessed (hot data) and/or stored for only brief periods of time.
  • for regional locations
    • is appropriate for storing data in the same location for Co-locating the resources such as GKE clusters or Compute Engine instances with the data used, which helps in maximizing performance can reduce network charges.
    • Availability SLA – 99.9%
  • for dual-region,
    • provides optimized performance when accessing Google Cloud products that are located in one of the associated regions,
    • provides improved availability that comes from storing data in geographically separate locations.
  • for multi-region
    • ideal for storing data that is accessed around the world, such as serving website content, streaming videos, executing interactive workloads, or serving data supporting mobile and gaming applications.

Nearline Storage

  • Nearline Storage is a low-cost, highly durable storage service for storing infrequently accessed data (warm data)
  • Nearline Storage is a better choice than Standard Storage in scenarios where slightly lower availability, a 30-day minimum storage duration, and costs for data access are acceptable trade-offs for lowered at-rest storage costs
  • Nearline Storage is ideal for data you plan to read or modify on average once per month or less. for e.g., if you want to continuously add files to Cloud Storage and plan to access those files once a month for analysis, Nearline Storage is a great choice.
  • Nearline Storage is also appropriate for data backup, long-tail multimedia content, and data archiving.

Coldline Storage

  • Coldline Storage provides a very-low-cost, highly durable storage service for storing infrequently accessed data (cold data)
  • Coldline Storage is a better choice than Standard Storage or Nearline Storage in scenarios where slightly lower availability, a 90-day minimum storage duration, and higher costs for data access are acceptable trade-offs for lowered at-rest storage costs.
  • Coldline Storage is ideal for data you plan to read or modify at most once a quarter.

Archive Storage

  • Archive Storage is the lowest-cost, highly durable storage service for data archiving, online backup, and disaster recovery. (coldest data)
  • Data is available within milliseconds, not hours or days.
  • Archive Storage has no availability SLA, though the typical availability is comparable to Nearline Storage and Coldline Storage.
  • Archive Storage has higher costs for data access and operations, as well as a 365-day minimum storage duration.
  • Archive Storage is the best choice for data that you plan to access less than once a year. for e.g. cold data storage for archival and disaster recovery

Google Cloud Storage - Storage Classes

Legacy Storage Classes

  • Google Cloud Storage provided additional storage classes which have be phased out
    • Multi-Regional Storage
      • Equivalent to Standard Storage, except Multi-Regional Storage can only be used for objects stored in multi-regions or dual-regions.
    • Regional Storage
      • Equivalent to Standard Storage, except Regional Storage can only be used for objects stored in regions.
    • Durable Reduced Availability (DRA) Storage:
      • Similar to Standard Storage except:
        • DRA has higher pricing for operations.
        • DRA has lower performance, particularly in terms of availability (DRA has a 99% availability SLA).

References