Google Cloud Data Loss Prevention – DLP

Google Cloud Data Loss Prevention – DLP

  • Cloud Data Loss Prevention – DLP is a fully managed service designed to help discover, classify, and protect the most sensitive data.
  • DLP helps inspect the data to gain valuable insights and make informed decisions to secure your data
  • DLP effectively reduces the data risk with de-identification methods like masking and tokenization
  • DLP seamlessly inspects and transforms the structured and unstructured data

Cloud Data Loss Prevention (DLP) Action

  • Cloud Data Loss Prevention (DLP) action is something that occurs after a DLP job completes successfully or, in the case of emails, on error.
  • DLP supports the following types of actions
    • Save findings to BigQuery (inspection and risk jobs)
    • Publish to Pub/Sub (inspection and risk jobs)
    • Publish to Security Command Center (risk jobs)
    • Publish to Data Catalog (risk jobs)
    • Publish to Google Cloud’s operations suite (risk jobs)
    • Notify by email (inspection and risk jobs)

Cloud Data Loss Prevention Key Concepts

  • Classification is the process to inspect the data and know what data we have, how sensitive it is, and the likelihood.
  • De-identification is the process of removing identifying information from data.
  • De-identification techniques supported by Cloud DLP
    • Redaction: Deletes all or part of a detected sensitive value.
    • Replacement: Replaces a detected sensitive value with a specified surrogate value.
    • Masking: Replaces a number of characters of a sensitive value with a specified surrogate character, such as a hash (#) or asterisk (*).
    • Pseudonymization: replaces sensitive data values with cryptographically generated tokens.
    • Generalization: is the process of abstracting a distinguishing value into a more general, less distinguishing value. Generalization attempts to preserve data utility while also reducing the identifiability of the data.
    • Bucketing: “Generalizes” a sensitive value by replacing it with a range of values. (For example, replacing a specific age with an age range, or temperatures with ranges corresponding to “Hot,” “Medium,” and “Cold.”)
    • Date shifting: Shifts sensitive date values by a random amount of time.
    • Time extraction: Extracts or preserves specified portions of date and time values.

Cloud Data Loss Prevention InfoTypes

  • Cloud Data Loss Prevention (DLP) uses information types – or infoTypes – to define what it scans for.
  • An infoType is a type of sensitive data, such as a name, email address, telephone number, identification number, credit card number, and so on.
  • An infoType detector is the corresponding detection mechanism that matches an infoType’s matching criteria.
  • Cloud DLP uses infoType detectors in the configuration for its scans to determine what to inspect for and how to transform findings. InfoType names are also used when displaying or reporting scan results.
  • Cloud DLP supports the following infoType detectors
    • Built-in infoType detectors specified by name and include detectors for the country- or region-specific sensitive data types as well as globally applicable data types.
    • Custom infoType detectors, defined by you
      • Small custom dictionary detectors
        • simple word lists that Cloud DLP matches on
        • ideal for several tens of thousands of words or phrases
        • preferred if the word list doesn’t change significantly.
      • Large custom dictionary detectors
        • are generated by Cloud DLP using large lists of words or phrases stored in either Cloud Storage or BigQuery.
        • ideal for a large list of words or phrases—up to tens of millions.
      • Regular expressions (regex) detectors
        • enable Cloud DLP to detect matches based on a regex pattern.
  • Cloud DLP supports inspection rules to fine-tune scan results using
    • Exclusion rules decrease the number of findings returned by adding rules to a built-in or custom infoType detector.
    • Hotword rules increase the quantity or change the likelihood value of findings returned by adding rules to a built-in or custom infoType detector.
  • DLP uses a bucketized representation of likelihood, which is intended to indicate how likely it is that a piece of data matches a given infoType
    • LIKELIHOOD_UNSPECIFIED: Default value; same as POSSIBLE.
    • VERY_UNLIKELY: very unlikely that the data matches the given InfoType
    • UNLIKELY: unlikely that the data matches the given InfoType.
    • POSSIBLE: possible that the data matches the given InfoType.
    • LIKELY: likely that the data matches the given InfoType.
    • VERY_LIKELY: very likely that the data matches the given InfoType

DLP Classification and De-identification

  • Cloud DLP can easily classify, redact and de-identify sensitive data contained in text-based content and images, including content stored in Google Cloud storage repositories.
  • Text Classification and Reduction
    • Text Classification returns classification findings
    • Automatic Text Redaction produces an output with sensitive data matches removed using a placeholder of ***
  • Image Classification and Reduction
    • DLP uses Optical Character Recognition (OCR) technology to recognize text prior to classification. Similar to text classification, it returns findings, but it also adds a bounding box where the text was found.
    • Inspection
      • Cloud DLP inspects the submitted base64-encoded image for the specified intoTypes.
      • It returns the detected InfoTypes, along with one or more set of pixel coordinates and dimensions.
      • Each set of pixel coordinate and dimension values indicate the bottom-left corner and the dimensions of bounding boxes, respectively
      • Each bounding box corresponds to all or part of a Cloud DLP finding.
    • Redaction
      • Cloud DLP redacts any sensitive data findings by masking them with opaque rectangles.
      • It returns the redacted base64-encoded image in the same image format as the original image.
      • Color of the redaction boxes can be configured in the request.
  • Storage classification
    • scans data stored in Cloud Storage, Datastore, and BigQuery
    • supports scanning of binary, text, image, Microsoft Word, PDF, and Apache Avro files
    • unrecognized file types are scanned as binary files.
  • Date, if considered PII, can be handled
    • Using generalization, or bucketing, which can however remove the utility in the dates for e.g. generalizing all the dates to just the year
    • Using date obfuscation by date shifting which randomly shifts a set of dates but preserves the sequence and duration of a period of time.

DLP Re-identification Risk Analysis

  • DLP Re-identification risk analysis is the process of analyzing sensitive data to find properties that might increase the risk of subjects being identified, or of sensitive information about individuals being revealed.
  • Risk analysis methods can be used before de-identification to help determine an effective de-identification strategy, or after de-identification to monitor for any changes or outliers.
  • Re-identification is the process of matching up de-identified data with other available data to determine the person to whom the data belongs.

Cloud Data Loss Prevention Templates

  • DLP Templates help decouple configuration information from the implementation of the requests
  • Templates provide a robust way to manage large-scale rollouts of Cloud DLP capabilities.
  • Cloud DLP supports two types of templates:
    • Inspection Templates: Templates for saving configuration information for inspection scan jobs, including what predefined or custom detectors to use.
    • De-identification Templates: Templates for saving configuration information for de-identification jobs, including both infoType and structured dataset transformations.

GCP Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.

Reference

Google_Cloud_Data_Loss_Prevention