Google Cloud Sensitive Data Protection (formerly Cloud DLP)

📢 Service Rebranding (2023): Cloud Data Loss Prevention (Cloud DLP) is now part of Sensitive Data Protection, a family of services designed to help discover, classify, and protect sensitive data. The API name remains the same: Cloud Data Loss Prevention API (DLP API).

Sensitive Data Protection (formerly Cloud DLP) is a fully managed service designed to help discover, classify, and protect the most sensitive data inside and outside Google Cloud.
Sensitive Data Protection helps inspect the data to gain valuable insights and make informed decisions to secure your data.

It effectively reduces data risk with de-identification methods like masking, tokenization, and encryption.
It seamlessly inspects and transforms structured and unstructured data.
Sensitive Data Protection is deeply integrated into Security Command Center Enterprise risk engine for continuous data monitoring, high-value asset identification, vulnerability analysis, and attack scenario simulation.
Supports data in Google Cloud (BigQuery, Cloud Storage, Cloud SQL, Datastore), multicloud (Amazon S3, Azure Blob Storage), and external sources via hybrid jobs.

Sensitive Data Protection Services

Sensitive Data Protection is a family of services comprising three core capabilities:
- Discovery – Automatically discovers, classifies, and profiles data across organizations, folders, or projects.
- Inspection – Detects and classifies sensitive data in content, storage, and images.
- De-identification – Transforms sensitive data to reduce risk while retaining utility.

Sensitive Data Protection Discovery Service

The discovery service continuously and automatically discovers, classifies, and profiles data across an organization.
Discovery helps understand the location and nature of data, including shadow data that might not be subject to proper governance.

Discovery generates data profiles at project, table, and column levels providing:
- Data classifications (infoTypes detected)
- Sensitivity levels and data risk levels
- Data size and shape
- Data security posture
- Estimated null proportion and uniqueness (at column level)
Supported Data Sources:
- BigQuery (GA since April 2022) – including BigLake tables
- Cloud SQL (GA since March 2024)
- Cloud Storage (GA since June 2024)
- Amazon S3 (Sept 2024) – requires Security Command Center Enterprise with AWS connector
- Azure Blob Storage (April 2025) – requires Security Command Center Enterprise with Azure connector
- Vertex AI datasets (GA Feb 2025) – profiles training and tuning data for sensitivity
- Vertex AI tuning jobs (Feb 2026)
Discovery can automatically attach tags to resources based on calculated data sensitivity, enabling IAM conditions to grant or deny access based on data sensitivity.

Discovery can detect unencrypted secrets (passwords, authentication tokens) in Cloud Functions and Cloud Run environment variables, reporting findings to Security Command Center.
Discovery findings can be published to Security Command Center, Dataplex Universal Catalog (Knowledge Catalog), and BigQuery.
Supports subscription pricing mode for predictable costs regardless of data growth.

Scan configurations are manageable via Terraform (since June 2024).

Data Security Posture Management (DSPM)

DSPM provides a data-centric view of Google Cloud security, integrated with Security Command Center.
DSPM lets you continuously identify and reduce data risk by understanding what sensitive data you have, where it is stored, and if its use aligns with security and compliance requirements.
Starts with a data map providing a birds-eye view of data across your environment, its sensitivity level, and default security posture.

Sensitive Data Protection discovery feeds sensitivity signals into DSPM for default data risk assessment.

Sensitive Data Protection Actions

An action occurs after a Sensitive Data Protection job completes successfully or, in the case of emails, on error.
Supported action types:
- Save findings to BigQuery (inspection and risk jobs)
- Save findings to Cloud Storage (inspection jobs – added Aug 2025)
- Publish to Pub/Sub (inspection and risk jobs)
- Publish to Security Command Center (risk jobs and discovery)
- Add Dataplex Catalog aspects based on insights from data profiles (replaced Data Catalog integration)
- Publish to Google Cloud’s operations suite (risk jobs)
- Notify by email (inspection and risk jobs)
- Send inspection results to Dataplex Universal Catalog (Knowledge Catalog) as aspects (Sept 2025)

⚠️ Deprecated Action: “Publish to Data Catalog” was deprecated on September 30, 2025. Data Catalog was discontinued on January 30, 2026, replaced by Knowledge Catalog (formerly Dataplex Universal Catalog). Use “Add Dataplex Catalog aspects” instead.

Sensitive Data Protection Key Concepts

Classification is the process to inspect the data and know what data we have, how sensitive it is, and the likelihood.
De-identification is the process of removing identifying information from data.

De-identification techniques supported:
- Redaction: Deletes all or part of a detected sensitive value.
- Replacement: Replaces a detected sensitive value with a specified surrogate value.
- Masking: Replaces a number of characters of a sensitive value with a specified surrogate character, such as a hash (#) or asterisk (*).
- Pseudonymization/Tokenization: Replaces sensitive data values with cryptographically generated tokens. Supports:
  - CryptoReplaceFfxFpeConfig – Format-preserving encryption
  - CryptoDeterministicConfig – Deterministic encryption for consistent tokenization
  - CryptoHashConfig – One-way cryptographic hashing
- Built-in Tokenization (Jan 2025): Serverless API endpoints for on-the-fly tokenization without managing third-party deployments, hardware, or VMs. Fully regionalized for compliance.
- Generalization: Abstracts a distinguishing value into a more general, less distinguishing value to preserve data utility while reducing identifiability.
- Bucketing: “Generalizes” a sensitive value by replacing it with a range of values. (For example, replacing a specific age with an age range.)
- Date shifting: Shifts sensitive date values by a random amount of time.
- Time extraction: Extracts or preserves specified portions of date and time values.
- Dictionary replacement: Replaces each detected sensitive value with a random value from a provided word list.

Sensitive Data Protection InfoTypes

Sensitive Data Protection uses information types — or infoTypes — to define what it scans for.

An infoType is a type of sensitive data, such as a name, email address, telephone number, identification number, credit card number, and so on.
An infoType detector is the corresponding detection mechanism that matches an infoType’s matching criteria.
InfoType detectors are used in scan configurations to determine what to inspect for and how to transform findings.

Supported infoType detector types:
- Built-in infoType detectors — specified by name, include country-specific and globally applicable data types. Categories include:
  - PII (names, addresses, phone numbers, dates of birth)
  - Financial (credit cards, bank accounts, tax IDs)
  - Credentials & secrets (API keys, passwords, auth tokens, encryption keys)
  - Medical data (medical records, health info)
  - Government IDs (passports, driver’s licenses, national IDs)
  - Document type classifiers (finance, legal, medical, R&D, source code)
  - Image/Object infoTypes (faces, signatures, passports, photo IDs, barcodes, license plates, persons, whiteboards)
  - Image safety classifiers (violence, sexually explicit/suggestive content)
- General infoType detectors — broader categories (e.g., CREDIT_CARD_DATA, MEDICAL_DATA, GOVERNMENT_ID) that cover multiple specific infoTypes.
- Custom infoType detectors, defined by you:
  - Small custom dictionary detectors — simple word lists, ideal for several tens of thousands of words or phrases.
  - Large custom dictionary detectors — generated from large lists stored in Cloud Storage or BigQuery, ideal for up to tens of millions of words.
  - Regular expressions (regex) detectors — detect matches based on regex patterns.
  - Custom metadata label detectors (March 2026) — detect specific client-provided metadata labels.
Inspection rules to fine-tune scan results:
- Exclusion rules — decrease findings (including ExcludeByHotword for column-name-based exclusion and image-based exclusion rules).
- Hotword rules — increase quantity or change likelihood of findings.
- Adjustment rules (GA Feb 2026) — customize likelihood based on context, supporting both text and image operations.
InfoType versioning — control which detection model to use via InfoType.version: latest, stable, or legacy.

Likelihood levels used for confidence scoring:
- LIKELIHOOD_UNSPECIFIED: Default value; same as POSSIBLE.
- VERY_UNLIKELY
- UNLIKELY
- POSSIBLE
- LIKELY
- VERY_LIKELY

Classification and De-identification

Sensitive Data Protection can classify, redact, and de-identify sensitive data contained in text-based content, images, and content stored in Google Cloud storage repositories.
Text Classification and De-identification
- Text Classification returns classification findings.
- Text Redaction produces output with sensitive data removed using a placeholder.
- Conversational content inspection (June 2026) — supports inspecting and de-identifying conversational content via the Conversation ContentItem type.
- Batched content inspection (June 2026) — supports inspecting and de-identifying batched content via BatchContentItem.

Image Classification and Redaction
- Uses Optical Character Recognition (OCR) technology to recognize text prior to classification.
- Inspection — inspects base64-encoded images for specified infoTypes, returns detected InfoTypes with pixel coordinates and bounding boxes.
- Redaction — masks sensitive data findings with opaque rectangles (configurable color).
- Object detection and redaction (July 2025) — detects and redacts sensitive objects (barcodes, license plates, persons, whiteboards) in images.
Storage Classification
- Scans data stored in Cloud Storage, Datastore, BigQuery, and Cloud SQL.
- Supports scanning of binary, text, image, Microsoft Word, PDF, Apache Avro files, and archive files.
- Detects sensitive data in headers and footers of rich document types.
- Unrecognized file types are scanned as binary files.

Date Handling — if dates are considered PII:
- Using generalization or bucketing (can reduce date utility)
- Using date shifting which randomly shifts dates but preserves sequence and duration.

Re-identification Risk Analysis

Risk analysis is the process of analyzing sensitive data to find properties that might increase the risk of subjects being identified.

Risk analysis methods can be used before de-identification to help determine an effective strategy, or after de-identification to monitor for changes or outliers.
Supported risk metrics:
- k-anonymity — measures whether individuals can be distinguished from at least k-1 other individuals.
- l-diversity — extends k-anonymity by ensuring diversity of sensitive values within each equivalence class.
- k-map estimation — estimates re-identification risk based on quasi-identifiers.
- δ-presence (delta-presence) — risk metric for when membership in the dataset is itself sensitive.

Sensitive Data Protection Templates

Templates help decouple configuration information from the implementation of requests.

Templates provide a robust way to manage large-scale rollouts.
Two types of templates:
- Inspection Templates: Templates for saving configuration for inspection scan jobs, including what predefined or custom detectors to use.
- De-identification Templates: Templates for saving configuration for de-identification jobs, including infoType and structured dataset transformations.

Discovery scans can be configured to reprofile data when the inspection template changes (since Feb 2024).

AI and Generative AI Protection

Sensitive Data Protection extends to AI workloads to help organizations safely unlock data value at every stage of the AI journey.
Vertex AI Discovery (GA Feb 2025) — discovers and profiles sensitivity of training and tuning data in Vertex AI datasets.

Context-aware inspection — understands data context even within images and rich documents for effective AI agent data access control.
Fine-grained data minimization for:
- AI model training and fine-tuning data preparation
- Runtime protection for chat, data collection, and generative AI prompts/responses
- Ensuring adherence to regulations and internal policies

Integrated with AI Protection framework for security posture management of AI workloads.

Integration with Security Command Center

Sensitive Data Protection is deeply integrated into Security Command Center Enterprise risk engine.
Publishes the following finding types:
- Data sensitivity and Data risk — calculated sensitivity and risk levels of profiled tables.
- PUBLIC_SENSITIVE_DATA — publicly accessible resources with sensitive data.
- SECRETS_IN_STORAGE — secrets detected in environment variables.
- SENSITIVE_DATA_CMEK_DISABLED — sensitive data without customer-managed encryption.
Security Command Center can automatically prioritize resources based on data sensitivity.
Discovery is included in Security Command Center Enterprise tier (since Nov 2024).

Regional Endpoints and Data Residency

Regional endpoints (GA Aug 2024) keep data at rest, in use, and in transit within a specified region.
Available in 35+ regions globally including multi-regions (asia, europe, us).
Content methods process data synchronously in memory — data is not stored on Google Cloud.

Storage methods process data in the same region where it resides.

Compliance and Security

Certifications: ISO/IEC 27001, ISO/IEC 27017, ISO/IEC 27018, PCI DSS, HIPAA BAA, MTCS SS 584.
Supports GDPR compliance with built-in and custom infoType detectors applicable to EU data.

Data encryption in transit and at rest follows Google Cloud encryption standards.

GCP Certification Exam Practice Questions

Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).

GCP services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.

GCP exam questions are not updated to keep up the pace with GCP updates, so even if the underlying feature has changed the question might not be updated

Open to further feedback, discussion and correction.

A healthcare company needs to de-identify patient records stored in BigQuery before sharing them with researchers. They need to preserve the format of Social Security numbers for application compatibility while making them non-reversible without the key. Which de-identification technique should they use?
1. Masking
2. Redaction
3. Format-preserving encryption (CryptoReplaceFfxFpeConfig)
4. Date shifting
Show Answer

Answer: c – Format-preserving encryption maintains the format while encrypting the value.

An organization wants to continuously monitor all their BigQuery, Cloud Storage, and Cloud SQL data across all projects for sensitive data without setting up individual scan jobs. What should they use?
1. Sensitive Data Protection inspection jobs with scheduled triggers
2. Sensitive Data Protection discovery service at the organization level
3. Cloud DLP content inspection API
4. Security Command Center vulnerability scanning
Show Answer

Answer: b – The discovery service automatically and continuously discovers, classifies, and profiles data across the organization.
A company is building a generative AI application and needs to ensure that sensitive PII is not included in prompts sent to their LLM. Which Sensitive Data Protection capability should they use?
1. Discovery service profiling
2. Risk analysis jobs
3. Content inspection and de-identification API
4. Storage classification jobs
Show Answer

Answer: c – Content inspection and de-identification processes data synchronously and can be used at runtime to inspect and redact sensitive data from AI prompts and responses.
A financial institution wants to automatically restrict access to BigQuery tables containing highly sensitive data. Which Sensitive Data Protection feature should they configure?
1. De-identification templates
2. Discovery with automatic sensitivity-based tagging and IAM conditions
3. Inspection templates with Pub/Sub actions
4. Risk analysis jobs with email notifications
Show Answer

Answer: b – Discovery can automatically attach tags based on data sensitivity, and IAM conditions can grant or deny access based on these tags.

A security team needs to discover if any Cloud Functions or Cloud Run services have unencrypted secrets in their environment variables. Which feature should they enable?
1. Storage inspection jobs targeting Cloud Functions
2. Sensitive Data Protection secrets discovery with Security Command Center integration
3. Custom infoType detectors for password patterns
4. Cloud Audit Logs monitoring
Show Answer

Answer: b – Sensitive Data Protection can automatically discover unencrypted secrets in Cloud Functions and Cloud Run environment variables and report findings to Security Command Center.

An enterprise using Security Command Center Enterprise wants to profile their Amazon S3 data for sensitive information. What prerequisite is required?
1. A VPN connection between GCP and AWS
2. An AWS connector in Security Command Center with Sensitive Data Protection permissions
3. A hybrid job trigger configured in Sensitive Data Protection
4. Amazon S3 data must be replicated to Cloud Storage first
Show Answer

Answer: b – Multicloud discovery for Amazon S3 (and Azure Blob Storage) requires Security Command Center Enterprise with the appropriate cloud connector configured.
Which of the following are valid de-identification techniques in Sensitive Data Protection? (Choose 3)
1. Masking
2. Format-preserving encryption
3. Dictionary replacement
4. Data replication
5. Column deletion
Show Answer

Answer: a, b, c – Masking, format-preserving encryption, and dictionary replacement are all supported de-identification techniques.