AWS Lake Formation

AWS Lake Formation

AWS Lake Formation

  • AWS Lake Formation easily creates secure data lakes, making data available for wide-ranging analytics.
  • is an integrated data lake service that helps to discover, ingest, clean, catalog, transform, and secure data and make it available for analysis and ML.
  • automatically manages access to the registered data in S3 through services including AWS Glue, Athena, Redshift, QuickSight, and EMR to ensure compliance with your defined policies.
  • helps configure and manage your data lake without manually integrating multiple underlying AWS services.
  • can manage data ingestion through AWS Glue. Data is automatically classified, and relevant data definitions, schema, and metadata are stored in the central Glue Data Catalog. Once the data is in the S3 data lake, access policies, including table-and-column-level access controls can be defined, and encryption for data at rest enforced.
  • uses a shared infrastructure with AWS Glue, including console controls, ETL code creation and job monitoring, blueprints to create workflows for data ingest, the same data catalog, and a serverless architecture.
  • integrates with IAM so authenticated users and roles can be automatically mapped to data protection policies that are stored in the data catalog. The IAM integration also supports Microsoft Active Directory or LDAP to federate into IAM using SAML.
  • helps centralize data access policy controls. Users and roles can be defined to control access, down to the table and column level.
  • supports private endpoints in the VPC and records all activity in AWS CloudTrail for network isolation and auditability.
  • is part of the Amazon SageMaker Lakehouse architecture (announced at re:Invent 2024), which unifies data across S3 data lakes, S3 Tables, and Redshift data warehouses with consistent Lake Formation permissions across all analytics and ML engines.

AWS Lake Formation

Lake Formation Fine-Grained Access Control (FGAC)

  • Lake Formation provides fine-grained access control at database, table, column, row, and cell levels.
  • Column-Level Security – restrict access to specific columns within a table.
  • Row-Level Security – define data filters with row filter expressions to restrict rows returned in query results.
  • Cell-Level Security – combine column-level and row-level security using data cell filters to restrict access at the cell level.
  • Data filters define both the columns that a user has access to and the rows that match a filter expression.
  • Fine-grained permissions are enforced across Athena, Redshift Spectrum, EMR, and AWS Glue ETL jobs.
  • Supports Open Table Formats (OTFs) including Apache Iceberg, Apache Hudi, and Delta Lake with table, row, column, and cell-level permissions.

Lake Formation Tag-Based Access Control (LF-TBAC)

  • LF-TBAC provides a scalable way to manage permissions using LF-Tags (key-value pairs) assigned to Data Catalog resources.
  • Instead of defining policies per named resource, data stewards create LF-Tags based on business needs and attach them to databases, tables, and columns.
  • Permissions are granted to principals based on matching LF-Tags, enabling automatic access as new resources are tagged.
  • Supports cross-account data sharing using AWS Resource Access Manager (RAM).
  • LF-Tag Expressions (Nov 2024) – save and reuse LF-Tag expressions to grant permissions on Data Catalog resources, reducing policy management overhead.
  • Supports tag-based access control for federated catalogs including Amazon S3 Tables, Amazon Redshift data warehouses, and federated data sources (DynamoDB, SQL Server, Snowflake).

Lake Formation Attribute-Based Access Control (ABAC)

  • (April 2025) – Lake Formation allows granting permissions to principals with matching attributes on Data Catalog resources.
  • Extends beyond tag-based controls to match principal attributes for fine-grained authorization.

Lake Formation Hybrid Access Mode

  • (September 2023) – provides flexibility to selectively enable Lake Formation permissions for specific databases and tables in the Data Catalog.
  • Both Lake Formation permissions and IAM permissions can control access to the same data simultaneously.
  • Opted-in principals require both Lake Formation permissions and IAM permissions, while non-opted-in principals continue accessing data using only IAM permissions.
  • Enables incremental migration to Lake Formation without interrupting existing users or workloads.
  • Integrates with Amazon DataZone (April 2024) – allows publishing and sharing Glue tables through DataZone without requiring prior Lake Formation registration.

Lake Formation Cross-Account Data Sharing

  • Share Data Catalog databases and tables across AWS accounts within or outside an AWS Organization.
  • Share with entire AWS accounts or directly with IAM principals in another account.
  • Supports cross-account sharing using both tag-based access control and named resource methods.
  • Uses AWS Resource Access Manager (RAM) for cross-account grants.
  • (Feb 2026) – Enhanced cross-account sharing allows sharing hundreds of thousands of tables across accounts for multi-account analytics environments at scale.
  • RetainSharingOnAccountLeaveOrganization parameter – keeps resource shares in place when accounts change organizations.
  • Enables building a data mesh architecture with producer accounts, central governance accounts, and consumer accounts.

Lake Formation Credential Vending

  • Provides temporary credentials to users, services, or applications for short-term access to Amazon S3 data.
  • Supports integration with third-party services and engines through credential vending API operations.
  • (June 2026) – Lake Formation extends table permissions to access underlying data files in S3 directly, using the GetTemporaryDataLocationCredentials() API.
  • Provides a single set of permissions for both SQL queries and direct file access using existing Lake Formation table grants.
  • Eliminates the need to maintain separate S3 bucket policies or IAM role policies for file-level access.
  • Supports auditable credential vending with IAM Identity Center user context in CloudTrail events (July 2024).

Lake Formation Multi-Catalog and Federated Catalogs

  • (December 2024) – AWS Glue Data Catalog allows creating federated catalogs to unify data across:
    • Amazon S3 data lakes
    • Amazon Redshift data warehouses
    • Operational databases (Amazon DynamoDB)
    • Third-party data sources (Snowflake, MySQL, etc.)
  • Lake Formation permissions apply consistently across all federated catalogs.
  • New permissions added: CREATE_CATALOG and SUPER_USER for catalog-level access control.

Lake Formation with Amazon S3 Tables

  • (March 2025) – S3 Tables can be integrated and cataloged as AWS Glue Data Catalog objects and registered as Lake Formation data locations.
  • Enables Lake Formation governance over S3 Tables using the same permission model as standard data lake tables.
  • Supports building cross-account data mesh architectures with S3 Tables where producer, governance, and consumer accounts operate independently.

Lake Formation with Open Table Formats

  • Supports managing access permissions for Apache Iceberg, Apache Hudi, and Delta Lake tables.
  • Enforces table, row, column, and cell-level permissions on OTF-based tables.
  • Apache Iceberg has the best integration with AWS Glue ETL via Lake Formation permissions, including full SQL support.
  • AWS Glue Data Catalog provides managed compaction for Iceberg tables to improve query performance.
  • Supports catalog federation for remote Apache Iceberg tables stored in external Iceberg catalogs.

Governed Tables (Deprecated)

⚠️ DEPRECATED: Lake Formation Governed Tables were deprecated effective December 31, 2024. All Governed Table APIs stopped working after February 17, 2025.

Migration: AWS recommends using open source transactional table formats — Apache Iceberg (recommended), Apache Hudi, or Delta Lake — which provide ACID transactions, time-travel queries, and automatic compaction natively.

  • Governed Tables previously provided ACID transactions, automatic data compaction, and time-travel queries within Lake Formation.
  • These capabilities are now fully supported through Apache Iceberg tables in the AWS Glue Data Catalog.

Lake Formation Blueprints and Workflows

  • Blueprints define the source data target and schedule for loading data into the data lake.
  • Workflows encapsulate complex multi-job ETL activities, generating AWS Glue crawlers, jobs, and triggers.
  • Lake Formation executes and tracks a workflow as a single entity.
  • Supports both on-demand and scheduled workflow execution.
  • Workflows are visible in the Glue console as a directed acyclic graph (DAG).
  • Blueprint types include Database Snapshot and Incremental Database.

Lake Formation Data Catalog Views

  • (November 2023) – create views in the AWS Glue Data Catalog that reference up to 10 tables.
  • Views can be created using SQL editors for Athena or Redshift, or via AWS Glue APIs (August 2024).
  • Lake Formation permissions can be applied to views for fine-grained access control.

Lake Formation Cross-Region Data Access

  • Supports querying Data Catalog tables across AWS Regions.
  • Access data in other Regions using Athena, EMR, and AWS Glue ETL by creating resource links pointing to source databases and tables.

Lake Formation Integration with IAM Identity Center

  • (November 2023) – integrates with IAM Identity Center for workforce identity federation.
  • Users and groups managed in Identity Center can access Data Catalog resources with Lake Formation permissions enforced.
  • Supports trusted identity propagation for analytics queries.

Lake Formation Key Integrations

  • Amazon SageMaker Lakehouse – unified lakehouse architecture with Lake Formation as the governance layer for all analytics and ML workloads.
  • Amazon DataZone – data management service that uses Lake Formation for access control and governance when sharing data assets.
  • Amazon Athena – serverless queries with Lake Formation FGAC enforced.
  • Amazon Redshift / Redshift Spectrum – data sharing governed by Lake Formation permissions and LF-Tags.
  • Amazon EMR – Spark, Hive, and Presto jobs with table, row, column, and cell-level security via Lake Formation.
  • AWS Glue ETL – fine-grained access control enforced in ETL jobs.
  • Amazon QuickSight – BI dashboards with Lake Formation permissions.
  • Amazon SageMaker Feature Store – supports fine-grained access control through Lake Formation for ML feature pipelines.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. A company needs to centrally manage fine-grained access control for its data lake stored in Amazon S3. Data analysts should only see rows relevant to their department when querying tables via Amazon Athena. Which approach should be used?
    1. Create separate S3 buckets per department and use S3 bucket policies
    2. Use AWS Lake Formation data filters with row-level security expressions
    3. Create separate Athena workgroups with IAM policies per department
    4. Use S3 Access Points with different policies per department
  2. A data engineering team wants to migrate to Lake Formation for permission management but cannot disrupt existing workloads that use IAM-based access to Glue Data Catalog tables. Which feature should they use?
    1. Lake Formation tag-based access control
    2. Lake Formation hybrid access mode
    3. Lake Formation credential vending
    4. Lake Formation cross-account sharing
  3. An organization has 50 analytics teams across 20 AWS accounts that need to share and govern hundreds of thousands of tables centrally. Which Lake Formation capability best supports this at scale?
    1. S3 bucket policies with cross-account access
    2. Lake Formation cross-account data sharing with LF-Tag-based access control
    3. AWS RAM resource shares without Lake Formation
    4. IAM roles with assume-role policies per account
  4. A company uses Apache Iceberg tables stored in Amazon S3 and needs to enforce column-level and cell-level security for different user groups querying via Athena and EMR. Which service provides this capability?
    1. S3 Access Grants
    2. AWS Glue Data Quality
    3. AWS Lake Formation with data cell filters
    4. Amazon Macie with data classification
  5. A data science team needs to access the underlying S3 data files directly for an ML training pipeline, but the tables are governed by Lake Formation permissions. Previously they had to maintain separate S3 bucket policies. Which new feature eliminates this overhead?
    1. S3 Access Points
    2. Lake Formation credential vending with GetTemporaryDataLocationCredentials API
    3. IAM Identity Center direct S3 access
    4. AWS Glue Data Catalog resource policies
  6. A company previously used Lake Formation Governed Tables for ACID transactions in their data lake. After the deprecation in December 2024, which is the AWS-recommended replacement? (Select TWO)
    1. Amazon DynamoDB transactions
    2. Apache Iceberg tables in the Glue Data Catalog
    3. Amazon Aurora with S3 export
    4. Apache Hudi tables with Lake Formation permissions
    5. Amazon Redshift Serverless

Answers:

  1. B – Lake Formation data filters support row-level security to restrict rows returned based on filter expressions.
  2. B – Hybrid access mode allows selective enablement of Lake Formation permissions without disrupting existing IAM-based access.
  3. B – LF-TBAC with cross-account sharing scales to hundreds of thousands of tables across multiple accounts.
  4. C – Lake Formation supports fine-grained access control (including cell-level security) for Apache Iceberg and other open table formats.
  5. B – The GetTemporaryDataLocationCredentials API vends temporary credentials scoped to S3 locations, providing unified permissions for both SQL queries and direct file access.
  6. B, D – AWS recommends migrating to open table formats (Apache Iceberg preferred, Hudi also supported) which provide ACID transactions natively with Lake Formation permission support.

References