Talk "Apache Iceberg and the Future of Secure, Interoperable Data Lakehouses"

Apache Iceberg and the Future of Secure, Interoperable Data Lakehouses by Christian Thiel was presented at Munich Datageeks - May Edition 2025

Abstract

Apache Iceberg has emerged as the de-facto standard for open Data Analytics systems, empowering organizations to freely choose their preferred query engines and ML frameworks while working with shared data. Historically, the "pick your tool" promise of data Lakehouses has been compromised by security fragmentation across vendors. In this session we go over the layered security of Apache Iceberg and demonstrate how security-focused Iceberg REST Catalogs like Lakekeeper can leverage open standards to enable genuinely secure cross-engine and cross-cloud interoperability. Learn how modern data architectures built purely on open-source tools and standards can finally deliver both flexibility and comprehensive governance.

About the speaker

Christian Thiel is the creator of Lakekeeper, an Apache Licensed, security-focused Iceberg REST Catalog. He’s a big believer in open standards, which he sees as the backbone of today’s modern, composable Data & Analytics systems. Christian is all about crafting innovative solutions that make collaborating on data a breeze.

Transcript summary

This talk by Christian, a software engineer from Hamburg and contributor to Apache Iceberg, explores the lakehouse architecture concept, focusing on Apache Iceberg as a table format and the challenges of implementing proper security and governance in open-source data systems. The presentation demonstrates how Lakekeeper, a Rust-native Iceberg REST catalog, addresses these challenges.

Evolution from Data Warehouses to Lakehouses

The journey of data analytics systems began with relational database systems that were initially used for analytical workloads. Traditional data warehouses worked well when there was a central analytics team with limited users. These systems provided ACID compliance, transaction safety, schema management, governance, and access controls all within the database.

However, requirements changed as organizations democratized data access across entire companies, with various teams, AI use cases, and machine learning models all requiring massive amounts of data. The fundamental limitation of traditional databases became apparent: they couple compute and storage in a single system, making scaling expensive and inefficient. Additionally, legacy protocols like ODBC and JDBC, while adequate for transactional systems, proved too slow for analytical workloads requiring large data transfers.

This led to the development of data lakes, primarily through the Hadoop ecosystem. The core principle of data lakes involves eliminating the database as a gatekeeper and allowing applications to directly access files stored in open formats on object storage like S3. This architecture provides exceptional scalability and performance while enabling heterogeneous use cases where different engines can access the same underlying data.

However, data lakes introduced significant problems: they lack transaction safety by default, have no inherent schema management, and provide no access controls. The lakehouse concept emerged to combine the best of both approaches - maintaining the scalability and performance of data lakes while reintroducing the reliability, governance, and transaction safety of traditional databases.

The Lakehouse Technology Stack

Building a functional lakehouse requires agreement on open standards across multiple layers. At the foundation is physical storage, typically S3 for cloud deployments or Ceph for on-premises environments. The next layer requires an open data format for storing data efficiently - Parquet has become the de facto standard. Parquet is a columnar storage format optimized for analytical queries, offering significant performance advantages over formats like CSV.

These two layers alone constitute a data lake but are insufficient for a lakehouse. Without additional infrastructure, concurrent read and write operations have undefined behavior, making the system unsuitable for production use.

Apache Iceberg as the Table Format Layer

Apache Iceberg was developed at Netflix in 2017 to solve the transaction safety problem on data lakes. It was donated to the Apache Foundation in 2018 and has become the leading open table format for analytical workloads. Iceberg functions as a metadata layer on top of Parquet, bringing SQL table reliability and simplicity to big data analytics.

While not the only project in this space - Delta Lake and Apache Hudi are notable alternatives - Iceberg has emerged as the industry standard. Delta Lake, originally developed by Databricks, is now converging with Iceberg at a binary level after Databricks acquired Iceberg's creators for two billion dollars. Major cloud providers including AWS, Azure, and Google all contribute to the Iceberg project, demonstrating broad industry adoption.

The Critical Role of Catalogs

The Iceberg project includes a REST catalog specification that defines how catalogs should operate through an OpenAPI specification. However, the Iceberg project itself does not provide an official implementation - various vendors and open-source projects implement this standard independently.

Catalogs serve several critical functions in a lakehouse architecture. They provide a single source of truth for table metadata, with each Iceberg table maintaining a JSON metadata file that tracks snapshots and table history. Catalogs also enable multi-table transactions, which are challenging to implement directly on object storage. Without a catalog, coordinating transactions across multiple tables stored on S3 would be extremely difficult.

The Iceberg community has consolidated around the REST specification after years of fragmentation with legacy catalog implementations including Hadoop, Hive, Glue, JDBC, DynamoDB, ECS, and Nessie. While these older implementations still exist, the REST specification represents the standardized approach going forward.

Identity and Permission Management Challenges

When moving from a monolithic database to a distributed lakehouse architecture, two critical components are often overlooked: permission management and identity management. Traditional databases handle both natively, but in a lakehouse built from separate components, these capabilities must be explicitly designed and implemented.

The fundamental challenge is enabling different users to collaborate using any query engine they prefer while maintaining secure access to shared data. Users should be able to switch between Python, Trino, or other tools while their personal permissions remain consistent across all engines. This is surprisingly difficult to achieve in a non-monolithic architecture but is possible with modern open-source tools.

Where to Store and Enforce Permissions

Several potential locations exist for storing permissions in a lakehouse stack. Storing permissions in individual query engines is problematic because permissions would not be shared across different compute systems. Similarly, securing data solely through storage layer policies (like S3 policies) is insufficient - the principle should be to secure your data, not your storage.

The recommended approaches involve either storing permissions in the catalog itself or in a dedicated permission management system. Modern permission systems like Open Policy Agent and OpenFGA are specialized CNCF projects designed to make permissions flexible and robust. These can integrate with catalogs to provide comprehensive permission enforcement.

Identity management is equally critical and often neglected. Many data engineering workflows rely on API keys or client credentials generated outside the organization's identity provider, creating security risks. Every system that creates personal access tokens, client credentials, or passwords outside the primary identity provider effectively becomes another identity provider in the infrastructure, introducing security vulnerabilities. The solution is to generate all credentials in the organization's identity provider and nowhere else.

Permission Enforcement Mechanisms

Permissions can be enforced at two primary levels in a lakehouse architecture. The first approach is catalog-enforced security using vended credentials. In this model, a compute engine like Apache Spark queries an Iceberg REST catalog (such as Lakekeeper), which validates permissions and returns temporary credentials. These credentials allow Spark to access only the specific table locations on S3 that the user is authorized to read. This mechanism provides table-level security and works with any query engine, including local Python installations or Jupyter notebooks, because the security is enforced before storage access is granted.

The second approach is query engine-enforced security, which applies when using multi-user query engines like Trino. In this model, users authenticate through the organization's identity provider (such as AWS Cognito or Keycloak) when logging into the query engine. When querying data, the query engine contacts the catalog using system credentials to retrieve table metadata and storage credentials. The query engine then separately checks permissions through a dedicated permission system and filters results based on what the user is authorized to see before returning data. This approach enables more granular row-level and column-level security but requires trusting the query engine to properly enforce permissions, meaning it only works when the query engine is maintained by organizational administrators.

Open Source Lakehouse Implementation

The open-source ecosystem now provides all necessary components to build a complete lakehouse with proper governance and security. This capability has traditionally been a major selling point for vendor platforms, but open-source alternatives have matured to offer comparable functionality.

The demonstration showcased a complete setup with Lakekeeper as the catalog, Trino as a shared query engine, PyIceberg for Python-based access, and Keycloak as the identity provider. The system demonstrated how two different users with distinct permissions could interact with the same data using different tools, with permissions consistently enforced across engines. When one user created tables and another attempted to access them without permissions, access was correctly denied. After granting specific permissions through Lakekeeper's UI, the second user could access authorized tables through both Trino and PyIceberg, while remaining blocked from unauthorized tables.

Lakekeeper Features and Architecture

Lakekeeper is an Apache-licensed, open-source Iceberg REST catalog implementation written in Rust. It provides security features out of the box, distinguishing it from many other open-source catalog implementations. The Rust implementation results in a minimal binary of approximately 70 megabytes that can run in highly restricted environments, including Google distroless Docker images without even a shell.

The system includes robust Kubernetes integration with support for multiple identity providers. Beyond standard OIDC and OAuth support for user authentication, Lakekeeper enables Kubernetes workloads to authenticate natively, facilitating secure service-to-service communication in cloud-native environments. The entire demonstration environment, including the identity provider, catalog, query engines, and permission system, can be launched with just three commands: git clone, cd, and docker compose up, making it highly accessible for testing and evaluation.