Apache Data Lakehouse Weekly: 7 Key Updates

Prev Article Next Article

As we approach the second week of April, the Apache Data Lakehouse ecosystem continues to evolve at a rapid pace, with several key projects making significant strides in their development and adoption. The post-summit V4 design work on Iceberg has been a defining thread on the dev list, with a focus on consolidating the design and preparing for a formal spec write-up. Meanwhile, Polaris is moving toward its 1.4.0 milestone, with credential vending for Azure and Google Cloud Storage being a major feature addition. Arrow’s release engineering and Java modernization discussions remain active, with a proposed JDK 17 minimum for Arrow Java 20.0.0 garnering input from community members. In this article, we’ll delve into the key updates from the Apache Data Lakehouse ecosystem and explore the implications of these developments.

Consolidating the Design: Iceberg V4

The post-summit V4 design work on Iceberg has been a major focus for the development community. The discussions have narrowed down to practical design questions, with a clear direction emerging. The proposed design treats catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics. This change replaces manifest lists with root manifests and introduces manifest delete vectors, enabling single-file commits that dramatically reduce metadata write overhead for high-frequency writers. The in-person sessions at the summit helped clear the last design disagreements, and the community is now aligning on the implementation plan.

The one-file commits design, proposed by Russell Spitzer and Amogh Jahagirdar, is a significant improvement over the current approach. By replacing manifest lists with root manifests, developers can streamline their workflow and reduce the overhead of metadata writes. This is particularly beneficial for high-frequency writers who need to manage large datasets. The design is moving toward a formal spec write-up, and the community is excited to see the impact it will have on the Iceberg ecosystem.

Another significant development is Péter Váry’s efficient column updates proposal for AI and ML workloads. This design lets Iceberg write only the columns that change on each write for wide feature tables, then stitch the result at read time. For teams managing petabyte-scale feature stores with embedding vectors and model scores, the I/O savings are meaningful. Anurag Mantripragada and Gábor Herman are working alongside Péter on POC benchmarks to support the formal proposal.

Progress on Polaris 1.4.0

Apache Polaris is moving toward its 1.4.0 milestone, with several key features being developed. Credential vending for Azure and Google Cloud Storage is a major feature addition, allowing users to securely manage credentials for these cloud storage systems. This feature is expected to be a significant improvement for users who rely on these services.

Another notable development is the release of Polaris’s blog post on building a fully integrated, locally-running open data lakehouse in under 30 minutes using k3d, Apache Ozone, Polaris, and Trino. This post showcases the ease of use and flexibility of Polaris, making it an attractive option for users looking to deploy a data lakehouse solution.

With incubator overhead behind it, release velocity has picked up noticeably from the 1.3.0 release on January 16. This is a positive sign for the Polaris community, indicating that the project is gaining momentum and moving in the right direction.

Arrow’s Release Calendar

Arrow’s release calendar shows arrow-rs 58.2.0 landing this month, following 58.1.0 in March which shipped with no breaking API changes. The Rust implementation has become one of the most actively maintained segments of the Arrow ecosystem, with a DataFusion integration drawing engines that want Arrow without a JVM dependency.

The proposed JDK 17 minimum for Arrow Java 20.0.0 is a significant development, with Jean-Baptiste Onofré’s proposal garnering input from community members. The practical rationale is coordination: setting JDK 17 as Arrow’s Java baseline aligns with Iceberg’s own upgrade timeline and effectively raises the minimum across the entire lakehouse stack in a single coordinated move.

AI Contribution Policy

The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is moving toward published guidance. This policy aims to establish clear disclosure requirements and code provenance standards for AI-generated contributions. The summit provided the in-person alignment that async debate rarely produces, and a working policy is expected on the dev list in the next couple of weeks.

Polaris is navigating the same question in parallel, and the two communities are likely to converge on a shared approach given their overlapping contributor base. The framing of AI as a resource is a key aspect of this policy, recognizing the potential benefits of AI-generated contributions while ensuring transparency and accountability.

Community Engagement

The Apache Ranger authorization RFC from Selvamohan Neethiraj remained the most active governance discussion. This plugin lets organizations running Ranger with Hive, Spark, and Trino manage Polaris security within the same policy framework, eliminating the policy duplication that arises when teams bolt separate authorization onto each engine.

It’s opt-in and backward compatible with Polaris’s internal authorization layer, which lowers the enterprise adoption barrier considerably. The Polaris PMC also shipped a March 29 post covering automated entity management for catalogs, principals, and roles, providing valuable insights into the capabilities of the Polaris platform.

Conclusion

The Apache Data Lakehouse ecosystem continues to evolve rapidly, with several key projects making significant strides in their development and adoption. The post-summit V4 design work on Iceberg, Polaris’s progress toward its 1.4.0 milestone, and Arrow’s release engineering and Java modernization discussions are all important developments that are expected to shape the future of the ecosystem. As the community continues to drive innovation and adoption, it’s exciting to see the impact that these projects will have on the world of data management and analytics.