TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
NEW! Try Stackie AI
Data

What’s New in Apache Iceberg 3.0

Apache Iceberg has expanded and is ready to support a greater variance of big data-based workloads.
Jun 19th, 2025 7:00am by
Featued image for: What’s New in Apache Iceberg 3.0
Feature image via Unsplash.

Russell Spitzer

Version 3 of Apache Iceberg has been released. A number of features have been added that expand the flexibility of the data table format, including a few much-requested data types, faster deletes, row lineage and default values for NULL types.

This new version —  as well the v4 version the core development team is about to start on — will better equip Iceberg for new types of use cases, explained Russell Spitzer, a program manager for Apache Iceberg.

As an open data format, Apache Iceberg has been instrumental in the creation of data lakehouses, which combine multiple sources of structured and unstructured data for large-scale analysis. It uses a sophisticated set of metadata to keep track of the tracks of the changes in the different files it indexes.

Iceberg, along with a good metadata store, keeps track of a schema as it evolves, which gives users more flexibility in updating the schema while maintaining the ability to query older data. It can do time travel and rollbacks. It can also scale without users worrying about partitions.

Apache Iceberg is both a set of specifications as well as a number of reference implementations. There is a view specification, a REST specification for how to communicate with the server. It also includes a specification for Puffin file format for storing indexes, statistics and other data bits that can’t be stored within an Iceberg manifest. There is also a range of implementations written in different languages (Java, Python, Rust, Go, C++), and based on different platforms such as Apache Spark and Apache Flink.

Spitzer

Spritzer, along with Snowflake Senior Manager of Developer Relations Danica Fine, held an explanatory session about Iceberg version 3 specification at the annual Snowflake Summit held earlier this month in San Francisco. The new features they covered included:

  • Deletion vectors
  • Variant type
  • Geometric type
  • Row lineage
  • NULL default values

Deletion Vectors

Fine explained that when it comes to deleting data, Iceberg has two avenues: copy-on-write and merge-on-read. Copy-on-write simply makes a copy of whatever file is being changed minus whatever row(s) that needed to be deleted; merge-on-read keeps the original but then makes a deletion file, noting the contents to be removed. Deletes can be all instances of a value (equality delete) or just one in a specific location (position delete).

Position delete can pose a quandary though. You can have one file capture all deletes in a partition, or have a file of deletes for each file. And so the admin was confronted with understanding and making the appropriate trade-offs to optimize speed of access.

For v3, position delete files will be replaced with highly performant Puffin files. Puffin has been repurposed for storing deletion files as well. Each file will get its deletion vector.

“You’re going to see a lot less maintenance tasks on your end. You don’t have to actually go through and consolidate your data,” Fine said.

New Data Types

New data types abound in v3.0.

First up is the Variant type, which provides a binary support format for storing semi-structured data, such as JSON. It allows you to change the type of data you’re ingesting without changing the schema. Variants can be extended with multiple values.

Snowflake itself already supported variant types, so it made sense to extend that to Iceberg, said Spitzer.

“Variants are really great for a lot of different things,” he said. You get “all the benefits of having a structured type while still having the flexibility of being able to store just about anything you want in every single cell.”

As an example, he pointed to using variant types to capture data from Internet of Things deployments. Think of a sensor that provides latitude and longitude values, but also error codes.

The end user will need to write the variant type into the engine (and/or apps) writing data to the Iceberg columns. Soon, you will also be able to shred variant types into proper Parquet columns for faster analysis, though that is not supported quite yet.

The release also offers a couple of new geometric data types as well. The geotype options unlock a lot of functionality: Geometry captures two-dimensional surfaces, and Geography can be used for three-dimensional and spherical objects (this type will be available later this summer).

The geotypes will only be available to new columns — you can’t backport the data type to existing columns, Fine warned. User code will also have to be revamped to enjoy the new data types.

Row Lineage

Row lineage is another feature that came from Snowflake, and is actually used in many of Snowflake’s functions, Spitzer said.

Row lineage provides a way to check each row of a table for changes: when the data was changed, and what it was changed from (“Change Data Capture“). Basically, row lineage allows you to audit your data changes.

“This is something that you simply could not do before in Iceberg tables,” Spitzer said. “We built a bunch of code to try to approximate this with CDC views, but you really just couldn’t ever know for sure what happened to a single row and where it came from.”

Two new metadata columns capture this activity: one for the row updated, and the other for the last snapshot that row was updated.

Row lineage is turned on automatically for Iceberg v3.

Null Default Values

NULL, NULL, NULL. Devs hate Nulls. Nulls are all over Iceberg tables. The problem was that there was no default value for nulls in Iceberg. When it comes to calculation time, how do you math?

Now the Iceberg provides a way to change all the missing values into a set value, before calculations are made.

So the Null gets two new parameters. One is initial-default. This will be the value that will replace NULL After the upgrade, the first time the engine scans the table, it will replace NULLs with the initial-default. The idea is that you set the initial-default once and forget about it.

There is also a write-default, which adds in the specified value when a NULL is written to the table. This can be changed at any time.

“The reason that there’s two different defaults is that the moment you’ve done that and a row is compacted or moved to a different file, the value that used to be null is now written as a real value, so you can’t change it a second time,” Spitzer explained.

Beyond v3: Streaming and Low-Latency

Those who upgrade to a Apache Iceberg v3 table will need a v3-compliant engine, and some of the updates will require changes to the code that operates on Iceberg.

The working groups are already starting up on Iceberg version 4, Spitzer said. Proposals for new features are being circulated on the mailing list.

“We’re looking at a whole bunch of things to try to make Iceberg a better table format for use cases that currently it’s not great at,” Spitzer said in a follow-up interview with TNS.

These include small tables and tables with lots of updates. They are looking a ways to better accommodate streaming applications, and they want to lower latency in general. This would involve reducing the number of writes to the metadata layer.

Apache Polaris Nears a Big Release Too

Although originally developed at Netflix and subsequently maintained by Dremio, Iceberg has received quite a bit of open source help from Snowflake — in terms of engineering time and even certain features that Snowflake originally developed in-house for its own data formats.

Last year, Snowflake released as open source its own REST catalog it had developed for Iceberg, called Polaris. Iceberg requires a metadata catalog to centralize metadata management, governance and access control for Iceberg tables.

The idea was to “abstract away commit logic from the client and have them in a central server location,” Spitzer said. The catalog often relies on a database for the actual persistence layer. Snowflake’s own commercial Polaris implementation, Open Catalog, uses FoundationDB

The version one release of Polaris will happen “soon,” Spitzer said. Last minute adjustments are being made for production and security assurances. The software had a lot of Snowflake-specifics that needed to be changed out.

And, of course, the software must be scalable.

“There’s folks who want to use it for their own internal organizations, where 20 transactions-a-second on the catalog is more than enough. But we have some people who want to run it as a service, or run it for a huge organization, where you need to handle thousands of transactions a second. It’s probably very rare, but we want to make sure that it scales up to that,” he said.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Dremio.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.