What is an Apache Iceberg Manifest List and Manifest?
Apache Iceberg is a powerful table format explicitly designed to manage massive analytical datasets within cloud object stores. It introduces concepts like hidden partitioning, schema evolution, and time travel that provide superior organization, efficient updates, and flexibility when compared to traditional file-based approaches like Hive tables.
At the heart of Iceberg's efficient handling of large datasets lie manifests and manifest lists. Manifests are specialized files that track the data files belonging to a table, along with relevant metadata. Manifest lists act as indexes or catalogs for manifests, aiding in the faster discovery of relevant data when queries are made.
Manifests: The Backbone of an Iceberg Table
Definition: An Iceberg manifest is an immutable file, formatted in Avro, that essentially provides a list of data files within the table. Crucially, it also stores specific metadata about these files, making it a treasure trove of information for tasks such as query planning.
Composition: Let's unpack the details that make up a manifest:
Partition Data Tuple: Each data file listed in a manifest must be accompanied by the values corresponding to its partition columns. This way, even without looking at the data file itself, you understand how that file's data is partitioned.
Metrics/Statistics: Manifests store column-level metrics such as minimum and maximum values, the number of null values, and distinct value counts. These aid in performance optimizations during query execution.
Tracking Information: Essential components like file path, format (e.g., Parquet, ORC), schema details, and other tracking information form the core of a manifest.
Role in Table Snapshots: Iceberg table snapshots, which represent the table's state at a specific point in time, are constructed using manifests. Think of snapshots as collections of related manifests that paint a picture of the table at that instant.
Types: Iceberg handles deletions gracefully with two types of manifests:
Data Manifests: These contain details about regular data files that hold the table's content.
Delete Manifests: These specifically list "delete files," which contain rows or entries marked for deletion. Iceberg processes these delete files to correctly filter data upon reading.
Manifest Lists: Management and Metadata Organization
Definition: A manifest list acts like a metadata file dedicated to the management of the individual manifests associated with a particular table snapshot.
Purpose: The primary functions of manifest lists revolve around efficiency and maintaining order:
Accelerate Metadata Operations: Instead of reading through potentially numerous individual manifests to assess a table's state, manifest lists provide helpful precomputed summaries. These summaries include partition value ranges and statistics like the number of files added, deleted, or existing.
Snapshot Sequencing: Keeping track of the sequence of changes to a table is crucial. Manifest lists assign sequence numbers to manifests, helping Iceberg track the correct chronological order of table snapshots.
Components: Let's dive into the types of information a manifest list stores:
Manifest File Metadata: For each manifest within a snapshot, a manifest list tracks the manifest's path, file length, the partition spec ID used, and more.
Summary Metadata: To make query planning fast, manifest lists include the counts of added, existing, and deleted files within the snapshot. Further, partition summaries offer insights into the distribution of data values within each partition of the table.
Scan Planning and Optimization
Role of Manifests and Manifest Lists: When you execute a query against an Iceberg table, manifests and manifest lists work in tandem to intelligently locate the relevant data and streamline the scan process.
Process: Let's break down the key steps involved:
Filtering Manifests: Iceberg starts by filtering out entire manifests it knows are unrelated to your query. It performs this initial filtering efficiently via file counts or by leveraging partition summaries offered in the manifest list.
Predicate Conversion: For more fine-grained filtering, data predicates (the filters in your query) are converted into partition predicates. For example, a filter might be converted, based on the table's partitioning, to a filter. These modified predicates are then used to target only the manifests and corresponding data files that likely contain matching data.
Metrics Usage: Remember those column metrics within manifests? Iceberg cleverly uses these to further refine the search for relevant data within a chosen manifest file. For example, it could bypass portions of a file knowing that your query's filters fall outside the min/max ranges in a given column.
Inclusive Projection: Ensuring no data is mistakenly discarded is crucial. When converting filtering predicates, Iceberg favors filtering logic (inclusive projection) that might include some extra rows but guarantees correctness by not mistakenly excluding potential matches.
Delete File Management
Applying Deletes: Unlike traditional database systems, Iceberg doesn't physically delete data immediately. Instead for merge-on-read, it relies on delete files. Understanding how those delete files interact with data files during queries is essential:
Equality Deletes: These contain rows identified by one or more column values (like an ID). Equality deletes apply to data files with older data sequence numbers and matching partitions. Notably, if the delete file has no partition specification, it acts as a 'global' delete.
Position Deletes: These are designed to delete rows based on their position within a data file. Therefore, position deletes always apply to a matching data file, even within the same commit with equal data sequence numbers.
Residual Predicates: Even though the focus is on deleting 'older' data, Iceberg uses metrics to prevent unnecessary overhead. Before applying a delete file, it calculates residual predicates – if applying the delete would have no effect on the scan results, it is safely skipped. Think of this as a pre-check to avoid wasted effort.
Snapshot Reference, Branches, and Tags
Iceberg Branching and Tagging: A significant advantage of Iceberg is its built-in features for versioning. The table format supports the creation of branches and tags that essentially label specific table snapshots. Branches are mutable, like in traditional version control systems, letting you update them to point to newer snapshots. Tags offer a way to mark specific moments in the table's history.
Snapshot Reference Object: Iceberg tracks branches and tags using a Snapshot Reference object. It contains the following information:
Snapshot ID: The unique identifier of the snapshot referenced by the tag or branch.Type: Identifies whether the reference is a 'tag' or a 'branch'.Retention Policy: Customizable rules governing how long a snapshot within a reference will be retained (we'll touch on this more in the next section).
Snapshot Retention Policy
Snapshot Expiration: Imagine keeping all snapshots from the creation of your Iceberg table – quite unnecessary and storage-intensive! Snapshots naturally expire according to a user-defined retention policy. This allows deleted or superseded data files to be physically removed.
Configuring Retention Policy: Iceberg offers customization to control snapshot aging in several ways:
: Ensures a minimum number of snapshots are always kept around, regardless of other settings, offering basic history for a branch or tag.
: Provides a time-based cap. Snapshots older than this threshold get expired even if other criteria are not met.
: This setting focuses on references (except the main branch, which never expires). If a tag or branch's referenced snapshot is older than this value, it will expire.
Conclusion
Iceberg manifests and manifest lists provide the metadata backbone that powers table organization and query efficiency. Understanding these structures offers crucial insight into the logic driving data selection and the mechanics of how Iceberg elegantly handles large and evolving datasets. The table format's flexibility with schemas, partitions, and its snapshot mechanism offer exciting possibilities for building robust data applications.
More Resources:
[Video Playlist] Apache Iceberg Lakehouse Engineering
[Blog] Why Lakehouse, Why Now?
[Website] Dremio, The Unified Lakehouse Platform for Self-Service Analytics