V4 colocated dv writes by stevenzwu · Pull Request #19 · stevenzwu/iceberg

stevenzwu · 2026-06-12T20:25:16Z

stacked implementation up to phase 6 according to this plan
apache#16694

…pache#16924) SparkValueConverter.convertToSpark returned a new UnsupportedOperationException for STRUCT, LIST, and MAP instead of throwing it, so the exception object would be passed downstream as a value. Throw it to match the method's own default branch.

…rter (apache#16925)

…e#16912) This reverts commit e5151f3.

Remove path filters from the ASF allowlist workflow so it runs for all pull requests and pushes to main. This surfaces upstream approved-pattern drift as a visible check failure even when a pull request does not edit workflow files. Fixes apache#16914 Generated-by: Codex Co-authored-by: Codex <codex@openai.com>

…value is null (apache#16826) * kafka-connect: evolve table schema when record schema is updated but value is null Signed-off-by: Thomas Thornton <thomaswilliamthornton@gmail.com> * kafka connect: fix nested schema evolution when parent evolves, add map key evolution Signed-off-by: Thomas Thornton <thomaswilliamthornton@gmail.com> * Fix style Signed-off-by: Thomas Thornton <thomaswilliamthornton@gmail.com> * kafka connect: improve docs and testing for evolve schema when value is null Signed-off-by: Thomas Thornton <thomaswilliamthornton@gmail.com> * kafka connect: defer nested null value evolution when parent evolves, drop map key recursion Signed-off-by: Thomas Thornton <thomaswilliamthornton@gmail.com> --------- Signed-off-by: Thomas Thornton <thomaswilliamthornton@gmail.com>

…4 write paths Adds status(), dataSequenceNumber(), fileSequenceNumber(), and firstRowId() setters to TrackedFileBuilder. These bypass the TrackingBuilder.added/from chain and construct the Tracking directly in build(), needed for v4 manifest references (explicit data/file sequence numbers without a source TrackedFile) and v4 non-ADDED transitions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce two package-private helpers used by the v4 manifest writer path: TrackedFileWrapper is a StructLike adapter exposing a TrackedFile via positional access matching TrackedFile.schemaWithContentStats(...). It mirrors V4Metadata.ManifestEntryWrapper for the new content_entry layout, supporting Parquet writes without materializing intermediate records. ContentEntryAdapter converts legacy ManifestEntry and ManifestFile inputs into TrackedFileStruct rows. The leaf factories accept any ManifestEntry whose file is DATA or EQUALITY_DELETES, so ManifestWriter.V4Writer and V4DeleteWriter share one entry point. A DV-specific overload takes ManifestEntry<DataFile> for colocated deletion vector emission used by Phase 6's REPLACED/MODIFIED pairs. fromManifestFile remains for root manifest references and accepts the writer_format_version override (1 for v4 leaves, 0 for v3 leaves carried through a v3->v4 upgrade). Status derivation is delegated to TrackingBuilder; content stats construction goes through MetricsUtil.fromMetrics. ManifestEntry.Status is mapped to EntryStatus in toEntryStatus(...), the only mapping needed since the legacy enum has no REPLACED or MODIFIED. Per the v4 plan, validation of writer_format_version against the supported set lives in the Phase 2 ContentEntryReader, not at storage or write time. The adapter validates only on the manifest factory, the single caller that may legitimately set 0. No callers wired yet; round-trip tests land with the Phase 2 reader and writer rewrite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce the v4 root manifest write/read pair — the replacement for the manifest list in format version 4. RootManifestWriter emits content_entry Parquet rows (content_type=DATA_MANIFEST or DELETE_MANIFEST) with manifest_info counts from each ManifestFile. The writer accepts an explicit writer_format_version (0 for legacy v3 leaves carried over in a v3->v4 upgrade; 1 for v4 content_entry leaves) so Phase 5's SnapshotProducer can set it correctly per entry. RootManifestReader reconstructs GenericManifestFile objects from those rows. Direct data-file entries (the small-write optimization) are skipped with a DEBUG log; that path is deferred to a future phase. RootManifests is the static factory (analogous to ManifestLists) with write() and two read() overloads. TestRootManifest covers round-trips for data/delete manifests, key metadata, multiple manifests, legacy writer_format_version=0 entries, and the version guard on write(). The empty Parquet group limitation is resolved by using a single dummy optional boolean field (field id 99999/_unpartitioned) for the partition struct and a separate dummy (field id 99998/_no_stats) for content_stats; both are always null on write and ignored on read. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Phase 4 of the v4 metadata-tree write-path plan. Adds rootManifestLocation() to the Snapshot API, plumbs formatVersion and rootManifestLocation fields through BaseSnapshot with a constructor that enforces exactly one of manifest-list or root-manifest is set, updates SnapshotParser to read/write the new root-manifest JSON key, and dispatches cacheManifests() to RootManifests.read() for format version >= 4. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Wire SnapshotProducer.apply() to write a root manifest (.parquet) instead of a manifest list (.avro) when the table format version is >= 4. The v4 path uses RootManifests.write(), derives ADDED/EXISTING status per manifest from snapshotId comparison, and carries firstRowId + addedRows for row lineage. The v1-v3 path is unchanged. The commit() cleanup now resolves the committed location through both manifestListLocation() and rootManifestLocation() so it handles both v3 and v4 snapshots. RootManifestWriter gains a three-argument add() overload that accepts an explicit EntryStatus, needed to emit EXISTING for carried-over leaf manifests in multi-snapshot root manifests. TestV4SnapshotProducer covers: single append (root manifest .parquet, DATA_MANIFEST entry ADDED with writer_format_version=1), two appends (first leaf EXISTING, second ADDED), and delete-file (rewritten leaf ADDED, unchanged leaf EXISTING, deleted file DELETED in rewritten leaf). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…airing Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

stevenzwu force-pushed the v4_colocated_dv_writes branch 4 times, most recently from 4df2623 to f2e1162 Compare June 16, 2026 18:34

stevenzwu force-pushed the v4_colocated_dv_writes branch 16 times, most recently from 727354a to 847273c Compare June 22, 2026 04:35

thswlsqls and others added 6 commits June 22, 2026 10:48

Spark 3.5, 4.0: Throw on unsupported complex types in SparkValueConve…

fb6bb97

…rter (apache#16925)

Revert "Spark: Spark tests cache rewrite input (apache#16740)" (apach…

902ed1b

…e#16912) This reverts commit e5151f3.

Core: Introduce builder for TrackedFile (apache#16769)

7c13104

stevenzwu force-pushed the v4_colocated_dv_writes branch from 847273c to c24aa44 Compare June 22, 2026 23:55

stevenzwu force-pushed the v4_colocated_dv_writes branch from c24aa44 to 13e208d Compare June 23, 2026 04:59

stevenzwu and others added 5 commits June 22, 2026 22:53

Core: Colocate DVs in v4 leaf data manifests with REPLACED/MODIFIED p…

cdc8e56

…airing Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

stevenzwu force-pushed the v4_colocated_dv_writes branch from 13e208d to cdc8e56 Compare June 23, 2026 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V4 colocated dv writes#19

V4 colocated dv writes#19
stevenzwu wants to merge 13 commits into
mainfrom
v4_colocated_dv_writes

stevenzwu commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

stevenzwu commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants