DRAFT: monotonic row timestamp#18
Open
stevenzwu wants to merge 5 commits into
Open
Conversation
… tables. Made-with: Cursor Model: claude-4.6-opus-high-thinking
9075c9e to
0116c2d
Compare
edc3c53 to
478dfc6
Compare
…w timestamp Add a new commit_timestamp_ms field to ManifestEntry and ManifestFile, threading the snapshot timestamp through the metadata layer for V4 tables. This enables the _last_updated_timestamp_ms metadata column which inherits from the snapshot timestamp, mirroring _last_updated_sequence_number. Includes test coverage in TestManifestWriterVersions, TestRowLineageAssignment, and TestRowLineageMetadata verifying V4 inheritance and pre-V4 null behavior. TestRowLineageAssignment is now parameterized with V3_AND_ABOVE. Made-with: Cursor Model: claude-4.6-opus-high-thinking Made-with: Cursor
478dfc6 to
cec91dd
Compare
Follow-up review fixes for the V4 row-timestamp feature. Tightens the commit_timestamp_ms contract end-to-end so the per-row _last_updated_timestamp_ms metadata column behaves correctly across manifest rewrites, table-path rewrites, compaction, and V3-to-V4 upgrades, and so V3 (and earlier) tables continue to read it as null. Core - TestManifestWriterVersions#testV4Write: assert pre-inheritance commit_timestamp_ms is null (InheritableMetadataFactory normalizes the UNASSIGNED_TS sentinel away), not the raw -1L sentinel. - ManifestWriter / RollingManifestWriter: add Long commitTimestampMs overloads on existing/delete and a new add(F, long, Long) overload; keep the previous signatures as @deprecated thin wrappers that pass null. The package-private add(ManifestEntry) now propagates the entry's commitTimestampMs through wrapAppend so merges and copies preserve the file's original commit time. - GenericManifestEntry: widen wrapAppend to accept commitTimestampMs (older overloads delegate with null) so callers can preserve the original timestamp when re-emitting ADDED entries. - GenericManifestFile#toString: include commit_timestamp_ms. - RewriteTablePathUtil#appendEntryWithFile: preserve dataSequenceNumber and commitTimestampMs on ADDED entries (with a V1 fallback to the simple add(file) when dataSequenceNumber is null). - ValueReaders.LastUpdatedTimestampReader: accept Long (boxed) and guard against null construction; the public factory method already routes nulls to a constant-null reader. - Tests (TestManifestWriter, TestRowLineageAssignment, TestRowLineageMetadata): cover the new overloads, the deprecated wrappers, and explicitly assert null entry/file commit_timestamp_ms on V3 and earlier. Spark (3.4, 3.5, 4.0, 4.1) - RewriteManifestsSparkAction: project commit_timestamp_ms in the manifest-entry DataFrame and pass it through to writer.existing(...) so V4 manifest rewrites no longer silently drop the field. - TestRewriteManifestsAction (4.1): regression test asserting per-file commit_timestamp_ms survives a manifest rewrite. - TestRewriteDataFilesAction (4.1): regression test asserting per-row _last_updated_timestamp_ms is preserved across compaction and is not collapsed to the rewrite snapshot's commit timestamp. - TestRowLevelOperationsWithLineage (4.1 extensions): assert pre-V4 tables expose null _last_updated_timestamp_ms for every row in MERGE / UPDATE flows, and add a V3-to-V4 upgrade test pinning that pre-upgrade rows continue to read as null after the format upgrade. Made-with: Cursor Model: claude-4.7-opus
…3->V4 upgrade comment) Cosmetic and clarity-only follow-ups uncovered while re-reviewing 197c9ac. No behaviour change in production code. - Apply spotless to three v3.4 / v3.5 / v4.0 source files (SparkCopyOnWriteOperation, SparkPositionDeltaOperation, SparkWriteBuilder) that were left mis-wrapped by the LAST_UPDATED_TIMESTAMP_MS plumbing in cec91dd. - Apply spotless to a few formatting drifts in core / v4.1 tests that the previous commit landed unformatted (GenericManifestEntry, TestManifestWriter, TestRewriteDataFilesAction). - Tidy ManifestWriter#add(ManifestEntry) into a single reused.wrapAppend(...) call; eliminates an unnecessary if/else now that wrapAppend(Long, Long, Long, F) handles a null dataSequenceNumber identically to the long overload. - Document why testV3ToV4UpgradePreservesPreUpgradeNullTimestamp does NOT (yet) assert that rows appended *after* the format upgrade carry the upgrade snapshot's commit_timestamp_ms. Empirically the existing V4 test testMergeIntoWithOnlyMatched is also broken on this branch: the Parquet vectorized reader returns the data file's literal placeholder (0L) instead of the inherited manifest commit timestamp, and the non-vectorized reader returns null. Pinning the post-upgrade half of the contract requires fixes to the data-file writer or readers that are out of scope here; the test now flags the missing assertion in a comment so it is not lost. Made-with: Cursor Model: claude-4.7-opus
…lanning Three related fixes that complete the V4 _last_updated_timestamp_ms contract for the Spark MERGE/UPDATE/DELETE path and for distributed scan planning. Spark: SparkPositionDeltaWrite.Context (v4.0 + v4.1) - The Spark-side dataSparkType was augmenting the schema with ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER, but not LAST_UPDATED_TIMESTAMP_MS. Iceberg's dataSchema already carried all three lineage columns (via schemaWithRowLineage), so the writer threw "Structs do not match" on V4 MERGE/UPDATE/DELETE writes. Append the third lineage column to dataSparkType so the Iceberg and Spark schemas stay in sync. Spark: ManifestFileBean (v4.0 + v4.1) - ManifestFileBean drives DISTRIBUTED scan planning in SparkDistributedDataScan: the driver projects each ManifestFile into a bean, ships the bean to executors, and reads manifests there. The bean was missing commitTimestampMs, so InheritableMetadataFactory on the executor saw a null manifest commit_timestamp_ms and could not inherit it onto V4 ADDED entries. Reads through DISTRIBUTED planning therefore returned null _last_updated_timestamp_ms for newly-added V4 rows, even though LOCAL planning returned the correct timestamp. Add the commitTimestampMs field + getter/setter to the bean and propagate it from manifest.commitTimestampMs() in fromManifest(). Tests: TestRowLevelOperationsWithLineage (v4.0 + v4.1) - INITIAL_RECORDS and createRecords now pass null for _last_updated_timestamp_ms so the fixtures simulate production rows (no row-level timestamp written) and exercise inheritance from the manifest. createRecord widens the parameter to Long. - v4.1 testV3ToV4UpgradeTimestampInheritance (renamed from testV3ToV4UpgradePreservesPreUpgradeNullTimestamp) now INSERTs a row via Spark SQL after the format upgrade and asserts both halves of the contract: pre-upgrade rows continue to read null (their V3 manifests have no commit_timestamp_ms) and the post-upgrade row inherits the new V4 snapshot's commit_timestamp_ms. Without the ManifestFileBean fix above this assertion failed under DISTRIBUTED planning. Made-with: Cursor Model: claude-4.7-opus-thinking
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.