Skip to content

DRAFT: monotonic row timestamp#18

Open
stevenzwu wants to merge 5 commits into
mainfrom
row-timestamp
Open

DRAFT: monotonic row timestamp#18
stevenzwu wants to merge 5 commits into
mainfrom
row-timestamp

Conversation

@stevenzwu

Copy link
Copy Markdown
Owner

No description provided.

… tables.

Made-with: Cursor
Model: claude-4.6-opus-high-thinking
…w timestamp

Add a new commit_timestamp_ms field to ManifestEntry and ManifestFile,
threading the snapshot timestamp through the metadata layer for V4 tables.
This enables the _last_updated_timestamp_ms metadata column which inherits
from the snapshot timestamp, mirroring _last_updated_sequence_number.

Includes test coverage in TestManifestWriterVersions,
TestRowLineageAssignment, and TestRowLineageMetadata verifying
V4 inheritance and pre-V4 null behavior. TestRowLineageAssignment
is now parameterized with V3_AND_ABOVE.

Made-with: Cursor
Model: claude-4.6-opus-high-thinking
Made-with: Cursor
Follow-up review fixes for the V4 row-timestamp feature. Tightens the
commit_timestamp_ms contract end-to-end so the per-row
_last_updated_timestamp_ms metadata column behaves correctly across
manifest rewrites, table-path rewrites, compaction, and V3-to-V4
upgrades, and so V3 (and earlier) tables continue to read it as null.

Core
- TestManifestWriterVersions#testV4Write: assert pre-inheritance
  commit_timestamp_ms is null (InheritableMetadataFactory normalizes the
  UNASSIGNED_TS sentinel away), not the raw -1L sentinel.
- ManifestWriter / RollingManifestWriter: add Long commitTimestampMs
  overloads on existing/delete and a new add(F, long, Long) overload;
  keep the previous signatures as @deprecated thin wrappers that pass
  null. The package-private add(ManifestEntry) now propagates the
  entry's commitTimestampMs through wrapAppend so merges and copies
  preserve the file's original commit time.
- GenericManifestEntry: widen wrapAppend to accept commitTimestampMs
  (older overloads delegate with null) so callers can preserve the
  original timestamp when re-emitting ADDED entries.
- GenericManifestFile#toString: include commit_timestamp_ms.
- RewriteTablePathUtil#appendEntryWithFile: preserve dataSequenceNumber
  and commitTimestampMs on ADDED entries (with a V1 fallback to the
  simple add(file) when dataSequenceNumber is null).
- ValueReaders.LastUpdatedTimestampReader: accept Long (boxed) and
  guard against null construction; the public factory method already
  routes nulls to a constant-null reader.
- Tests (TestManifestWriter, TestRowLineageAssignment,
  TestRowLineageMetadata): cover the new overloads, the deprecated
  wrappers, and explicitly assert null entry/file commit_timestamp_ms
  on V3 and earlier.

Spark (3.4, 3.5, 4.0, 4.1)
- RewriteManifestsSparkAction: project commit_timestamp_ms in the
  manifest-entry DataFrame and pass it through to writer.existing(...)
  so V4 manifest rewrites no longer silently drop the field.
- TestRewriteManifestsAction (4.1): regression test asserting
  per-file commit_timestamp_ms survives a manifest rewrite.
- TestRewriteDataFilesAction (4.1): regression test asserting
  per-row _last_updated_timestamp_ms is preserved across compaction
  and is not collapsed to the rewrite snapshot's commit timestamp.
- TestRowLevelOperationsWithLineage (4.1 extensions): assert pre-V4
  tables expose null _last_updated_timestamp_ms for every row in
  MERGE / UPDATE flows, and add a V3-to-V4 upgrade test pinning that
  pre-upgrade rows continue to read as null after the format upgrade.

Made-with: Cursor
Model: claude-4.7-opus
…3->V4 upgrade comment)

Cosmetic and clarity-only follow-ups uncovered while re-reviewing
197c9ac. No behaviour change in production code.

- Apply spotless to three v3.4 / v3.5 / v4.0 source files
  (SparkCopyOnWriteOperation, SparkPositionDeltaOperation,
  SparkWriteBuilder) that were left mis-wrapped by the
  LAST_UPDATED_TIMESTAMP_MS plumbing in cec91dd.
- Apply spotless to a few formatting drifts in core / v4.1 tests that
  the previous commit landed unformatted (GenericManifestEntry,
  TestManifestWriter, TestRewriteDataFilesAction).
- Tidy ManifestWriter#add(ManifestEntry) into a single
  reused.wrapAppend(...) call; eliminates an unnecessary if/else now
  that wrapAppend(Long, Long, Long, F) handles a null
  dataSequenceNumber identically to the long overload.
- Document why testV3ToV4UpgradePreservesPreUpgradeNullTimestamp does
  NOT (yet) assert that rows appended *after* the format upgrade carry
  the upgrade snapshot's commit_timestamp_ms. Empirically the existing
  V4 test testMergeIntoWithOnlyMatched is also broken on this branch:
  the Parquet vectorized reader returns the data file's literal
  placeholder (0L) instead of the inherited manifest commit timestamp,
  and the non-vectorized reader returns null. Pinning the post-upgrade
  half of the contract requires fixes to the data-file writer or
  readers that are out of scope here; the test now flags the missing
  assertion in a comment so it is not lost.

Made-with: Cursor
Model: claude-4.7-opus
…lanning

Three related fixes that complete the V4 _last_updated_timestamp_ms
contract for the Spark MERGE/UPDATE/DELETE path and for distributed
scan planning.

Spark: SparkPositionDeltaWrite.Context (v4.0 + v4.1)
- The Spark-side dataSparkType was augmenting the schema with ROW_ID
  and LAST_UPDATED_SEQUENCE_NUMBER, but not LAST_UPDATED_TIMESTAMP_MS.
  Iceberg's dataSchema already carried all three lineage columns (via
  schemaWithRowLineage), so the writer threw "Structs do not match"
  on V4 MERGE/UPDATE/DELETE writes. Append the third lineage column to
  dataSparkType so the Iceberg and Spark schemas stay in sync.

Spark: ManifestFileBean (v4.0 + v4.1)
- ManifestFileBean drives DISTRIBUTED scan planning in
  SparkDistributedDataScan: the driver projects each ManifestFile into
  a bean, ships the bean to executors, and reads manifests there. The
  bean was missing commitTimestampMs, so InheritableMetadataFactory on
  the executor saw a null manifest commit_timestamp_ms and could not
  inherit it onto V4 ADDED entries. Reads through DISTRIBUTED planning
  therefore returned null _last_updated_timestamp_ms for newly-added
  V4 rows, even though LOCAL planning returned the correct timestamp.
  Add the commitTimestampMs field + getter/setter to the bean and
  propagate it from manifest.commitTimestampMs() in fromManifest().

Tests: TestRowLevelOperationsWithLineage (v4.0 + v4.1)
- INITIAL_RECORDS and createRecords now pass null for
  _last_updated_timestamp_ms so the fixtures simulate production
  rows (no row-level timestamp written) and exercise inheritance from
  the manifest. createRecord widens the parameter to Long.
- v4.1 testV3ToV4UpgradeTimestampInheritance (renamed from
  testV3ToV4UpgradePreservesPreUpgradeNullTimestamp) now INSERTs a row
  via Spark SQL after the format upgrade and asserts both halves of
  the contract: pre-upgrade rows continue to read null (their V3
  manifests have no commit_timestamp_ms) and the post-upgrade row
  inherits the new V4 snapshot's commit_timestamp_ms. Without the
  ManifestFileBean fix above this assertion failed under DISTRIBUTED
  planning.

Made-with: Cursor
Model: claude-4.7-opus-thinking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant