Arrow: Fix ClassCastException in vectorized reader on int-to-long pro… by xndai · Pull Request #16343 · apache/iceberg

xndai · 2026-05-14T22:25:51Z

…motion with INT logical type

Fix ClassCastException: BigIntVector cannot be cast to IntVector when reading Parquet files with INT(32, true) logical type annotation after promoting a column from int to long.

The vectorized reader's LogicalTypeVisitor now allocates vectors based on the Parquet physical type instead of deriving them from the (potentially promoted) Iceberg schema type.

Root Cause:
In VectorizedArrowReader.allocateFieldVector(), the Arrow field was created from the Iceberg schema type (which reflects the promoted LongType), producing a BigIntVector. The LogicalTypeVisitor then cast this vector to IntVector based on the Parquet file's INT(32) logical type, causing the mismatch.

The non-vectorized reader (BaseParquetReaders) already handles this correctly by checking the expected Iceberg type and using IntAsLongReader for promotion. The vectorized reader relies on the accessor layer for widening (IntAccessor.getLong() widens int to long), so the fix ensures the vector matches the physical data layout.

Tests:

testIntToLongPromotionWithLogicalType: verifies reading after promotion when file has INT(32, true) annotation (the reported crash)
testIntToLongPromotionWithoutLogicalType: verifies reading after promotion when file has bare INT32

Fixes #16341

CTTY

LGTM! just one minor comment

CTTY · 2026-05-15T00:10:58Z

-        // Iceberg has no unsigned integer type. Reading UINT32 into a 32-bit signed value would
-        // silently produce negative results for inputs above Integer.MAX_VALUE. UINT8 and UINT16
-        // both fit losslessly in a signed int32 and are allowed, matching the policy in
-        // BaseParquetReaders for the non-vectorized path.


why do we remove this comment? this still looks relevant

I thought the check below was self explanatory. I add them back.

ldudas-marx · 2026-05-18T10:01:45Z

+        IntVector intVector = (IntVector) vector;
+        for (int i = 0; i < root.getRowCount(); i++) {
+          assertThat(intVector.get(i))
+              .as("Row %d value should be read correctly", rowIndex)
+              .isEqualTo(values.get(rowIndex));
+          rowIndex++;
+        }


Testing if the accessor gives back the values correctly is missing

Suggested change

IntVector intVector = (IntVector) vector;

for (int i = 0; i < root.getRowCount(); i++) {

assertThat(intVector.get(i))

.as("Row %d value should be read correctly", rowIndex)

.isEqualTo(values.get(rowIndex));

rowIndex++;

}

ColumnVector columnVector = batch.column(0);

for (int i = 0; i < root.getRowCount(); i++) {

assertThat(columnVector.getLong(i))

.as("Row %d value should be read correctly", rowIndex)

.isEqualTo((long) values.get(i));

}

ldudas-marx · 2026-05-18T10:04:06Z

+    int totalRows = 0;
+    int rowIndex = 0;


If the test data is so small and batch size so large no need for tracking the rows

ldudas-marx · 2026-05-18T10:07:27Z

+    int totalRows = 0;
+    int rowIndex = 0;
+    int[] expectedValues = new int[] {1, 2, 3, Integer.MAX_VALUE};
+    try (VectorizedTableScanIterable vectorizedReader =
+        new VectorizedTableScanIterable(table.newScan(), 1024, false)) {
+      for (ColumnarBatch batch : vectorizedReader) {
+        VectorSchemaRoot root = batch.createVectorSchemaRootFromVectors();
+        FieldVector vector = root.getVector("col");
+        assertThat(vector)
+            .as("Vector should be IntVector matching the physical Parquet type")
+            .isInstanceOf(IntVector.class);
+        IntVector intVector = (IntVector) vector;
+        for (int i = 0; i < root.getRowCount(); i++) {
+          assertThat(intVector.get(i))
+              .as("Row %d value should be read correctly", rowIndex)
+              .isEqualTo(expectedValues[rowIndex]);
+          rowIndex++;
+        }
+        totalRows += root.getRowCount();
+        root.close();
+      }
+    }


My comments for the other test apply here too.

ldudas-marx · 2026-05-18T10:11:15Z

I miss a test case where the table is written with values larger than Integer.MAX_VALUE after the type promotion and reuseContainers is true for the table scan .

ldudas-marx · 2026-05-18T14:58:42Z

    // Perform a type promotion
    // TODO: The read Arrow vector should of type BigInt (promoted type) but it is Int (old type).
    Table tableLatest = tables.load(tableLocation);
    tableLatest.updateSchema().updateColumn("int_promotion", Types.LongType.get()).commit();


This type promotion is now tested separately and the TODO is not true anymore. So these lines are with the "int_promotion" column can be deleted.

…motion with INT logical type Fix ClassCastException: BigIntVector cannot be cast to IntVector when reading Parquet files with INT(32, true) logical type annotation after promoting a column from int to long. The vectorized reader's LogicalTypeVisitor now allocates vectors based on the Parquet physical type instead of deriving them from the (potentially promoted) Iceberg schema type. Root Cause: In VectorizedArrowReader.allocateFieldVector(), the Arrow field was created from the Iceberg schema type (which reflects the promoted LongType), producing a BigIntVector. The LogicalTypeVisitor then cast this vector to IntVector based on the Parquet file's INT(32) logical type, causing the mismatch. The non-vectorized reader (BaseParquetReaders) already handles this correctly by checking the expected Iceberg type and using IntAsLongReader for promotion. The vectorized reader relies on the accessor layer for widening (IntAccessor.getLong() widens int to long), so the fix ensures the vector matches the physical data layout. Tests: - testIntToLongPromotionWithLogicalType: verifies reading after promotion when file has INT(32, true) annotation (the reported crash) - testIntToLongPromotionWithoutLogicalType: verifies reading after promotion when file has bare INT32

github-actions Bot added the arrow label May 14, 2026

CTTY reviewed May 15, 2026

View reviewed changes

ldudas-marx reviewed May 18, 2026

View reviewed changes

xndai force-pushed the iceberg-16341 branch from e3b963d to d2d8dd5 Compare May 18, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Fix ClassCastException in vectorized reader on int-to-long pro…#16343

Arrow: Fix ClassCastException in vectorized reader on int-to-long pro…#16343
xndai wants to merge 1 commit into
apache:mainfrom
xndai:iceberg-16341

xndai commented May 14, 2026

Uh oh!

CTTY left a comment

Uh oh!

CTTY May 15, 2026

Uh oh!

xndai May 18, 2026

Uh oh!

ldudas-marx May 18, 2026

Uh oh!

ldudas-marx May 18, 2026

Uh oh!

ldudas-marx May 18, 2026

Uh oh!

ldudas-marx May 18, 2026

Uh oh!

ldudas-marx May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xndai commented May 14, 2026

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

CTTY May 15, 2026

Choose a reason for hiding this comment

Uh oh!

xndai May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ldudas-marx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ldudas-marx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ldudas-marx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ldudas-marx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ldudas-marx May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants