-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Description
Summary
Follow-up to #502. The data conversion layer now supports LargeListArray (64-bit offsets) via ProjectRecordBatch, but the Parquet reader's schema validation still rejects LARGE_LIST types.
Problem
ValidateParquetSchemaEvolution in parquet_schema_util.cc:177-180 only accepts ::arrow::Type::LIST:
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST) {
return {};
}
break;When reading a Parquet file containing LargeListArray columns, the reader fails with:
Cannot read Iceberg type: list from Parquet type: large_list<...>
Proposed Solution
Update the validation to accept both list types:
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST ||
arrow_type->id() == ::arrow::Type::LARGE_LIST) {
return {};
}
break;This is safe because:
- Iceberg's
ListTypedoesn't distinguish between LIST and LARGE_LIST - The projection layer (
ProjectRecordBatch) already handles both via templatedProjectListArrayImpl<> - Both represent the same logical "list" concept, just with different offset sizes
Files to Change
src/iceberg/parquet/parquet_schema_util.cc- UpdateValidateParquetSchemaEvolutionsrc/iceberg/test/parquet_test.cc- Add integration test for reading LargeListArray through full reader pipeline
Related
- Closes the remaining work from Add support for Arrow LargeListArray in Parquet data projection #502
Metadata
Metadata
Assignees
Labels
No labels