Skip to content

Enable LargeListArray support in Parquet reader schema validation #513

@callmepandey

Description

@callmepandey

Summary

Follow-up to #502. The data conversion layer now supports LargeListArray (64-bit offsets) via ProjectRecordBatch, but the Parquet reader's schema validation still rejects LARGE_LIST types.

Problem

ValidateParquetSchemaEvolution in parquet_schema_util.cc:177-180 only accepts ::arrow::Type::LIST:

case TypeId::kList:
  if (arrow_type->id() == ::arrow::Type::LIST) {
    return {};
  }
  break;

When reading a Parquet file containing LargeListArray columns, the reader fails with:

Cannot read Iceberg type: list from Parquet type: large_list<...>

Proposed Solution

Update the validation to accept both list types:

case TypeId::kList:
  if (arrow_type->id() == ::arrow::Type::LIST ||
      arrow_type->id() == ::arrow::Type::LARGE_LIST) {
    return {};
  }
  break;

This is safe because:

  1. Iceberg's ListType doesn't distinguish between LIST and LARGE_LIST
  2. The projection layer (ProjectRecordBatch) already handles both via templated ProjectListArrayImpl<>
  3. Both represent the same logical "list" concept, just with different offset sizes

Files to Change

  • src/iceberg/parquet/parquet_schema_util.cc - Update ValidateParquetSchemaEvolution
  • src/iceberg/test/parquet_test.cc - Add integration test for reading LargeListArray through full reader pipeline

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions