Skip to content

Fast non-validating reader mode to count records in Avro OCF files #9613

@mzabaluev

Description

@mzabaluev

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A query like SELECT COUNT(*) ... on an Avro data source needs no data fields, only the number of rows in the partitioned data set.
With the Avro OCF format, this information can be obtained by decoding just the block frames, presuming that the data encoding is well-formed and the number of encoded records in each block matches the one stated in the block header.

Describe the solution you'd like
Add an option method to the reader builders that would make the reader bypass decompression and decoding of Avro record data. Instead, the decoder should only parse the OCF data blocks to sum the row counts, and produce record batches with no columns, but with the row counts and metadata corresponding to the file content. This method should not be used together with with_reader_schema.

The name of the method should give sufficient warning, e.g. count_without_validation.

Describe alternatives you've considered
This behavior could be enabled when the reader schema has no fields. However, since this could lead to invalid encoded data being accepted based on the block framing, it's preferable that an explicit option is used.

Additional context
#9608 concerns the behavior when the reader schema has no fields, but validation of Avro data is performed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions