Experimental: Native CSV files read #3044

kazantsev-maksim · 2026-01-06T15:36:28Z

Which issue does this PR close?

N/A

Rationale for this change

Added an experimental implementation of native CSV file reading (currently only for DataSourceV2 version)

Required improvements:

Conduct more benchmark tests
Try to implement the idea from - Implement native parsing of CSV files #882
Test reading files from S3/HDFS (currently only tested on local files)

Results of simple benchmark test (1 iteration): native_csv_read.txt

How are these changes tested?

Added new unit test
Added simple benchmark test

# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

This reverts commit 768b3e9.

comphead · 2026-01-06T17:27:35Z

nice, would love to see benches )

parthchandra · 2026-01-10T02:37:54Z

Shouldn't CSV be a file format and part of ScanExec?

codecov-commenter · 2026-01-10T02:59:56Z

Codecov Report

❌ Patch coverage is 84.92063% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.29%. Comparing base (f09f8af) to head (65251eb).
⚠️ Report is 834 commits behind head on main.

Files with missing lines	Patch %	Lines
...pache/spark/sql/comet/CometCsvNativeScanExec.scala	84.21%	4 Missing and 5 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	76.92%	2 Missing and 4 partials ⚠️
...cala/org/apache/comet/serde/operator/package.scala	92.85%	0 Missing and 2 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala	50.00%	0 Missing and 1 partial ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3044      +/-   ##
============================================
+ Coverage     56.12%   59.29%   +3.16%     
- Complexity      976     1374     +398     
============================================
  Files           119      169      +50     
  Lines         11743    15576    +3833     
  Branches       2251     2560     +309     
============================================
+ Hits           6591     9236    +2645     
- Misses         4012     5010     +998     
- Partials       1140     1330     +190

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kazantsev-maksim · 2026-01-11T16:10:33Z

Thanks @parthchandra, you are absolutely right. In the first phase, I wanted to implement it only for DataSourceV2 to check the performance improvement. I hope to finish the benchmark tests in the coming days.

Kazantsev Maksim and others added 27 commits November 26, 2025 20:14

Work

1c19b51

Merge remote-tracking branch 'origin/main' into native_csv_read

062296f

Work

0d9355f

Merge remote-tracking branch 'origin/main' into native_csv_read

4479678

Work

6c12812

work

b601956

Merge remote-tracking branch 'origin/main' into native_csv_read

2ffd9cb

# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

work

c685235

impl map_from_entries

768b3e9

Revert "impl map_from_entries"

c68c342

This reverts commit 768b3e9.

Merge branch 'apache:main' into main

d887555

Merge branch 'apache:main' into main

231aa90

Merge remote-tracking branch 'origin/main' into native_csv_read

2be0069

work

7ea16ee

work

c521006

Merge branch 'apache:main' into main

9500bbb

Merge branch 'apache:main' into main

9577481

Merge remote-tracking branch 'origin/main' into native_csv_read

8796a68

WIP

0f06936

WIP

033ba8b

WIP

dafa0de

Work

1809df8

Merge branch 'apache:main' into main

3791557

Merge branch 'apache:main' into main

7c2f082

Merge branch 'apache:main' into main

609a605

Final approach

d8c7760

Fix workflows

88aeb33

kazantsev-maksim marked this pull request as draft January 6, 2026 15:38

Kazantsev Maksim added 2 commits January 6, 2026 19:46

Fix fmt

b2a0c28

Fix params

ba98e37

Kazantsev Maksim added 2 commits January 6, 2026 20:02

Fix tests

cd449c5

Fix rust fmt

a1801c1

Kazantsev Maksim and others added 2 commits January 6, 2026 21:57

Fix fmt

65251eb

Merge branch 'apache:main' into main

a151b2c

kazantsev-maksim and others added 3 commits January 10, 2026 01:06

Merge branch 'apache:main' into main

ad3e7f5

Merge remote-tracking branch 'origin/main' into native_csv_read

da73c27

Fix tests

a21ae07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental: Native CSV files read #3044

Experimental: Native CSV files read #3044

kazantsev-maksim commented Jan 6, 2026 •

edited

Loading

Uh oh!

comphead commented Jan 6, 2026

Uh oh!

parthchandra commented Jan 10, 2026

Uh oh!

codecov-commenter commented Jan 10, 2026 •

edited

Loading

Uh oh!

kazantsev-maksim commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Experimental: Native CSV files read #3044

Are you sure you want to change the base?

Experimental: Native CSV files read #3044

Conversation

kazantsev-maksim commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

How are these changes tested?

Uh oh!

comphead commented Jan 6, 2026

Uh oh!

parthchandra commented Jan 10, 2026

Uh oh!

codecov-commenter commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kazantsev-maksim commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kazantsev-maksim commented Jan 6, 2026 •

edited

Loading

codecov-commenter commented Jan 10, 2026 •

edited

Loading