Skip to content

Conversation

@kazantsev-maksim
Copy link
Contributor

@kazantsev-maksim kazantsev-maksim commented Jan 6, 2026

Which issue does this PR close?

  • N/A

Rationale for this change

Added an experimental implementation of native CSV file reading (currently only for DataSourceV2 version)

Required improvements:

  1. Conduct more benchmark tests
  2. Try to implement the idea from - Implement native parsing of CSV files #882
  3. Test reading files from S3/HDFS (currently only tested on local files)

Results of simple benchmark test (1 iteration): native_csv_read.txt

How are these changes tested?

  1. Added new unit test
  2. Added simple benchmark test

@kazantsev-maksim kazantsev-maksim marked this pull request as draft January 6, 2026 15:38
Kazantsev Maksim added 2 commits January 6, 2026 19:46
Kazantsev Maksim added 2 commits January 6, 2026 20:02
@comphead
Copy link
Contributor

comphead commented Jan 6, 2026

nice, would love to see benches )

@parthchandra
Copy link
Contributor

Shouldn't CSV be a file format and part of ScanExec?

@codecov-commenter
Copy link

codecov-commenter commented Jan 10, 2026

Codecov Report

❌ Patch coverage is 84.92063% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.29%. Comparing base (f09f8af) to head (65251eb).
⚠️ Report is 834 commits behind head on main.

Files with missing lines Patch % Lines
...pache/spark/sql/comet/CometCsvNativeScanExec.scala 84.21% 4 Missing and 5 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala 76.92% 2 Missing and 4 partials ⚠️
...cala/org/apache/comet/serde/operator/package.scala 92.85% 0 Missing and 2 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala 50.00% 0 Missing and 1 partial ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3044      +/-   ##
============================================
+ Coverage     56.12%   59.29%   +3.16%     
- Complexity      976     1374     +398     
============================================
  Files           119      169      +50     
  Lines         11743    15576    +3833     
  Branches       2251     2560     +309     
============================================
+ Hits           6591     9236    +2645     
- Misses         4012     5010     +998     
- Partials       1140     1330     +190     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kazantsev-maksim
Copy link
Contributor Author

Thanks @parthchandra, you are absolutely right. In the first phase, I wanted to implement it only for DataSourceV2 to check the performance improvement. I hope to finish the benchmark tests in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants