Skip to content

Conversation

@anujmodi2021
Copy link
Contributor

Description of PR

Since the onset of ABFS Driver, there has been a single implementation of AbfsInputStream. Different kinds of workloads require different heuristics to give the best performance for that type of workload. For example: 

Sequential Read Workloads like DFSIO and DistCP gain performance improvement from prefetched 
Random Read Workloads on other hand do not need Prefetches and enabling prefetches for them is an overhead and TPS heavy 
Query Workloads involving Parquet/ORC files benefit from improvements like Footer Read and Small Files Reads

To accomodate this we need to determine the pattern and accordingly create Input Streams implemented for that particular pattern.

image

Moving ahead more relevant policies and specialized implementation of AbfsInputStream can be added.

This PR only refactors the way we create input streams. No logical change introduced. As today by default we will continue to use AbfsAdaptiveInputStream which can cater to all kind of workloads.

How was this patch tested?

New tests were added.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 8m 36s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
-1 ❌ mvninstall 22m 15s /branch-mvninstall-root.txt root in trunk failed.
+1 💚 compile 0m 27s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 24s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 checkstyle 0m 17s trunk passed
+1 💚 mvnsite 0m 28s trunk passed
+1 💚 javadoc 0m 26s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 22s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
-1 ❌ spotbugs 0m 44s /branch-spotbugs-hadoop-tools_hadoop-azure-warnings.html hadoop-tools/hadoop-azure in trunk has 1 extant spotbugs warnings.
+1 💚 shadedclient 14m 36s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 20s the patch passed
+1 💚 compile 0m 18s the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 18s the patch passed
+1 💚 compile 0m 19s the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 19s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 10s /results-checkstyle-hadoop-tools_hadoop-azure.txt hadoop-tools/hadoop-azure: The patch generated 28 new + 2 unchanged - 1 fixed = 30 total (was 3)
+1 💚 mvnsite 0m 21s the patch passed
-1 ❌ javadoc 0m 16s /results-javadoc-javadoc-hadoop-tools_hadoop-azure-jdkUbuntu-21.0.7+6-Ubuntu-0ubuntu120.04.txt hadoop-tools_hadoop-azure-jdkUbuntu-21.0.7+6-Ubuntu-0ubuntu120.04 with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 generated 37 new + 1585 unchanged - 0 fixed = 1622 total (was 1585)
-1 ❌ javadoc 0m 15s /results-javadoc-javadoc-hadoop-tools_hadoop-azure-jdkUbuntu-17.0.15+6-Ubuntu-0ubuntu120.04.txt hadoop-tools_hadoop-azure-jdkUbuntu-17.0.15+6-Ubuntu-0ubuntu120.04 with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 generated 37 new + 1464 unchanged - 0 fixed = 1501 total (was 1464)
-1 ❌ spotbugs 0m 42s /new-spotbugs-hadoop-tools_hadoop-azure.html hadoop-tools/hadoop-azure generated 2 new + 1 unchanged - 0 fixed = 3 total (was 1)
+1 💚 shadedclient 14m 12s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 11s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 22s The patch does not generate ASF License warnings.
69m 11s
Reason Tests
SpotBugs module:hadoop-tools/hadoop-azure
Inconsistent synchronization of org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.bCursor; locked 73% of time Unsynchronized access at AbfsInputStream.java:73% of time Unsynchronized access at AbfsInputStream.java:[line 860]
Inconsistent synchronization of org.apache.hadoop.fs.azurebfs.services.AbfsInputStream.fCursor; locked 58% of time Unsynchronized access at AbfsInputStream.java:58% of time Unsynchronized access at AbfsInputStream.java:[line 865]
Subsystem Report/Notes
Docker ClientAPI=1.52 ServerAPI=1.52 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8153/1/artifact/out/Dockerfile
GITHUB PR #8153
JIRA Issue HADOOP-19767
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 3223a0fb1ce6 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / bb42e3c
Default Java Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8153/1/testReport/
Max. process+thread count 641 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8153/1/console
versions git=2.25.1 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@ahmarsuhail
Copy link
Contributor

@anujmodi2021 I am trying to propose a single optimised implementation of an input stream across cloud implementations, as I think we all need this kind of logic. Ideally I want to get to a place where 80% of the logic is shared in a common layer, and then we only implement cloud specific clients to actually make the requests separately.

There is some consensus to move the shared logic into the parquet-java repo: https://lists.apache.org/thread/nbksq32cs8h1ldj8762y6wh9zzp8gqx6 , and some buy-in from the team at google. I'll be following up on this in the new year.

Would be great to get your thoughts and if your team would also like to collaborate on this.

@anujmodi2021 anujmodi2021 marked this pull request as ready for review December 31, 2025 10:58
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 44s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
-1 ❌ mvninstall 20m 52s /branch-mvninstall-root.txt root in trunk failed.
+1 💚 compile 0m 25s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 25s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 checkstyle 0m 20s trunk passed
+1 💚 mvnsite 0m 29s trunk passed
+1 💚 javadoc 0m 25s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 24s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
-1 ❌ spotbugs 0m 44s /branch-spotbugs-hadoop-tools_hadoop-azure-warnings.html hadoop-tools/hadoop-azure in trunk has 1 extant spotbugs warnings.
+1 💚 shadedclient 14m 38s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
-1 ❌ mvninstall 0m 15s /patch-mvninstall-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
-1 ❌ compile 0m 16s /patch-compile-hadoop-tools_hadoop-azure-jdkUbuntu-21.0.7+6-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04.
-1 ❌ javac 0m 16s /patch-compile-hadoop-tools_hadoop-azure-jdkUbuntu-21.0.7+6-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04.
-1 ❌ compile 0m 16s /patch-compile-hadoop-tools_hadoop-azure-jdkUbuntu-17.0.15+6-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04.
-1 ❌ javac 0m 16s /patch-compile-hadoop-tools_hadoop-azure-jdkUbuntu-17.0.15+6-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 11s /results-checkstyle-hadoop-tools_hadoop-azure.txt hadoop-tools/hadoop-azure: The patch generated 18 new + 2 unchanged - 1 fixed = 20 total (was 3)
-1 ❌ mvnsite 0m 16s /patch-mvnsite-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
-1 ❌ javadoc 0m 20s /results-javadoc-javadoc-hadoop-tools_hadoop-azure-jdkUbuntu-21.0.7+6-Ubuntu-0ubuntu120.04.txt hadoop-tools_hadoop-azure-jdkUbuntu-21.0.7+6-Ubuntu-0ubuntu120.04 with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 generated 25 new + 1585 unchanged - 0 fixed = 1610 total (was 1585)
-1 ❌ javadoc 0m 17s /results-javadoc-javadoc-hadoop-tools_hadoop-azure-jdkUbuntu-17.0.15+6-Ubuntu-0ubuntu120.04.txt hadoop-tools_hadoop-azure-jdkUbuntu-17.0.15+6-Ubuntu-0ubuntu120.04 with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 generated 25 new + 1464 unchanged - 0 fixed = 1489 total (was 1464)
-1 ❌ spotbugs 0m 15s /patch-spotbugs-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
+1 💚 shadedclient 16m 7s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 0m 18s /patch-unit-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
+1 💚 asflicense 0m 21s The patch does not generate ASF License warnings.
58m 1s
Subsystem Report/Notes
Docker ClientAPI=1.52 ServerAPI=1.52 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8153/2/artifact/out/Dockerfile
GITHUB PR #8153
JIRA Issue HADOOP-19767
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 5c3d30308556 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 94d2336
Default Java Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8153/2/testReport/
Max. process+thread count 636 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8153/2/console
versions git=2.25.1 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants