-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19767: [ABFS] Introduce Abfs Input Policy for detecting read patterns #8153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
@anujmodi2021 I am trying to propose a single optimised implementation of an input stream across cloud implementations, as I think we all need this kind of logic. Ideally I want to get to a place where 80% of the logic is shared in a common layer, and then we only implement cloud specific clients to actually make the requests separately. There is some consensus to move the shared logic into the parquet-java repo: https://lists.apache.org/thread/nbksq32cs8h1ldj8762y6wh9zzp8gqx6 , and some buy-in from the team at google. I'll be following up on this in the new year. Would be great to get your thoughts and if your team would also like to collaborate on this. |
|
💔 -1 overall
This message was automatically generated. |
Description of PR
Since the onset of ABFS Driver, there has been a single implementation of AbfsInputStream. Different kinds of workloads require different heuristics to give the best performance for that type of workload. For example:
Sequential Read Workloads like DFSIO and DistCP gain performance improvement from prefetched
Random Read Workloads on other hand do not need Prefetches and enabling prefetches for them is an overhead and TPS heavy
Query Workloads involving Parquet/ORC files benefit from improvements like Footer Read and Small Files Reads
To accomodate this we need to determine the pattern and accordingly create Input Streams implemented for that particular pattern.
Moving ahead more relevant policies and specialized implementation of AbfsInputStream can be added.
This PR only refactors the way we create input streams. No logical change introduced. As today by default we will continue to use AbfsAdaptiveInputStream which can cater to all kind of workloads.
How was this patch tested?
New tests were added.