Skip to content

Regional#111

Draft
googlyrahman wants to merge 8 commits into
mainfrom
regional
Draft

Regional#111
googlyrahman wants to merge 8 commits into
mainfrom
regional

Conversation

@googlyrahman
Copy link
Copy Markdown
Collaborator

temporary created for perf check

@googlyrahman
Copy link
Copy Markdown
Collaborator Author

/gcbrun

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a BackgroundPrefetcher to gcsfs to optimize sequential reads by asynchronously fetching data blocks in a background task. It refactors _cat_file to support concurrent fetching, integrates the prefetcher into GCSFile, and adds a suite of new tests. Feedback focuses on improving maintainability and robustness by replacing magic numbers with named constants, narrowing exception handling from BaseException or broad Exception types to more specific ones, and ensuring correct boolean parsing for environment variables. Additionally, the reviewer suggests avoiding blocking the event loop during initialization and refactoring complex conditional logic in the prefetcher's producer loop for better clarity.

Comment thread gcsfs/core.py
default_block_size = DEFAULT_BLOCK_SIZE
protocol = "gs", "gcs"
async_impl = True
MIN_CHUNK_SIZE_FOR_CONCURRENCY = 5 * 1024 * 1024
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 5 * 1024 * 1024 is a magic number. It would be more readable and maintainable to define this as a named constant, perhaps at the module level or as a class constant with a descriptive name, to improve clarity and allow for easier modification if the threshold needs to change in the future.

Suggested change
MIN_CHUNK_SIZE_FOR_CONCURRENCY = 5 * 1024 * 1024
MIN_CHUNK_SIZE_FOR_CONCURRENCY = 5 * 1024 * 1024 # Consider defining this as a named constant for clarity

Comment thread gcsfs/core.py
# for start=5, end=5), causing the server to return the whole file instead of nothing.
if start is not None and end is not None and start >= end >= 0:
return b""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment # 'if start or end' fails when start=0 or end=0 because 0 is Falsey. was removed from the previous version of this line. This comment provided valuable context regarding Python's truthiness and how it affects if conditions with start or end values of 0. Restoring it would improve code clarity for future maintainers.

Suggested change
# 'if start or end' fails when start=0 or end=0 because 0 is Falsey.
u2 = self.url(path)

Comment thread gcsfs/core.py
try:
results = await asyncio.gather(*tasks)
return b"".join(results)
except BaseException as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching BaseException is generally too broad as it can mask unexpected system-level errors, including KeyboardInterrupt or SystemExit. It's best practice to catch more specific exceptions (e.g., Exception or a custom exception type) to ensure that only anticipated errors are handled, allowing other critical issues to propagate.

Suggested change
except BaseException as e:
except Exception as e:

Comment thread gcsfs/core.py Outdated
# disrupted. This will be refactored once the upstream `fsspec` changes are merged.
use_prefetch_reader = kwargs.get(
"use_prefetch_reader", False
) or os.environ.get("use_prefetch_reader", False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Environment variables are always strings. Comparing os.environ.get("use_prefetch_reader", False) directly with a boolean False might not yield the expected result if the environment variable is set to a string like "0", "false", or "True". It's safer to explicitly convert the environment variable's string value to a boolean (e.g., str(value).lower() == 'true').

Suggested change
) or os.environ.get("use_prefetch_reader", False)
) or str(os.environ.get("use_prefetch_reader", "false")).lower() == "true"

Comment thread gcsfs/core.py Outdated
) or os.environ.get("use_prefetch_reader", False)
if use_prefetch_reader:
max_prefetch_size = kwargs.get("max_prefetch_size", None)
concurrency = kwargs.get("concurrency", 4)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value 4 for concurrency is a magic number. It would be more maintainable to define this as a named constant, allowing for easier updates and better readability.

Suggested change
concurrency = kwargs.get("concurrency", 4)
concurrency = kwargs.get("concurrency", 4) # Consider using a named constant for the default

Comment thread gcsfs/prefetcher.py Outdated
Comment on lines +215 to +224
if prefetch_size >= self.MIN_CHUNK_SIZE:
if prefetch_space_available >= self.MIN_CHUNK_SIZE:
actual_size = min(
max(self.MIN_CHUNK_SIZE, block_size),
space_remaining,
)
else:
break
else:
actual_size = min(block_size, space_remaining)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nested if conditions for calculating actual_size can be difficult to read and understand. Consider refactoring this logic into a helper function or using a more linear approach to improve clarity and maintainability.

Comment thread gcsfs/prefetcher.py Outdated
logger.debug("Producer loop cancelled")
pass
except Exception as e:
logger.error("Producer loop encountered an error: %s", e, exc_info=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching Exception is generally too broad as it can mask unexpected errors. It's best practice to catch more specific exceptions to ensure that only anticipated errors are handled, allowing other critical issues to propagate.

Suggested change
logger.error("Producer loop encountered an error: %s", e, exc_info=True)
except Exception as e: # Consider catching more specific exceptions

Comment thread gcsfs/prefetcher.py Outdated
except asyncio.CancelledError:
logger.debug("Read task was cancelled.")
raise
except Exception as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching Exception is generally too broad as it can mask unexpected errors. It's best practice to catch more specific exceptions to ensure that only anticipated errors are handled, allowing other critical issues to propagate.

Suggested change
except Exception as e:
except Exception as e: # Consider catching more specific exceptions

Comment thread gcsfs/prefetcher.py Outdated
try:
result = fsspec.asyn.sync(self.loop, self._async_fetch, start, end)
except Exception as e:
logger.error("Exception during synchronous fetch: %s", e, exc_info=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching Exception is generally too broad as it can mask unexpected errors. It's best practice to catch more specific exceptions to ensure that only anticipated errors are handled, allowing other critical issues to propagate.

Suggested change
logger.error("Exception during synchronous fetch: %s", e, exc_info=True)
except Exception as e: # Consider catching more specific exceptions

Comment thread gcsfs/tests/test_core.py
assert mock_seq.call_count == 1
assert mock_conc.call_count == 0

# 2. Concurrency = 4, but read size (1MB) is < MIN_CHUNK_SIZE_FOR_CONCURRENCY (5MB)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 5MB is hardcoded here. It would be more robust to reference the MIN_CHUNK_SIZE_FOR_CONCURRENCY constant from gcsfs.core to ensure consistency and prevent test failures if the constant's value changes.

Suggested change
# 2. Concurrency = 4, but read size (1MB) is < MIN_CHUNK_SIZE_FOR_CONCURRENCY (5MB)
# 2. Concurrency = 4, but read size (1MB) is < gcs.MIN_CHUNK_SIZE_FOR_CONCURRENCY

@googlyrahman googlyrahman force-pushed the regional branch 3 times, most recently from e747193 to 32f2056 Compare March 31, 2026 20:44
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 98.36512% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
gcsfs/core.py 93.75% 3 Missing ⚠️
gcsfs/prefetcher.py 99.05% 3 Missing ⚠️
Files with missing lines Coverage Δ
gcsfs/zb_hns_utils.py 98.38% <100.00%> (+0.08%) ⬆️
gcsfs/core.py 80.26% <93.75%> (+3.05%) ⬆️
gcsfs/prefetcher.py 99.05% <99.05%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@googlyrahman googlyrahman force-pushed the regional branch 6 times, most recently from 72d3596 to b5d348e Compare April 7, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant