fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector#18133
Open
kyungsoo-datahub wants to merge 10 commits into
Open
fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector#18133kyungsoo-datahub wants to merge 10 commits into
kyungsoo-datahub wants to merge 10 commits into
Conversation
…/bucket_duration to top level These nested fields were inherited but never read by either the queries-v2 or legacy usage extraction paths, so setting them silently had no effect.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
…on legacy-only usage fields under queries-v2 Extends the usage.* time-window forwarding to max_query_duration, which had the same silent-ignore trap as start_time/end_time/bucket_duration. Also warns when include_read_operational_stats or apply_view_usage_to_tables are set under use_queries_v2, since queries-v2 has no equivalent mechanism for either and silently ignores them.
…queries/queries_character_limit under queries-v2 These usage.* fields worked in the legacy extraction path but were dropped when building the queries-v2 aggregator's usage config, so setting usage.format_sql_queries had no effect on the default (use_queries_v2=True) path.
Documents the max_query_duration deprecation, the queries-v2 wiring of format_sql_queries/include_top_n_queries/queries_character_limit, and the legacy-only status of include_read_operational_stats/apply_view_usage_to_tables.
…ge per review usage.max_query_duration, unlike start_time/end_time/bucket_duration, is only read on the legacy (non-queries-v2) extraction path, so the deprecation warning and docs should not claim it applies to lineage/usage/operations under the default queries-v2 path. Also closes test gaps flagged in review: end_time forwarding, nested usage config reverting to its own default after forwarding, and aggregator forwarding of include_top_n_queries/queries_character_limit.
…queries-v2 config mapping Redeclares usage.start_time/end_time/bucket_duration/max_query_duration on BigQueryUsageConfig so the deprecation is visible in generated connector docs, not just the runtime log line. Also extracts the self.config.usage.* -> BigQueryQueriesExtractorConfig mapping in bigquery.py into a small testable method, closing a pre-existing untested seam (top_n_queries had the same gap) that a typo'd kwarg could otherwise slip through.
Add missing return type annotation and construct the fake BigqueryV2Source via __new__ instead of SimpleNamespace so it satisfies the method's type signature. lintFix only runs ruff, not mypy, so these slipped past local verification and only surfaced in CI's :metadata-ingestion:lint task.
…it on standalone queries source BigQueryQueriesExtractorConfig accepted inconsistent top_n_queries/ queries_character_limit combos at parse time (only failing later inside BaseUsageConfig during extractor init) since it doesn't share BaseUsageConfig's validator. Extracts that check into a shared helper and applies it here too. Also documents the programmatic-construction bypass of the usage.* forwarding validator, surfaces the two legacy-only-field warnings in the structured ingestion report (not just the log), composes the redeclared field descriptions from the base class text to avoid drift, and parametrizes the conflict tests across all four forwarded fields.
Removed three tests that only checked a Field's default= against a hardcoded or re-exported literal (no behavior beyond what pydantic already guarantees): format_sql_queries/include_top_n_queries default checks, and a queries_character_limit "parity" check that referenced the same imported constant the field itself uses, so it could never actually drift. Also hoisted per-method local imports in TestQueriesExtractorUsageConfigWiring to module level.
…ate tests Renames forward_usage_time_window_fields to forward_deprecated_usage_fields since it also covers max_query_duration, which isn't a window boundary field. Tightens code comments on the dict-only guard and defensive copy. Parametrizes near-duplicate tests across both test files (field forwarding, legacy-only warnings, aggregator wiring) and drops the programmatic-bypass test, whose documented behavior is already covered by a code comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
usage.start_time, usage.end_time, usage.bucket_duration, usage.max_query_duration, usage.format_sql_queries, usage.include_top_n_queries, and usage.queries_character_limit were not working before this fix — several were silently ignored, and one (format_sql_queries) was a real bug on the default extraction path. This PR closes out the remaining gaps:
Time window forwarding: usage.start_time, usage.end_time, usage.bucket_duration (already fixed in a prior commit on this branch) are forwarded to their top-level equivalents with a deprecation warning, and now apply correctly under both extraction paths. usage.max_query_duration is also forwarded, but unlike the other three, it only takes effect on the legacy path (use_queries_v2: False) — the top-level field itself is legacy-only, so queries-v2 ignores it either way. Setting both the nested and top-level version of the same field is now a configuration error.
Queries-v2 wiring: usage.format_sql_queries, usage.include_top_n_queries, and usage.queries_character_limit already worked on the legacy path, which reads them directly. The queries-v2 aggregator (the default, use_queries_v2: True) dropped them entirely and hardcoded format_queries=False. They now flow through to the SQL parsing aggregator used by queries-v2.
Legacy-only fields documented: usage.include_read_operational_stats and usage.apply_view_usage_to_tables have no equivalent under queries-v2, which attributes usage via SQL parsing rather than the legacy mechanism these fields rely on. DataHub now logs a warning when either is set to a non-default value while use_queries_v2: True, and their descriptions are updated to clarify they're legacy-only.
Behavior changes: