Skip to content

fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector#18133

Open
kyungsoo-datahub wants to merge 10 commits into
masterfrom
fix/bigquery-usage-time-window-fields-ignored
Open

fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector#18133
kyungsoo-datahub wants to merge 10 commits into
masterfrom
fix/bigquery-usage-time-window-fields-ignored

Conversation

@kyungsoo-datahub

@kyungsoo-datahub kyungsoo-datahub commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
  • usage.start_time, usage.end_time, usage.bucket_duration, usage.max_query_duration, usage.format_sql_queries, usage.include_top_n_queries, and usage.queries_character_limit were not working before this fix — several were silently ignored, and one (format_sql_queries) was a real bug on the default extraction path. This PR closes out the remaining gaps:

    • Time window forwarding: usage.start_time, usage.end_time, usage.bucket_duration (already fixed in a prior commit on this branch) are forwarded to their top-level equivalents with a deprecation warning, and now apply correctly under both extraction paths. usage.max_query_duration is also forwarded, but unlike the other three, it only takes effect on the legacy path (use_queries_v2: False) — the top-level field itself is legacy-only, so queries-v2 ignores it either way. Setting both the nested and top-level version of the same field is now a configuration error.

    • Queries-v2 wiring: usage.format_sql_queries, usage.include_top_n_queries, and usage.queries_character_limit already worked on the legacy path, which reads them directly. The queries-v2 aggregator (the default, use_queries_v2: True) dropped them entirely and hardcoded format_queries=False. They now flow through to the SQL parsing aggregator used by queries-v2.

    • Legacy-only fields documented: usage.include_read_operational_stats and usage.apply_view_usage_to_tables have no equivalent under queries-v2, which attributes usage via SQL parsing rather than the legacy mechanism these fields rely on. DataHub now logs a warning when either is set to a non-default value while use_queries_v2: True, and their descriptions are updated to clarify they're legacy-only.

Behavior changes:

  • A recipe that previously set only usage.start_time, usage.end_time, or usage.bucket_duration will now have it applied at the top level under both extraction paths, which may widen or narrow the range of data ingested. Since BigQuery usage and lineage extraction scan audit logs, a wider window can increase query cost.
  • A recipe that previously set only usage.max_query_duration will now have it forwarded to the top level, but it still only affects ingestion under the legacy path (use_queries_v2: False).
  • A recipe that sets usage.format_sql_queries: true will now actually reformat SQL queries under the default queries-v2 path (previously silently skipped).
  • Recipes using use_queries_v2: True with usage.include_read_operational_stats or usage.apply_view_usage_to_tables set will now see a warning explaining the field is ignored.

…/bucket_duration to top level

These nested fields were inherited but never read by either the queries-v2 or
legacy usage extraction paths, so setting them silently had no effect.
@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Jul 1, 2026
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.14815% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...c/datahub/ingestion/source/bigquery_v2/bigquery.py 85.71% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests

datahub-connector-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit b855176

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

…on legacy-only usage fields under queries-v2

Extends the usage.* time-window forwarding to max_query_duration, which had
the same silent-ignore trap as start_time/end_time/bucket_duration. Also
warns when include_read_operational_stats or apply_view_usage_to_tables are
set under use_queries_v2, since queries-v2 has no equivalent mechanism for
either and silently ignores them.
…queries/queries_character_limit under queries-v2

These usage.* fields worked in the legacy extraction path but were dropped
when building the queries-v2 aggregator's usage config, so setting
usage.format_sql_queries had no effect on the default (use_queries_v2=True)
path.
Documents the max_query_duration deprecation, the queries-v2 wiring of
format_sql_queries/include_top_n_queries/queries_character_limit, and the
legacy-only status of include_read_operational_stats/apply_view_usage_to_tables.
@kyungsoo-datahub kyungsoo-datahub changed the title fix(ingestion/bigquery): forward deprecated usage.start_time/end_time/bucket_duration to top level fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector Jul 1, 2026
…ge per review

usage.max_query_duration, unlike start_time/end_time/bucket_duration, is only
read on the legacy (non-queries-v2) extraction path, so the deprecation
warning and docs should not claim it applies to lineage/usage/operations
under the default queries-v2 path. Also closes test gaps flagged in review:
end_time forwarding, nested usage config reverting to its own default after
forwarding, and aggregator forwarding of include_top_n_queries/queries_character_limit.
…queries-v2 config mapping

Redeclares usage.start_time/end_time/bucket_duration/max_query_duration on
BigQueryUsageConfig so the deprecation is visible in generated connector docs,
not just the runtime log line. Also extracts the self.config.usage.* ->
BigQueryQueriesExtractorConfig mapping in bigquery.py into a small testable
method, closing a pre-existing untested seam (top_n_queries had the same gap)
that a typo'd kwarg could otherwise slip through.
Add missing return type annotation and construct the fake BigqueryV2Source
via __new__ instead of SimpleNamespace so it satisfies the method's type
signature. lintFix only runs ruff, not mypy, so these slipped past local
verification and only surfaced in CI's :metadata-ingestion:lint task.
…it on standalone queries source

BigQueryQueriesExtractorConfig accepted inconsistent top_n_queries/
queries_character_limit combos at parse time (only failing later inside
BaseUsageConfig during extractor init) since it doesn't share BaseUsageConfig's
validator. Extracts that check into a shared helper and applies it here too.

Also documents the programmatic-construction bypass of the usage.* forwarding
validator, surfaces the two legacy-only-field warnings in the structured
ingestion report (not just the log), composes the redeclared field
descriptions from the base class text to avoid drift, and parametrizes the
conflict tests across all four forwarded fields.
Removed three tests that only checked a Field's default= against a hardcoded
or re-exported literal (no behavior beyond what pydantic already guarantees):
format_sql_queries/include_top_n_queries default checks, and a
queries_character_limit "parity" check that referenced the same imported
constant the field itself uses, so it could never actually drift. Also
hoisted per-method local imports in TestQueriesExtractorUsageConfigWiring to
module level.
…ate tests

Renames forward_usage_time_window_fields to forward_deprecated_usage_fields
since it also covers max_query_duration, which isn't a window boundary field.
Tightens code comments on the dict-only guard and defensive copy.

Parametrizes near-duplicate tests across both test files (field forwarding,
legacy-only warnings, aggregator wiring) and drops the programmatic-bypass
test, whose documented behavior is already covered by a code comment.
@kyungsoo-datahub kyungsoo-datahub marked this pull request as ready for review July 2, 2026 19:58
@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants