fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector by kyungsoo-datahub · Pull Request #18133 · datahub-project/datahub

kyungsoo-datahub · 2026-07-01T22:58:15Z

usage.start_time, usage.end_time, usage.bucket_duration, usage.max_query_duration, usage.format_sql_queries, usage.include_top_n_queries, and usage.queries_character_limit were not working before this fix — several were silently ignored, and one (format_sql_queries) was a real bug on the default extraction path. This PR closes out the remaining gaps:
- Time window forwarding: usage.start_time, usage.end_time, usage.bucket_duration (already fixed in a prior commit on this branch) are forwarded to their top-level equivalents with a deprecation warning, and now apply correctly under both extraction paths. usage.max_query_duration is also forwarded, but unlike the other three, it only takes effect on the legacy path (use_queries_v2: False) — the top-level field itself is legacy-only, so queries-v2 ignores it either way. Setting both the nested and top-level version of the same field is now a configuration error.
- Queries-v2 wiring: usage.format_sql_queries, usage.include_top_n_queries, and usage.queries_character_limit already worked on the legacy path, which reads them directly. The queries-v2 aggregator (the default, use_queries_v2: True) dropped them entirely and hardcoded format_queries=False. They now flow through to the SQL parsing aggregator used by queries-v2.
- Legacy-only fields documented: usage.include_read_operational_stats and usage.apply_view_usage_to_tables have no equivalent under queries-v2, which attributes usage via SQL parsing rather than the legacy mechanism these fields rely on. DataHub now logs a warning when either is set to a non-default value while use_queries_v2: True, and their descriptions are updated to clarify they're legacy-only.

Behavior changes:

A recipe that previously set only usage.start_time, usage.end_time, or usage.bucket_duration will now have it applied at the top level under both extraction paths, which may widen or narrow the range of data ingested. Since BigQuery usage and lineage extraction scan audit logs, a wider window can increase query cost.
A recipe that previously set only usage.max_query_duration will now have it forwarded to the top level, but it still only affects ingestion under the legacy path (use_queries_v2: False).
A recipe that sets usage.format_sql_queries: true will now actually reformat SQL queries under the default queries-v2 path (previously silently skipped).
Recipes using use_queries_v2: True with usage.include_read_operational_stats or usage.apply_view_usage_to_tables set will now see a warning explaining the field is ignored.

…/bucket_duration to top level These nested fields were inherited but never read by either the queries-v2 or legacy usage extraction paths, so setting them silently had no effect.

codecov · 2026-07-01T23:02:39Z

Codecov Report

❌ Patch coverage is 98.14815% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...c/datahub/ingestion/source/bigquery_v2/bigquery.py	85.71%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-07-01T23:17:08Z

Connector Tests Results

All connector tests passed for commit b855176

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

…on legacy-only usage fields under queries-v2 Extends the usage.* time-window forwarding to max_query_duration, which had the same silent-ignore trap as start_time/end_time/bucket_duration. Also warns when include_read_operational_stats or apply_view_usage_to_tables are set under use_queries_v2, since queries-v2 has no equivalent mechanism for either and silently ignores them.

…queries/queries_character_limit under queries-v2 These usage.* fields worked in the legacy extraction path but were dropped when building the queries-v2 aggregator's usage config, so setting usage.format_sql_queries had no effect on the default (use_queries_v2=True) path.

Documents the max_query_duration deprecation, the queries-v2 wiring of format_sql_queries/include_top_n_queries/queries_character_limit, and the legacy-only status of include_read_operational_stats/apply_view_usage_to_tables.

…ge per review usage.max_query_duration, unlike start_time/end_time/bucket_duration, is only read on the legacy (non-queries-v2) extraction path, so the deprecation warning and docs should not claim it applies to lineage/usage/operations under the default queries-v2 path. Also closes test gaps flagged in review: end_time forwarding, nested usage config reverting to its own default after forwarding, and aggregator forwarding of include_top_n_queries/queries_character_limit.

…queries-v2 config mapping Redeclares usage.start_time/end_time/bucket_duration/max_query_duration on BigQueryUsageConfig so the deprecation is visible in generated connector docs, not just the runtime log line. Also extracts the self.config.usage.* -> BigQueryQueriesExtractorConfig mapping in bigquery.py into a small testable method, closing a pre-existing untested seam (top_n_queries had the same gap) that a typo'd kwarg could otherwise slip through.

Add missing return type annotation and construct the fake BigqueryV2Source via __new__ instead of SimpleNamespace so it satisfies the method's type signature. lintFix only runs ruff, not mypy, so these slipped past local verification and only surfaced in CI's :metadata-ingestion:lint task.

…it on standalone queries source BigQueryQueriesExtractorConfig accepted inconsistent top_n_queries/ queries_character_limit combos at parse time (only failing later inside BaseUsageConfig during extractor init) since it doesn't share BaseUsageConfig's validator. Extracts that check into a shared helper and applies it here too. Also documents the programmatic-construction bypass of the usage.* forwarding validator, surfaces the two legacy-only-field warnings in the structured ingestion report (not just the log), composes the redeclared field descriptions from the base class text to avoid drift, and parametrizes the conflict tests across all four forwarded fields.

Removed three tests that only checked a Field's default= against a hardcoded or re-exported literal (no behavior beyond what pydantic already guarantees): format_sql_queries/include_top_n_queries default checks, and a queries_character_limit "parity" check that referenced the same imported constant the field itself uses, so it could never actually drift. Also hoisted per-method local imports in TestQueriesExtractorUsageConfigWiring to module level.

…ate tests Renames forward_usage_time_window_fields to forward_deprecated_usage_fields since it also covers max_query_duration, which isn't a window boundary field. Tightens code comments on the dict-only guard and defensive copy. Parametrizes near-duplicate tests across both test files (field forwarding, legacy-only warnings, aggregator wiring) and drops the programmatic-bypass test, whose documented behavior is already covered by a code comment.

fix(ingestion/bigquery): forward deprecated usage.start_time/end_time…

fdf7df4

…/bucket_duration to top level These nested fields were inherited but never read by either the queries-v2 or legacy usage extraction paths, so setting them silently had no effect.

github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Jul 1, 2026

github-actions Bot deployed to datahub-wheels (Preview) July 1, 2026 23:00 View deployment

vercel Bot deployed to Preview July 1, 2026 23:10 View deployment

kyungsoo-datahub added 3 commits July 1, 2026 16:28

github-actions Bot deployed to datahub-wheels (Preview) July 1, 2026 23:30 View deployment

kyungsoo-datahub changed the title ~~fix(ingestion/bigquery): forward deprecated usage.start_time/end_time/bucket_duration to top level~~ fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector Jul 1, 2026

vercel Bot deployed to Preview July 1, 2026 23:43 View deployment

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 00:26 View deployment

vercel Bot deployed to Preview July 2, 2026 00:38 View deployment

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 12:48 View deployment

vercel Bot deployed to Preview July 2, 2026 13:01 View deployment

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 14:45 View deployment

vercel Bot deployed to Preview July 2, 2026 14:57 View deployment

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 15:25 View deployment

vercel Bot deployed to Preview July 2, 2026 15:37 View deployment

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 16:22 View deployment

vercel Bot deployed to Preview July 2, 2026 16:35 View deployment

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 19:01 View deployment

vercel Bot deployed to Preview July 2, 2026 19:14 View deployment

kyungsoo-datahub marked this pull request as ready for review July 2, 2026 19:58

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector#18133

fix(ingestion/bigquery): eliminate silently-ignored usage.* config in BigQuery connector#18133
kyungsoo-datahub wants to merge 10 commits into
masterfrom
fix/bigquery-usage-time-window-fields-ignored

kyungsoo-datahub commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

datahub-connector-tests Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kyungsoo-datahub commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyungsoo-datahub commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading

datahub-connector-tests Bot commented Jul 1, 2026 •

edited

Loading