fix(ingest/sql-parsing): attribute query usage to temp-table composite queries by treff7es · Pull Request #18130 · datahub-project/datahub

treff7es · 2026-07-01T20:14:27Z

Summary

Query usage statistics were recorded but never emitted for temp-table-fed queries — the common ELT/dbt shape. This fixes the attribution so per-query popularity lands on the composite Query entity that lineage actually references.

Root cause

SqlParsingAggregator keys query usage counts by raw statement fingerprint, but a query fed by temp tables is emitted as a Query entity under a composite fingerprint (temp-table resolution merges the staging statements into one logical query). Emission looked up usage by the composite id, which is never present in the usage store, so the lookup always missed:

query_counter = self._query_usage_counts.get(query_id)  # query_id == composite → None

Consequences:

Composite queries were emitted with lineage but no queryUsageStatistics.
The fallback _gen_remaining_queries pass then either dropped the usage entirely (when is_allowed_table filtered the raw temp upstreams — e.g. Redshift, where reports showed num_query_usage_stats_generated: 0) or emitted it on orphan raw-fingerprint Query URNs that no lineage edge references.
Operations referenced the raw statement id while lineage referenced the composite — two Query nodes for one logical load.

Non-temp queries were unaffected (the fast path keeps the raw id), which is why this went unnoticed.

Fix

Carry the raw component fingerprints on the composite (composed_of, base statement first) and:

attribute usage via composed_of[0] — the base (downstream-writing) statement's count reflects how often the pipeline ran, not the sum of the merged statements;
mark the merged components generated so they aren't re-emitted as orphan Query entities carrying duplicate usage;
resolve temp tables in the operations pass too, so operations reference the same composite Query as lineage (otherwise the raw id is referenced but never emitted, leaving a dangling query reference).

Behavior change

For temp-table-fed targets the graph now has one canonical (composite) Query entity carrying both lineage and usage, instead of the composite plus orphan raw-statement Query entities. Operations and lineage now agree on which Query they reference.

Testing

New unit test test_query_usage_stats_attributed_to_temp_table_composite_query (asserts usage lands only on the composite, with the base-statement count).
Regenerated affected goldens: sql_queries/session-temp-tables, sql_queries/temp-table-patterns, bigquery_queries_mcps_golden.
tests/unit/sql_parsing + tests/unit/redshift + tests/integration/sql_queries + bigquery_queries → all pass (562 passed, 3 skipped). Dangling-reference audit across regenerated goldens: none. ruff + mypy clean.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Tests added/updated
Docs updated (n/a — internal correctness fix)

Golden file changes

Three goldens were regenerated to reflect the corrected behavior. Every change falls into the same three categories, and the result was audited to contain no dangling query references — every operation.queries and lineage query still resolves to an emitted Query entity, and every removed entity was confirmed to be cleanly subsumed by a composite (not referenced anywhere).

Golden	Query entities	What changed
`tests/integration/sql_queries/golden/session-temp-tables.json`	7 → 2	5 orphan raw-statement Query entities removed; the 2 composite queries gained `queryUsageStatistics` (`queryCount=1`); the `audit_log` and `user_summary` `operation` aspects were re-pointed from raw statement ids to their composites
`tests/integration/sql_queries/golden/temp-table-patterns.json`	7 → 5	2 raw-statement Query entities removed (subsumed into a composite); the `processed_events` `operation` aspect was re-pointed from a raw statement id to its composite. This scenario has no query-usage config, so it isolates the operation-reference part of the fix (no `queryUsageStatistics` changes)
`tests/integration/bigquery_v2/bigquery_queries_mcps_golden.json`	83 → 81	2 orphan raw-statement Query entities removed; `composite_29c3…` gained `queryUsageStatistics` (`queryCount=1`); the `lineage_from_tmp_table` `operation` aspect was re-pointed to the composite

Why: a temp-table-fed load is represented by a single composite Query that lineage points to. Before this change, the raw component statements were also emitted as standalone Query entities (carrying the usage that belonged on the composite), and operation aspects referenced those raw ids. The regenerated goldens therefore (1) drop the redundant raw Query entities, (2) move queryUsageStatistics onto the composite, and (3) re-point operation.queries at the composite so operations agree with lineage. queryCount on the composite reflects the base (downstream-writing) statement's execution count — i.e. how often the load ran — not the sum of the merged statements.

…e queries Query usage counts are keyed by raw statement fingerprint, but a query fed by temp tables is emitted as a Query entity under a composite fingerprint (temp-table resolution merges the staging statements into one logical query). Emission looked up usage by the composite id, which is never present in the usage store, so the lookup always missed. The result: composite queries were emitted with lineage but no queryUsageStatistics, and the "remaining queries" pass either dropped the usage entirely (when is_allowed_table filtered the raw temp upstreams) or emitted it on orphan raw-fingerprint Query URNs that no lineage edge references. This is the common ELT/dbt shape, so per-query popularity was silently lost for temp-table-fed tables. Carry the raw component fingerprints on the composite (composed_of, base statement first) and: - attribute usage via composed_of[0] - the base statement's count reflects how often the pipeline ran (not the sum of the merged statements) - mark the merged components generated so they aren't re-emitted as orphan Query entities carrying duplicate usage - resolve temp tables in the operations pass too, so operations reference the same composite Query as lineage (otherwise the raw id is referenced but never emitted, leaving a dangling query reference) Adds a unit test and regenerates the affected sql_queries and bigquery_queries goldens.

Add an "Other Notable Changes" entry (#18130) describing the corrected queryUsageStatistics attribution for temp-table-fed queries across the SQL parsing connectors.

codecov · 2026-07-01T20:17:52Z

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
.../src/datahub/sql_parsing/sql_parsing_aggregator.py	95.65%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-07-01T20:59:54Z

Connector Tests Results

Connector tests failed for commit 06dafda

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

…osition Address review feedback: replace the positional composed_of list (element [0] = base by convention) with an explicit QueryComposition(base, others) dataclass, and move the usage-key selection into a QueryMetadata.usage_query_id property. No behavior change - goldens and tests are unchanged.

sgomezvillamor

Approving — the fix is correct, well-scoped, and well-documented. The root cause (usage keyed by raw fingerprint but looked up by composite id) is real, and all three parts check out: usage_query_id redirects the count lookup to the base component while the emitted URN stays composite; marking composed_of.all_queries generated prevents orphan re-emission in _gen_remaining_queries; and resolving temp tables in the operations pass keeps operation.queries pointed at the same composite as lineage. The golden deltas match exactly. Both earlier review comments (QueryComposition dataclass, usage_query_id on the model) are addressed in 2cc583b.

Three non-blocking findings:

🟡 Double temp-table resolution. _resolve_query_with_temp_tables is now called a second time per lineage entry in _gen_operation_mcps (its cache is local per-call), so with both generate_lineage and generate_operations on, the recursive resolution runs twice per downstream/query — exactly the ELT shape this targets. Consider memoizing resolved composites (keyed by base query_id) or reusing the lineage-pass result.
ℹ️ queryCount semantic. Using the base statement's count rather than the sum is a deliberate, documented choice — just confirming it's intended for sessions where base vs. temp-loaders ran different numbers of times.
ℹ️ Test coverage. The new unit test runs with generate_operations=False, so the operations re-pointing is covered only by the integration goldens. Optional adds: assert no orphan raw-statement Query entities are emitted, and a base-executed-twice case to lock in the count semantic.

CI note: the only failing test, test_nifi_ingest_cluster, is unrelated (NiFi doesn't use SqlParsingAggregator; it's a network-flaky external-host golden) — a re-run should go green.

Generated by Claude Code

Address non-blocking review feedback on #18130 (test coverage). Add a unit test covering generate_operations=True - the operation.queries aspect references the composite Query, matching lineage - and a base-executed-twice case that locks the base-count-not-sum queryCount semantic. Also assert no orphan raw-statement Query entities are emitted. (The suggested memoization of temp-table resolution was evaluated and intentionally not added: caching composites holds QueryMetadata proportional to the workload, which cuts against the aggregator's FileBackedDict memory model, while only saving cheap re-traversal of already-parsed queries that occurs solely when generate_operations and generate_lineage are both enabled.)

treff7es · 2026-07-03T05:26:31Z

Thanks! Addressed the non-blocking findings in 06dafda9bac:

Add dataset ownership metadata collection #3 (tests): added test_temp_table_composite_query_operations_and_repeated_execution — covers the operations re-pointing (operation.queries → composite, matching lineage) and a base-executed-twice case that locks the base-count (not sum) queryCount semantic; also assert no orphan raw-statement Query entities are emitted on the existing test.
#1 (double resolution): evaluated memoization but opted not to — caching composites holds QueryMetadata proportional to the workload, which cuts against the aggregator's FileBackedDict memory model, while only saving cheap re-traversal of already-parsed queries that occurs solely when generate_operations and generate_lineage are both enabled. If it ever shows up as a hotspot, folding operation emission into the lineage pass would remove the double walk with zero extra memory.
Exclude tests that need configurations #2 (queryCount): confirmed intended — the base statement's execution count (documented on QueryComposition), not the sum of the merged statements.

The commit is test-only; the source is byte-identical to the approved state.

github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 1, 2026

github-actions Bot deployed to datahub-wheels (Preview) July 1, 2026 20:16 View deployment

docs: note temp-table query usage behavior change in updating-datahub

6825b9c

Add an "Other Notable Changes" entry (#18130) describing the corrected queryUsageStatistics attribution for temp-table-fed queries across the SQL parsing connectors.

github-actions Bot deployed to datahub-wheels (Preview) July 1, 2026 20:19 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Jul 1, 2026

vercel Bot deployed to Preview July 1, 2026 20:32 View deployment

sgomezvillamor reviewed Jul 2, 2026

View reviewed changes

Comment thread metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py Outdated

sgomezvillamor reviewed Jul 2, 2026

View reviewed changes

Comment thread metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py Outdated

maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jul 2, 2026

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 09:33 View deployment

vercel Bot deployed to Preview July 2, 2026 09:46 View deployment

sgomezvillamor approved these changes Jul 2, 2026

View reviewed changes

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 19:28 View deployment

vercel Bot deployed to Preview July 2, 2026 19:41 View deployment

maggiehays added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jul 2, 2026

treff7es merged commit 4ce3ea5 into master Jul 3, 2026
77 of 79 checks passed

treff7es deleted the worktree-fix-temp-composite-query-usage branch July 3, 2026 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ingest/sql-parsing): attribute query usage to temp-table composite queries#18130

fix(ingest/sql-parsing): attribute query usage to temp-table composite queries#18130
treff7es merged 4 commits into
masterfrom
worktree-fix-temp-composite-query-usage

treff7es commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

datahub-connector-tests Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sgomezvillamor left a comment

Uh oh!

treff7es commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

treff7es commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Behavior change

Testing

Checklist

Golden file changes

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

Uh oh!

Uh oh!

sgomezvillamor left a comment

Choose a reason for hiding this comment

Uh oh!

treff7es commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

treff7es commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading

datahub-connector-tests Bot commented Jul 1, 2026 •

edited

Loading