Skip to content

fix(ingest/sql-parsing): attribute query usage to temp-table composite queries#18130

Merged
treff7es merged 4 commits into
masterfrom
worktree-fix-temp-composite-query-usage
Jul 3, 2026
Merged

fix(ingest/sql-parsing): attribute query usage to temp-table composite queries#18130
treff7es merged 4 commits into
masterfrom
worktree-fix-temp-composite-query-usage

Conversation

@treff7es

@treff7es treff7es commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Query usage statistics were recorded but never emitted for temp-table-fed queries — the common ELT/dbt shape. This fixes the attribution so per-query popularity lands on the composite Query entity that lineage actually references.

Root cause

SqlParsingAggregator keys query usage counts by raw statement fingerprint, but a query fed by temp tables is emitted as a Query entity under a composite fingerprint (temp-table resolution merges the staging statements into one logical query). Emission looked up usage by the composite id, which is never present in the usage store, so the lookup always missed:

query_counter = self._query_usage_counts.get(query_id)  # query_id == composite → None

Consequences:

  • Composite queries were emitted with lineage but no queryUsageStatistics.
  • The fallback _gen_remaining_queries pass then either dropped the usage entirely (when is_allowed_table filtered the raw temp upstreams — e.g. Redshift, where reports showed num_query_usage_stats_generated: 0) or emitted it on orphan raw-fingerprint Query URNs that no lineage edge references.
  • Operations referenced the raw statement id while lineage referenced the composite — two Query nodes for one logical load.

Non-temp queries were unaffected (the fast path keeps the raw id), which is why this went unnoticed.

Fix

Carry the raw component fingerprints on the composite (composed_of, base statement first) and:

  • attribute usage via composed_of[0] — the base (downstream-writing) statement's count reflects how often the pipeline ran, not the sum of the merged statements;
  • mark the merged components generated so they aren't re-emitted as orphan Query entities carrying duplicate usage;
  • resolve temp tables in the operations pass too, so operations reference the same composite Query as lineage (otherwise the raw id is referenced but never emitted, leaving a dangling query reference).

Behavior change

For temp-table-fed targets the graph now has one canonical (composite) Query entity carrying both lineage and usage, instead of the composite plus orphan raw-statement Query entities. Operations and lineage now agree on which Query they reference.

Testing

  • New unit test test_query_usage_stats_attributed_to_temp_table_composite_query (asserts usage lands only on the composite, with the base-statement count).
  • Regenerated affected goldens: sql_queries/session-temp-tables, sql_queries/temp-table-patterns, bigquery_queries_mcps_golden.
  • tests/unit/sql_parsing + tests/unit/redshift + tests/integration/sql_queries + bigquery_queries → all pass (562 passed, 3 skipped). Dangling-reference audit across regenerated goldens: none. ruff + mypy clean.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Tests added/updated
  • Docs updated (n/a — internal correctness fix)

Golden file changes

Three goldens were regenerated to reflect the corrected behavior. Every change falls into the same three categories, and the result was audited to contain no dangling query references — every operation.queries and lineage query still resolves to an emitted Query entity, and every removed entity was confirmed to be cleanly subsumed by a composite (not referenced anywhere).

Golden Query entities What changed
tests/integration/sql_queries/golden/session-temp-tables.json 7 → 2 5 orphan raw-statement Query entities removed; the 2 composite queries gained queryUsageStatistics (queryCount=1); the audit_log and user_summary operation aspects were re-pointed from raw statement ids to their composites
tests/integration/sql_queries/golden/temp-table-patterns.json 7 → 5 2 raw-statement Query entities removed (subsumed into a composite); the processed_events operation aspect was re-pointed from a raw statement id to its composite. This scenario has no query-usage config, so it isolates the operation-reference part of the fix (no queryUsageStatistics changes)
tests/integration/bigquery_v2/bigquery_queries_mcps_golden.json 83 → 81 2 orphan raw-statement Query entities removed; composite_29c3… gained queryUsageStatistics (queryCount=1); the lineage_from_tmp_table operation aspect was re-pointed to the composite

Why: a temp-table-fed load is represented by a single composite Query that lineage points to. Before this change, the raw component statements were also emitted as standalone Query entities (carrying the usage that belonged on the composite), and operation aspects referenced those raw ids. The regenerated goldens therefore (1) drop the redundant raw Query entities, (2) move queryUsageStatistics onto the composite, and (3) re-point operation.queries at the composite so operations agree with lineage. queryCount on the composite reflects the base (downstream-writing) statement's execution count — i.e. how often the load ran — not the sum of the merged statements.

…e queries

Query usage counts are keyed by raw statement fingerprint, but a query fed
by temp tables is emitted as a Query entity under a composite fingerprint
(temp-table resolution merges the staging statements into one logical query).
Emission looked up usage by the composite id, which is never present in the
usage store, so the lookup always missed. The result: composite queries were
emitted with lineage but no queryUsageStatistics, and the "remaining queries"
pass either dropped the usage entirely (when is_allowed_table filtered the raw
temp upstreams) or emitted it on orphan raw-fingerprint Query URNs that no
lineage edge references. This is the common ELT/dbt shape, so per-query
popularity was silently lost for temp-table-fed tables.

Carry the raw component fingerprints on the composite (composed_of, base
statement first) and:
- attribute usage via composed_of[0] - the base statement's count reflects how
  often the pipeline ran (not the sum of the merged statements)
- mark the merged components generated so they aren't re-emitted as orphan
  Query entities carrying duplicate usage
- resolve temp tables in the operations pass too, so operations reference the
  same composite Query as lineage (otherwise the raw id is referenced but never
  emitted, leaving a dangling query reference)

Adds a unit test and regenerates the affected sql_queries and bigquery_queries
goldens.
@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 1, 2026
Add an "Other Notable Changes" entry (#18130) describing the corrected
queryUsageStatistics attribution for temp-table-fed queries across the SQL
parsing connectors.
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
.../src/datahub/sql_parsing/sql_parsing_aggregator.py 95.65% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests

datahub-connector-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Connector Tests Results

Connector tests failed for commit 06dafda

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Comment thread metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py Outdated
Comment thread metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py Outdated
@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jul 2, 2026
…osition

Address review feedback: replace the positional composed_of list (element [0]
= base by convention) with an explicit QueryComposition(base, others) dataclass,
and move the usage-key selection into a QueryMetadata.usage_query_id property.
No behavior change - goldens and tests are unchanged.

@sgomezvillamor sgomezvillamor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the fix is correct, well-scoped, and well-documented. The root cause (usage keyed by raw fingerprint but looked up by composite id) is real, and all three parts check out: usage_query_id redirects the count lookup to the base component while the emitted URN stays composite; marking composed_of.all_queries generated prevents orphan re-emission in _gen_remaining_queries; and resolving temp tables in the operations pass keeps operation.queries pointed at the same composite as lineage. The golden deltas match exactly. Both earlier review comments (QueryComposition dataclass, usage_query_id on the model) are addressed in 2cc583b.

Three non-blocking findings:

  1. 🟡 Double temp-table resolution. _resolve_query_with_temp_tables is now called a second time per lineage entry in _gen_operation_mcps (its cache is local per-call), so with both generate_lineage and generate_operations on, the recursive resolution runs twice per downstream/query — exactly the ELT shape this targets. Consider memoizing resolved composites (keyed by base query_id) or reusing the lineage-pass result.

  2. ℹ️ queryCount semantic. Using the base statement's count rather than the sum is a deliberate, documented choice — just confirming it's intended for sessions where base vs. temp-loaders ran different numbers of times.

  3. ℹ️ Test coverage. The new unit test runs with generate_operations=False, so the operations re-pointing is covered only by the integration goldens. Optional adds: assert no orphan raw-statement Query entities are emitted, and a base-executed-twice case to lock in the count semantic.

CI note: the only failing test, test_nifi_ingest_cluster, is unrelated (NiFi doesn't use SqlParsingAggregator; it's a network-flaky external-host golden) — a re-run should go green.


Generated by Claude Code

Address non-blocking review feedback on #18130 (test coverage). Add a unit test
covering generate_operations=True - the operation.queries aspect references the
composite Query, matching lineage - and a base-executed-twice case that locks the
base-count-not-sum queryCount semantic. Also assert no orphan raw-statement Query
entities are emitted.

(The suggested memoization of temp-table resolution was evaluated and
intentionally not added: caching composites holds QueryMetadata proportional to
the workload, which cuts against the aggregator's FileBackedDict memory model,
while only saving cheap re-traversal of already-parsed queries that occurs solely
when generate_operations and generate_lineage are both enabled.)
@treff7es

treff7es commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Thanks! Addressed the non-blocking findings in 06dafda9bac:

  • Add dataset ownership metadata collection #3 (tests): added test_temp_table_composite_query_operations_and_repeated_execution — covers the operations re-pointing (operation.queries → composite, matching lineage) and a base-executed-twice case that locks the base-count (not sum) queryCount semantic; also assert no orphan raw-statement Query entities are emitted on the existing test.
  • #1 (double resolution): evaluated memoization but opted not to — caching composites holds QueryMetadata proportional to the workload, which cuts against the aggregator's FileBackedDict memory model, while only saving cheap re-traversal of already-parsed queries that occurs solely when generate_operations and generate_lineage are both enabled. If it ever shows up as a hotspot, folding operation emission into the lineage pass would remove the double walk with zero extra memory.
  • Exclude tests that need configurations #2 (queryCount): confirmed intended — the base statement's execution count (documented on QueryComposition), not the sum of the merged statements.

The commit is test-only; the source is byte-identical to the approved state.

@treff7es treff7es merged commit 4ce3ea5 into master Jul 3, 2026
77 of 79 checks passed
@treff7es treff7es deleted the worktree-fix-temp-composite-query-usage branch July 3, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants