fix(ingest/sql-parsing): attribute query usage to temp-table composite queries#18130
Conversation
…e queries Query usage counts are keyed by raw statement fingerprint, but a query fed by temp tables is emitted as a Query entity under a composite fingerprint (temp-table resolution merges the staging statements into one logical query). Emission looked up usage by the composite id, which is never present in the usage store, so the lookup always missed. The result: composite queries were emitted with lineage but no queryUsageStatistics, and the "remaining queries" pass either dropped the usage entirely (when is_allowed_table filtered the raw temp upstreams) or emitted it on orphan raw-fingerprint Query URNs that no lineage edge references. This is the common ELT/dbt shape, so per-query popularity was silently lost for temp-table-fed tables. Carry the raw component fingerprints on the composite (composed_of, base statement first) and: - attribute usage via composed_of[0] - the base statement's count reflects how often the pipeline ran (not the sum of the merged statements) - mark the merged components generated so they aren't re-emitted as orphan Query entities carrying duplicate usage - resolve temp tables in the operations pass too, so operations reference the same composite Query as lineage (otherwise the raw id is referenced but never emitted, leaving a dangling query reference) Adds a unit test and regenerates the affected sql_queries and bigquery_queries goldens.
Add an "Other Notable Changes" entry (#18130) describing the corrected queryUsageStatistics attribution for temp-table-fed queries across the SQL parsing connectors.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Connector Tests ResultsConnector tests failed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
…osition Address review feedback: replace the positional composed_of list (element [0] = base by convention) with an explicit QueryComposition(base, others) dataclass, and move the usage-key selection into a QueryMetadata.usage_query_id property. No behavior change - goldens and tests are unchanged.
sgomezvillamor
left a comment
There was a problem hiding this comment.
Approving — the fix is correct, well-scoped, and well-documented. The root cause (usage keyed by raw fingerprint but looked up by composite id) is real, and all three parts check out: usage_query_id redirects the count lookup to the base component while the emitted URN stays composite; marking composed_of.all_queries generated prevents orphan re-emission in _gen_remaining_queries; and resolving temp tables in the operations pass keeps operation.queries pointed at the same composite as lineage. The golden deltas match exactly. Both earlier review comments (QueryComposition dataclass, usage_query_id on the model) are addressed in 2cc583b.
Three non-blocking findings:
-
🟡 Double temp-table resolution.
_resolve_query_with_temp_tablesis now called a second time per lineage entry in_gen_operation_mcps(itscacheis local per-call), so with bothgenerate_lineageandgenerate_operationson, the recursive resolution runs twice per downstream/query — exactly the ELT shape this targets. Consider memoizing resolved composites (keyed by basequery_id) or reusing the lineage-pass result. -
ℹ️
queryCountsemantic. Using the base statement's count rather than the sum is a deliberate, documented choice — just confirming it's intended for sessions where base vs. temp-loaders ran different numbers of times. -
ℹ️ Test coverage. The new unit test runs with
generate_operations=False, so the operations re-pointing is covered only by the integration goldens. Optional adds: assert no orphan raw-statement Query entities are emitted, and a base-executed-twice case to lock in the count semantic.
CI note: the only failing test, test_nifi_ingest_cluster, is unrelated (NiFi doesn't use SqlParsingAggregator; it's a network-flaky external-host golden) — a re-run should go green.
Generated by Claude Code
Address non-blocking review feedback on #18130 (test coverage). Add a unit test covering generate_operations=True - the operation.queries aspect references the composite Query, matching lineage - and a base-executed-twice case that locks the base-count-not-sum queryCount semantic. Also assert no orphan raw-statement Query entities are emitted. (The suggested memoization of temp-table resolution was evaluated and intentionally not added: caching composites holds QueryMetadata proportional to the workload, which cuts against the aggregator's FileBackedDict memory model, while only saving cheap re-traversal of already-parsed queries that occurs solely when generate_operations and generate_lineage are both enabled.)
|
Thanks! Addressed the non-blocking findings in
The commit is test-only; the source is byte-identical to the approved state. |
Summary
Query usage statistics were recorded but never emitted for temp-table-fed queries — the common ELT/dbt shape. This fixes the attribution so per-query popularity lands on the composite Query entity that lineage actually references.
Root cause
SqlParsingAggregatorkeys query usage counts by raw statement fingerprint, but a query fed by temp tables is emitted as a Query entity under a composite fingerprint (temp-table resolution merges the staging statements into one logical query). Emission looked up usage by the composite id, which is never present in the usage store, so the lookup always missed:Consequences:
queryUsageStatistics._gen_remaining_queriespass then either dropped the usage entirely (whenis_allowed_tablefiltered the raw temp upstreams — e.g. Redshift, where reports showednum_query_usage_stats_generated: 0) or emitted it on orphan raw-fingerprint Query URNs that no lineage edge references.Non-temp queries were unaffected (the fast path keeps the raw id), which is why this went unnoticed.
Fix
Carry the raw component fingerprints on the composite (
composed_of, base statement first) and:composed_of[0]— the base (downstream-writing) statement's count reflects how often the pipeline ran, not the sum of the merged statements;Behavior change
For temp-table-fed targets the graph now has one canonical (composite) Query entity carrying both lineage and usage, instead of the composite plus orphan raw-statement Query entities. Operations and lineage now agree on which Query they reference.
Testing
test_query_usage_stats_attributed_to_temp_table_composite_query(asserts usage lands only on the composite, with the base-statement count).sql_queries/session-temp-tables,sql_queries/temp-table-patterns,bigquery_queries_mcps_golden.tests/unit/sql_parsing+tests/unit/redshift+tests/integration/sql_queries+bigquery_queries→ all pass (562 passed, 3 skipped). Dangling-reference audit across regenerated goldens: none.ruff+mypyclean.Checklist
Golden file changes
Three goldens were regenerated to reflect the corrected behavior. Every change falls into the same three categories, and the result was audited to contain no dangling query references — every
operation.queriesand lineagequerystill resolves to an emitted Query entity, and every removed entity was confirmed to be cleanly subsumed by a composite (not referenced anywhere).tests/integration/sql_queries/golden/session-temp-tables.jsonqueryUsageStatistics(queryCount=1); theaudit_loganduser_summaryoperationaspects were re-pointed from raw statement ids to their compositestests/integration/sql_queries/golden/temp-table-patterns.jsonprocessed_eventsoperationaspect was re-pointed from a raw statement id to its composite. This scenario has no query-usage config, so it isolates the operation-reference part of the fix (noqueryUsageStatisticschanges)tests/integration/bigquery_v2/bigquery_queries_mcps_golden.jsoncomposite_29c3…gainedqueryUsageStatistics(queryCount=1); thelineage_from_tmp_tableoperationaspect was re-pointed to the compositeWhy: a temp-table-fed load is represented by a single composite Query that lineage points to. Before this change, the raw component statements were also emitted as standalone Query entities (carrying the usage that belonged on the composite), and
operationaspects referenced those raw ids. The regenerated goldens therefore (1) drop the redundant raw Query entities, (2) movequeryUsageStatisticsonto the composite, and (3) re-pointoperation.queriesat the composite so operations agree with lineage.queryCounton the composite reflects the base (downstream-writing) statement's execution count — i.e. how often the load ran — not the sum of the merged statements.