Skip to content

fix(ingestion): prevent oversized query metadata from failing runs#18102

Open
alfiyas-datahub wants to merge 6 commits into
masterfrom
fix/oversized-query-properties-truncation
Open

fix(ingestion): prevent oversized query metadata from failing runs#18102
alfiyas-datahub wants to merge 6 commits into
masterfrom
fix/oversized-query-properties-truncation

Conversation

@alfiyas-datahub

@alfiyas-datahub alfiyas-datahub commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Snowflake (and other SQL connectors) ingestion runs were failing with repeated GMS 400 Cannot parse request entity errors on queryProperties aspects, surfacing in the UI only as "An unexpected issue occurred". Root cause is two compounding bugs in shared ingestion code:

  1. Unbounded composite query text — When queries write through temp tables, SqlParsingAggregator merges the whole chain into one synthetic composite_* query and concatenates every constituent statement (";\n\n".join(...)) with no size cap. A pipeline that writes to a temp table thousands of times in one session (e.g. Hightouch) can balloon the merged statement to ~140MB.

  2. Broken size guard (unit mismatch)EnsureAspectSizeProcessor.ensure_query_properties_size, the guard that should truncate oversized statements before send, computed the required reduction in serialized JSON bytes but compared it against the raw character count of the statement (len(statement.value)). JSON escaping (\n, \", control chars, non-ASCII) inflates the serialized size past the raw length, so the check failed, the guard logged "Cannot truncate..." and emitted the oversized aspect anyway. GMS rejected it; after enough rejections the run exited 1 / FAILED.

This was first reported on Snowflake with include_queries enabled, but applies to any SQL connector that emits query entities.

Changes

  • sql_parsing_aggregator.py: cap composite query statement text at MAX_COMPOSITE_QUERY_STATEMENT_CHARS (default 5MB, overridable via DATAHUB_MAX_COMPOSITE_QUERY_STATEMENT_CHARS), following the existing MAX_UPSTREAM_TABLES_COUNT pattern; record truncations in SqlAggregatorReport.num_composite_queries_truncated_due_to_large_size.
  • ensure_aspect_size.py: rewrite ensure_query_properties_size to measure serialized size correctly via binary search over the statement prefix (escape-inflation safe), fall back to dropping name/description when the non-statement overhead alone exceeds the limit, and always emit a structured source_report.warning + record the truncation (previously only a logger.warning that never reached the run report).
  • updating-datahub.md: note the fix under Other Notable Changes.

@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Jun 30, 2026
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@datahub-connector-tests

datahub-connector-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit 9bca8cf

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

# A composite query concatenates every statement in a temp-table chain with no
# size bound; a session that writes to a temp table many times can grow the
# merged text to hundreds of MB and overflow the GMS payload limit. Cap it here.
MAX_COMPOSITE_QUERY_STATEMENT_CHARS = int(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things about this constant:

  1. Diverges from the pattern it claims to follow. The PR body says it follows the MAX_UPSTREAM_TABLES_COUNT pattern, but that sibling (and MAX_FINEGRAINEDLINEAGE_COUNT) is a plain hardcoded constant with no env override. This is the only os.environ.get(...) in the file — env-driven config elsewhere goes through datahub/configuration/env_vars.py accessors (e.g. get_sql_agg_skip_joins()).
  2. The env-var path is effectively untested. Because it's read at import time into a module constant, the new test has to monkeypatch.setattr(agg_module, "MAX_COMPOSITE_QUERY_STATEMENT_CHARS", cap) rather than set the env var — so the os.environ/int() path never runs in tests (and would raise ValueError at import on a malformed value).

Suggestion: either drop the override and make it a plain constant like its siblings (simplest, and this cap is unlikely to need per-deployment tuning), or add a get_max_composite_query_statement_chars() accessor in env_vars.py and read it lazily.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 5MB default so much lower than the usual 16MB aspect size limit? Shouldn't they be more aligned? If there is any reason, it should be documented here.

@alfiyas-datahub alfiyas-datahub Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system has two possible size limits: a 16 MB aspect validation limit and a 5 MB Kafka message limit. The 16 MB validator is disabled by default in the open-source deployment, while the 5 MB Kafka limit is always enforced. This means that if a metadata payload exceeds 5 MB, Kafka rejects it with a RecordTooLargeException before it ever reaches the aspect validator. Therefore, the truncation cap is intentionally set to 5 MB, since that is the effective limit that prevents ingestion failures. I will add a comment documenting this on the constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with a plain constant, no env override; removed import os; added comment explaining why 5 MB (Kafka max.request.size, not the 16 MB aspect validator)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still concerned about the 5MB cap. Are we applying this to Query statements for non-composite queries too?

All the trimming we do is in the ensure_aspect_size workunit processor at 16MB. Why should composite queries have a more restrictive limit for their statements?

My proposal would be not to do any trimming in sql parsing and only do that in ensure_aspect_size workunit processor.

Comment thread docs/how/updating-datahub.md
@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants