fix(ingestion): prevent oversized query metadata from failing runs by alfiyas-datahub · Pull Request #18102 · datahub-project/datahub

alfiyas-datahub · 2026-06-30T14:21:01Z

Summary

Snowflake (and other SQL connectors) ingestion runs were failing with repeated GMS 400 Cannot parse request entity errors on queryProperties aspects, surfacing in the UI only as "An unexpected issue occurred". Root cause is two compounding bugs in shared ingestion code:

Unbounded composite query text — When queries write through temp tables, SqlParsingAggregator merges the whole chain into one synthetic composite_* query and concatenates every constituent statement (";\n\n".join(...)) with no size cap. A pipeline that writes to a temp table thousands of times in one session (e.g. Hightouch) can balloon the merged statement to ~140MB.
Broken size guard (unit mismatch) — EnsureAspectSizeProcessor.ensure_query_properties_size, the guard that should truncate oversized statements before send, computed the required reduction in serialized JSON bytes but compared it against the raw character count of the statement (len(statement.value)). JSON escaping (\n, \", control chars, non-ASCII) inflates the serialized size past the raw length, so the check failed, the guard logged "Cannot truncate..." and emitted the oversized aspect anyway. GMS rejected it; after enough rejections the run exited 1 / FAILED.

This was first reported on Snowflake with include_queries enabled, but applies to any SQL connector that emits query entities.

Changes

sql_parsing_aggregator.py: cap composite query statement text at MAX_COMPOSITE_QUERY_STATEMENT_CHARS (default 5MB, overridable via DATAHUB_MAX_COMPOSITE_QUERY_STATEMENT_CHARS), following the existing MAX_UPSTREAM_TABLES_COUNT pattern; record truncations in SqlAggregatorReport.num_composite_queries_truncated_due_to_large_size.
ensure_aspect_size.py: rewrite ensure_query_properties_size to measure serialized size correctly via binary search over the statement prefix (escape-inflation safe), fall back to dropping name/description when the non-statement overhead alone exceeds the limit, and always emit a structured source_report.warning + record the truncation (previously only a logger.warning that never reached the run report).
updating-datahub.md: note the fix under Other Notable Changes.

codecov · 2026-06-30T14:24:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-07-01T11:49:40Z

Connector Tests Results

All connector tests passed for commit 9bca8cf

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

treff7es · 2026-07-02T06:31:57Z

+# A composite query concatenates every statement in a temp-table chain with no
+# size bound; a session that writes to a temp table many times can grow the
+# merged text to hundreds of MB and overflow the GMS payload limit. Cap it here.
+MAX_COMPOSITE_QUERY_STATEMENT_CHARS = int(


Two things about this constant:

Diverges from the pattern it claims to follow. The PR body says it follows the MAX_UPSTREAM_TABLES_COUNT pattern, but that sibling (and MAX_FINEGRAINEDLINEAGE_COUNT) is a plain hardcoded constant with no env override. This is the only os.environ.get(...) in the file — env-driven config elsewhere goes through datahub/configuration/env_vars.py accessors (e.g. get_sql_agg_skip_joins()).

The env-var path is effectively untested. Because it's read at import time into a module constant, the new test has to monkeypatch.setattr(agg_module, "MAX_COMPOSITE_QUERY_STATEMENT_CHARS", cap) rather than set the env var — so the os.environ/int() path never runs in tests (and would raise ValueError at import on a malformed value).

Suggestion: either drop the override and make it a plain constant like its siblings (simplest, and this cap is unlikely to need per-deployment tuning), or add a get_max_composite_query_statement_chars() accessor in env_vars.py and read it lazily.

Why is this 5MB default so much lower than the usual 16MB aspect size limit? Shouldn't they be more aligned? If there is any reason, it should be documented here.

The system has two possible size limits: a 16 MB aspect validation limit and a 5 MB Kafka message limit. The 16 MB validator is disabled by default in the open-source deployment, while the 5 MB Kafka limit is always enforced. This means that if a metadata payload exceeds 5 MB, Kafka rejects it with a RecordTooLargeException before it ever reaches the aspect validator. Therefore, the truncation cap is intentionally set to 5 MB, since that is the effective limit that prevents ingestion failures. I will add a comment documenting this on the constant.

updated with a plain constant, no env override; removed import os; added comment explaining why 5 MB (Kafka max.request.size, not the 16 MB aspect validator)

Still concerned about the 5MB cap. Are we applying this to Query statements for non-composite queries too?

All the trimming we do is in the ensure_aspect_size workunit processor at 16MB. Why should composite queries have a more restrictive limit for their statements?

My proposal would be not to do any trimming in sql parsing and only do that in ensure_aspect_size workunit processor.

github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Jun 30, 2026

github-actions Bot deployed to datahub-wheels (Preview) June 30, 2026 14:23 View deployment

github-actions Bot deployed to datahub-wheels (Preview) June 30, 2026 14:33 View deployment

alfiyas-datahub force-pushed the fix/oversized-query-properties-truncation branch from 4455560 to 3ede8d2 Compare June 30, 2026 14:37

alfiyas-datahub added 2 commits June 30, 2026 20:08

fix(ingestion): prevent oversized query metadata from failing runs

76e4493

style(ingestion): tighten comments on query size fixes

4400c07

alfiyas-datahub force-pushed the fix/oversized-query-properties-truncation branch from 3ede8d2 to 4400c07 Compare June 30, 2026 14:39

github-actions Bot deployed to datahub-wheels (Preview) June 30, 2026 14:41 View deployment

vercel Bot deployed to Preview June 30, 2026 14:53 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Jun 30, 2026

alfiyas-datahub added 2 commits July 1, 2026 16:30

Merge branch 'master' into fix/oversized-query-properties-truncation

bff25a2

fix(ingestion): add type annotation to composite query cap test

b6ce4f6

github-actions Bot deployed to datahub-wheels (Preview) July 1, 2026 11:05 View deployment

vercel Bot deployed to Preview July 1, 2026 11:17 View deployment

treff7es reviewed Jul 2, 2026

View reviewed changes

Merge branch 'master' into fix/oversized-query-properties-truncation

3f80472

sgomezvillamor reviewed Jul 2, 2026

View reviewed changes

Comment thread docs/how/updating-datahub.md

maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jul 2, 2026

Merge branch 'master' into fix/oversized-query-properties-truncation

9bca8cf

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 13:14 View deployment

vercel Bot deployed to Preview July 2, 2026 13:26 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ingestion): prevent oversized query metadata from failing runs#18102

fix(ingestion): prevent oversized query metadata from failing runs#18102
alfiyas-datahub wants to merge 6 commits into
masterfrom
fix/oversized-query-properties-truncation

alfiyas-datahub commented Jun 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

datahub-connector-tests Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

treff7es Jul 2, 2026

Uh oh!

sgomezvillamor Jul 2, 2026

Uh oh!

alfiyas-datahub Jul 2, 2026 •

edited

Loading

Uh oh!

alfiyas-datahub Jul 2, 2026

Uh oh!

sgomezvillamor Jul 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

alfiyas-datahub commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

Uh oh!

treff7es Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

alfiyas-datahub Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alfiyas-datahub Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alfiyas-datahub commented Jun 30, 2026 •

edited

Loading

codecov Bot commented Jun 30, 2026 •

edited

Loading

datahub-connector-tests Bot commented Jul 1, 2026 •

edited

Loading

alfiyas-datahub Jul 2, 2026 •

edited

Loading