fix(ingestion): prevent oversized query metadata from failing runs#18102
fix(ingestion): prevent oversized query metadata from failing runs#18102alfiyas-datahub wants to merge 6 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
4455560 to
3ede8d2
Compare
3ede8d2 to
4400c07
Compare
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
| # A composite query concatenates every statement in a temp-table chain with no | ||
| # size bound; a session that writes to a temp table many times can grow the | ||
| # merged text to hundreds of MB and overflow the GMS payload limit. Cap it here. | ||
| MAX_COMPOSITE_QUERY_STATEMENT_CHARS = int( |
There was a problem hiding this comment.
Two things about this constant:
- Diverges from the pattern it claims to follow. The PR body says it follows the
MAX_UPSTREAM_TABLES_COUNTpattern, but that sibling (andMAX_FINEGRAINEDLINEAGE_COUNT) is a plain hardcoded constant with no env override. This is the onlyos.environ.get(...)in the file — env-driven config elsewhere goes throughdatahub/configuration/env_vars.pyaccessors (e.g.get_sql_agg_skip_joins()). - The env-var path is effectively untested. Because it's read at import time into a module constant, the new test has to
monkeypatch.setattr(agg_module, "MAX_COMPOSITE_QUERY_STATEMENT_CHARS", cap)rather than set the env var — so theos.environ/int()path never runs in tests (and would raiseValueErrorat import on a malformed value).
Suggestion: either drop the override and make it a plain constant like its siblings (simplest, and this cap is unlikely to need per-deployment tuning), or add a get_max_composite_query_statement_chars() accessor in env_vars.py and read it lazily.
There was a problem hiding this comment.
Why is this 5MB default so much lower than the usual 16MB aspect size limit? Shouldn't they be more aligned? If there is any reason, it should be documented here.
There was a problem hiding this comment.
The system has two possible size limits: a 16 MB aspect validation limit and a 5 MB Kafka message limit. The 16 MB validator is disabled by default in the open-source deployment, while the 5 MB Kafka limit is always enforced. This means that if a metadata payload exceeds 5 MB, Kafka rejects it with a RecordTooLargeException before it ever reaches the aspect validator. Therefore, the truncation cap is intentionally set to 5 MB, since that is the effective limit that prevents ingestion failures. I will add a comment documenting this on the constant.
There was a problem hiding this comment.
updated with a plain constant, no env override; removed import os; added comment explaining why 5 MB (Kafka max.request.size, not the 16 MB aspect validator)
There was a problem hiding this comment.
Still concerned about the 5MB cap. Are we applying this to Query statements for non-composite queries too?
All the trimming we do is in the ensure_aspect_size workunit processor at 16MB. Why should composite queries have a more restrictive limit for their statements?
My proposal would be not to do any trimming in sql parsing and only do that in ensure_aspect_size workunit processor.
Summary
Snowflake (and other SQL connectors) ingestion runs were failing with repeated GMS
400 Cannot parse request entityerrors onqueryPropertiesaspects, surfacing in the UI only as "An unexpected issue occurred". Root cause is two compounding bugs in shared ingestion code:Unbounded composite query text — When queries write through temp tables,
SqlParsingAggregatormerges the whole chain into one syntheticcomposite_*query and concatenates every constituent statement (";\n\n".join(...)) with no size cap. A pipeline that writes to a temp table thousands of times in one session (e.g. Hightouch) can balloon the merged statement to ~140MB.Broken size guard (unit mismatch) —
EnsureAspectSizeProcessor.ensure_query_properties_size, the guard that should truncate oversized statements before send, computed the required reduction in serialized JSON bytes but compared it against the raw character count of the statement (len(statement.value)). JSON escaping (\n,\", control chars, non-ASCII) inflates the serialized size past the raw length, so the check failed, the guard logged "Cannot truncate..." and emitted the oversized aspect anyway. GMS rejected it; after enough rejections the run exited 1 / FAILED.This was first reported on Snowflake with
include_queriesenabled, but applies to any SQL connector that emits query entities.Changes
sql_parsing_aggregator.py: cap composite query statement text atMAX_COMPOSITE_QUERY_STATEMENT_CHARS(default 5MB, overridable viaDATAHUB_MAX_COMPOSITE_QUERY_STATEMENT_CHARS), following the existingMAX_UPSTREAM_TABLES_COUNTpattern; record truncations inSqlAggregatorReport.num_composite_queries_truncated_due_to_large_size.ensure_aspect_size.py: rewriteensure_query_properties_sizeto measure serialized size correctly via binary search over the statement prefix (escape-inflation safe), fall back to droppingname/descriptionwhen the non-statement overhead alone exceeds the limit, and always emit a structuredsource_report.warning+ record the truncation (previously only alogger.warningthat never reached the run report).updating-datahub.md: note the fix under Other Notable Changes.