fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting#18141
fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting#18141treff7es wants to merge 1 commit into
Conversation
Redshift `COPY ... CREDENTIALS '<secret>'` crashed query fingerprinting with `TypeError: 'Placeholder' object is not iterable`, which aborted usage extraction. Two defects combined to make it fatal: 1. generalize_query replaced the credentials Literal with a Placeholder. sqlglot's credentials_sql generator dispatches on isinstance(child, Literal) to pick the Redshift scalar form over the Snowflake key=value list; a Placeholder flips it onto the list path, which iterates the node and raises. Redact the credentials literal to a constant string instead: serialization stays on the scalar path, the secret never lands in the generalized query text, and fingerprints stay independent of the credential value. 2. get_query_fingerprint_debug only caught ValueError/SqlglotError, so the TypeError propagated and killed the pipeline. Widen the catch to include TypeError so fingerprinting degrades to the raw-text fallback instead.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
askumar27
left a comment
There was a problem hiding this comment.
LGTM — approving. The core fix is minimal, correct, and well-commented, and the root-cause analysis checks out against sqlglot 30.8.0 (generator.py:5166 — credentials_sql dispatches the Redshift scalar vs. Snowflake key=value form on isinstance(cred_expr, exp.Literal), so replacing the credentials literal with a Placeholder reroutes serialization onto the iterating path and raises TypeError). Keeping the node a Literal (redacted constant) is the safer of the two possible fixes.
Verified:
- Secret never reaches the stored/generalized text (redaction stays on the scalar path); test confirms
EXAMPLESECRETabsence and fingerprint independence. - The widened
exceptis a specific tuple, not a blanket catch, and still re-raises for non-strexpressions; the fallback setsexpression_sql = None, so the raw secret is only hashed, never stored. - No backward-compat fingerprint concern — these COPY queries previously aborted the whole batch, so no prior fingerprints exist to diverge from.
Left a few inline notes — one worth addressing before merge (late import in a test, which OSS guidelines flag), the rest optional.
| # Defense in depth: if generalization raises an unexpected error (e.g. a | ||
| # sqlglot generator bug on an exotic statement), fingerprinting must fall | ||
| # back to the raw text rather than propagating and killing the pipeline. | ||
| import datahub.sql_parsing.sqlglot_utils as sqlglot_utils |
There was a problem hiding this comment.
WARNING — late import in test body. Hoisting test imports to the top of the file is an OSS convention (late imports in tests are treated as an auto-rejection blocker). This is a one-line fix: add from datahub.sql_parsing import sqlglot_utils to the existing import block at the top of the file, then drop this inline import. Monkeypatching still works because get_query_fingerprint_debug resolves generalize_query via the module global at call time.
| import datahub.sql_parsing.sqlglot_utils as sqlglot_utils |
(remove this line; add the import at the top)
| monkeypatch.setattr(sqlglot_utils, "generalize_query", _boom) | ||
|
|
||
| fingerprint = get_query_fingerprint("SELECT * FROM my_table", "redshift") | ||
| assert fingerprint |
There was a problem hiding this comment.
Suggestion (non-blocking). This proves a fingerprint is returned, but not that it came from the raw-text fallback — a bug where _boom never fired would still pass, since any non-empty hash satisfies assert fingerprint. Consider asserting it equals generate_hash("SELECT * FROM my_table") to pin the fallback behavior rather than just "didn't raise".
|
|
||
| # The secret must never leak into the generalized (stored) query text. | ||
| assert "EXAMPLESECRET" not in generalized | ||
| assert "CREDENTIALS" in generalized |
There was a problem hiding this comment.
nit (non-blocking): slightly redundant with the secret-absence check above, but harmless — leave it. The fingerprint-equality assertion below is the strongest part of this test.
Summary
Redshift
COPY ... CREDENTIALS '<secret>'statements (seen on thestl_scanusage path) crashed query fingerprinting withTypeError: 'Placeholder' object is not iterable, which aborted the entire usage-extraction batch. Two independent defects combined to make this fatal; this PR fixes both.Defect 1 —
generalize_querycorrupts COPY credentials._strip_expressionreplaced every literal with aPlaceholder. For a Redshift COPY, the credentials are parsed as a singleLiteraldirectly under aCredentialsnode. sqlglot'scredentials_sqlgenerator dispatches onisinstance(child, Literal)to choose the Redshift scalar form (CREDENTIALS '<value>') over the Snowflake key=value list form. APlaceholderflips that check onto the Snowflake list path, which iterates the credentials node — and aPlaceholderisn't iterable, so it raises.Fix: when the literal is the
credentialschild of aCredentialsnode, redact it to a constant literal instead of a placeholder. Serialization stays on the scalar path, the secret never lands in the generalized (stored) query text, and fingerprints stay independent of the credential value. Snowflake key=value credentials and RedshiftIAM_ROLEforms are unaffected.Defect 2 — the fingerprint safety net was too narrow.
get_query_fingerprint_debugonly caughtValueError/SqlglotError, so theTypeErrorpropagated past the usage extractor and killed the pipeline. Widened the catch to includeTypeErrorso fingerprinting degrades to its raw-text fallback rather than aborting ingestion.Testing
test_redshift_copy_credentials_generalization— COPY with credentials generalizes without raising, the secret is absent from the output, and two COPYs differing only in credentials share a fingerprint.test_get_query_fingerprint_survives_generalization_error— a generalizationTypeErrorfalls back to raw-text fingerprinting instead of propagating.tests/unit/sql_parsing/suite passes (445 tests); ruff + mypy clean on the changed files.Checklist