Skip to content

fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting#18141

Open
treff7es wants to merge 1 commit into
masterfrom
fix-redshift-copy-credentials-fingerprint
Open

fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting#18141
treff7es wants to merge 1 commit into
masterfrom
fix-redshift-copy-credentials-fingerprint

Conversation

@treff7es

@treff7es treff7es commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Redshift COPY ... CREDENTIALS '<secret>' statements (seen on the stl_scan usage path) crashed query fingerprinting with TypeError: 'Placeholder' object is not iterable, which aborted the entire usage-extraction batch. Two independent defects combined to make this fatal; this PR fixes both.

Defect 1 — generalize_query corrupts COPY credentials.
_strip_expression replaced every literal with a Placeholder. For a Redshift COPY, the credentials are parsed as a single Literal directly under a Credentials node. sqlglot's credentials_sql generator dispatches on isinstance(child, Literal) to choose the Redshift scalar form (CREDENTIALS '<value>') over the Snowflake key=value list form. A Placeholder flips that check onto the Snowflake list path, which iterates the credentials node — and a Placeholder isn't iterable, so it raises.

Fix: when the literal is the credentials child of a Credentials node, redact it to a constant literal instead of a placeholder. Serialization stays on the scalar path, the secret never lands in the generalized (stored) query text, and fingerprints stay independent of the credential value. Snowflake key=value credentials and Redshift IAM_ROLE forms are unaffected.

Defect 2 — the fingerprint safety net was too narrow.
get_query_fingerprint_debug only caught ValueError / SqlglotError, so the TypeError propagated past the usage extractor and killed the pipeline. Widened the catch to include TypeError so fingerprinting degrades to its raw-text fallback rather than aborting ingestion.

Testing

  • Added test_redshift_copy_credentials_generalization — COPY with credentials generalizes without raising, the secret is absent from the output, and two COPYs differing only in credentials share a fingerprint.
  • Added test_get_query_fingerprint_survives_generalization_error — a generalization TypeError falls back to raw-text fingerprinting instead of propagating.
  • Full tests/unit/sql_parsing/ suite passes (445 tests); ruff + mypy clean on the changed files.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Tests added/updated
  • Docs updated (n/a — internal bugfix)

Redshift `COPY ... CREDENTIALS '<secret>'` crashed query fingerprinting
with `TypeError: 'Placeholder' object is not iterable`, which aborted
usage extraction. Two defects combined to make it fatal:

1. generalize_query replaced the credentials Literal with a Placeholder.
   sqlglot's credentials_sql generator dispatches on isinstance(child,
   Literal) to pick the Redshift scalar form over the Snowflake key=value
   list; a Placeholder flips it onto the list path, which iterates the
   node and raises. Redact the credentials literal to a constant string
   instead: serialization stays on the scalar path, the secret never
   lands in the generalized query text, and fingerprints stay independent
   of the credential value.

2. get_query_fingerprint_debug only caught ValueError/SqlglotError, so the
   TypeError propagated and killed the pipeline. Widen the catch to include
   TypeError so fingerprinting degrades to the raw-text fallback instead.
@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 2, 2026
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@datahub-connector-tests

Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit 415db2b

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

@askumar27 askumar27 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving. The core fix is minimal, correct, and well-commented, and the root-cause analysis checks out against sqlglot 30.8.0 (generator.py:5166credentials_sql dispatches the Redshift scalar vs. Snowflake key=value form on isinstance(cred_expr, exp.Literal), so replacing the credentials literal with a Placeholder reroutes serialization onto the iterating path and raises TypeError). Keeping the node a Literal (redacted constant) is the safer of the two possible fixes.

Verified:

  • Secret never reaches the stored/generalized text (redaction stays on the scalar path); test confirms EXAMPLESECRET absence and fingerprint independence.
  • The widened except is a specific tuple, not a blanket catch, and still re-raises for non-str expressions; the fallback sets expression_sql = None, so the raw secret is only hashed, never stored.
  • No backward-compat fingerprint concern — these COPY queries previously aborted the whole batch, so no prior fingerprints exist to diverge from.

Left a few inline notes — one worth addressing before merge (late import in a test, which OSS guidelines flag), the rest optional.

# Defense in depth: if generalization raises an unexpected error (e.g. a
# sqlglot generator bug on an exotic statement), fingerprinting must fall
# back to the raw text rather than propagating and killing the pipeline.
import datahub.sql_parsing.sqlglot_utils as sqlglot_utils

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING — late import in test body. Hoisting test imports to the top of the file is an OSS convention (late imports in tests are treated as an auto-rejection blocker). This is a one-line fix: add from datahub.sql_parsing import sqlglot_utils to the existing import block at the top of the file, then drop this inline import. Monkeypatching still works because get_query_fingerprint_debug resolves generalize_query via the module global at call time.

Suggested change
import datahub.sql_parsing.sqlglot_utils as sqlglot_utils

(remove this line; add the import at the top)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, thanks

monkeypatch.setattr(sqlglot_utils, "generalize_query", _boom)

fingerprint = get_query_fingerprint("SELECT * FROM my_table", "redshift")
assert fingerprint

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (non-blocking). This proves a fingerprint is returned, but not that it came from the raw-text fallback — a bug where _boom never fired would still pass, since any non-empty hash satisfies assert fingerprint. Consider asserting it equals generate_hash("SELECT * FROM my_table") to pin the fallback behavior rather than just "didn't raise".


# The secret must never leak into the generalized (stored) query text.
assert "EXAMPLESECRET" not in generalized
assert "CREDENTIALS" in generalized

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): slightly redundant with the secret-absence check above, but harmless — leave it. The fingerprint-equality assertion below is the strongest part of this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants