fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting by treff7es · Pull Request #18141 · datahub-project/datahub

treff7es · 2026-07-02T15:29:19Z

Summary

Redshift COPY ... CREDENTIALS '<secret>' statements (seen on the stl_scan usage path) crashed query fingerprinting with TypeError: 'Placeholder' object is not iterable, which aborted the entire usage-extraction batch. Two independent defects combined to make this fatal; this PR fixes both.

Defect 1 — generalize_query corrupts COPY credentials.
_strip_expression replaced every literal with a Placeholder. For a Redshift COPY, the credentials are parsed as a single Literal directly under a Credentials node. sqlglot's credentials_sql generator dispatches on isinstance(child, Literal) to choose the Redshift scalar form (CREDENTIALS '<value>') over the Snowflake key=value list form. A Placeholder flips that check onto the Snowflake list path, which iterates the credentials node — and a Placeholder isn't iterable, so it raises.

Fix: when the literal is the credentials child of a Credentials node, redact it to a constant literal instead of a placeholder. Serialization stays on the scalar path, the secret never lands in the generalized (stored) query text, and fingerprints stay independent of the credential value. Snowflake key=value credentials and Redshift IAM_ROLE forms are unaffected.

Defect 2 — the fingerprint safety net was too narrow.
get_query_fingerprint_debug only caught ValueError / SqlglotError, so the TypeError propagated past the usage extractor and killed the pipeline. Widened the catch to include TypeError so fingerprinting degrades to its raw-text fallback rather than aborting ingestion.

Testing

Added test_redshift_copy_credentials_generalization — COPY with credentials generalizes without raising, the secret is absent from the output, and two COPYs differing only in credentials share a fingerprint.
Added test_get_query_fingerprint_survives_generalization_error — a generalization TypeError falls back to raw-text fingerprinting instead of propagating.
Full tests/unit/sql_parsing/ suite passes (445 tests); ruff + mypy clean on the changed files.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Tests added/updated
Docs updated (n/a — internal bugfix)

Redshift `COPY ... CREDENTIALS '<secret>'` crashed query fingerprinting with `TypeError: 'Placeholder' object is not iterable`, which aborted usage extraction. Two defects combined to make it fatal: 1. generalize_query replaced the credentials Literal with a Placeholder. sqlglot's credentials_sql generator dispatches on isinstance(child, Literal) to pick the Redshift scalar form over the Snowflake key=value list; a Placeholder flips it onto the list path, which iterates the node and raises. Redact the credentials literal to a constant string instead: serialization stays on the scalar path, the secret never lands in the generalized query text, and fingerprints stay independent of the credential value. 2. get_query_fingerprint_debug only caught ValueError/SqlglotError, so the TypeError propagated and killed the pipeline. Widen the catch to include TypeError so fingerprinting degrades to the raw-text fallback instead.

codecov · 2026-07-02T15:32:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-07-02T17:13:42Z

Connector Tests Results

All connector tests passed for commit 415db2b

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

askumar27

LGTM — approving. The core fix is minimal, correct, and well-commented, and the root-cause analysis checks out against sqlglot 30.8.0 (generator.py:5166 — credentials_sql dispatches the Redshift scalar vs. Snowflake key=value form on isinstance(cred_expr, exp.Literal), so replacing the credentials literal with a Placeholder reroutes serialization onto the iterating path and raises TypeError). Keeping the node a Literal (redacted constant) is the safer of the two possible fixes.

Verified:

Secret never reaches the stored/generalized text (redaction stays on the scalar path); test confirms EXAMPLESECRET absence and fingerprint independence.
The widened except is a specific tuple, not a blanket catch, and still re-raises for non-str expressions; the fallback sets expression_sql = None, so the raw secret is only hashed, never stored.
No backward-compat fingerprint concern — these COPY queries previously aborted the whole batch, so no prior fingerprints exist to diverge from.

Left a few inline notes — one worth addressing before merge (late import in a test, which OSS guidelines flag), the rest optional.

askumar27 · 2026-07-02T19:02:21Z

+    # Defense in depth: if generalization raises an unexpected error (e.g. a
+    # sqlglot generator bug on an exotic statement), fingerprinting must fall
+    # back to the raw text rather than propagating and killing the pipeline.
+    import datahub.sql_parsing.sqlglot_utils as sqlglot_utils


WARNING — late import in test body. Hoisting test imports to the top of the file is an OSS convention (late imports in tests are treated as an auto-rejection blocker). This is a one-line fix: add from datahub.sql_parsing import sqlglot_utils to the existing import block at the top of the file, then drop this inline import. Monkeypatching still works because get_query_fingerprint_debug resolves generalize_query via the module global at call time.

Suggested change

import datahub.sql_parsing.sqlglot_utils as sqlglot_utils

(remove this line; add the import at the top)

ahh, thanks

askumar27 · 2026-07-02T19:02:21Z

+    monkeypatch.setattr(sqlglot_utils, "generalize_query", _boom)
+
+    fingerprint = get_query_fingerprint("SELECT * FROM my_table", "redshift")
+    assert fingerprint


Suggestion (non-blocking). This proves a fingerprint is returned, but not that it came from the raw-text fallback — a bug where _boom never fired would still pass, since any non-empty hash satisfies assert fingerprint. Consider asserting it equals generate_hash("SELECT * FROM my_table") to pin the fallback behavior rather than just "didn't raise".

askumar27 · 2026-07-02T19:02:21Z

+
+    # The secret must never leak into the generalized (stored) query text.
+    assert "EXAMPLESECRET" not in generalized
+    assert "CREDENTIALS" in generalized


nit (non-blocking): slightly redundant with the secret-absence check above, but harmless — leave it. The fingerprint-equality assertion below is the strongest part of this test.

github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 2, 2026

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 15:31 View deployment

vercel Bot deployed to Preview July 2, 2026 15:48 View deployment

askumar27 approved these changes Jul 2, 2026

View reviewed changes

maggiehays added the pending-submitter-merge label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting#18141

fix(ingestion/redshift): handle COPY CREDENTIALS in query fingerprinting#18141
treff7es wants to merge 1 commit into
masterfrom
fix-redshift-copy-credentials-fingerprint

treff7es commented Jul 2, 2026

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

datahub-connector-tests Bot commented Jul 2, 2026

Uh oh!

askumar27 left a comment

Uh oh!

askumar27 Jul 2, 2026

Uh oh!

treff7es Jul 2, 2026

Uh oh!

askumar27 Jul 2, 2026

Uh oh!

askumar27 Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

treff7es commented Jul 2, 2026

Summary

Testing

Checklist

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests Bot commented Jul 2, 2026

Connector Tests Results

Uh oh!

askumar27 left a comment

Choose a reason for hiding this comment

Uh oh!

askumar27 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

treff7es Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

askumar27 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

askumar27 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jul 2, 2026 •

edited

Loading