feat(learn): sanitize sensitive data in output by default by asaphe · Pull Request #705 · rtk-ai/rtk

asaphe · 2026-03-18T19:54:33Z

Summary

rtk learn --write-rules outputs raw command pairs that often contain sensitive infrastructure identifiers. These get auto-loaded into Claude Code sessions if the generated rules file is committed.

This PR adds a Sanitizer that redacts sensitive patterns from all output formats by default, with user-extensible patterns via config.toml.

Supersedes #652

Built-in patterns (8)

Pattern	Replacement	Context guard
`vpc-`, `sg-`, `subnet-`, `vpce-`, + 7 more	`{prefix}-<ID>`	Word boundary + hex `[0-9a-f]{8,17}`
`i-*` (EC2 instances)	`i-<ID>`	Separate regex — short prefix needs own match
Route53 zone IDs	`Z<ZONE_ID>`	`Z` + `[0-9A-Z]{10,32}`
AWS account IDs	`<ACCOUNT_ID>`	Only after `::`, `:`, or `/` — ignores bare numbers
`/Users/name/...`, `/home/name/...`	`~/...`	Absolute home paths only
`github.com/org/repo`, `repos/org/repo`	`{prefix}<org>/<repo>`	URL and `gh api` path patterns
UUIDs	`<UUID>`	Standard 8-4-4-4-12 hex format
`--repo org/repo`	`--repo <org>/<repo>`	CLI flag pattern (not caught by URL regex)

User-defined patterns via config.toml

[learn]
sanitize_patterns = [
    "my-company-name",
    "internal\\.example\\.co",
]

Compiled as regexes, applied after built-in patterns. Matches replaced with <REDACTED>. Invalid patterns warn on stderr and are skipped.

Flags

Sanitization is on by default. --no-sanitize restores raw output.

Design

Sanitizer struct constructed once, passed by &Sanitizer to all output functions:

sanitize() returns Cow<str> — uses Option<String> to track whether any pattern matched. Zero allocation when disabled or when no pattern fires.
Individual named lazy_static! regexes match the existing detector.rs conventions.
apply! macro eliminates repetition while correctly handling the Option<String> → &str → Cow borrow chain.

Changes

File	What
`src/learn/report.rs`	`Sanitizer` struct, 8 named regexes, `apply!` macro, updated output functions, 23 tests
`src/learn/mod.rs`	Uses `Sanitizer`, sanitizes JSON output (was a gap in original code)
`src/config.rs`	`LearnConfig` with `sanitize_patterns: Vec<String>`
`src/main.rs`	`--no-sanitize` CLI flag
`CHANGELOG.md`	`[Unreleased]` entry

Verification

Local (rustc 1.94.0, macOS aarch64):

cargo fmt --all --check — our files clean
cargo clippy --all-targets — no warnings from our code
cargo test — 999 passed, 0 failed, 3 ignored
E2E: built binary, ran against 62 real sessions containing AWS IDs, Route53 zones, Databricks UUIDs, GitHub org/repo refs. Verified all 8 pattern types redacted in text output, JSON output, and --write-rules file. Verified --no-sanitize preserves raw. Verified Cow::Borrowed for disabled and no-match paths.

23 tests:

Category	Tests
AWS resource IDs	`sanitize_redacts_aws_resource_ids`, `sanitize_redacts_ec2_instance_ids`
Route53	`sanitize_redacts_route53_zone_ids`
Account IDs	`sanitize_redacts_account_ids_after_delimiters`, `sanitize_ignores_bare_12_digit_numbers`
User paths	`sanitize_redacts_user_home_paths`
GitHub URLs	`sanitize_redacts_github_org_repo_in_urls`
UUIDs	`sanitize_redacts_uuids`, `sanitize_redacts_uuids_in_quotes`, `sanitize_ignores_short_hex_dashes`
--repo flag	`sanitize_redacts_repo_flag`
Safe content	`sanitize_preserves_commands_without_sensitive_data` (verifies `Cow::Borrowed`), `sanitize_handles_empty_input`
Multi-pattern	`sanitize_applies_multiple_patterns_in_one_command`
User patterns	`sanitize_applies_user_patterns_from_config`, `sanitize_user_patterns_combine_with_builtins`, `sanitize_multiple_user_patterns_chain_correctly`
Disabled	`disabled_sanitizer_returns_input_unchanged` (verifies `Cow::Borrowed`)
Console report	`format_console_report_shows_header_for_empty_rules`, `format_console_report_includes_counts_and_errors`, `format_console_report_redacts_when_sanitized`
Markdown file	`write_rules_file_produces_grouped_markdown`, `write_rules_file_redacts_when_sanitized`

Closes #651

Signed-off-by: asaphe asaphe@users.noreply.github.com

Add regex-based sanitization to `rtk learn` that redacts infrastructure identifiers before writing rules files or printing reports: - AWS resource IDs (vpc-*, sg-*, subnet-*, vpce-*, i-*, etc.) - AWS account IDs (12-digit numbers) - Route53 hosted zone IDs - Absolute user home paths (/Users/name/... -> ~/...) - GitHub org/repo names in URLs Sanitization is on by default. Use --no-sanitize to preserve raw output for debugging. Closes rtk-ai#651 Signed-off-by: asaphe <asaphe@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(learn): sanitize sensitive data in output by default#705

feat(learn): sanitize sensitive data in output by default#705
asaphe wants to merge 1 commit intortk-ai:developfrom
asaphe:feat(learn)/sanitize-sensitive-data

asaphe commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asaphe commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Built-in patterns (8)

User-defined patterns via config.toml

Flags

Design

Changes

Verification

23 tests:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asaphe commented Mar 18, 2026 •

edited

Loading