Skip to content

feat(learn): sanitize sensitive data in output by default#705

Open
asaphe wants to merge 1 commit intortk-ai:developfrom
asaphe:feat(learn)/sanitize-sensitive-data
Open

feat(learn): sanitize sensitive data in output by default#705
asaphe wants to merge 1 commit intortk-ai:developfrom
asaphe:feat(learn)/sanitize-sensitive-data

Conversation

@asaphe
Copy link

@asaphe asaphe commented Mar 18, 2026

Summary

rtk learn --write-rules outputs raw command pairs that often contain sensitive infrastructure identifiers. These get auto-loaded into Claude Code sessions if the generated rules file is committed.

This PR adds a Sanitizer that redacts sensitive patterns from all output formats by default, with user-extensible patterns via config.toml.

Supersedes #652

Built-in patterns (8)

Pattern Replacement Context guard
vpc-*, sg-*, subnet-*, vpce-*, + 7 more {prefix}-<ID> Word boundary + hex [0-9a-f]{8,17}
i-* (EC2 instances) i-<ID> Separate regex — short prefix needs own match
Route53 zone IDs Z<ZONE_ID> Z + [0-9A-Z]{10,32}
AWS account IDs <ACCOUNT_ID> Only after ::, :, or / — ignores bare numbers
/Users/name/..., /home/name/... ~/... Absolute home paths only
github.com/org/repo, repos/org/repo {prefix}<org>/<repo> URL and gh api path patterns
UUIDs <UUID> Standard 8-4-4-4-12 hex format
--repo org/repo --repo <org>/<repo> CLI flag pattern (not caught by URL regex)

User-defined patterns via config.toml

[learn]
sanitize_patterns = [
    "my-company-name",
    "internal\\.example\\.co",
]

Compiled as regexes, applied after built-in patterns. Matches replaced with <REDACTED>. Invalid patterns warn on stderr and are skipped.

Flags

Sanitization is on by default. --no-sanitize restores raw output.

Design

Sanitizer struct constructed once, passed by &Sanitizer to all output functions:

  • sanitize() returns Cow<str> — uses Option<String> to track whether any pattern matched. Zero allocation when disabled or when no pattern fires.
  • Individual named lazy_static! regexes match the existing detector.rs conventions.
  • apply! macro eliminates repetition while correctly handling the Option<String>&strCow borrow chain.

Changes

File What
src/learn/report.rs Sanitizer struct, 8 named regexes, apply! macro, updated output functions, 23 tests
src/learn/mod.rs Uses Sanitizer, sanitizes JSON output (was a gap in original code)
src/config.rs LearnConfig with sanitize_patterns: Vec<String>
src/main.rs --no-sanitize CLI flag
CHANGELOG.md [Unreleased] entry

Verification

Local (rustc 1.94.0, macOS aarch64):

  • cargo fmt --all --check — our files clean
  • cargo clippy --all-targets — no warnings from our code
  • cargo test999 passed, 0 failed, 3 ignored
  • E2E: built binary, ran against 62 real sessions containing AWS IDs, Route53 zones, Databricks UUIDs, GitHub org/repo refs. Verified all 8 pattern types redacted in text output, JSON output, and --write-rules file. Verified --no-sanitize preserves raw. Verified Cow::Borrowed for disabled and no-match paths.

23 tests:

Category Tests
AWS resource IDs sanitize_redacts_aws_resource_ids, sanitize_redacts_ec2_instance_ids
Route53 sanitize_redacts_route53_zone_ids
Account IDs sanitize_redacts_account_ids_after_delimiters, sanitize_ignores_bare_12_digit_numbers
User paths sanitize_redacts_user_home_paths
GitHub URLs sanitize_redacts_github_org_repo_in_urls
UUIDs sanitize_redacts_uuids, sanitize_redacts_uuids_in_quotes, sanitize_ignores_short_hex_dashes
--repo flag sanitize_redacts_repo_flag
Safe content sanitize_preserves_commands_without_sensitive_data (verifies Cow::Borrowed), sanitize_handles_empty_input
Multi-pattern sanitize_applies_multiple_patterns_in_one_command
User patterns sanitize_applies_user_patterns_from_config, sanitize_user_patterns_combine_with_builtins, sanitize_multiple_user_patterns_chain_correctly
Disabled disabled_sanitizer_returns_input_unchanged (verifies Cow::Borrowed)
Console report format_console_report_shows_header_for_empty_rules, format_console_report_includes_counts_and_errors, format_console_report_redacts_when_sanitized
Markdown file write_rules_file_produces_grouped_markdown, write_rules_file_redacts_when_sanitized

Closes #651

Signed-off-by: asaphe asaphe@users.noreply.github.com

Add regex-based sanitization to `rtk learn` that redacts infrastructure
identifiers before writing rules files or printing reports:

- AWS resource IDs (vpc-*, sg-*, subnet-*, vpce-*, i-*, etc.)
- AWS account IDs (12-digit numbers)
- Route53 hosted zone IDs
- Absolute user home paths (/Users/name/... -> ~/...)
- GitHub org/repo names in URLs

Sanitization is on by default. Use --no-sanitize to preserve raw output
for debugging.

Closes rtk-ai#651

Signed-off-by: asaphe <asaphe@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant