Skip to content

Conversation

@rootranjan
Copy link

Fixes #4631

Description:

Reduce false positives in DatadogToken detector by filtering out legitimate code identifiers, checksums, encrypted data, and test values that match the detector pattern.

Changes:

  • Add filter to exclude letters-only matches (no digits)
  • Add filter to exclude repeated characters (test/placeholder values)
  • Add filter to exclude NPM integrity hashes (sha512-...== patterns)
  • Add filter to exclude Go module checksums (h1:...= patterns)
  • Add filter to exclude URL-encoded paths (%3A patterns)
  • Add filter to exclude SOPS-encrypted data (ENC[AES256_GCM,data:...] patterns)
  • Add filter to exclude base64-encoded certificates (caBundle patterns)
  • Fix lint errors by properly handling res.Body.Close() errors

This reduces false positives from legitimate code identifiers, checksums, encrypted data, and test values while still detecting real Datadog API and Application keys that contain digits and have higher entropy.

Problem:
The DatadogToken detector was flagging any 32-character or 40-character alphanumeric string near the keywords "datadog" or "dd" as a potential secret, including:

  • URL-encoded service names in paths (e.g., service%3Amy-app-service-name)
  • NPM package integrity hashes (e.g., substrings from sha512-...== patterns)
  • Go module checksums (e.g., substrings from h1:...= patterns)
  • SOPS-encrypted data (e.g., substrings from ENC[AES256_GCM,data:...] patterns)
  • Test/placeholder values (e.g., 11111111111111111111111111111111)
  • Base64-encoded certificates (e.g., substrings from caBundle fields)

Solution:
Added isLikelyFalsePositive() helper function with multiple filters:

  1. Letters-only filter - Excludes strings with no digits (service names/identifiers)
  2. Repeated characters filter - Excludes test/placeholder values like 11111111111111111111111111111111
  3. NPM integrity hash filter - Detects sha512-...== patterns in package.json files
  4. Go module checksum filter - Detects h1:...= patterns in go.sum/go.mod files
  5. URL-encoded path filter - Detects %3A patterns and URL structures
  6. SOPS-encrypted data filter - Detects ENC[AES256_GCM,data:...] patterns
  7. Base64 certificate filter - Detects caBundle and certificate-related fields

Implementation Details:

  • Modified FromData() to use FindAllStringSubmatchIndex() to get match positions for context extraction
  • Added context-aware filtering that checks surrounding text (±200 chars for most patterns, ±2000 chars for certificates) to detect patterns
  • Filters are applied before processing matches to avoid unnecessary verification calls
  • Each filter function extracts context around the match and checks for specific patterns (e.g., sha512-, h1:, %3A, ENC[, caBundle)

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Add filters to exclude legitimate code patterns:
- Letters-only matches (service names/identifiers)
- Repeated characters (test/placeholder values)
- NPM integrity hashes (sha512-...== patterns)
- Go module checksums (h1:...= patterns)
- URL-encoded paths (%3A patterns)
- SOPS-encrypted data (ENC[AES256_GCM,data:...] patterns)
- Base64-encoded certificates (caBundle patterns)

This reduces false positives while still detecting real Datadog API and Application keys.
@rootranjan rootranjan requested a review from a team December 31, 2025 13:49
@rootranjan rootranjan requested a review from a team as a code owner December 31, 2025 13:49
Copy link
Contributor

@nabeelalam nabeelalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rootranjan! Thanks for proposing a solution for the false positive issues in this detector.

I'm all for testing for entropy and whether the tokens contain all required characters, but I feel the rest of the checks may be adding some complexity and performance hits without a strong guarantee of detecting false positives.

Introducing 4 more regular expressions along with the slicing/matching definitely will definitely impact the performance on this detector, we generally like to keep detectors light since inputs can be large.

Plus, several filters seem a little to broad to me (e.g., %3A anywhere in the range of 200 chars) and could possibly suppress actual secrets and mark them false and it's better to be safe than sorry in this case. Plus with the added complexity it would be harder to debug those cases too.

I would suggest keeping changes minimal, like the lighter weight constraints (required chars, entropy threshold) that you have added and add a couple of high-confidence exclusions only.

Comment on lines +238 to +251
// isRepeatedCharacter checks if a string consists of the same character repeated.
// This filters out test/placeholder values like "11111111111111111111111111111111"
func isRepeatedCharacter(s string) bool {
if len(s) == 0 {
return false
}
firstChar := s[0]
for i := 1; i < len(s); i++ {
if s[i] != firstChar {
return false
}
}
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use an entropy test for this

Comment on lines +105 to +112
func hasDigit(s string) bool {
for _, r := range s {
if unicode.IsDigit(r) {
return true
}
}
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we should check whether the token contains at least one lower-case and upper-case letter as well. I'm not entirely sure but the entropy check might also solve this (I doubt it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DatadogToken detector produces false positives for checksums, encrypted data, and service names

2 participants