-
Notifications
You must be signed in to change notification settings - Fork 848
Fix origin_server_auth URL encoding for mixed-encoding URLs #12802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Added unit tests and autest to demonstrate the isUriEncoded() bug in the origin_server_auth plugin (s3_auth). Bug: When a URL has mixed encoding (some characters encoded, some not), isUriEncoded() incorrectly returns true upon finding the first %XX sequence, assuming the entire string is already encoded. This causes canonicalEncode() to skip re-encoding unencoded characters like parentheses. Example: Client sends: /app/(channel)/%5B%5Bparts%5D%5D/page.js isUriEncoded(): Returns true (finds %5B) canonicalEncode(): Returns as-is (thinks already encoded) Signature for: /app/(channel)/%5B%5Bparts%5D%5D/page.js <- WRONG S3 expects: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js This results in S3 returning 403 SignatureDoesNotMatch. Unit tests marked with [!mayfail] to document expected-to-fail behavior. Autest passes with mock server but would fail with real S3.
Fixed the isUriEncoded() and canonicalEncode() functions to properly handle URLs with mixed encoding (some characters encoded, some not). Bug: URLs like /app/(channel)/%5B%5Bparts%5D%5D/page.js would cause S3 signature mismatch because: - isUriEncoded() found %5B and returned true (incorrectly) - canonicalEncode() returned the string as-is - Signature was calculated for partially-encoded path - S3 expected signature for fully-encoded path Fix: 1. isUriEncoded() now checks the ENTIRE string and returns false if ANY character that should be encoded is found unencoded. 2. canonicalEncode() now always decodes first, then re-encodes using AWS canonical rules. This ensures consistent output regardless of input. Added uriDecode() helper function to decode percent-encoded characters. Updated unit tests to verify the correct behavior. This fixes YTSATS-4835.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug in the origin_server_auth plugin where mixed URL encoding (some characters encoded, others not) caused AWS SigV4 signature mismatches with S3/GCP.
Changes:
- Added
uriDecode()function to decode percent-encoded URLs - Modified
isUriEncoded()to check if ALL characters requiring encoding are encoded (not just if ANY encoded character exists) - Modified
canonicalEncode()to decode-then-reencode partially-encoded URLs for consistent canonical output
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/origin_server_auth/aws_auth_v4.cc | Added uriDecode() function and fixed isUriEncoded()/canonicalEncode() to handle mixed encoding |
| plugins/origin_server_auth/unit_tests/test_aws_auth_v4.h | Exposed uriDecode() and canonicalEncode() functions for unit testing |
| plugins/origin_server_auth/unit_tests/test_aws_auth_v4.cc | Added comprehensive unit tests for mixed encoding scenarios and S3 safe characters |
| tests/gold_tests/pluginTest/origin_server_auth/s3_url_encoding.test.py | Added integration test demonstrating the bug and verifying the fix |
| tests/gold_tests/pluginTest/origin_server_auth/rules/s3_url_encoding.test_input | Added test configuration with fake AWS credentials |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AWS SigV4 requires uppercase hex digits in percent-encoding (e.g., %2F not %2f). URLs with lowercase hex were incorrectly passing through isUriEncoded() without normalization, causing potential signature mismatch with S3. Changes: - isUriEncoded() now returns false for lowercase hex digits, triggering canonicalEncode() to normalize via decode/re-encode - Added comprehensive tests for lowercase hex handling - Improved documentation comments Fixes issue identified by Copilot review.
Simple URIs like /path/foo.jpg with only unreserved characters were unnecessarily being decoded and re-encoded. Now isUriEncoded() returns true for strings that are already in canonical form (no reserved chars needing encoding), avoiding the unnecessary processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TEST_CASE("isUriEncoded(): check an empty input", "[AWS][auth][utility]") | ||
| { | ||
| CHECK(false == isUriEncoded("")); | ||
| // Empty string has no characters that need encoding - it's already canonical | ||
| CHECK(true == isUriEncoded("")); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior change for empty strings is semantically reasonable (an empty string is already canonical), but this changes the return value from false to true for empty input. While the test has been updated to reflect this, ensure that all calling code handles this correctly, especially since the PR description doesn't explicitly mention this behavioral change for empty strings.
| String decoded = uriDecode(in); | ||
| String canonical = uriEncode(decoded, isObjectName); | ||
| return canonical; |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decode-then-reencode approach allocates two temporary strings for every URL that needs normalization. For high-traffic scenarios with many requests containing mixed encoding, this could have performance implications. Consider adding a fast-path optimization that checks if the string needs normalization before allocating the decode buffer, or document the performance characteristics of this approach.
| result.reserve(in.length()); /* Decoded string will be same size or smaller */ | ||
|
|
||
| for (size_t i = 0; i < in.length(); i++) { | ||
| if (in[i] == '%' && i + 2 < in.length() && std::isxdigit(in[i + 1]) && std::isxdigit(in[i + 2])) { |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::isxdigit with potentially signed char values (when char is signed) can cause undefined behavior if the value is negative. The arguments to isxdigit should be cast to unsigned char to ensure values in the range 0-255 are correctly handled. This applies to all three calls: in[i + 1] and in[i + 2].
| if (in[i] == '%' && i + 2 < in.length() && std::isxdigit(in[i + 1]) && std::isxdigit(in[i + 2])) { | |
| if (in[i] == '%' && i + 2 < in.length() && | |
| std::isxdigit(static_cast<unsigned char>(in[i + 1])) && | |
| std::isxdigit(static_cast<unsigned char>(in[i + 2]))) { |
| TEST_CASE("canonicalEncode(): plus sign handling", "[AWS][auth][utility]") | ||
| { | ||
| // Per AWS SigV4 spec, plus signs in URLs are treated as spaces and encoded as %20 | ||
| // This test documents and verifies that behavior | ||
|
|
||
| // Literal plus sign should be encoded as %20 (treated as space per AWS spec) | ||
| CHECK(uriEncode("+", false) == "%20"); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case name and assertion at line 369 states "Literal plus sign should be encoded as %20 (treated as space per AWS spec)" but this behavior contradicts standard URL encoding where a literal plus sign would typically be encoded as %2B, not %20. While the code comment at lines 365-366 explains this is intentional AWS behavior, the test should verify that this matches actual AWS SigV4 behavior. Consider adding a reference to the specific AWS documentation that states plus signs should be treated as spaces in the canonical URI.
| String | ||
| uriDecode(const String &in) | ||
| { | ||
| String result; | ||
| result.reserve(in.length()); /* Decoded string will be same size or smaller */ | ||
|
|
||
| for (size_t i = 0; i < in.length(); i++) { | ||
| if (in[i] == '%' && i + 2 < in.length() && std::isxdigit(in[i + 1]) && std::isxdigit(in[i + 2])) { | ||
| /* Decode %XX to character */ | ||
| char hex[3] = {in[i + 1], in[i + 2], '\0'}; | ||
| result += static_cast<char>(std::strtol(hex, nullptr, 16)); | ||
| i += 2; /* Skip past the hex digits */ | ||
| } else { | ||
| result += in[i]; | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uriDecode function does not handle invalid UTF-8 sequences or characters outside the printable ASCII range. When decoding percent-encoded bytes that represent non-ASCII or invalid UTF-8 sequences, the function may produce strings with invalid characters. Consider adding validation or documenting this limitation, especially since this is used in a security-sensitive context for AWS signature calculation.
| char c = in[pos]; | ||
|
|
||
| if (isalnum(c) || c == '-' || c == '_' || c == '.' || c == '~') { | ||
| /* found a unreserved character which should not have been be encoded regardless | ||
| * 'A'-'Z', 'a'-'z', '0'-'9', '-', '.', '_', and '~'. */ | ||
| /* Unreserved characters don't need encoding: | ||
| * 'A'-'Z', 'a'-'z', '0'-'9', '-', '.', '_', and '~' */ | ||
| continue; | ||
| } | ||
|
|
||
| if (' ' == c) { | ||
| /* space should have been encoded with %20 if the string was encoded */ | ||
| return false; | ||
| } | ||
|
|
||
| if ('/' == c && !isObjectName) { | ||
| /* if this is not an object name '/' should have been encoded */ | ||
| return false; | ||
| if ('/' == c && isObjectName) { | ||
| /* '/' is allowed unencoded in object names */ | ||
| continue; | ||
| } | ||
|
|
||
| if ('%' == c) { | ||
| if (pos + 2 < in.length() && std::isxdigit(in[pos + 1]) && std::isxdigit(in[pos + 2])) { | ||
| /* if string was encoded we should have exactly 2 hexadecimal chars following it */ | ||
| return true; | ||
| /* Valid percent-encoded sequence found. AWS SigV4 requires UPPERCASE hex digits. | ||
| * If lowercase hex is found, return false to trigger normalization. | ||
| * See: https://docs.aws.amazon.com/IAM/latest/UserGuide/create-signed-request.html | ||
| * "Letters in the hexadecimal value must be uppercase, for example "%1A"." */ | ||
| if (std::islower(in[pos + 1]) || std::islower(in[pos + 2])) { | ||
| return false; /* Lowercase hex needs normalization to uppercase */ | ||
| } | ||
| pos += 2; /* Skip past the hex digits */ | ||
| continue; | ||
| } else { | ||
| /* lonely '%' should have been encoded with %25 according to the RFC so likely not encoded */ | ||
| /* Lone '%' or incomplete sequence - needs encoding as %25 */ | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| /* Any other character needs encoding (space, '(', ')', '[', ']', etc.) */ | ||
| return false; | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::isxdigit, std::isalnum, and std::islower with potentially signed char values (when char is signed) can cause undefined behavior if the value is negative or outside the range of unsigned char. The character being checked should be cast to unsigned char before being passed to these functions. This applies to multiple locations where in[pos], in[pos + 1], and in[pos + 2] are used with these functions.
| /* Input needs normalization. This handles: | ||
| * 1. Unencoded strings - encode all reserved characters | ||
| * 2. Partially encoded - some chars encoded, some not (e.g., parentheses vs brackets) | ||
| * 3. Lowercase hex - convert %2f to %2F per AWS SigV4 requirement | ||
| * | ||
| * Decode first to get the raw string, then re-encode with AWS canonical rules. | ||
| * | ||
| * Example (mixed encoding): | ||
| * Input: /app/(channel)/%5B%5Bparts%5D%5D/page.js (parentheses not encoded) | ||
| * Decode: /app/(channel)/[[parts]]/page.js | ||
| * Encode: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | ||
| * | ||
| * Example (lowercase hex): | ||
| * Input: /path/%5btest%5d/file.js (lowercase hex) | ||
| * Decode: /path/[test]/file.js | ||
| * Encode: /path/%5Btest%5D/file.js (uppercase hex) | ||
| */ | ||
| String decoded = uriDecode(in); | ||
| String canonical = uriEncode(decoded, isObjectName); | ||
| return canonical; |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decode-then-reencode approach could potentially be exploited if an attacker can inject URL-encoded data that normalizes to unexpected values. For instance, overlong UTF-8 sequences or other encoding tricks could be used. While AWS itself will normalize the URL, it's worth considering if there are edge cases where the normalization here could differ from AWS's normalization, leading to security issues. Consider adding explicit validation or documenting the security assumptions.
| * | ||
| * Example from YTSATS-4835: | ||
| * Client sends: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | ||
| * ATS decodes to: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode) |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 159 describes a scenario where "ATS decodes to: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode)". However, this seems to imply that ATS performs partial decoding of the URL, which may not be accurate. According to the test setup and description, ATS preserves URL encoding from the client and forwards it as-is. The comment should clarify whether this partial decode is actually performed by ATS or if it's just describing what a hypothetical partial decode would look like.
| * ATS decodes to: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode) | |
| * Partially decoded form involved in the bug: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode example) | |
| * (ATS preserves the client’s URL encoding and forwards it as-is; this line illustrates the mixed-encoding state) |
| canonicalEncode(const String &in, bool isObjectName) | ||
| { | ||
| String canonical; | ||
| if (!isUriEncoded(in, isObjectName)) { | ||
| /* Not URI-encoded */ | ||
| canonical = uriEncode(in, isObjectName); | ||
| } else { | ||
| /* URI-encoded, then don't encode since AWS does not encode which is not mentioned in the spec, | ||
| * asked AWS, still waiting for confirmation */ | ||
| canonical = in; | ||
| if (isUriEncoded(in, isObjectName)) { | ||
| /* Fully URI-encoded with uppercase hex, return as-is */ | ||
| return in; | ||
| } | ||
|
|
||
| /* Input needs normalization. This handles: | ||
| * 1. Unencoded strings - encode all reserved characters | ||
| * 2. Partially encoded - some chars encoded, some not (e.g., parentheses vs brackets) | ||
| * 3. Lowercase hex - convert %2f to %2F per AWS SigV4 requirement | ||
| * | ||
| * Decode first to get the raw string, then re-encode with AWS canonical rules. | ||
| * | ||
| * Example (mixed encoding): | ||
| * Input: /app/(channel)/%5B%5Bparts%5D%5D/page.js (parentheses not encoded) | ||
| * Decode: /app/(channel)/[[parts]]/page.js | ||
| * Encode: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | ||
| * | ||
| * Example (lowercase hex): | ||
| * Input: /path/%5btest%5d/file.js (lowercase hex) | ||
| * Decode: /path/[test]/file.js | ||
| * Encode: /path/%5Btest%5D/file.js (uppercase hex) | ||
| */ | ||
| String decoded = uriDecode(in); | ||
| String canonical = uriEncode(decoded, isObjectName); | ||
| return canonical; |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonicalEncode function doesn't handle the case of double-encoded URLs (e.g., "%2528" which is an encoded "%28"). When such a URL is decoded once, it becomes "%28", which then gets re-encoded to "%2528" again, resulting in the original string. However, AWS might expect only a single level of decoding. Consider adding test cases for double-encoded URLs to verify the behavior matches AWS expectations, or document that double-encoding is not supported.
Summary
Fixed the
isUriEncoded()andcanonicalEncode()functions in the origin_server_auth plugin to properly handle URLs with mixed encoding (some characters encoded, some not).Bug Description
When a URL has mixed encoding (e.g.,
/app/(channel)/%5B%5Bparts%5D%5D/page.jswhere parentheses are NOT encoded but brackets ARE encoded), the AWS v4 signature calculation was incorrect:isUriEncoded()found%5Band returnedtrue, incorrectly assuming the entire string was fully encodedcanonicalEncode()returned the string as-isAWS SigV4 Canonical URI Encoding
Per the AWS SigV4 spec:
This means characters like
(,),!,*,'must be percent-encoded in the canonical URI for signature calculation, even though they are listed as "safe" for S3 object key names.(%28))%29)!%21)*%2A)'%27)Fix
isUriEncoded(): Now checks the ENTIRE string and returnsfalseif ANY character that should be encoded is found unencodedcanonicalEncode(): For partially-encoded strings, decodes first then re-encodes to ensure consistent canonical outputuriDecode()helper functionTesting
!,*,',(,))References