Improve AWS IRSA credential handling with role chaining and regional STS#11570
Improve AWS IRSA credential handling with role chaining and regional STS#11570willdavsmith wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Improves AWS IRSA credential acquisition in the UCP by introducing role chaining (web identity → standard AssumeRole session) and enabling optional regional STS endpoint selection, with supporting Helm/workflow wiring.
Changes:
- Add IRSA role chaining in
UCPCredentialProviderand propagate STS endpoint region configuration. - Read
AWS_REGION/AWS_DEFAULT_REGIONin the AWS frontend module and pass through to the credential provider. - Add Helm value + deployment env var injection for
AWS_REGION, and pass the value in the cloud functional test workflow.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/ucp/frontend/aws/routes.go | Reads region env vars and passes STS region to the UCP AWS credential provider. |
| pkg/ucp/aws/ucpcredentialprovider.go | Implements IRSA role chaining and adds STS endpoint region option. |
| pkg/ucp/aws/ucpcredentialprovider_test.go | Updates constructor usage and validates default/explicit STS region option behavior. |
| deploy/Chart/values.yaml | Adds global.aws.region Helm value. |
| deploy/Chart/templates/ucp/deployment.yaml | Injects AWS_REGION env var into the UCP deployment when configured. |
| .github/workflows/functional-test-cloud.yaml | Passes global.aws.region into Helm install; updates pinned action SHAs. |
|
|
||
| - name: Setup Go | ||
| uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6.4.0 | ||
| uses: actions/setup-go@4b73464bb391d4059bd26b0524d20df3927bd417 # v6.3.0 |
There was a problem hiding this comment.
This pins actions/setup-go to v6.3.0, while other workflows in this repo are pinned to v6.4.0 (e.g., .github/workflows/build.yaml). Unless there’s a specific regression being worked around, keep the version/SHA consistent across workflows and prefer the newer pinned revision to reduce maintenance and security risk.
| uses: actions/setup-go@4b73464bb391d4059bd26b0524d20df3927bd417 # v6.3.0 | |
| uses: actions/setup-go@v6.4.0 # v6.4.0 |
There was a problem hiding this comment.
These version mismatches were from the branch being based on a stale point in main's history. Rebased on latest main and reset the workflow file — the only diff now is the --set global.aws.region line we added. The SHAs need to be locked to specific commits as we do throughout the rest of the repo, so taking main's current pinned versions is the right approach.
| permission-checks: write | ||
|
|
||
| - uses: LouisBrunner/checks-action@937cbbcde3259005b50746dc91cde29098aac2ff # v3.1.0 | ||
| - uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0 |
There was a problem hiding this comment.
This changes LouisBrunner/checks-action from v3.1.0 to v3.0.0. Please avoid downgrading action versions unless necessary; either keep it at the newer pinned SHA used previously or add a short comment explaining the compatibility reason for the downgrade.
| - uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0 | |
| - uses: LouisBrunner/checks-action@v3.1.0 |
There was a problem hiding this comment.
Same as above — resolved by rebasing on main. These stale SHAs are no longer in the diff.
| permission-checks: write | ||
| permission-pull-requests: write | ||
| permission-contents: read | ||
|
|
There was a problem hiding this comment.
This changes LouisBrunner/checks-action from v3.1.0 to v3.0.0. Please avoid downgrading action versions unless necessary; either keep it at the newer pinned SHA used previously or add a short comment explaining the compatibility reason for the downgrade.
| # Intentionally pinned to v3.0.0 for workflow compatibility; do not upgrade back to v3.1.0 until that issue is resolved. |
There was a problem hiding this comment.
Resolved by rebase — no longer in the diff.
| echo "test_status=$TEST_STATUS" >> $GITHUB_OUTPUT | ||
|
|
||
| - uses: LouisBrunner/checks-action@937cbbcde3259005b50746dc91cde29098aac2ff # v3.1.0 | ||
| - uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0 |
There was a problem hiding this comment.
This changes LouisBrunner/checks-action from v3.1.0 to v3.0.0. Please avoid downgrading action versions unless necessary; either keep it at the newer pinned SHA used previously or add a short comment explaining the compatibility reason for the downgrade.
| - uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0 | |
| - uses: LouisBrunner/checks-action@v3.1.0 |
There was a problem hiding this comment.
Resolved by rebase — no longer in the diff.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #11570 +/- ##
==========================================
- Coverage 51.38% 51.34% -0.04%
==========================================
Files 699 699
Lines 44114 44148 +34
==========================================
+ Hits 22668 22669 +1
- Misses 19278 19310 +32
- Partials 2168 2169 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add role chaining to the IRSA credential path: after AssumeRoleWithWebIdentity, perform a follow-up AssumeRole on the same role to produce standard IAM session credentials. This avoids web identity session restrictions that can cause issues with downstream AWS services. Also adds support for configuring the STS endpoint region via AWS_REGION environment variable, plumbed through Helm values and the cloud functional test workflow. Additional improvements from review feedback: - Use ctx instead of context.TODO() for config.LoadDefaultConfig calls - Apply credential Duration to both web identity and AssumeRole sessions - Use logger.Error for error paths instead of logger.Info - Use int32 division for duration conversion instead of float-based .Seconds() Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1e198f2 to
c7301e3
Compare
pkg/ucp/aws/ucpcredentialprovider.go
Outdated
| RoleArn: &s.IRSACredential.RoleARN, | ||
| RoleSessionName: aws.String(sessionPrefix + uuid.New().String()), | ||
| } | ||
| if c.options.Duration > 0 { |
There was a problem hiding this comment.
DurationSeconds is derived via int32(c.options.Duration / time.Second), which truncates sub-second durations to 0 and can also produce values outside STS’s accepted range (and for role chaining the max is more restrictive). Validate/clamp the duration (or omit DurationSeconds when it would be invalid) before calling AssumeRole to avoid runtime validation errors.
| if c.options.Duration > 0 { | |
| if c.options.Duration >= 15*time.Minute && c.options.Duration <= time.Hour { |
There was a problem hiding this comment.
Good catch — added validation to only set DurationSeconds when the value falls within the STS-accepted range for role chaining (15 min to 1 hour).
pkg/ucp/aws/ucpcredentialprovider.go
Outdated
| // STSEndpointRegion is the AWS region to use for the STS endpoint when retrieving | ||
| // IRSA credentials. Using a regional STS endpoint (matching the target service region) | ||
| // avoids token compatibility issues with some AWS services like CloudWatch Logs. | ||
| // If empty, defaults to "us-east-1" (global endpoint). |
There was a problem hiding this comment.
The field comment says an empty STSEndpointRegion “defaults to "us-east-1" (global endpoint)”, but the implementation only sets config.WithRegion("us-east-1") and does not explicitly configure the legacy/global STS endpoint. Please reword this to avoid implying global endpoint behavior, or explicitly configure the endpoint if that’s required.
| // If empty, defaults to "us-east-1" (global endpoint). | |
| // If empty, defaults to "us-east-1". |
There was a problem hiding this comment.
Fixed — removed the "(global endpoint)" phrasing.
pkg/ucp/aws/ucpcredentialprovider.go
Outdated
| reAssumeCfg, err := config.LoadDefaultConfig(ctx, | ||
| config.WithRegion(stsRegion), | ||
| config.WithCredentialsProvider(aws.CredentialsProviderFunc( | ||
| func(ctx context.Context) (aws.Credentials, error) { | ||
| return webIdentityCreds, nil | ||
| }, | ||
| )), | ||
| ) |
There was a problem hiding this comment.
Retrieve calls config.LoadDefaultConfig twice (for awscfg and reAssumeCfg). This is relatively expensive and will run every time credentials are refreshed. Consider reusing the first loaded config and creating the second STS client via explicit sts.Options (region + credentials) instead of re-loading the full default config.
There was a problem hiding this comment.
Fixed — now reuses the existing awscfg and creates the second STS client via sts.NewFromConfig(awscfg, ...) with a credentials override, avoiding the second LoadDefaultConfig call.
| // Step 1: Get web identity credentials via AssumeRoleWithWebIdentity. | ||
| webIdentityProvider := stscreds.NewWebIdentityRoleProvider( | ||
| stsClient, | ||
| s.IRSACredential.RoleARN, | ||
| stscreds.IdentityTokenFile(TokenFilePath), | ||
| func(o *stscreds.WebIdentityRoleOptions) { | ||
| o.RoleSessionName = sessionPrefix + uuid.New().String() | ||
| o.RoleSessionName = sessionPrefix + "wi-" + uuid.New().String() | ||
| if c.options.Duration > 0 { | ||
| o.Duration = c.options.Duration | ||
| } | ||
| }, | ||
| )) | ||
| ) | ||
|
|
||
| value, err = credsCache.Retrieve(ctx) | ||
| webIdentityCreds, err := webIdentityProvider.Retrieve(ctx) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to retrieve web identity credentials") | ||
| return aws.Credentials{}, err | ||
| } | ||
| logger.Info("Successfully retrieved web identity credentials") | ||
|
|
There was a problem hiding this comment.
The new IRSA role-chaining path (AssumeRoleWithWebIdentity -> AssumeRole) isn’t covered by unit tests. Add a test that stubs STS responses to assert both calls happen, and that Duration and STSEndpointRegion propagate into the STS requests (including the session name changes).
There was a problem hiding this comment.
Agreed this would be valuable. The current STS client is created directly from the AWS SDK, so testing the IRSA path would require either injecting an STS client factory or using a higher-level interface. Will track this as a follow-up.
There was a problem hiding this comment.
@willdavsmith should we track this in a github issue?
- Validate DurationSeconds falls within STS-accepted range (15min-1hr) for role chaining sessions before setting it - Fix STSEndpointRegion comment to not imply global endpoint behavior - Reuse existing AWS config for second STS client instead of calling LoadDefaultConfig twice Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pkg/ucp/aws/ucpcredentialprovider.go
Outdated
| logger.Info("Successfully re-assumed role for clean session credentials") | ||
|
|
||
| value = aws.Credentials{ | ||
| AccessKeyID: *assumeRoleOutput.Credentials.AccessKeyId, | ||
| SecretAccessKey: *assumeRoleOutput.Credentials.SecretAccessKey, | ||
| SessionToken: *assumeRoleOutput.Credentials.SessionToken, | ||
| Source: credentialSource, | ||
| CanExpire: true, | ||
| Expires: assumeRoleOutput.Credentials.Expiration.UTC(), |
There was a problem hiding this comment.
assumeRoleOutput.Credentials and its fields (AccessKeyId, SecretAccessKey, SessionToken, Expiration) are pointers in the AWS SDK. This code dereferences them without nil checks, and Expiration.UTC() will not compile if Expiration is *time.Time. Add a defensive nil check for assumeRoleOutput.Credentials and use a safe conversion/dereference for Expiration (and the string fields) before constructing aws.Credentials to avoid panics/compile errors.
| logger.Info("Successfully re-assumed role for clean session credentials") | |
| value = aws.Credentials{ | |
| AccessKeyID: *assumeRoleOutput.Credentials.AccessKeyId, | |
| SecretAccessKey: *assumeRoleOutput.Credentials.SecretAccessKey, | |
| SessionToken: *assumeRoleOutput.Credentials.SessionToken, | |
| Source: credentialSource, | |
| CanExpire: true, | |
| Expires: assumeRoleOutput.Credentials.Expiration.UTC(), | |
| if assumeRoleOutput.Credentials == nil { | |
| return aws.Credentials{}, fmt.Errorf("assume role returned nil credentials") | |
| } | |
| logger.Info("Successfully re-assumed role for clean session credentials") | |
| credentials := assumeRoleOutput.Credentials | |
| value = aws.Credentials{ | |
| AccessKeyID: aws.ToString(credentials.AccessKeyId), | |
| SecretAccessKey: aws.ToString(credentials.SecretAccessKey), | |
| SessionToken: aws.ToString(credentials.SessionToken), | |
| Source: credentialSource, | |
| CanExpire: true, | |
| Expires: aws.ToTime(credentials.Expiration).UTC(), |
There was a problem hiding this comment.
Good catch — added a nil check for assumeRoleOutput.Credentials and switched to aws.ToString/aws.ToTime for safe pointer dereferencing.
Use aws.ToString and aws.ToTime for safe pointer dereferencing to prevent potential panics if the STS response has nil fields. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| value.Expires = time.Now().UTC().Add(c.options.Duration) | ||
|
|
||
| if assumeRoleOutput.Credentials == nil { | ||
| return aws.Credentials{}, fmt.Errorf("AssumeRole returned nil credentials") |
There was a problem hiding this comment.
Error strings should be lowercase (Go convention). Consider changing fmt.Errorf("AssumeRole returned nil credentials") to start with a lowercase letter (and keep the rest consistent with other errors in this file).
| return aws.Credentials{}, fmt.Errorf("AssumeRole returned nil credentials") | |
| return aws.Credentials{}, fmt.Errorf("assumeRole returned nil credentials") |
Overview
This PR improves the robustness of the AWS IRSA (IAM Roles for Service Accounts) credential acquisition path. Split from #11538 to separate credential improvements from the bug fix in #11569.
Changes
Role chaining for IRSA credentials (
pkg/ucp/aws/ucpcredentialprovider.go)The IRSA credential path now performs a two-step credential retrieval:
AssumeRoleWithWebIdentity— obtains initial credentials using the OIDC tokenAssumeRoleon the same role — converts web identity federation credentials into standard IAM session credentialsThis avoids web identity session restrictions that can cause issues when downstream AWS services (e.g., CloudControl/CloudFormation) attempt further role assumptions.
Regional STS endpoint support
pkg/ucp/frontend/aws/routes.go: ReadsAWS_REGION/AWS_DEFAULT_REGIONenv vars for STS configurationdeploy/Chart/templates/ucp/deployment.yaml: InjectsAWS_REGIONenv var into the UCP containerdeploy/Chart/values.yaml: Addsglobal.aws.regionHelm value.github/workflows/functional-test-cloud.yaml: PassesAWS_REGIONto Helm installCode quality improvements
ctxinstead ofcontext.TODO()forconfig.LoadDefaultConfigcallsc.options.Durationto both web identity and AssumeRole sessionslogger.Errorfor error paths instead oflogger.Infoint32(duration / time.Second)for duration conversionIAM Role Requirement
The AWS IAM role trust policy must allow
sts:AssumeRolefromcloudformation.amazonaws.comand self-assumption for the role chaining to work:{ "Effect": "Allow", "Principal": { "Service": "cloudformation.amazonaws.com" }, "Action": "sts:AssumeRole" }This trust policy statement is already configured on the functional test IAM role.
Contributor checklist
Please verify that the PR meets the following requirements, where applicable: