Skip to content

Improve AWS IRSA credential handling with role chaining and regional STS#11570

Open
willdavsmith wants to merge 3 commits intomainfrom
improve-aws-irsa-credentials
Open

Improve AWS IRSA credential handling with role chaining and regional STS#11570
willdavsmith wants to merge 3 commits intomainfrom
improve-aws-irsa-credentials

Conversation

@willdavsmith
Copy link
Copy Markdown
Contributor

@willdavsmith willdavsmith commented Apr 7, 2026

Overview

This PR improves the robustness of the AWS IRSA (IAM Roles for Service Accounts) credential acquisition path. Split from #11538 to separate credential improvements from the bug fix in #11569.

Changes

Role chaining for IRSA credentials (pkg/ucp/aws/ucpcredentialprovider.go)

The IRSA credential path now performs a two-step credential retrieval:

  1. AssumeRoleWithWebIdentity — obtains initial credentials using the OIDC token
  2. AssumeRole on the same role — converts web identity federation credentials into standard IAM session credentials

This avoids web identity session restrictions that can cause issues when downstream AWS services (e.g., CloudControl/CloudFormation) attempt further role assumptions.

Regional STS endpoint support

  • pkg/ucp/frontend/aws/routes.go: Reads AWS_REGION/AWS_DEFAULT_REGION env vars for STS configuration
  • deploy/Chart/templates/ucp/deployment.yaml: Injects AWS_REGION env var into the UCP container
  • deploy/Chart/values.yaml: Adds global.aws.region Helm value
  • .github/workflows/functional-test-cloud.yaml: Passes AWS_REGION to Helm install

Code quality improvements

  • Use ctx instead of context.TODO() for config.LoadDefaultConfig calls
  • Apply c.options.Duration to both web identity and AssumeRole sessions
  • Use logger.Error for error paths instead of logger.Info
  • Use idiomatic int32(duration / time.Second) for duration conversion

IAM Role Requirement

The AWS IAM role trust policy must allow sts:AssumeRole from cloudformation.amazonaws.com and self-assumption for the role chaining to work:

{
    "Effect": "Allow",
    "Principal": {
        "Service": "cloudformation.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
}

This trust policy statement is already configured on the functional test IAM role.

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Yes
    • Not applicable
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
    • Yes
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Yes
    • Not applicable
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
    • Yes
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Yes
    • Not applicable
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
    • Yes
    • Not applicable

Copilot AI review requested due to automatic review settings April 7, 2026 17:02
@willdavsmith willdavsmith requested review from a team as code owners April 7, 2026 17:02
@willdavsmith willdavsmith requested a deployment to external-contributor-approval April 7, 2026 17:02 — with GitHub Actions Waiting
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves AWS IRSA credential acquisition in the UCP by introducing role chaining (web identity → standard AssumeRole session) and enabling optional regional STS endpoint selection, with supporting Helm/workflow wiring.

Changes:

  • Add IRSA role chaining in UCPCredentialProvider and propagate STS endpoint region configuration.
  • Read AWS_REGION / AWS_DEFAULT_REGION in the AWS frontend module and pass through to the credential provider.
  • Add Helm value + deployment env var injection for AWS_REGION, and pass the value in the cloud functional test workflow.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/ucp/frontend/aws/routes.go Reads region env vars and passes STS region to the UCP AWS credential provider.
pkg/ucp/aws/ucpcredentialprovider.go Implements IRSA role chaining and adds STS endpoint region option.
pkg/ucp/aws/ucpcredentialprovider_test.go Updates constructor usage and validates default/explicit STS region option behavior.
deploy/Chart/values.yaml Adds global.aws.region Helm value.
deploy/Chart/templates/ucp/deployment.yaml Injects AWS_REGION env var into the UCP deployment when configured.
.github/workflows/functional-test-cloud.yaml Passes global.aws.region into Helm install; updates pinned action SHAs.


- name: Setup Go
uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6.4.0
uses: actions/setup-go@4b73464bb391d4059bd26b0524d20df3927bd417 # v6.3.0
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pins actions/setup-go to v6.3.0, while other workflows in this repo are pinned to v6.4.0 (e.g., .github/workflows/build.yaml). Unless there’s a specific regression being worked around, keep the version/SHA consistent across workflows and prefer the newer pinned revision to reduce maintenance and security risk.

Suggested change
uses: actions/setup-go@4b73464bb391d4059bd26b0524d20df3927bd417 # v6.3.0
uses: actions/setup-go@v6.4.0 # v6.4.0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These version mismatches were from the branch being based on a stale point in main's history. Rebased on latest main and reset the workflow file — the only diff now is the --set global.aws.region line we added. The SHAs need to be locked to specific commits as we do throughout the rest of the repo, so taking main's current pinned versions is the right approach.

permission-checks: write

- uses: LouisBrunner/checks-action@937cbbcde3259005b50746dc91cde29098aac2ff # v3.1.0
- uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes LouisBrunner/checks-action from v3.1.0 to v3.0.0. Please avoid downgrading action versions unless necessary; either keep it at the newer pinned SHA used previously or add a short comment explaining the compatibility reason for the downgrade.

Suggested change
- uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0
- uses: LouisBrunner/checks-action@v3.1.0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — resolved by rebasing on main. These stale SHAs are no longer in the diff.

permission-checks: write
permission-pull-requests: write
permission-contents: read

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes LouisBrunner/checks-action from v3.1.0 to v3.0.0. Please avoid downgrading action versions unless necessary; either keep it at the newer pinned SHA used previously or add a short comment explaining the compatibility reason for the downgrade.

Suggested change
# Intentionally pinned to v3.0.0 for workflow compatibility; do not upgrade back to v3.1.0 until that issue is resolved.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by rebase — no longer in the diff.

echo "test_status=$TEST_STATUS" >> $GITHUB_OUTPUT

- uses: LouisBrunner/checks-action@937cbbcde3259005b50746dc91cde29098aac2ff # v3.1.0
- uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes LouisBrunner/checks-action from v3.1.0 to v3.0.0. Please avoid downgrading action versions unless necessary; either keep it at the newer pinned SHA used previously or add a short comment explaining the compatibility reason for the downgrade.

Suggested change
- uses: LouisBrunner/checks-action@dfcbcf801bff1ea7f1414824fc28f2cd697b35da # v3.0.0
- uses: LouisBrunner/checks-action@v3.1.0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by rebase — no longer in the diff.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Unit Tests

    2 files  ±0    415 suites  ±0   6m 42s ⏱️ -7s
4 872 tests ±0  4 870 ✅ ±0  2 💤 ±0  0 ❌ ±0 
5 774 runs  ±0  5 772 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit 478a568. ± Comparison against base commit ec0db63.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 12.00000% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.34%. Comparing base (ec0db63) to head (478a568).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/ucp/aws/ucpcredentialprovider.go 13.63% 38 Missing ⚠️
pkg/ucp/frontend/aws/routes.go 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11570      +/-   ##
==========================================
- Coverage   51.38%   51.34%   -0.04%     
==========================================
  Files         699      699              
  Lines       44114    44148      +34     
==========================================
+ Hits        22668    22669       +1     
- Misses      19278    19310      +32     
- Partials     2168     2169       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add role chaining to the IRSA credential path: after AssumeRoleWithWebIdentity,
perform a follow-up AssumeRole on the same role to produce standard IAM session
credentials. This avoids web identity session restrictions that can cause
issues with downstream AWS services.

Also adds support for configuring the STS endpoint region via AWS_REGION
environment variable, plumbed through Helm values and the cloud functional
test workflow.

Additional improvements from review feedback:
- Use ctx instead of context.TODO() for config.LoadDefaultConfig calls
- Apply credential Duration to both web identity and AssumeRole sessions
- Use logger.Error for error paths instead of logger.Info
- Use int32 division for duration conversion instead of float-based .Seconds()

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@willdavsmith willdavsmith force-pushed the improve-aws-irsa-credentials branch from 1e198f2 to c7301e3 Compare April 7, 2026 20:50
@willdavsmith willdavsmith requested a deployment to external-contributor-approval April 7, 2026 20:50 — with GitHub Actions Waiting
@willdavsmith willdavsmith requested a review from Copilot April 7, 2026 20:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

RoleArn: &s.IRSACredential.RoleARN,
RoleSessionName: aws.String(sessionPrefix + uuid.New().String()),
}
if c.options.Duration > 0 {
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DurationSeconds is derived via int32(c.options.Duration / time.Second), which truncates sub-second durations to 0 and can also produce values outside STS’s accepted range (and for role chaining the max is more restrictive). Validate/clamp the duration (or omit DurationSeconds when it would be invalid) before calling AssumeRole to avoid runtime validation errors.

Suggested change
if c.options.Duration > 0 {
if c.options.Duration >= 15*time.Minute && c.options.Duration <= time.Hour {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added validation to only set DurationSeconds when the value falls within the STS-accepted range for role chaining (15 min to 1 hour).

// STSEndpointRegion is the AWS region to use for the STS endpoint when retrieving
// IRSA credentials. Using a regional STS endpoint (matching the target service region)
// avoids token compatibility issues with some AWS services like CloudWatch Logs.
// If empty, defaults to "us-east-1" (global endpoint).
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field comment says an empty STSEndpointRegion “defaults to "us-east-1" (global endpoint)”, but the implementation only sets config.WithRegion("us-east-1") and does not explicitly configure the legacy/global STS endpoint. Please reword this to avoid implying global endpoint behavior, or explicitly configure the endpoint if that’s required.

Suggested change
// If empty, defaults to "us-east-1" (global endpoint).
// If empty, defaults to "us-east-1".

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — removed the "(global endpoint)" phrasing.

Comment on lines +164 to +171
reAssumeCfg, err := config.LoadDefaultConfig(ctx,
config.WithRegion(stsRegion),
config.WithCredentialsProvider(aws.CredentialsProviderFunc(
func(ctx context.Context) (aws.Credentials, error) {
return webIdentityCreds, nil
},
)),
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieve calls config.LoadDefaultConfig twice (for awscfg and reAssumeCfg). This is relatively expensive and will run every time credentials are refreshed. Consider reusing the first loaded config and creating the second STS client via explicit sts.Options (region + credentials) instead of re-loading the full default config.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — now reuses the existing awscfg and creates the second STS client via sts.NewFromConfig(awscfg, ...) with a credentials override, avoiding the second LoadDefaultConfig call.

Comment on lines +137 to +156
// Step 1: Get web identity credentials via AssumeRoleWithWebIdentity.
webIdentityProvider := stscreds.NewWebIdentityRoleProvider(
stsClient,
s.IRSACredential.RoleARN,
stscreds.IdentityTokenFile(TokenFilePath),
func(o *stscreds.WebIdentityRoleOptions) {
o.RoleSessionName = sessionPrefix + uuid.New().String()
o.RoleSessionName = sessionPrefix + "wi-" + uuid.New().String()
if c.options.Duration > 0 {
o.Duration = c.options.Duration
}
},
))
)

value, err = credsCache.Retrieve(ctx)
webIdentityCreds, err := webIdentityProvider.Retrieve(ctx)
if err != nil {
logger.Error(err, "Failed to retrieve web identity credentials")
return aws.Credentials{}, err
}
logger.Info("Successfully retrieved web identity credentials")

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new IRSA role-chaining path (AssumeRoleWithWebIdentity -> AssumeRole) isn’t covered by unit tests. Add a test that stubs STS responses to assert both calls happen, and that Duration and STSEndpointRegion propagate into the STS requests (including the session name changes).

Copilot generated this review using guidance from repository custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this would be valuable. The current STS client is created directly from the AWS SDK, so testing the IRSA path would require either injecting an STS client factory or using a higher-level interface. Will track this as a follow-up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willdavsmith should we track this in a github issue?

- Validate DurationSeconds falls within STS-accepted range (15min-1hr)
  for role chaining sessions before setting it
- Fix STSEndpointRegion comment to not imply global endpoint behavior
- Reuse existing AWS config for second STS client instead of calling
  LoadDefaultConfig twice

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment on lines +188 to +196
logger.Info("Successfully re-assumed role for clean session credentials")

value = aws.Credentials{
AccessKeyID: *assumeRoleOutput.Credentials.AccessKeyId,
SecretAccessKey: *assumeRoleOutput.Credentials.SecretAccessKey,
SessionToken: *assumeRoleOutput.Credentials.SessionToken,
Source: credentialSource,
CanExpire: true,
Expires: assumeRoleOutput.Credentials.Expiration.UTC(),
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assumeRoleOutput.Credentials and its fields (AccessKeyId, SecretAccessKey, SessionToken, Expiration) are pointers in the AWS SDK. This code dereferences them without nil checks, and Expiration.UTC() will not compile if Expiration is *time.Time. Add a defensive nil check for assumeRoleOutput.Credentials and use a safe conversion/dereference for Expiration (and the string fields) before constructing aws.Credentials to avoid panics/compile errors.

Suggested change
logger.Info("Successfully re-assumed role for clean session credentials")
value = aws.Credentials{
AccessKeyID: *assumeRoleOutput.Credentials.AccessKeyId,
SecretAccessKey: *assumeRoleOutput.Credentials.SecretAccessKey,
SessionToken: *assumeRoleOutput.Credentials.SessionToken,
Source: credentialSource,
CanExpire: true,
Expires: assumeRoleOutput.Credentials.Expiration.UTC(),
if assumeRoleOutput.Credentials == nil {
return aws.Credentials{}, fmt.Errorf("assume role returned nil credentials")
}
logger.Info("Successfully re-assumed role for clean session credentials")
credentials := assumeRoleOutput.Credentials
value = aws.Credentials{
AccessKeyID: aws.ToString(credentials.AccessKeyId),
SecretAccessKey: aws.ToString(credentials.SecretAccessKey),
SessionToken: aws.ToString(credentials.SessionToken),
Source: credentialSource,
CanExpire: true,
Expires: aws.ToTime(credentials.Expiration).UTC(),

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a nil check for assumeRoleOutput.Credentials and switched to aws.ToString/aws.ToTime for safe pointer dereferencing.

Use aws.ToString and aws.ToTime for safe pointer dereferencing
to prevent potential panics if the STS response has nil fields.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

value.Expires = time.Now().UTC().Add(c.options.Duration)

if assumeRoleOutput.Credentials == nil {
return aws.Credentials{}, fmt.Errorf("AssumeRole returned nil credentials")
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error strings should be lowercase (Go convention). Consider changing fmt.Errorf("AssumeRole returned nil credentials") to start with a lowercase letter (and keep the rest consistent with other errors in this file).

Suggested change
return aws.Credentials{}, fmt.Errorf("AssumeRole returned nil credentials")
return aws.Credentials{}, fmt.Errorf("assumeRole returned nil credentials")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants