Skip to content

Split "set target release" endpoint into two: one for update, one for mupdate recovery#9887

Open
jgallagher wants to merge 20 commits intomainfrom
john/split-target-release-endpoint
Open

Split "set target release" endpoint into two: one for update, one for mupdate recovery#9887
jgallagher wants to merge 20 commits intomainfrom
john/split-target-release-endpoint

Conversation

@jgallagher
Copy link
Contributor

The existing "set target release" external API endpoint is used for two reasons:

  1. To start a new online update
  2. To inform Nexus that a mupdate has occurred, and to allow reconfigurator to recover from that mupdate

However, the checks we ought to perform for "should the new target release version be allowed" are pretty different for the two cases, and we were both too strict and too loose. A couple examples of incorrect behavior prior to this PR:

  1. We refused recover from a mupdate if the version we mupdated to was below the current target version (even if no downgrade had actually taken place! - see below for an example of how this could happen)
  2. We allowed setting the target release to itself spuriously (should not be able to set target release to itself (unless MUPdate happened)? #9113)

As of this change, there are separate "set target release for update" and "set target release for mupdate recovery" endpoints with more correct validation for each intent. In the two examples above:

  1. This is now allowed - if we're in a "need recovery from mupdate" case, we allow any new target version. (If it doesn't match the software that we actually mupdated to, the planner won't be able to match up artifacts, so we'll stay in the "need recovery from mupdate" case until the correct version is set.)
  2. This is no longer allowed - "set target release for update" now rejects setting the release version to itself.

Closes #9113. Also addresses an issue @askfongjojo ran into on a racklette recently with needing to "downgrade"; e.g., in a sequence like this:

  1. Install R16
  2. Mupdate to 17
  3. Upload TUF repos for R17 and R18
  4. Set target release to 18 (oops! - this should have been 17, and now we have no way to proceed other than mupdating to R18)

After this change, we can now correct the mistake in step 4: because 18 wasn't the release actually deployed, we'd still be in the "need to recover from mupdate" state, allowing the operator to set the target release back to 17.

@david-crespo
Copy link
Contributor

Schema diff. Very simple, nice that they take the same params. Do you think I should expose this functionality in the console? Probably not, right?

--- a/2026021301.0.0-6e51ab/spec.json
+++ b/2026021800.0.0-38e767/spec.json
@@ -7,7 +7,7 @@
       "url": "https://oxide.computer",
       "email": "api@oxide.computer"
     },
-    "version": "2026021301.0.0"
+    "version": "2026021800.0.0"
   },
   "paths": {
     "/device/auth": {
@@ -12383,6 +12383,35 @@
         }
       }
     },
+    "/v1/system/update/target-release/recovery": {
+      "put": {
+        "tags": ["system/update"],
+        "summary": "Recover from an Oxide-support-driven system update",
+        "description": "Inform the control plane of the release of the rack's system software it is now running due to a recovery operation (\"mupdate\") performed by Oxide support.\n\nThis endpoint should only be called at the direction of Oxide support.",
+        "operationId": "target_release_update_recovery",
+        "requestBody": {
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/SetTargetReleaseParams"
+              }
+            }
+          },
+          "required": true
+        },
+        "responses": {
+          "204": {
+            "description": "resource updated"
+          },
+          "4XX": {
+            "$ref": "#/components/responses/Error"
+          },
+          "5XX": {
+            "$ref": "#/components/responses/Error"
+          }
+        }
+      }
+    },
     "/v1/system/update/trust-roots": {
       "get": {
         "tags": ["system/update"],

@jgallagher
Copy link
Contributor Author

jgallagher commented Feb 20, 2026

Schema diff. Very simple, nice that they take the same params. Do you think I should expose this functionality in the console? Probably not, right?

Probably not, yeah. @ahl and I chatted about this a few weeks ago, and IIRC we wanted to tuck this operation somewhere out of the main path even in the CLI, since it should only be called after support performs a mupdate (and will fail if called any other time anyway).

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still going through nexus/src/app/deployment.rs but wanted to leave this before the watercooler)

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

I think it wouldn't hurt to get another set of eyes on it, given how tricky and important this is.

// bypass all our typical version ordering requirements, so we have to allow
// recovery to the _actual_ version it installed, regardless of what we
// currently have on the system.
// does not take an arguments about the proposed system version (unlike
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// does not take an arguments about the proposed system version (unlike
// does not take any arguments about the proposed system version (unlike

Comment on lines 668 to 669
// Update status of a sled, not considering its zones, based on the current
// target version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Update status of a sled, not considering its zones, based on the current
// target version.
// Status of any update or mupdate on a sled, not considering its zones, based on the current
// target version.

(easy to misread "Update status" as "this is going to update the status")

@jgallagher
Copy link
Contributor Author

Testing notes from dublin:

I initially set up the rack with the TUF repo from this branch (version 19.0.0-0.ci+git44ac79d168b). I built a fake R20 (20.0.0-0.local+git553e6c0886a) and R20.1 (20.1.0-0.local+git6b95112f000).

The first request was to set the target release "for update" to my fake R20. This succeeded, because we always allow the initial target release to be set. After this, update status reported that some components were running from this version (I didn't change hubris images, for example), but not all:

{
  "components_by_release_version": {
    "install dataset": 59,
    "20.0.0-0.local+git553e6c0886a": 17,
    "unknown": 9
  },
  "suspended": false,
  "target_release": {
    "time_requested": "2026-02-27T23:16:36.127893Z",
    "version": "20.0.0-0.local+git553e6c0886a"
  },
  "time_last_step_planned": "2026-02-27T19:51:21.296256Z"
}

Prior to this PR we'd be stuck here. We need to set the correct release, 19.0.0-0.ci+git44ac79d168b, but that would have looked like a downgrade and being rejected. While we're still in this state, requests to change the target release "for update" are rejected as expected. We can't set the version to our current version:

% oxide system update target-release update --system-version '20.0.0-0.local+git553e6c0886a'
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "bb61f4f6-06df-45ed-95d3-d18ef5190c82", "content-length": "237", "date": "Fri, 27 Feb 2026 23:18:47 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "Target release cannot be changed: cannot update to target release 20.0.0-0.local+git553e6c0886a (already targeting that version)", request_id: "bb61f4f6-06df-45ed-95d3-d18ef5190c82" }

We can't downgrade:

% oxide system update target-release update --system-version '19.0.0-0.ci+git44ac79d168b'
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "d74af818-746b-4018-b4f4-740442c589d2", "content-length": "295", "date": "Fri, 27 Feb 2026 23:22:45 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "Target release cannot be changed: cannot downgrade: requested target release version 19.0.0-0.ci+git44ac79d168b is older than current target release version 20.0.0-0.local+git553e6c0886a", request_id: "d74af818-746b-4018-b4f4-740442c589d2" }

And we can't start an upgrade because we're waiting for mupdate recovery:

% oxide system update target-release update --system-version '20.1.0-0.local+git6b95112f000'
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "e0955f1d-1696-4359-b9d8-995254428ad1", "content-length": "217", "date": "Fri, 27 Feb 2026 23:51:43 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "Target release cannot be changed: a support-driven recovery (mupdate) has occurred and must be cleared first", request_id: "e0955f1d-1696-4359-b9d8-995254428ad1" }

However, we can successfully use the new recovery-finish API. We can set the target release to current version (not useful here):

% oxide api /v1/system/update/recovery-finish -X PUT --field "system_version=20.0.0-0.local+git553e6c0886a"

and we can use it to downgrade (which is useful here):

% oxide api /v1/system/update/recovery-finish -X PUT --field "system_version=19.0.0-0.ci+git44ac79d168b"

A few minutes after doing so, the system recognized that all components were on the target version:

{
  "components_by_release_version": {
    "19.0.0-0.ci+git44ac79d168b": 85
  },
  "suspended": false,
  "target_release": {
    "time_requested": "2026-02-27T23:58:20.286946Z",
    "version": "19.0.0-0.ci+git44ac79d168b"
  },
  "time_last_step_planned": "2026-02-27T23:59:23.469113Z"
}

We still can't start a new update to the current target version, as expected:

% oxide system update target-release update --system-version '19.0.0-0.ci+git44ac79d168b'
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "699a9de3-b00f-4575-8d46-0a86ae96f90d", "content-length": "234", "date": "Sat, 28 Feb 2026 00:00:47 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "Target release cannot be changed: cannot update to target release 19.0.0-0.ci+git44ac79d168b (already targeting that version)", request_id: "699a9de3-b00f-4575-8d46-0a86ae96f90d" }

But now we can start an update to a later version:

% oxide system update target-release update --system-version '20.0.0-0.local+git553e6c0886a'

While that update is running, we can't start another one, as expected:

% oxide system update target-release update --system-version '20.1.0-0.local+git6b95112f000'
Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "1f414b7a-6082-4913-a814-dd4abdb98b42", "content-length": "181", "date": "Sat, 28 Feb 2026 00:02:46 GMT"}; value: Error { error_code: Some("InvalidRequest"), message: "Target release cannot be changed: a previous update is still in progress", request_id: "1f414b7a-6082-4913-a814-dd4abdb98b42" }

and we can't use the recovery-finish endpoint, because nothing has been mupdated:

% oxide api /v1/system/update/recovery-finish -X PUT --field "system_version=19.0.0-0.ci+git44ac79d168b"
error; status code: 400 Bad Request
{
  "error_code": "InvalidRequest",
  "message": "Target release cannot be changed: no evidence a mupdate has occurred - recovery not needed",
  "request_id": "9abc8a51-9b77-468e-83c1-b56d50463062"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

should not be able to set target release to itself (unless MUPdate happened)?

4 participants