Skip to content

feat(fal): add SSH lease provider#693

Open
coygeek wants to merge 16 commits into
openclaw:mainfrom
coygeek:fal-ssh-lease-provider
Open

feat(fal): add SSH lease provider#693
coygeek wants to merge 16 commits into
openclaw:mainfrom
coygeek:fal-ssh-lease-provider

Conversation

@coygeek

@coygeek coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Closes #694

Summary

Adds a direct fal Compute SSH lease provider with Crabbox-managed SSH and sync support, local-claim-owned cleanup, and the fal-ai alias.

This also adds fal provider docs, provider matrix metadata, benchmark category generation, and an opt-in guarded live smoke script that defaults to no live provider mutation unless CRABBOX_LIVE=1 and fal is selected.

Lifecycle Safety

  • Uses the fal Compute Idempotency-Key header with the Crabbox lease ID for creates.
  • Retries ambiguous create responses with the same idempotency key and requires a provider instance ID before proceeding.
  • Avoids persisting unrecoverable empty-provider-ID recovery claims.
  • Rolls back callback/readiness/bootstrap failures unless --keep explicitly owns a failed-acquire recovery claim.
  • Cleanup removes provider-absent local claims before TTL filtering and only deletes local-claim-owned fal instances.

Verification

  • Exact candidate: 14dbee97e4b0c5ed215d3e13552a1de3ace4d48c
  • gofmt -w internal/providers/fal/backend.go internal/providers/fal/backend_test.go && git diff --check
  • go test -race ./internal/providers/fal ./internal/cli -run 'Fal|fal' -count=1
  • bash -n scripts/live-fal-smoke.sh
  • node scripts/generate-provider-matrix.mjs --check
  • scripts/check-docs.sh
  • node --test scripts/live-fal-smoke.test.js scripts/live-smoke.test.js
  • CRABBOX_LIVE= CRABBOX_LIVE_PROVIDERS= FAL_KEY= CRABBOX_FAL_KEY= scripts/live-fal-smoke.sh
  • go build -trimpath -o bin/crabbox ./cmd/crabbox
  • outside-source CLI proof for crabbox providers --json, doctor --provider fal --json, list --provider fal --json, and cleanup --provider fal --dry-run with blank fal credentials
  • go vet ./...
  • go test -race -timeout 8m ./...
  • fal package coverage raised from 75.2% to 80.5%; exact repository core coverage gate passes at 90.1%
  • RunPod rollback timing regression repeated 10/10 after a single unrelated full-suite load miss
  • Worker: format, lint, typecheck, 760 Vitest tests, Wrangler dry-run build
  • docs site build
  • autoreview --mode branch --base origin/main: clean, no accepted/actionable findings
  • focused coverage repair AutoReview: clean, no accepted/actionable findings

The rebased candidate also fixes the default 1× H100 create path: it no longer sends the 8×-only sector field, even when stale configuration supplies one.

Live proof gate

Authenticated create/use/destroy proof is still required before merge. Targeted 1Password lookups found no fal credential record, and neither CRABBOX_FAL_KEY nor FAL_KEY is set, so the exact candidate could not make an authorized fal API call. No live fal resources were created.

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed July 4, 2026, 8:51 AM ET / 12:51 UTC.

Summary
Adds a built-in fal Compute SSH lease provider with CLI config/env/flags, lifecycle backend/client, docs/provider metadata, guarded live-smoke scripts, and tests.

Reproducibility: yes. from source, but not from a live fal account. The PR defaults fal SSH to root, copies that into cfg.SSHUser, and builds the SSH target from that config while #694 expects ubuntu.

Review metrics: 2 noteworthy metrics.

  • Diff size: 30 files changed, +3948/-1. The PR spans provider code, CLI config, docs metadata, scripts, tests, and release notes, so maintainers should treat it as a broad new provider surface.
  • Authenticated fal proof: 0 live fal resources created. The PR body says no authorized fal API call was made, leaving the billable lifecycle unproven.

Root-cause cluster
Relationship: fixed_by_candidate
Canonical: #694
Summary: This PR is the open candidate implementation for the canonical fal Compute provider feature request.

Members:

Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Default fal SSH user, docs, smoke config, and tests to ubuntu or a provider-returned user.
  • Remove the release-owned changelog entry from this feature PR.
  • [P1] Add redacted authenticated live proof for fal doctor, create, SSH/run, stop, and cleanup.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body says authenticated live fal create/use/destroy proof is still required, no fal credential was available, and no live fal resources were created; add redacted terminal output or logs and update the PR body for re-review.

Risk before merge

  • [P1] The default fal path can create a billable instance and then fail SSH readiness if fal Compute expects ubuntu rather than root.
  • [P1] Offline unit tests and blank-credential CLI checks do not prove the live billable provider lifecycle against fal Compute.
  • [P1] Merging a new billable provider without authenticated live proof leaves cleanup and resource-residue behavior unproven against the real API.

Maintainer options:

  1. Fix and prove the live lifecycle (recommended)
    Default or derive the correct fal SSH user, remove the release-owned changelog edit, and add redacted authenticated output for doctor, create, SSH/run, stop, and cleanup.
  2. Accept unproven provider risk
    Maintainers could intentionally merge without live fal proof, but they would own first-use failure and billable cleanup risk.
  3. Pause until fal access is available
    Hold the PR until someone with fal Compute credentials can refresh the proof and confirm the lifecycle against the real provider.

Next step before merge

  • [P1] Human follow-up is required because the remaining blockers include contributor/operator live fal proof that automation cannot supply.

Security
Cleared: No concrete security or supply-chain regression was found; fal credentials stay env-only and the client/script include redirect guards and redaction paths.

Review findings

  • [P2] Default fal SSH to ubuntu — internal/cli/config.go:2002
  • [P3] Remove the release-owned changelog entry — CHANGELOG.md:7
Review details

Best possible solution:

Land the fal provider after defaulting or deriving the correct SSH user, removing release-owned changelog churn, and adding redacted authenticated proof for doctor, create, SSH/run, stop, and cleanup.

Do we have a high-confidence way to reproduce the issue?

Yes from source, but not from a live fal account. The PR defaults fal SSH to root, copies that into cfg.SSHUser, and builds the SSH target from that config while #694 expects ubuntu.

Is this the best way to solve the issue?

No. The direct SSH lease provider direction fits Crabbox, but the current PR needs the default-user fix, release-note cleanup, and authenticated live lifecycle proof before it is the best merge path.

Full review comments:

  • [P2] Default fal SSH to ubuntu — internal/cli/config.go:2002
    The fal defaults still set cfg.Fal.User to root and copy that into cfg.SSHUser; Acquire then probes SSH with that target. The linked provider request at Add a fal.ai Compute provider #694 describes the fal SSH flow as ubuntu@..., so the default crabbox warmup/run --provider fal path can create a billable instance and then fail readiness unless users know to override --fal-user ubuntu. Default to ubuntu or derive the provider-returned user, and update the mirrored provider default, docs, and tests together.
    Confidence: 0.91
  • [P3] Remove the release-owned changelog entry — CHANGELOG.md:7
    CHANGELOG.md is release-owned for this repository workflow, but this normal feature PR still adds its own unreleased entry. Please keep the release-note context in the PR body or commit metadata and leave the changelog line for release curation.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.88

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against c5bc10cc058e.

Label changes

Label justifications:

  • P2: A new direct GPU provider is a normal-priority feature with bounded provider-surface impact and current merge blockers.
  • merge-risk: 🚨 other: Green offline checks do not settle the billable fal Compute lifecycle because authenticated live create/use/destroy proof is absent.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body says authenticated live fal create/use/destroy proof is still required, no fal credential was available, and no live fal resources were created; add redacted terminal output or logs and update the PR body for re-review.
Evidence reviewed

Acceptance criteria:

  • [P1] go test -race ./internal/providers/fal ./internal/cli -run 'Fal|fal' -count=1.
  • [P1] node --test scripts/live-fal-smoke.test.js scripts/live-smoke.test.js.
  • [P1] git diff --check.

What I checked:

  • Repository policy read: AGENTS.md was read fully; its provider-neutral core boundary and secret-handling guidance are relevant to this provider PR. (AGENTS.md:1, c5bc10cc058e)
  • No maintainer notes: No .agents/maintainer-notes files were present in this checkout, so there were no matching internal notes to apply. (c5bc10cc058e)
  • Current main lacks fal support: A targeted current-main search found no fal provider, fal alias, or fal credential names under README, docs, internal, or scripts. (c5bc10cc058e)
  • PR scope: The submitted head changes 30 files with 3948 additions and 1 deletion across provider code, CLI config, docs metadata, scripts, tests, and CHANGELOG.md. (5a030ef41183)
  • Linked issue expects ubuntu SSH: The canonical provider request says fal's documented SSH example uses ubuntu@ and suggests user: ubuntu in fal config.
  • Core fal default user: The PR head defaults cfg.Fal.User to root and then copies it into cfg.SSHUser when no explicit SSH user is provided. (internal/cli/config.go:2002, 5a030ef41183)

Likely related people:

  • coygeek: Current-main history shows substantial adjacent SSH-lease provider and lifecycle work, including Lambda foundations and provider config/credential areas used by this PR. (role: recent provider lifecycle contributor; confidence: high; commits: 6e7939dbdc63, 5b9ecc2096c3, fba067282b84; files: internal/providers/lambda, internal/cli/config.go, internal/cli/credential_provenance.go)
  • steipete: Current-main history includes provider credential-destination protection and adjacent direct-provider work; the PR branch also contains recent fal lifecycle repair commits by this author. (role: recent adjacent provider/security contributor; confidence: medium; commits: c62fb939341d, 8ae979ffd42b, 0301236b2752; files: internal/cli/credential_provenance.go, internal/providers/lambda, internal/providers/fal)
  • Yossi Eliaz: Current-main history shows this person introduced RunPod, an adjacent direct GPU SSH provider with similar lifecycle and API-client concerns. (role: adjacent GPU provider introducer; confidence: medium; commits: cef985b92482; files: internal/providers/runpod, internal/cli/config.go, internal/providers/all/all.go)
  • Vincent Koc: Current-main history includes provider live-smoke dispatch additions and bounded RunPod acquire cleanup close to this PR's live proof and lifecycle concerns. (role: recent adjacent live-smoke and cleanup contributor; confidence: medium; commits: 35b728746e4e, 56d058c0b846, 8bfebcffa7e4; files: internal/providers/runpod, scripts/live-smoke.sh)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.
Review history (3 earlier review cycles)
  • reviewed 2026-07-03T18:32:00.714Z sha 14dbee9 :: needs real behavior proof before merge. :: [P2] Default fal SSH to ubuntu
  • reviewed 2026-07-04T09:45:16.284Z sha bfd0e56 :: needs real behavior proof before merge. :: [P2] Default fal SSH to ubuntu | [P3] Remove the release-owned changelog entry
  • reviewed 2026-07-04T09:58:22.805Z sha a6d24f5 :: needs real behavior proof before merge. :: [P2] Default fal SSH to ubuntu | [P3] Remove the release-owned changelog entry

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. labels Jun 25, 2026
@coygeek

coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Context on the Go CI timeout update in 605caed0:

The first red Go run was a real static-analysis issue, not runtime: deadcode reported internal/providers/fal/core.go:35:6: unreachable func: inventoryDoctorResult. That helper was an unused wrapper around the shared provider doctor result helper, so it was removed in 039ebdf3.

After that fix, the next Go run passed the earlier gates: formatting, go vet ./..., go run golang.org/x/tools/cmd/deadcode@v0.45.0 -test ./..., go test -race ./..., and scripts/test-go-modules.sh. It then entered scripts/check-go-coverage.sh 90.0 and was canceled by the Go job's 15-minute workflow timeout while coverage was still running. The log showed GitHub cancellation during Coverage, not a coverage assertion failure.

To verify the diagnosis before changing CI, I ran the same coverage command locally. It completed successfully with Go core coverage 91.3% >= 90.0%; the long-running package work, including internal/providers/xcpng, finished cleanly. Based on that, the timeout update was scoped to the Go job only: timeout-minutes: 15 -> 30, giving the existing full Go gauntlet enough wall-clock budget without weakening any check.

Result: the replacement GitHub Actions run passed: https://github.com/openclaw/crabbox/actions/runs/28198853668. The Go job passed in 14m17s, including Deadcode, Test, Test all Go modules, Coverage, and Build: https://github.com/openclaw/crabbox/actions/runs/28198853668/job/83532759285.

@clawsweeper clawsweeper Bot removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 25, 2026
@steipete steipete force-pushed the fal-ssh-lease-provider branch from 605caed to d9e9bd3 Compare July 3, 2026 17:06
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jul 3, 2026
@steipete steipete force-pushed the fal-ssh-lease-provider branch 2 times, most recently from bfd0e56 to a6d24f5 Compare July 4, 2026 09:53
coygeek and others added 16 commits July 4, 2026 13:41
Add fal provider registration, env-only credential loading, non-secret configuration, and a schema-backed Compute API client skeleton.

Keep fal lifecycle support out of the advertised run surface until the PLAN-02 backend wires acquire/list/release behavior.
Restore PLAN-01's advertised fal surface to an SSH lease provider with ssh, crabbox-sync, and cleanup capabilities.

Keep lifecycle behavior deferred behind explicit PLAN-02 errors so discovery matches the plan without silently performing unsupported resource operations.
Implement fal Compute lease acquire, resolve, list, touch, release, and cleanup flows with local-claim ownership checks.\n\nAdd offline lifecycle tests for rollback, recovery claims, status-only resolve, persisted SSH endpoints, and destructive-operation safeguards.
Document the direct fal Compute SSH lease provider, add provider matrix metadata, and regenerate the provider category surfaces.

Add an opt-in live smoke script with no-live defaults, credential gating, classified external blockers, redaction, cleanup attempts, and dispatcher coverage.
Build the acquired fal lease target after the SSH readiness probe so fallback port discovery is reflected in the returned lease as well as the persisted claim.

Add regression coverage for a configured SSH port corrected by the readiness probe.
Retry ambiguous fal Compute creates with the same lease idempotency key before proceeding so a recoverable provider id is required for local claim ownership.

Avoid persisting empty-provider-id recovery claims when idempotent reconciliation cannot return a fal instance id, and cover both retry success and retry failure paths.
Drop the unused fal inventoryDoctorResult wrapper so the CI deadcode gate passes.
@steipete steipete force-pushed the fal-ssh-lease-provider branch from a6d24f5 to 5a030ef Compare July 4, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a fal.ai Compute provider

2 participants