Skip to content

Add a fal.ai Compute provider #694

Description

@coygeek

Summary

Add fal as a built-in Crabbox provider backed by fal.ai Compute. The first implementation should target fal Compute dedicated GPU instances as a direct SSH lease provider, not fal Model APIs or fal Serverless, because Crabbox providers need to run arbitrary repo commands and fal Compute exposes full SSH access to Linux GPU machines.

Why

Crabbox already supports direct cloud and GPU-oriented SSH lease providers such as Lambda Cloud, RunPod, Nebius, and NVIDIA Brev. fal.ai Compute fits the same Crabbox execution model:

  • fal Compute provides dedicated GPU instances that stay running under the user's control.
  • The documented flow provisions an instance, accepts an SSH public key, waits for the instance to become ready, and connects as ubuntu@<instance-ip>.
  • fal exposes Platform APIs for compute instance create, get, list, and delete.
  • Instance types documented today are 1xH100-SXM and 8xH100-SXM.

This would let users run:

FAL_KEY=... crabbox run --provider fal -- nvidia-smi
FAL_KEY=... crabbox warmup --provider fal --keep --slug fal-h100
FAL_KEY=... crabbox ssh --provider fal fal-h100
FAL_KEY=... crabbox stop --provider fal fal-h100

Provider shape

Implement fal Compute as an SSH lease provider:

  • Provider name: fal
  • Optional alias: fal-ai
  • Kind: ProviderKindSSHLease
  • Family: fal
  • Targets: Linux only
  • Initial features:
    • ssh
    • crabbox-sync
    • cleanup
  • Coordinator: CoordinatorNever

Do not route this through the Worker coordinator initially. Crabbox's coordinator provider registry currently only brokers aws, azure, gcp, and hetzner, and the existing direct GPU providers are CLI-owned.

Non-goals for the first PR

  • Do not implement fal Model API inference as the Crabbox provider path. Model APIs run endpoint-specific JSON requests and do not provide an SSH host for arbitrary command execution.
  • Do not implement fal Serverless deployment or fal run as a delegated-run backend in the first PR. Serverless owns the app container and endpoint contract, which is a different execution model from a Crabbox SSH lease.
  • Do not add coordinator support until there is a separate design for storing fal credentials and lifecycle state in the broker.
  • Do not expose FAL_KEY through CLI flags or persisted config output.

fal.ai docs evidence

Source analyzed: fal-docs.ai/2026-06-24_fal_ai_combined.md.

Relevant docs:

  • https://fal.ai/docs/documentation/compute
    • fal Compute is dedicated GPU infrastructure with full SSH access.
    • Compute instances are intended for training, fine-tuning, batch jobs, and workloads needing sustained hardware access.
    • Documented instances include 1xH100-SXM and 8xH100-SXM.
  • https://fal.ai/docs/documentation/compute/quickstart
    • Users paste an SSH public key at creation time.
    • Ready state usually takes 2 to 3 minutes.
    • SSH example uses ubuntu@<instance-ip>.
  • https://fal.ai/docs/api-reference/platform-apis
    • Platform APIs list create, get, list, and delete operations for Compute instances.
  • https://fal.ai/docs/documentation/setting-up/authentication
    • fal keys are read from FAL_KEY.
    • Raw API examples use Authorization: Key $FAL_KEY.
    • API keys are account or team scoped.

Additional docs were reviewed for Model APIs, queues, uploads, concurrency, retries, and webhooks. Those are useful context for future fal inference features, but not the best fit for this Crabbox provider because they do not expose a normal SSH lease.

Crabbox codebase analysis

Relevant Crabbox provider architecture:

  • internal/cli/provider_backend.go
    • Defines Provider, ProviderSpec, ProviderKindSSHLease, FeatureSSH, FeatureCrabboxSync, FeatureCleanup, and SSHLeaseBackend.
    • Core owns provider selection, sync, command execution, claims, status rendering, and release workflows.
  • docs/provider-backends.md and docs/features/provider-authoring.md
    • New SSH providers should return a usable LeaseTarget.SSH and let core handle rsync, command execution, results, and release.
  • internal/providers/all/all.go
    • Built-in providers register through side-effect imports.
  • internal/providers/lambda
    • Good pattern for a direct GPU cloud SSH provider with an HTTP API client, per-lease SSH key handling, readiness polling, cleanup, and doctor checks.
  • internal/providers/runpod
    • Good pattern for direct GPU provider lifecycle with public SSH coordinates and REST API polling.
  • docs/providers/provider-metadata.json and docs/providers/README.md
    • Provider metadata and the generated provider matrix must be kept in sync.
  • scripts/generate-provider-matrix.mjs and scripts/check-provider-matrix.mjs
    • Docs checks fail when provider registration, metadata, docs paths, or the generated matrix drift.

Proposed implementation

1. Add provider package

Create:

internal/providers/fal/
  provider.go
  backend.go
  client.go
  config.go
  flags.go
  doctor.go
  *_test.go

Provider spec:

func (Provider) Spec() core.ProviderSpec {
	return core.ProviderSpec{
		Name:        "fal",
		Family:      "fal",
		Kind:        core.ProviderKindSSHLease,
		Targets:     []core.TargetSpec{{OS: core.TargetLinux}},
		Features:    core.FeatureSet{core.FeatureSSH, core.FeatureCrabboxSync, core.FeatureCleanup},
		Coordinator: core.CoordinatorNever,
	}
}

Register it in internal/providers/all/all.go.

2. Add fal config and flags

Add config fields consistent with direct providers:

provider: fal
fal:
  instanceType: 1xH100-SXM
  sector: default
  user: ubuntu
  apiURL: ""

Suggested env vars:

  • FAL_KEY, primary because fal docs and SDKs use it.
  • CRABBOX_FAL_KEY, optional Crabbox-prefixed override for consistency with other providers.
  • CRABBOX_FAL_INSTANCE_TYPE
  • CRABBOX_FAL_SECTOR
  • CRABBOX_FAL_API_URL, for tests and enterprise/private endpoints if needed.

Suggested flags:

  • --fal-instance-type
  • --fal-sector
  • --fal-user
  • --fal-api-url

Security requirements:

  • Do not accept the API key via CLI flag.
  • Do not print the API key in doctor, config show, errors, debug output, or tests.
  • Redact fal credentials the same way LAMBDA_API_KEY and RUNPOD_API_KEY are handled.

3. Implement a small fal Compute API client

Use Go's net/http, following the direct provider client style. The client should:

  • Read the key from config/env.
  • Send Authorization: Key <token>.
  • Set Accept: application/json.
  • Use Content-Type: application/json for writes.
  • Limit response bodies.
  • Reject cross-origin redirects unless the configured API URL is a trusted loopback test URL.
  • Decode structured error bodies when available and include HTTP status, fal request ID, and safe error type details.

Needed API operations:

  • Create compute instance with:
    • instance type,
    • sector,
    • SSH public key,
    • Crabbox-owned name or labels if the API supports them.
  • Get compute instance by ID.
  • List compute instances.
  • Delete compute instance by ID.

Important implementation note: verify exact REST paths and request/response schemas from fal's OpenAPI before coding. The combined docs list the operations, but not all operation payload schemas are present in the local bundle.

4. Implement SSH lease lifecycle

Acquire should:

  1. Generate a Crabbox lease ID and direct lease slug.
  2. Generate or prepare a Crabbox-managed SSH key for the lease, matching patterns from existing direct SSH providers.
  3. Create the fal Compute instance with the public key.
  4. Poll the fal instance until it has:
    • terminal non-error lifecycle state,
    • public SSH host or IP,
    • SSH user, defaulting to ubuntu if the API does not return one.
  5. Wait for SSH readiness.
  6. Return LeaseTarget with Server, SSHTarget, and LeaseID.
  7. Claim the lease for the repo.
  8. If creation or readiness fails and --keep was not set, delete the partially created instance.

Resolve should:

  • Resolve by Crabbox lease ID, slug, or provider instance ID when possible.
  • Rebuild LeaseTarget.SSH from fal instance state.
  • Preserve existing local claim behavior.

List should:

  • List fal Compute instances owned or recognizable by Crabbox.
  • Map them to normalized LeaseView records.
  • Include provider instance ID, instance type, region/sector, state, public IP, and created time when available.

ReleaseLease should:

  • Delete the fal Compute instance.
  • Remove the local lease claim after successful delete.
  • Treat already-deleted/not-found responses as successful cleanup when safe.

Touch should:

  • Reuse Crabbox's direct lease label/touch conventions where possible.

Cleanup should:

  • Provide dry-run and mutating cleanup for orphaned Crabbox-owned fal instances.
  • Never delete instances that are not clearly owned by Crabbox.
  • Include a conservative ownership rule if fal labels/tags are unavailable.

Doctor should:

  • Verify FAL_KEY or CRABBOX_FAL_KEY is present.
  • Call a read-only endpoint such as list instances or account/billing readiness.
  • Report auth, Compute access, inventory readability, and default config.
  • Avoid creating any paid resource.

5. Add docs and provider matrix metadata

Add or update:

  • docs/providers/fal.md
  • docs/providers/provider-metadata.json
  • docs/providers/README.md, generated by provider matrix script
  • docs/source-map.md, if the provider list there is kept manually
  • README.md, if the provider table is generated or expected to include the new provider

Suggested provider metadata:

"fal": {
  "status": "built-in",
  "category": "gpu-cloud",
  "substrate": "fal Compute H100 instance",
  "location": "cloud",
  "ssh": "crabbox-managed",
  "sync": "crabbox-sync",
  "gpu": "yes",
  "lifecycle": "Crabbox",
  "cleanup": "instance delete",
  "bestFit": "Dedicated H100 GPU workloads over SSH",
  "caveat": "Requires fal Compute access, FAL_KEY, quota, and available H100 capacity",
  "docs": "fal.md"
}

6. Add tests

Unit tests should use fake HTTP servers and no live fal credentials.

Suggested tests:

  • Provider registration:
    • ProviderFor("fal") resolves.
    • Optional alias fal-ai resolves if added.
    • Spec is Linux-only SSH lease, direct only, with ssh, crabbox-sync, and cleanup.
    • Provider exposes DoctorProvider.
  • Config:
    • FAL_KEY and CRABBOX_FAL_KEY loading.
    • config show does not leak token or token env names.
    • Invalid API URL is rejected.
    • HTTPS required except trusted loopback test URL.
  • Client:
    • Sends Authorization: Key <token>.
    • Handles create/get/list/delete.
    • Handles 401/403 with clear auth errors.
    • Handles 404 delete idempotently where intended.
    • Limits response bodies.
    • Rejects cross-origin redirects.
  • Backend:
    • Acquire creates instance, waits ready, waits SSH, claims lease.
    • Acquire rolls back on readiness failure when not kept.
    • Resolve reconstructs SSH target.
    • List maps fal instances to LeaseView.
    • Release deletes instance and removes claim.
    • Cleanup does not delete unowned instances.
  • Docs:
    • Provider matrix check passes.
    • Provider docs path exists.

7. Optional live smoke

Add a guarded live smoke only if credentials and Compute access are available:

CRABBOX_LIVE=1 CRABBOX_LIVE_PROVIDERS=fal FAL_KEY=... scripts/live-fal-smoke.sh

The live smoke should:

  1. Run doctor --provider fal.
  2. Assert starting inventory is safe or isolated by Crabbox-owned names.
  3. warmup --provider fal --keep --slug fal-live-....
  4. status --provider fal --wait.
  5. run --provider fal --id <lease> --no-sync -- nvidia-smi.
  6. stop --provider fal <lease>.
  7. Run cleanup dry-run and final inventory check.

The script must skip cleanly without CRABBOX_LIVE=1, without fal selected in CRABBOX_LIVE_PROVIDERS, or without FAL_KEY.

Acceptance criteria

  • crabbox providers lists fal as a built-in GPU cloud SSH lease provider.
  • crabbox doctor --provider fal verifies credentials and Compute access without creating resources.
  • crabbox warmup --provider fal --keep provisions a fal Compute instance and returns a reusable lease.
  • crabbox run --provider fal -- nvidia-smi runs on the fal Compute instance over Crabbox-managed SSH.
  • crabbox ssh --provider fal <id-or-slug> opens an SSH session.
  • crabbox list --provider fal --json returns normalized lease data without leaking credentials.
  • crabbox stop --provider fal <id-or-slug> deletes the fal Compute instance and removes the local claim.
  • Failed acquire rolls back the fal instance unless --keep was requested.
  • Cleanup only touches instances Crabbox can prove it owns.
  • Docs, provider matrix, and generated provider listings are in sync.
  • No fal credential appears in command output, config output, logs, tests, or docs.

Validation checklist

Before opening the PR:

gofmt -w $(git ls-files '*.go')
go vet ./...
go test -race ./...
scripts/check-docs.sh

If Worker files are touched for coordinator support, also run:

npm ci --prefix worker
npm run format:check --prefix worker
npm run lint --prefix worker
npm run check --prefix worker
npm test --prefix worker
npm run build --prefix worker

Suggested PR plan

  1. Create a local branch, for example feat/fal-provider.
  2. Land the provider package, config, and tests in small commits.
  3. Add docs and regenerate the provider matrix.
  4. Run the validation checklist.
  5. If live credentials are available, run the guarded live smoke and include the result in the PR.
  6. Submit this issue and the PR together, linking both by full GitHub URLs.

Open questions

  • What are the exact fal Platform API base URL, paths, and request/response schemas for Compute instance lifecycle?
  • Does fal Compute expose labels/tags or user metadata that Crabbox can use for safe cleanup ownership?
  • Does create support caller-provided names, sectors, and SSH keys directly, or does any of that require prior setup?
  • What states should be treated as provisioning, ready, terminal failure, and deleted?
  • Are Compute API calls available with normal API scoped keys, or is ADMIN scope required?
  • Are there per-team or enterprise constraints that should be surfaced in doctor?
  • Should fal-ai be accepted as an alias, or should the provider follow newer Crabbox guidance and use only canonical fal?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal priority bug or improvement with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.issue-rating: 🌊 off-meta tidepoolIssue quality rating does not apply to this item.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions