Skip to content

fix: green the failing unit test suite#1296

Open
lLukDreams wants to merge 7 commits into
XiaomiMiMo:mainfrom
lLukDreams:fix/test-root-not-forbidden
Open

fix: green the failing unit test suite#1296
lLukDreams wants to merge 7 commits into
XiaomiMiMo:mainfrom
lLukDreams:fix/test-root-not-forbidden

Conversation

@lLukDreams

Copy link
Copy Markdown

Issue for this PR

Closes #

Type of change

  • Bug fix

What does this PR do?

Repairs the unit suite. The failures were a mix:

Stale tests (code changed, assertion didn't):

  • Drop /root from the forbidden-dirs assertion — it was intentionally removed from the blocklist.
  • Update the cache_control expectation to the rolling-tail marker placement the provider transform now uses.

Real bugs:

  • A subtask that fails before onReady lost its metadata; seed sessionId/model up front so the error tool state keeps it.
  • .claude.json MCP servers were eagerly connected instead of lazy pending — the mcp_origins "claude" marker wasn't honored.
  • The checkpoint writer read the Actor service only via the global spawnRef, which goes stale when a memoized Actor layer is torn down by instance disposal in an earlier test. Resolve from the live context via Effect.serviceOption (optional, so no layer cycle).
  • Workflow cancel could spawn orphans and hang with a wide fan-out: the promise-based agent semaphore kept starting queued agents as reclaim freed permits, and Actor.cancel awaited a fiber unwind that never finishes when stuck on a hung LLM stream. Bail queued spawns once cancelling, and bound the interrupt.

Test isolation:

  • History backfill inspects all FTS rows but only cleared the shared DB in afterEach; rows from earlier shard files leaked in. Clear in beforeEach too.

How did you verify your code works?

Ran the full suite on Linux across all 4 shards (matching the sharded ubuntu-latest CI) — green, 0 failures. The workflow cancel fix specifically: 20/20 with the concurrency ceiling pinned below the fan-out to force queuing. typecheck clean.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

/root was deliberately removed from the directory blocklist in
instance.ts (commits 20b79f7, 49c0d9e) because it is the root user's
home directory (common in Docker/CI) and is already covered by the
workspace-trust prompt. The directory-safety test was left asserting
that /root rejects with "Access denied", which fails on the Linux CI
runner. Align the test with the intended implementation.
The provider transform was refactored (rolling-tail/head cache breakpoints)
after this test was written, so the expected Anthropic payload was stale:
the ephemeral cache_control marker now sits on the last assistant tool_use
and last user tool_result, not on the first user text block. Align the
expectation with the current behavior.
A subtask that failed before the actor signaled onReady (e.g. the actor
service is unavailable or the model is unresolvable) lost its metadata on
the resulting error tool state, because metadata was only reported from
onReady. Seed sessionId/model onto the tool part up-front so it survives a
failure at any point during spawn.
Servers imported from .claude.json carry mcp_origins[name].type === 'claude'
and are meant to connect lazily, but MCP init eagerly connected every
configured server — so a Claude config auto-spawned every local MCP server
on startup and the lifecycle test (expecting 'pending' until an explicit
connect()) failed. Honor the origin marker: leave Claude servers pending.
…nRef

tryStartCheckpointWriter read the Actor service only via the late-bound
global spawnRef, which goes stale when a memoized Actor layer is torn down
by instance disposal elsewhere in the process (observed as cross-test
pollution: the writer skipped with 'Actor service unavailable' depending on
test order). Prefer the Actor service from the live Effect context via
Effect.serviceOption — always present under AppRuntime and robust to the
global going stale — falling back to spawnRef. serviceOption does not add
Actor to requirements, so the layer cycle that motivated spawnRef is intact.
…red DB

The backfill suite inspects ALL FTS rows and only cleared the global DB in
afterEach, so rows left by earlier files in the same shard leaked into its
assertions (saw 11 stray part_ids). Clear the DB in beforeEach too.
Cancelling a workflow run with more agents than the concurrency ceiling
(globalSem = config.workflow.maxConcurrentAgents ?? min(16, 2x cores)) could
spawn orphans and intermittently hang. Two causes, two fixes:

1. Orphan spawns (runtime.ts): the promise-based agent semaphore fires queued
   callbacks as permits free. While reclaim cancels the started agents and frees
   their permits, queued siblings would acquire them and spawn NEW actors that
   escape the cancel sweep. Mark entry.cancelling before reclaim and bail at
   both spawn sites so a late-acquiring agent never starts.

2. Cancel hang (spawn.ts): Actor.cancel awaited each agent's runner cancel,
   which Fiber.interrupts the work fiber and awaits its full unwind. A fiber
   blocked on a hung LLM response stream never finishes unwinding, hanging the
   whole sweep (and runtime.cancel) indefinitely. Bound that interrupt with a
   grace period — the signal is still delivered and the registry status still
   records the cancellation (cancel's contract); we just stop awaiting a stuck
   unwind.

The regression test now pins the ceiling BELOW the fan-out so some agents are
always queued (deterministic on any host), exercising both paths. Verified
20/20 with forced queuing and green across all 4 CI shards.
@lLukDreams

Copy link
Copy Markdown
Author

@yanyihan-xiaomi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant