Skip to content

[bug] Entity Loom stale running checkpoint blocks resumed background stages after restart #17

Description

@lyrishark

Entity Loom: stale running checkpoint blocks resumed background stages after restart

Summary

If Entity Loom is interrupted while a long background stage is running, the
checkpoint can remain on status: "running" even though the daemon has been
restarted and no in-memory stage lock exists. After resuming the package, the UI
trusts the persisted running status, hides the Start button, and shows the
stage as running without any active work happening.

Observed Case

From a Gemini import package captured after the stall:

  • Upload detected as gemini.
  • Convert/staging completed successfully.
  • 338 conversations / 3452 messages were committed.
  • Significant stage started with 338 conversations.
  • 101 significant items were checkpointed.
  • 21 significant memory files were written.
  • Later daemon starts only logged Resumed package.
  • checkpoint.json still had:
{
  "currentStage": "significant",
  "stages": {
    "significant": {
      "status": "running",
      "completed": false
    }
  }
}

At that point getRunningStage() was null, so the status was stale.

Impact

The package is resumable, but the UI does not offer the user a way to resume it.
This is especially likely on large imports where Significant/Daily/Graph can run
for a long time and the terminal, app, or machine may close before completion.

Proposed Fix

On package resume, if no stage is actually running in memory, normalize stale
running statuses for resumable background stages:

  • significant
  • daily
  • graph

Set them to aborted with completed: false, preserve processedItems and
failedItems, save the checkpoint, and let the existing resumable UI path show
the Start/Continue action.

Sketch:

const RESUMABLE_BACKGROUND_STAGES: StageName[] = [
  "significant",
  "daily",
  "graph",
];

function recoverStaleRunningStages(checkpoint: CheckpointStateV2): boolean {
  if (getRunningStage()) return false;

  let recovered = false;
  for (const stage of RESUMABLE_BACKGROUND_STAGES) {
    const stageCheckpoint = checkpoint.stages[stage];
    if (stageCheckpoint.status === "running") {
      stageCheckpoint.status = "aborted";
      recovered = true;
      log(
        "warn",
        `Recovered stale '${stage}' stage as aborted/resumable on package resume`,
      );
    }
  }

  return recovered;
}

Then call it during loadPackage() after loading/migrating the checkpoint and
save the checkpoint if anything changed.

Local Patch / Verification

Patched locally in packages/entity-loom/src/stages/setup-stage.ts.

Verification performed:

  • deno check src/main.ts passes in the local Psycheros workspace.
  • A copied affected package with significant.status = "running" and 101
    processed items was repaired to status = "aborted" while preserving the 101
    processed item IDs.
  • The repair path is packaged in the community Gemini resume patch as an
    explicit modded Entity Loom file set, not silently mixed into the exporter.

Notes

The triggering import happened to be Gemini, but the stale checkpoint behavior is
not Gemini-specific. The same failure mode should apply to any long
Significant/Daily/Graph stage if the process exits mid-run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions