Skip to content

tracking: unified predicate-based stall detector — supersede the whack-a-mole pattern across releases #527

Description

Audit-filed by untether-staging monitor loop, run 20260513T015418Z (passes 1-8 + retrospective synthesis).

Source: untether.service on lba-1, 0.35.3rc13.

Type: Tracking / design discussion (ENH-PATCH candidate for v0.35.4 implementation).

Context

Across four releases (v0.35.0v0.35.3) the stall detector has been refined for specific false-positive scenarios on every cycle — but each release adds another variant of "Untether thinks Claude is hung, but Claude isn't hung". The current design (warn on JSONL silence + CPU heuristic) keeps shipping single-context fixes; the underlying signal model is what's not scaling.

What's been fixed (and is still in the codebase)

Release Issue What it added
v0.35.0 #92 Foundation: stall warn after 5 min of no progress events
v0.35.0 #95 Detect stalls when no events arrive after StartedEvent
v0.35.0 #99 Auto-cancel sleep-then-stall loop after laptop sleep
v0.35.0 #105 Suppress when tool execution is active
v0.35.0 #115 Liveness watchdog false-positive auto-cancel guard
v0.35.0 #121 Suppress during CPU-active extended thinking
v0.35.0 #154 Longer threshold + contextual messaging for MCP tool calls
v0.35.0 #155 Frozen ring buffer escalation broadened beyond MCP
v0.35.0 #168 Suppress when main proc sleeping but children CPU-active
v0.35.0 #188 Verbose/misleading messaging cleanup for long-running tools
v0.35.1 #264 False positives during normal Agent/Bash workflows
v0.35.2 #349 Half-fix: visible chat indicator for rate-limit waits added, but the structured stall_warning still fires alongside it

What's still firing in v0.35.3 (OPEN)

Issue Variant
#470 Post-result idle (after last_event_type=result) keeps firing stall warnings every 3 min — known benign noise
#481 Long-running Bash + ScheduleWakeup waits look hung in chat
#499 ExitPlanMode approval-pending session fires 22 stall warnings while user reads the plan
#526 Approval-pending + rate-limit-retry combo fires stall warnings despite Untether's own chat indicator showing the rate-limit
#482 Upstream Bash tool_result interim deltas (sibling — partial mitigation of #481)

Today's audit produced fresh evidence for both #499 (brand-copilot, peak_idle 5621s, 22 stall_warning fires this morning) and #526 (legal-librarian-local, 900s idle window mixing approval-pending + rate-limit-retry).

Proposal

Replace the implicit "stall = no JSONL events + tree_active check" with an explicit predicate-based decision at stall-detection time. The stall warning fires only when none of the predicates are true.

Predicates (initial set)

Predicate True when Source signal
approval_pending A control_request is awaiting user response; inline keyboard is still rendered _OUTLINE_REGISTRY / control-request registry / progress_edits keyboard state
rate_limit_waiting The last rate_limit_event was within the past 60s and no assistant/result event has cleared it claude.rate_limit_event log line (already emitted)
bash_running An action with last_action='tool:Bash …' started < N seconds ago and hasn't shown (done) existing last_action field
mcp_in_flight An MCP tool call started < N seconds ago existing MCP-aware threshold path (#154)
extended_thinking Claude is in assistant.thinking mode with CPU active existing tree_active + CPU check
post_result_idle last_event_type=result AND idle_seconds < watchdog_threshold #470's framing

Behavioural change

Daemon integration

The always-on untether-issue-watcher daemon currently treats subprocess.liveness_stall and progress_edits.stall_detected as auto-fileable. Once predicates ship, the daemon should filter on the predicate fields (pending_approval=true → skip, rate_limit_waiting=true → skip, …) so it doesn't keep auto-filing benign waits.

Related code

  • src/untether/runner_bridge.py (stall detection + escalation; _should_emit_stall_warning)
  • src/untether/runners/claude.py (_OUTLINE_REGISTRY, _DISCUSS_APPROVED, control-request registry, rate_limit_event handler)
  • src/untether/utils/proc_diag.py (existing tree_active / cpu_active checks — keep, just demoted to fallback)
  • ~/.local/bin/untether-issue-watcher (predicate-aware filter for auto:error-report filings)

Effort estimate

M — touches 2-3 files in src/untether/, requires a structured event taxonomy decision, and needs daemon-side dedup updates. Predicates can ship incrementally (start with approval_pending + rate_limit_waiting for the biggest wins; add bash_running / mcp_in_flight as #481/#482 land).

Why this is worth the refactor

Out of scope

— filed by untether-staging monitor loop, run 20260513T015418Z retrospective synthesis

Metadata

Metadata

Assignees

No one assigned

    Labels

    auto:monitor-auditAuto-filed by /monitor command audit loop (bugs + enhancements, perf and otherwise)enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions