Audit-filed by untether-staging monitor loop, run 20260513T015418Z (passes 1-8 + retrospective synthesis).
Source: untether.service on lba-1, 0.35.3rc13.
Type: Tracking / design discussion (ENH-PATCH candidate for v0.35.4 implementation).
Context
Across four releases (v0.35.0 → v0.35.3) the stall detector has been refined for specific false-positive scenarios on every cycle — but each release adds another variant of "Untether thinks Claude is hung, but Claude isn't hung". The current design (warn on JSONL silence + CPU heuristic) keeps shipping single-context fixes; the underlying signal model is what's not scaling.
What's been fixed (and is still in the codebase)
| Release |
Issue |
What it added |
| v0.35.0 |
#92 |
Foundation: stall warn after 5 min of no progress events |
| v0.35.0 |
#95 |
Detect stalls when no events arrive after StartedEvent |
| v0.35.0 |
#99 |
Auto-cancel sleep-then-stall loop after laptop sleep |
| v0.35.0 |
#105 |
Suppress when tool execution is active |
| v0.35.0 |
#115 |
Liveness watchdog false-positive auto-cancel guard |
| v0.35.0 |
#121 |
Suppress during CPU-active extended thinking |
| v0.35.0 |
#154 |
Longer threshold + contextual messaging for MCP tool calls |
| v0.35.0 |
#155 |
Frozen ring buffer escalation broadened beyond MCP |
| v0.35.0 |
#168 |
Suppress when main proc sleeping but children CPU-active |
| v0.35.0 |
#188 |
Verbose/misleading messaging cleanup for long-running tools |
| v0.35.1 |
#264 |
False positives during normal Agent/Bash workflows |
| v0.35.2 |
#349 |
Half-fix: visible chat indicator for rate-limit waits added, but the structured stall_warning still fires alongside it |
What's still firing in v0.35.3 (OPEN)
| Issue |
Variant |
| #470 |
Post-result idle (after last_event_type=result) keeps firing stall warnings every 3 min — known benign noise |
| #481 |
Long-running Bash + ScheduleWakeup waits look hung in chat |
| #499 |
ExitPlanMode approval-pending session fires 22 stall warnings while user reads the plan |
| #526 |
Approval-pending + rate-limit-retry combo fires stall warnings despite Untether's own chat indicator showing the rate-limit |
| #482 |
Upstream Bash tool_result interim deltas (sibling — partial mitigation of #481) |
Today's audit produced fresh evidence for both #499 (brand-copilot, peak_idle 5621s, 22 stall_warning fires this morning) and #526 (legal-librarian-local, 900s idle window mixing approval-pending + rate-limit-retry).
Proposal
Replace the implicit "stall = no JSONL events + tree_active check" with an explicit predicate-based decision at stall-detection time. The stall warning fires only when none of the predicates are true.
Predicates (initial set)
| Predicate |
True when |
Source signal |
approval_pending |
A control_request is awaiting user response; inline keyboard is still rendered |
_OUTLINE_REGISTRY / control-request registry / progress_edits keyboard state |
rate_limit_waiting |
The last rate_limit_event was within the past 60s and no assistant/result event has cleared it |
claude.rate_limit_event log line (already emitted) |
bash_running |
An action with last_action='tool:Bash …' started < N seconds ago and hasn't shown (done) |
existing last_action field |
mcp_in_flight |
An MCP tool call started < N seconds ago |
existing MCP-aware threshold path (#154) |
extended_thinking |
Claude is in assistant.thinking mode with CPU active |
existing tree_active + CPU check |
post_result_idle |
last_event_type=result AND idle_seconds < watchdog_threshold |
#470's framing |
Behavioural change
Daemon integration
The always-on untether-issue-watcher daemon currently treats subprocess.liveness_stall and progress_edits.stall_detected as auto-fileable. Once predicates ship, the daemon should filter on the predicate fields (pending_approval=true → skip, rate_limit_waiting=true → skip, …) so it doesn't keep auto-filing benign waits.
Related code
src/untether/runner_bridge.py (stall detection + escalation; _should_emit_stall_warning)
src/untether/runners/claude.py (_OUTLINE_REGISTRY, _DISCUSS_APPROVED, control-request registry, rate_limit_event handler)
src/untether/utils/proc_diag.py (existing tree_active / cpu_active checks — keep, just demoted to fallback)
~/.local/bin/untether-issue-watcher (predicate-aware filter for auto:error-report filings)
Effort estimate
M — touches 2-3 files in src/untether/, requires a structured event taxonomy decision, and needs daemon-side dedup updates. Predicates can ship incrementally (start with approval_pending + rate_limit_waiting for the biggest wins; add bash_running / mcp_in_flight as #481/#482 land).
Why this is worth the refactor
Out of scope
— filed by untether-staging monitor loop, run 20260513T015418Z retrospective synthesis
Audit-filed by
untether-stagingmonitor loop, run20260513T015418Z(passes 1-8 + retrospective synthesis).Source:
untether.serviceonlba-1,0.35.3rc13.Type: Tracking / design discussion (
ENH-PATCHcandidate for v0.35.4 implementation).Context
Across four releases (
v0.35.0→v0.35.3) the stall detector has been refined for specific false-positive scenarios on every cycle — but each release adds another variant of "Untether thinks Claude is hung, but Claude isn't hung". The current design (warn on JSONL silence + CPU heuristic) keeps shipping single-context fixes; the underlying signal model is what's not scaling.What's been fixed (and is still in the codebase)
stall_warningstill fires alongside itWhat's still firing in v0.35.3 (OPEN)
last_event_type=result) keeps firing stall warnings every 3 min — known benign noiseScheduleWakeupwaits look hung in chatToday's audit produced fresh evidence for both #499 (brand-copilot, peak_idle 5621s, 22 stall_warning fires this morning) and #526 (legal-librarian-local, 900s idle window mixing approval-pending + rate-limit-retry).
Proposal
Replace the implicit "stall = no JSONL events + tree_active check" with an explicit predicate-based decision at stall-detection time. The stall warning fires only when none of the predicates are true.
Predicates (initial set)
approval_pendingcontrol_requestis awaiting user response; inline keyboard is still rendered_OUTLINE_REGISTRY/ control-request registry /progress_editskeyboard staterate_limit_waitingrate_limit_eventwas within the past 60s and no assistant/result event has cleared itclaude.rate_limit_eventlog line (already emitted)bash_runninglast_action='tool:Bash …'started < N seconds ago and hasn't shown(done)last_actionfieldmcp_in_flightextended_thinkingassistant.thinkingmode with CPU activepost_result_idlelast_event_type=resultANDidle_seconds < watchdog_thresholdBehavioural change
stall_warningfires every 3 min on JSONL silence + CPU heuristic. All 6 of the above scenarios produce warnings.subprocess.approval_pending,subprocess.rate_limit_waiting, etc.) with a much longer re-fire interval (e.g. 30 min) so they remain observable for operators but don't false-alarm.rate_limit_eventas a visible "waiting for API" indicator, not a silent cancel #349; bash/wait surface from feat(progress): surface long-running Bash/ScheduleWakeup waits in chat — silent 5–10 min holds look hung #481; approval keyboard) keeps working unchanged. The stall message is suppressed in those states.Daemon integration
The always-on
untether-issue-watcherdaemon currently treatssubprocess.liveness_stallandprogress_edits.stall_detectedas auto-fileable. Once predicates ship, the daemon should filter on the predicate fields (pending_approval=true → skip,rate_limit_waiting=true → skip, …) so it doesn't keep auto-filing benign waits.Related code
src/untether/runner_bridge.py(stall detection + escalation;_should_emit_stall_warning)src/untether/runners/claude.py(_OUTLINE_REGISTRY,_DISCUSS_APPROVED, control-request registry, rate_limit_event handler)src/untether/utils/proc_diag.py(existing tree_active / cpu_active checks — keep, just demoted to fallback)~/.local/bin/untether-issue-watcher(predicate-aware filter for auto:error-report filings)Effort estimate
M — touches 2-3 files in
src/untether/, requires a structured event taxonomy decision, and needs daemon-side dedup updates. Predicates can ship incrementally (start withapproval_pending+rate_limit_waitingfor the biggest wins; addbash_running/mcp_in_flightas #481/#482 land).Why this is worth the refactor
rate_limit_eventas a visible "waiting for API" indicator, not a silent cancel #349auto:error-report) and/monitor(auto:monitor-audit) from both auto-filing benign waits — today both treatsubprocess.liveness_stallas automatically fileable, contributing to the comment-saturation pattern noted in the current audit runOut of scope
subprocess.liveness_stallas today; that's exactly the signal that should remain after benign cases are filtered outsession.summary.last_event_typefield semantics observation from this audit (separate observability gap, sibling to bug: runner's proc.wait() doesn't fire when CC subprocess exits cleanly post-result — watchdog is the only safety net #502)— filed by untether-staging monitor loop, run
20260513T015418Zretrospective synthesis