Skip to content

fix(session): add context overflow recovery for long-running sessions#1342

Open
walker83 wants to merge 1 commit into
XiaomiMiMo:mainfrom
walker83:fix/context-overflow-recovery
Open

fix(session): add context overflow recovery for long-running sessions#1342
walker83 wants to merge 1 commit into
XiaomiMiMo:mainfrom
walker83:fix/context-overflow-recovery

Conversation

@walker83

Copy link
Copy Markdown

PR: Fix context overflow recovery for long-running sessions

Summary

This PR adds a context overflow recovery mechanism ported from OpenCode, which allows sessions to gracefully handle context overflow instead of crashing.

Problem

As reported in #1221, long-running sessions (>200k tokens) crash when context overflows:

  1. No recovery mechanism: When compaction fails due to context overflow, the session immediately stops with an error
  2. No retry with reduced content: The system doesn't attempt to strip media attachments and retry
  3. Users lose work: The entire session becomes unusable

Related Issues

Solution

Ported from OpenCode commit 820c984d475c5ad0b60c8a2f5aabc715e57eaf4c (PR #31005).

Key Changes

1. Compaction Recovery (compaction.ts)

Before: On overflow, immediately return "stop" with error

After:

  • First overflow attempt: Return "text-repeat" to trigger retry with overflow flag
  • This allows the system to strip media attachments and try again
  • Only show error if retry also fails
if (result === "overflow") {
  // First overflow: attempt recovery by returning "text-repeat"
  // This triggers a retry with stripped media
  if (!replay) {
    log.info("context.overflow.attempting.recovery", { sessionID: input.sessionID })
    return "text-repeat"
  }
  // Only error if retry also fails
  // ... error handling ...
}

2. Enhanced Logging (processor.ts)

Added detailed overflow logging:

if (MessageV2.ContextOverflowError.isInstance(error)) {
  slog.warn("context.overflow.detected", {
    sessionID: ctx.sessionID,
    attemptingRecovery: true,
    message: error.message,
  })
  // ...
}

Recovery Flow

┌─────────────────┐
│  Context        │
│  Overflow       │
│  Detected       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ First Attempt?  │
│ (replay = null) │
└────────┬────────┘
    Yes /    \ No
       /      \
      ▼        ▼
┌──────────┐ ┌──────────┐
│ Return   │ │ Return   │
│ "text-    │ │ "stop"   │
│ repeat"   │ │ + Error  │
│ (retry    │ │          │
│ with      │ │          │
│ stripped  │ │          │
│ media)    │ │          │
└──────────┘ └──────────┘

Changes

Files Modified

File Changes
packages/opencode/src/session/compaction.ts +18 lines: Add recovery logic and structured logging
packages/opencode/src/session/processor.ts +10 lines: Add detailed overflow logging

New Log Events

  • context.overflow.detected - When overflow is detected
  • context.overflow.attempting.recovery - When attempting recovery
  • session.message-limit-exceeded - When messages > 1000 (from previous commit)

Testing

Manual Test Steps

  1. Start a new session
  2. Send messages until approaching context limit (~180k tokens)
  3. Send a large message with media attachments
  4. Verify:
    • First overflow: Should trigger retry with stripped media
    • Check logs for context.overflow.attempting.recovery
    • Session should continue, not crash

Expected Behavior

Scenario Before After
Context overflow with media Session crashes Media stripped, session continues
Context overflow without media Session crashes Graceful error with clear message
Large attachment upload OOM/crash Attachment stripped and retried

Verification

Log Output

# When overflow occurs
[WARN] context.overflow.detected: { sessionID: "...", attemptingRecovery: true }

# Recovery attempt
[INFO] context.overflow.attempting.recovery: { sessionID: "..." }

# If successful
[INFO] compaction.complete: { sessionID: "..." }

# If failed after retry
[ERROR] context.overflow.failed: { sessionID: "..." }

Backwards Compatibility

This change is fully backwards compatible:

  • No API changes
  • No configuration changes required
  • Existing sessions continue to work
  • Recovery is automatic and transparent to users

Related Commits

  • OpenCode: 820c984d475c5ad0b60c8a2f5aabc715e57eaf4c
  • OpenCode PR: #31005

Checklist

  • Code follows project style guidelines
  • Ported from upstream OpenCode with attribution
  • Added structured logging for debugging
  • Maintains backwards compatibility
  • No breaking changes
  • Commit message follows conventional format

Notes

This is a critical stability fix for long-running sessions. The recovery mechanism significantly improves user experience by:

  1. Preventing session loss on overflow
  2. Automatically stripping large attachments to save context
  3. Providing clear logging for debugging

Future improvements could include:

  • Automatic session archival when approaching limits
  • Better context budget management
  • Proactive user warnings before overflow

…g sessions

Ported from OpenCode commit 820c984d475c5ad0b60c8a2f5aabc715e57eaf4c

Problem:
- Long-running sessions (>200k tokens) crash when context overflows
- No recovery mechanism when compaction fails due to context overflow
- Users lose entire session when overflow occurs

Solution:
1. Add recovery attempt in compaction.ts when overflow is detected:
   - First overflow: return "text-repeat" to trigger retry with overflow flag
   - This allows stripping media and retrying before giving up
   - Only show error if retry (replay) also fails

2. Add detailed logging in processor.ts for overflow events:
   - Log when overflow is detected
   - Log recovery attempts
   - Include sessionID for debugging

Changes:
- packages/opencode/src/session/compaction.ts:
  - Add recovery logic before returning "stop"
  - Return "text-repeat" on first overflow to allow retry
  - Add structured logging with context.overflow.* events

- packages/opencode/src/session/processor.ts:
  - Add detailed overflow logging in halt function
  - Log recovery attempts with sessionID

Related Issues:
- Fixes XiaomiMiMo#1221 (Long running sessions cause lag and crash)
- Fixes XiaomiMiMo#813 (长时间思考会卡死)
- Related: XiaomiMiMo#680 (超大内存占用), XiaomiMiMo#300 (疑似内存泄露)

Refs: OpenCode PR #31005
Refs: anomalyco/opencode@820c984d
@walker83

Copy link
Copy Markdown
Author

Superseded by #1344 which includes the same overflow recovery changes plus the timeout.ts fix. This PR should be closed to avoid merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant