Skip to content

feat(telemetry): instrument auth recovery and workspace lifecycle#948

Open
EhabY wants to merge 8 commits into
mainfrom
feat/issue-906-auth-workspace-telemetry
Open

feat(telemetry): instrument auth recovery and workspace lifecycle#948
EhabY wants to merge 8 commits into
mainfrom
feat/issue-906-auth-workspace-telemetry

Conversation

@EhabY
Copy link
Copy Markdown
Collaborator

@EhabY EhabY commented May 12, 2026

Summary

Records local telemetry for the auth-recovery and workspace-lifecycle paths so latency and failure modes are visible alongside the rest of the local telemetry stream.

Auth events (src/instrumentation/auth.ts):

  • auth.token_refreshed: OAuth refresh attempt with trigger (background / reactive).
  • auth.token_refresh.deduped: emitted when a refresh call joins an in-flight refresh so the dropped trigger stays countable.
  • auth.unauthorized_intercepted: recovery path for a 401, with recovery (refresh_success / login_required / none) and refreshAttempted.
  • auth.login_prompted: modal login prompt outcome, with trigger and a reason on abort/failure.

Workspace events (src/instrumentation/workspace.ts, split into three single-purpose classes):

  • workspace.state_transitioned: deduped on (status, transition, reason); emits observedDurationMs between transitions and observedBuildDurationMs when a provisioner run resolves.
  • workspace.agent.state_transitioned: deduped on (status, lifecycle_state); emits observedDurationMs.
  • workspace.start.triggered / workspace.update.triggered: traced spans around user-initiated workspace operations.

All workspace events carry workspaceName. The workspace event uses bare from/to (single state dimension); the agent event uses qualified fromStatus/toStatus plus fromLifecycleState/toLifecycleState because two dimensions can change in the same emission. Cross-event consistency via dimension-as-prefix property keys is tracked in #954.

Closes #906.

Validation

  • pnpm test:extension
  • pnpm typecheck
  • pnpm lint
  • pnpm format:check

Generated by Coder Agents.

@EhabY EhabY self-assigned this May 12, 2026
EhabY and others added 4 commits May 13, 2026 14:33
- workspace.ts: inline single-use transition helpers, unify on the
  spread idiom for optional `observedDurationMs`, move `performance.now()`
  below the dedup early-return, drop the no-op `() => fn()` wrapper in
  `traceUpdateTriggered`, and make `INITIAL_STATE` private.
- auth.ts: replace the `T extends LoginPromptResult` constraint with a
  `LoginPromptTracer` exposing `markAborted()`, matching the
  `RemoteSetupTracer` pattern. Callers now drive abort signaling, so
  telemetry no longer depends on `LoginResult`'s shape.
- loginCoordinator: call `tracer.markAborted()` on `!result.success`.
@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from 128cc7c to fc03561 Compare May 13, 2026 11:39
- Add Span.markFailure() so non-thrown failures can record result=error
  without throwing. The emitted event omits the error block, matching
  OTel's "Status: Error + optional exception event" convention.
- Move login-prompt outcome classification (success/auth_failed/aborted)
  from LoginCoordinator into AuthTelemetry.traceLoginPrompt. The
  coordinator just returns its natural LoginResult; telemetry maps it.
- Promote auth.intercept_401 from a log to a span so durationMs covers
  the full 401 recovery + retry. TODO marker points at #925 for a
  correlated received-marker once Span.log() lands.
- Add workspace.start.triggered span (mirrors update.triggered) and
  buildDurationMs measurement on workspace.state_transitioned.
- Dedupe workspace transitions on (status, transition, reason).
- Switch LoginCoordinator/WorkspaceMonitor/WorkspaceStateMachine/
  AuthInterceptor/OAuthSessionManager to take ServiceContainer instead
  of TelemetryReporter (+ individual services). Drops NOOP telemetry
  defaults; injection is now explicit.
- Make ensureLoggedInWithDialog trigger required.

Tests: 1601 passing, 1 skipped. Adds direct Span.markFailure unit tests
in service.test.ts and telemetry describe blocks per affected unit.
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 13, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong instrumentation work. The typed facades (AuthTelemetry, WorkspaceTelemetry) centralize event schemas per domain, the dedup logic in observeWorkspace/observeAgent prevents event spam without losing real transitions, and the buildDurationMs accumulation across intermediate building states is well-tested. The handle401Error restructuring preserves every behavioral path (Pariston traced all five branches, found no difference). Test coverage is proportional: ~300 new source lines, ~520 new test lines, every telemetry event exercised. The LoginResult type enrichment with reason updates all 7 failure return sites, indicating sibling search.

7 P3 findings, 7 nits. The P3s cluster around telemetry schema completeness: missing outcome/recovery properties on throw paths, missing workspace identifiers for cross-session analytics, an incomplete status set, and a test coverage gap. No correctness bugs found.

Process note: the PR description mentions "optional TelemetryReporter injection" and omits workspace.start.triggered, both stale relative to the shipped code. These aren't code findings, but the description should match what landed.

"The setRecovery callback pattern in traceIntercept401 is a nice design: it defers the property write to the caller's decision point without leaking the span object." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/instrumentation/workspace.ts Outdated
Comment thread test/unit/api/authInterceptor.test.ts
Comment thread src/instrumentation/workspace.ts Outdated
Comment thread src/instrumentation/auth.ts
Comment thread src/oauth/sessionManager.ts
Comment thread src/instrumentation/workspace.ts
Comment thread src/instrumentation/workspace.ts Outdated
Comment thread test/mocks/testHelpers.ts Outdated
Comment thread test/mocks/testHelpers.ts Outdated
Comment thread test/unit/remote/workspaceStateMachine.test.ts
- Align auth event names with the past-tense convention used by the
  rest of the codebase: auth.token_refresh -> auth.token_refreshed,
  auth.intercept_401 -> auth.unauthorized_intercepted,
  auth.login_prompt -> auth.login_prompted.
- Replace the redundant `outcome` property on auth.login_prompted with
  `reason`, only emitted on non-success (`result` already covers the
  success/aborted/error axis).
- Collapse LoginPromptAbortReason + LoginPromptFailureReason into a
  single LoginPromptReason; tighten ObservedWorkspaceState now that
  transition/reason are non-nullable in the API types.
- Drop the dead `?? undefined` and conditional spreads in
  observeWorkspace; fold buildStartedAtMs updates into the same branch
  that produces buildDurationMs.
- Type BUILDING_STATUSES as ReadonlySet.
- Reuse createMockServiceContainer in deploymentManager.test.ts and
  drop a non-deterministic 30ms sleep in sshProcess.test.ts.
@EhabY EhabY changed the title feat(telemetry): instrument auth and workspace state feat(telemetry): instrument auth recovery and workspace lifecycle May 13, 2026
@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from 65fa45f to 9ffe6ac Compare May 13, 2026 21:43
- workspace.ts: split WorkspaceTelemetry into three single-purpose
  classes in the same file so each call site only holds the surface
  it uses, eliminating the duplicate-event trap by construction:
    - WorkspaceStateTelemetry (observe) for WorkspaceMonitor
    - WorkspaceAgentTelemetry (observe/reset) for WorkspaceStateMachine
    - WorkspaceOperationTelemetry (traceStartTriggered/traceUpdateTriggered)
      for WorkspaceStateMachine
- workspace.ts: add canceling/deleting to the provisioner-active set
  (renamed BUILDING_STATUSES -> PROVISIONING_STATUSES). Take
  workspaceName via constructors and stamp it on every emission.
- workspace.ts: unify workspace and agent state events on
  fromStatus/toStatus (agent additionally carries
  fromLifecycleState/toLifecycleState for its second dimension).
- auth.ts: drop the "401" baggage from API names. traceIntercept401
  -> traceAuthRecovery, Intercept401Recorder -> AuthRecoveryRecorder,
  AuthIntercept401Recovery -> AuthRecoveryOutcome, handle401Error ->
  recoverFromUnauthorized. The 401 stays in the event name
  `auth.unauthorized_intercepted` and at the call site.
- auth.ts: traceAuthRecovery sets recovery="none" and
  refreshAttempted="false" as initial properties so a pre-callback
  throw still emits values; AuthInterceptor records refreshAttempted
  before tryOAuthRefresh.
- auth.ts + sessionManager.ts: add auth.token_refresh.deduped so
  reactive refreshes that join an in-flight background refresh stay
  countable.
- testHelpers: type contextManager via a local ContextManagerLike
  shape and replace non-null assertions with a require() helper so
  missing services throw a named error.
- Add a recovery=none telemetry test; assert on the previously
  unused mock field in the workspaceStateMachine it.each.
@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from 9ffe6ac to 28a6962 Compare May 13, 2026 21:49
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 13, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 14 R1 findings addressed. 12 fixed, 2 contested and closed by unanimous panel vote (7/7). DEREM-4 partial fix verified and accepted by 9/9 evaluating reviewers. Strong fix discipline: single organized commit with itemized entries per finding, all tests passing, typecheck clean.

R2 found 2 P3 test coverage gaps and 4 nits. The P3s are both untested telemetry emissions: the OAuth-refresh-failed recovery path and the new auth.token_refresh.deduped event (added to fix DEREM-5 but shipped without its own test). The nits are naming and documentation accuracy.

Process note: the commit message for the review-response commit claims "unify workspace and agent state events on fromStatus/toStatus," but workspace events still emit bare from/to (consistent with the author's own DEREM-10 defense). The commit message should match the code.

"The ??= guard prevents the start timestamp from shifting during intermediate provisioning statuses." (Hisoka, verifying the buildDurationMs lifecycle)


test/unit/telemetry/service.test.ts:295

Nit [DEREM-21] The post-emit guard test exercises setProperty, setMeasurement, and markAborted after completion, asserting 3 warnings. This PR added markFailure() with an identical if (completed) { warnPostEmit(...) } guard at service.ts:197, but the guard test was not extended. If someone removes the guard from markFailure, this test won't catch it.

Add escapedSpan?.markFailure() after line 307 and bump the expected warning count from 3 to 4.

(Bisky)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread test/unit/api/authInterceptor.test.ts
Comment thread test/unit/oauth/sessionManager.test.ts
Comment thread src/api/authInterceptor.ts Outdated
Comment thread src/instrumentation/auth.ts Outdated
Comment thread src/instrumentation/workspace.ts Outdated
@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from 870d06b to a55308a Compare May 13, 2026 22:38
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 13, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R2 findings DEREM-19, DEREM-20, DEREM-22, DEREM-23, DEREM-24 all fixed. Thank you.

One finding remains unaddressed:

DEREM-21 (Nit, test/unit/telemetry/service.test.ts:295): The post-emit guard test exercises setProperty, setMeasurement, and markAborted after completion but does not exercise markFailure. The markFailure guard at service.ts:197 has an identical if (completed) { warnPostEmit(...) } pattern but no coverage. Add escapedSpan?.markFailure() after line 307 and bump the expected warning count from 3 to 4.

This was folded into the R2 review body (not an inline comment) because the line is outside the diff. It may have been overlooked. Please fix or respond, and the panel will review on the next round.

🤖 This review was automatically generated with Coder Agents.

@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from a55308a to cd54c1a Compare May 13, 2026 22:51
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 13, 2026

DEREM-21 fixed in cd54c1a: markFailure is now exercised alongside setProperty/setMeasurement/markAborted in the post-emit guard test, with the expected warning count bumped to 4.

@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 13, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All prior findings resolved. DEREM-21 fix verified. The PR is in excellent shape: 25 findings raised across 4 rounds, 22 fixed, 2 contested and closed by unanimous panel vote, 4 dropped. Six of eight R4 reviewers found no findings. Pariston's four-framing analysis found no gap in premises.

1 P3 (one untested reason string in login prompt telemetry), 1 Nit (stale doc comment). These are the diminishing-returns zone; the core instrumentation, behavioral refactoring, and test coverage are all solid.

"The code fought well." (Hisoka)


src/telemetry/service.ts:110

Nit [DEREM-26] The doc says result is success, error, or aborted (set via span.markAborted() for intentional early exits). After this PR, result: "error" can also come from span.markFailure() (for failures captured in the return value rather than thrown). The parenthetical now describes an incomplete picture. The Span interface and SpanResult type have correct docs, but this is where most callers start.

Suggested fix: append mention of markFailure(), e.g.: or \aborted` (via `span.markAborted()`) / `error` without throwing (via `span.markFailure()`)`.

(Netero)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread test/unit/login/loginCoordinator.test.ts
@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from cd54c1a to 1e1c95b Compare May 13, 2026 23:36
- Rename AuthRecoveryOutcome -> AuthRecoveryAction (DEREM-23). Values
  encode the path taken (refresh_success / login_required / none),
  not the final outcome; the outcome lives in the span's `result`
  property.
- Rename buildDurationMs -> observedBuildDurationMs on the
  workspace.state_transitioned event (DEREM-24). Matches the
  observedDurationMs sibling and signals that this is a client-side
  performance.now() delta (includes SSE delivery latency), not the
  actual provisioner job duration.
- Trim "Token refresh successful, retrying request" to "Token refresh
  successful" (DEREM-22). The retry moved to the caller when
  recoverFromUnauthorized was extracted.
- Add a telemetry test for the OAuth-refresh-fails-then-callback
  path (DEREM-19): asserts refreshAttempted=true with
  recovery=login_required and result=error.
- Add a telemetry test for auth.token_refresh.deduped (DEREM-20):
  two concurrent refreshToken calls produce one auth.token_refreshed
  span and one deduped log with the dropped trigger.
@EhabY EhabY force-pushed the feat/issue-906-auth-workspace-telemetry branch from 1e1c95b to 18ab98f Compare May 14, 2026 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry: instrument auth and workspace lifecycle

1 participant