feat(telemetry): instrument auth recovery and workspace lifecycle#948
feat(telemetry): instrument auth recovery and workspace lifecycle#948EhabY wants to merge 8 commits into
Conversation
- workspace.ts: inline single-use transition helpers, unify on the spread idiom for optional `observedDurationMs`, move `performance.now()` below the dedup early-return, drop the no-op `() => fn()` wrapper in `traceUpdateTriggered`, and make `INITIAL_STATE` private. - auth.ts: replace the `T extends LoginPromptResult` constraint with a `LoginPromptTracer` exposing `markAborted()`, matching the `RemoteSetupTracer` pattern. Callers now drive abort signaling, so telemetry no longer depends on `LoginResult`'s shape. - loginCoordinator: call `tracer.markAborted()` on `!result.success`.
128cc7c to
fc03561
Compare
- Add Span.markFailure() so non-thrown failures can record result=error without throwing. The emitted event omits the error block, matching OTel's "Status: Error + optional exception event" convention. - Move login-prompt outcome classification (success/auth_failed/aborted) from LoginCoordinator into AuthTelemetry.traceLoginPrompt. The coordinator just returns its natural LoginResult; telemetry maps it. - Promote auth.intercept_401 from a log to a span so durationMs covers the full 401 recovery + retry. TODO marker points at #925 for a correlated received-marker once Span.log() lands. - Add workspace.start.triggered span (mirrors update.triggered) and buildDurationMs measurement on workspace.state_transitioned. - Dedupe workspace transitions on (status, transition, reason). - Switch LoginCoordinator/WorkspaceMonitor/WorkspaceStateMachine/ AuthInterceptor/OAuthSessionManager to take ServiceContainer instead of TelemetryReporter (+ individual services). Drops NOOP telemetry defaults; injection is now explicit. - Make ensureLoggedInWithDialog trigger required. Tests: 1601 passing, 1 skipped. Adds direct Span.markFailure unit tests in service.test.ts and telemetry describe blocks per affected unit.
|
/coder-agents-review |
There was a problem hiding this comment.
Strong instrumentation work. The typed facades (AuthTelemetry, WorkspaceTelemetry) centralize event schemas per domain, the dedup logic in observeWorkspace/observeAgent prevents event spam without losing real transitions, and the buildDurationMs accumulation across intermediate building states is well-tested. The handle401Error restructuring preserves every behavioral path (Pariston traced all five branches, found no difference). Test coverage is proportional: ~300 new source lines, ~520 new test lines, every telemetry event exercised. The LoginResult type enrichment with reason updates all 7 failure return sites, indicating sibling search.
7 P3 findings, 7 nits. The P3s cluster around telemetry schema completeness: missing outcome/recovery properties on throw paths, missing workspace identifiers for cross-session analytics, an incomplete status set, and a test coverage gap. No correctness bugs found.
Process note: the PR description mentions "optional TelemetryReporter injection" and omits workspace.start.triggered, both stale relative to the shipped code. These aren't code findings, but the description should match what landed.
"The
setRecoverycallback pattern intraceIntercept401is a nice design: it defers the property write to the caller's decision point without leaking the span object." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
- Align auth event names with the past-tense convention used by the rest of the codebase: auth.token_refresh -> auth.token_refreshed, auth.intercept_401 -> auth.unauthorized_intercepted, auth.login_prompt -> auth.login_prompted. - Replace the redundant `outcome` property on auth.login_prompted with `reason`, only emitted on non-success (`result` already covers the success/aborted/error axis). - Collapse LoginPromptAbortReason + LoginPromptFailureReason into a single LoginPromptReason; tighten ObservedWorkspaceState now that transition/reason are non-nullable in the API types. - Drop the dead `?? undefined` and conditional spreads in observeWorkspace; fold buildStartedAtMs updates into the same branch that produces buildDurationMs. - Type BUILDING_STATUSES as ReadonlySet. - Reuse createMockServiceContainer in deploymentManager.test.ts and drop a non-deterministic 30ms sleep in sshProcess.test.ts.
65fa45f to
9ffe6ac
Compare
- workspace.ts: split WorkspaceTelemetry into three single-purpose
classes in the same file so each call site only holds the surface
it uses, eliminating the duplicate-event trap by construction:
- WorkspaceStateTelemetry (observe) for WorkspaceMonitor
- WorkspaceAgentTelemetry (observe/reset) for WorkspaceStateMachine
- WorkspaceOperationTelemetry (traceStartTriggered/traceUpdateTriggered)
for WorkspaceStateMachine
- workspace.ts: add canceling/deleting to the provisioner-active set
(renamed BUILDING_STATUSES -> PROVISIONING_STATUSES). Take
workspaceName via constructors and stamp it on every emission.
- workspace.ts: unify workspace and agent state events on
fromStatus/toStatus (agent additionally carries
fromLifecycleState/toLifecycleState for its second dimension).
- auth.ts: drop the "401" baggage from API names. traceIntercept401
-> traceAuthRecovery, Intercept401Recorder -> AuthRecoveryRecorder,
AuthIntercept401Recovery -> AuthRecoveryOutcome, handle401Error ->
recoverFromUnauthorized. The 401 stays in the event name
`auth.unauthorized_intercepted` and at the call site.
- auth.ts: traceAuthRecovery sets recovery="none" and
refreshAttempted="false" as initial properties so a pre-callback
throw still emits values; AuthInterceptor records refreshAttempted
before tryOAuthRefresh.
- auth.ts + sessionManager.ts: add auth.token_refresh.deduped so
reactive refreshes that join an in-flight background refresh stay
countable.
- testHelpers: type contextManager via a local ContextManagerLike
shape and replace non-null assertions with a require() helper so
missing services throw a named error.
- Add a recovery=none telemetry test; assert on the previously
unused mock field in the workspaceStateMachine it.each.
9ffe6ac to
28a6962
Compare
|
/coder-agents-review |
There was a problem hiding this comment.
All 14 R1 findings addressed. 12 fixed, 2 contested and closed by unanimous panel vote (7/7). DEREM-4 partial fix verified and accepted by 9/9 evaluating reviewers. Strong fix discipline: single organized commit with itemized entries per finding, all tests passing, typecheck clean.
R2 found 2 P3 test coverage gaps and 4 nits. The P3s are both untested telemetry emissions: the OAuth-refresh-failed recovery path and the new auth.token_refresh.deduped event (added to fix DEREM-5 but shipped without its own test). The nits are naming and documentation accuracy.
Process note: the commit message for the review-response commit claims "unify workspace and agent state events on fromStatus/toStatus," but workspace events still emit bare from/to (consistent with the author's own DEREM-10 defense). The commit message should match the code.
"The
??=guard prevents the start timestamp from shifting during intermediate provisioning statuses." (Hisoka, verifying the buildDurationMs lifecycle)
test/unit/telemetry/service.test.ts:295
Nit [DEREM-21] The post-emit guard test exercises setProperty, setMeasurement, and markAborted after completion, asserting 3 warnings. This PR added markFailure() with an identical if (completed) { warnPostEmit(...) } guard at service.ts:197, but the guard test was not extended. If someone removes the guard from markFailure, this test won't catch it.
Add escapedSpan?.markFailure() after line 307 and bump the expected warning count from 3 to 4.
(Bisky)
🤖
🤖 This review was automatically generated with Coder Agents.
870d06b to
a55308a
Compare
|
/coder-agents-review |
There was a problem hiding this comment.
R2 findings DEREM-19, DEREM-20, DEREM-22, DEREM-23, DEREM-24 all fixed. Thank you.
One finding remains unaddressed:
DEREM-21 (Nit, test/unit/telemetry/service.test.ts:295): The post-emit guard test exercises setProperty, setMeasurement, and markAborted after completion but does not exercise markFailure. The markFailure guard at service.ts:197 has an identical if (completed) { warnPostEmit(...) } pattern but no coverage. Add escapedSpan?.markFailure() after line 307 and bump the expected warning count from 3 to 4.
This was folded into the R2 review body (not an inline comment) because the line is outside the diff. It may have been overlooked. Please fix or respond, and the panel will review on the next round.
🤖 This review was automatically generated with Coder Agents.
a55308a to
cd54c1a
Compare
|
DEREM-21 fixed in cd54c1a: |
|
/coder-agents-review |
There was a problem hiding this comment.
All prior findings resolved. DEREM-21 fix verified. The PR is in excellent shape: 25 findings raised across 4 rounds, 22 fixed, 2 contested and closed by unanimous panel vote, 4 dropped. Six of eight R4 reviewers found no findings. Pariston's four-framing analysis found no gap in premises.
1 P3 (one untested reason string in login prompt telemetry), 1 Nit (stale doc comment). These are the diminishing-returns zone; the core instrumentation, behavioral refactoring, and test coverage are all solid.
"The code fought well." (Hisoka)
src/telemetry/service.ts:110
Nit [DEREM-26] The doc says result is success, error, or aborted (set via span.markAborted() for intentional early exits). After this PR, result: "error" can also come from span.markFailure() (for failures captured in the return value rather than thrown). The parenthetical now describes an incomplete picture. The Span interface and SpanResult type have correct docs, but this is where most callers start.
Suggested fix: append mention of markFailure(), e.g.: or \aborted` (via `span.markAborted()`) / `error` without throwing (via `span.markFailure()`)`.
(Netero)
🤖
🤖 This review was automatically generated with Coder Agents.
cd54c1a to
1e1c95b
Compare
- Rename AuthRecoveryOutcome -> AuthRecoveryAction (DEREM-23). Values encode the path taken (refresh_success / login_required / none), not the final outcome; the outcome lives in the span's `result` property. - Rename buildDurationMs -> observedBuildDurationMs on the workspace.state_transitioned event (DEREM-24). Matches the observedDurationMs sibling and signals that this is a client-side performance.now() delta (includes SSE delivery latency), not the actual provisioner job duration. - Trim "Token refresh successful, retrying request" to "Token refresh successful" (DEREM-22). The retry moved to the caller when recoverFromUnauthorized was extracted. - Add a telemetry test for the OAuth-refresh-fails-then-callback path (DEREM-19): asserts refreshAttempted=true with recovery=login_required and result=error. - Add a telemetry test for auth.token_refresh.deduped (DEREM-20): two concurrent refreshToken calls produce one auth.token_refreshed span and one deduped log with the dropped trigger.
1e1c95b to
18ab98f
Compare
Summary
Records local telemetry for the auth-recovery and workspace-lifecycle paths so latency and failure modes are visible alongside the rest of the local telemetry stream.
Auth events (
src/instrumentation/auth.ts):auth.token_refreshed: OAuth refresh attempt withtrigger(background/reactive).auth.token_refresh.deduped: emitted when a refresh call joins an in-flight refresh so the dropped trigger stays countable.auth.unauthorized_intercepted: recovery path for a 401, withrecovery(refresh_success/login_required/none) andrefreshAttempted.auth.login_prompted: modal login prompt outcome, withtriggerand areasonon abort/failure.Workspace events (
src/instrumentation/workspace.ts, split into three single-purpose classes):workspace.state_transitioned: deduped on(status, transition, reason); emitsobservedDurationMsbetween transitions andobservedBuildDurationMswhen a provisioner run resolves.workspace.agent.state_transitioned: deduped on(status, lifecycle_state); emitsobservedDurationMs.workspace.start.triggered/workspace.update.triggered: traced spans around user-initiated workspace operations.All workspace events carry
workspaceName. The workspace event uses barefrom/to(single state dimension); the agent event uses qualifiedfromStatus/toStatusplusfromLifecycleState/toLifecycleStatebecause two dimensions can change in the same emission. Cross-event consistency via dimension-as-prefix property keys is tracked in #954.Closes #906.
Validation
pnpm test:extensionpnpm typecheckpnpm lintpnpm format:checkGenerated by Coder Agents.