Skip to content

feat: extend per-tick observability spans to the remaining periodic scheduler fibers #2987

@msywulak

Description

@msywulak

Context

Follow-up to #2945 / PR #2986, surfaced by the pr-review-toolkit pass on that PR.

#2945 added per-tick withEffectSpan("atlas.scheduler.<label>", ...) to the 8 named periodic cleanup fibers (oauth_state_cleanup, rate_limit_cleanup, demo_rate_limit_cleanup, contact_rate_limit_cleanup, abuse_cleanup, dashboard_rate_limit_cleanup, conversation_rate_sweep, share_token_cleanup) in packages/api/src/lib/effect/layers.ts, matching the BYOT span landed in #2949.

But makeSchedulerLive (and the surrounding startup DAG) has other withFiberDeathLog-only periodic fibers with the identical observability gap — a hung-but-not-crashed fiber is invisible in traces (no per-tick span, and the catchAllCause death log never fires on a hang). The reviewers flagged:

  • sub_processor_publisher (~layers.ts:1565) — same shape as the 8, no per-tick span. Most direct consistency gap.
  • settings_refresh, onboarding_email, expert_scheduler — periodic fibers, also span-less.

Note: the CRM/email outbox flushers + their stall watchdogs (lead_outbox_flusher/lead_outbox_watchdog, email_outbox_flusher/email_outbox_watchdog) already have a heartbeat-count + stall-watchdog liveness treatment (layers.ts:1655+), so they are observable by a different mechanism — assess whether a span adds value there or is redundant.

Acceptance criteria

  • sub_processor_publisher emits a per-tick span under the atlas.scheduler.<label> convention (additive to withFiberDeathLog)
  • Decide + document whether settings_refresh / onboarding_email / expert_scheduler should also get spans (likely yes) and wire them if so
  • Decide + document whether the outbox flushers/watchdogs need spans given their existing heartbeat/watchdog liveness signal
  • Reuse the SCHEDULER_CLEANUP_SPAN_NAMES-style single-source-of-truth + test-pin pattern from feat(observability): per-tick spans on the 8 periodic cleanup fibers (#2945) #2986
  • No snake_case leakage in span names; keep the dotted atlas.scheduler. prefix

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: apiBackend, agent, tools, SQL validationfeatureNew feature or capability

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions