Skip to content

Trace L4 M7: schedule the outer harness-improvement loop (event-driven driver, no worker integration) #951

Description

@xrf9268-hue

Summary

Implement Milestone 7 of the trace-driven harness improvement loop: add a
scheduled / event-driven outer driver that runs the existing report →
proposal → follow-through chain (M1 #938 / M2 #939 / M3 #940) on a cadence
without a human prompting each run, and without integrating into the
worker binary
.

Goal alignment (read first): per #937's North Star the goal is durable
harness learning that makes future runs fail less
not more automation or
more reports
. This milestone is a means, not the goal: it earns its place
only if it raises the rate at which confirmed failures become durable harness
fixes. It is sequenced after M5 (#952, durable evidence) and M6 (#953, the
goal-bearing milestone that closes the ratchet), and may be deferred or
dropped
— a driver that only schedules more reports/issues without raising the
failure→durable-fix conversion rate is the North Star's named anti-pattern.

The driver is operator-side product tooling: it runs in the environment
that runs the aiops worker, consumes that deployment's evidence, and
proposes changes to the operator's target repo. It is not part of, and does
not act on, the aiops-platform source repo's own development harness.

Part of #937. This is the concrete realization of the "Outer-agent automation
later" step already foreseen in #931's recommended sequence (step 3). Design
source of truth: docs/design/trace-driven-harness-improvement.md.
Depends on #940 (M3) and #941 (M4) — both merged — and on M5 (#952, durable
evidence) and M6 (#953, ratchet closure); sequenced last. Follow-up to #931
/ PR #936.

Problem — operator toil, not a missing goal

M1–M4 delivered the loop's artifact-production half: given evidence,
scripts/trace-harness-report.py produces grouped clusters, issue/draft-PR
proposals (schema trace-harness-report/v3), and advisory evaluator candidates.
Every connective step is manual today: capture worker stderr, run the report,
run jq, open the issue, dispatch the follow-through agent. That manual toil is
real friction, but reducing it is a convenience, not the loop's purpose — the
purpose (durable learning that closes the ratchet) is M5 (#952) + M6 (#953). The
loop-engineering sources name automation as a separate loop level, not the goal:

  • LangChain L3 event-driven loop: "Schedules, webhooks, cron jobs… trigger
    agent runs without a human manually prompting each one."
    (docs/research/2026-06-16-langchain-art-of-loop-engineering.md)
  • Zach Lloyd's outer loop: "A scheduled agent reviews those records… and
    opens a diff." (docs/research/2026-06-16-zach-lloyd-self-improving-skills.md)

Pursue this milestone only after M5/M6 prove the ratchet turns, and only if
scheduling measurably increases the failure→durable-fix rate.

Required behavior

Add a scheduled / triggered driver whose venue is the operator's choice — a
scheduled CI workflow (the same scheduled-workflow mechanism this repo already
ships for ruleset-drift.yml / capture-unresolved-reviews.yml), an external
cron on the worker host, or a coding-agent workflow — running wherever it can
reach the target deployment's evidence. The driver:

  1. Collects already-available evidence for a bounded recent window (retained
    worker logs and/or CI log artifacts; the Trace L4 M5 (Trace L4 M5: durable, redacted evidence input for harness reports #952) durable
    manifest when available).
  2. Runs scripts/trace-harness-report.py to produce the grouped report.
  3. Selects only actionable, recurring clusters using the design's
    recurrence rule (normally ≥2 independent occurrences; a single severe
    safety / data-loss finding may qualify).
  4. Promotes each selected cluster into a tracking issue carrying the M2
    proposal body, through ordinary forge / agent tooling (gh or a
    coding-agent workflow), never the worker. The driver does not open a
    draft PR itself: a draft PR contains a diff, which is the M3 follow-through
    output produced only after operator approval (see Approval model).
  5. Is idempotent: before creating, it searches for an existing open
    tracking issue for the same cluster id (a provenance marker or a
    trace-harness label namespaced with the cluster id, e.g.
    trace-harness:runner-timeout) and updates or skips instead of duplicating.
  6. Records each run's outcome (clusters seen / promoted / skipped-as-duplicate /
    below-threshold) so the loop is auditable, and tracks how many promoted
    clusters became durable fixes — the metric that justifies this milestone.

Approval model (decision to lock in this issue)

Reconcile the articles' "scheduled agent opens a diff" with the design's
"operator approves intent, not the body" (M2/M3). Recommended default:

  • The driver may auto-create the proposal artifact — a tracking issue
    carrying the M2 proposal body — for a recurring cluster. This is transport,
    not a harness change.
  • The M3 follow-through implementation only starts after explicit operator
    approval
    (label flip or comment naming the cluster). The draft PR and its
    diff are that approved M3 output, so a human approves intent before any
    harness code is written.
  • Merging / shipping the harness change always stays behind normal review
    (human or reviewer-agent). No unattended merge.

Alternatives, both explicit opt-in and never the default: (a) a notify-only mode
that posts the generated proposal to a single rolling issue and leaves creation
manual; (b) an advanced mode where the driver also dispatches the M3
follow-through agent to open a draft PR directly — still gated by the same
reviewed-merge rule. Default is the dedup-guarded auto-create-issue +
approve-before-implement behavior above; expose the mode as config.

Boundary requirements (SPEC alignment)

  • The driver is not the worker/orchestrator and adds no worker-side
    code path, phase, gate, tracker write, or PR write. SPEC §1 keeps the worker
    a scheduler/runner/tracker reader; [P0][spec-alignment] Orchestrator overreach: PR creation, git push, and ticket state writes belong to the agent (SPEC §1 boundary) #76 closed orchestrator-owned PR/tracker
    writes.
  • All issue/PR/tracker writes happen through the coding agent's / operator's
    ordinary forge tools — the same surface M3 already uses.
  • No unattended merge; no automatic prompt/rubric/skill/LEARNINGS.md
    rewrite without a reviewed PR; no evaluator promoted to a required gate.
  • Redaction / byte bounds from the design and M1 are preserved end-to-end: the
    driver must not widen what the report embeds and must mask clone-URL userinfo
    (workflow.MaskCloneURL) and tokens in anything it posts.

Non-goals

Open considerations to resolve during design

  • Is it earned yet? Confirm M5/M6 have shown the ratchet turns and that
    manual promotion is the bottleneck before building any scheduler.
  • Evidence availability: the aiops worker runs in the operator's
    environment, which a CI runner cannot see by default. Specify how
    logs/artifacts reach the driver (operator-provided artifact path, a
    self-hosted runner on the worker host, or the M5 manifest). A v1 may accept an
    explicit input path; the robust unattended path depends on M5.
  • Cadence & window: pick a default cron and a bounded look-back; do not
    re-scan the whole history each run.
  • Dedup key: define the cluster-id provenance marker (hidden body marker or
    a cluster-id-namespaced label) used to find existing issues.
  • Noise control: recurrence threshold plus a per-run cap on how many
    proposals it may open.

Acceptance criteria

  • A scheduled / triggered workflow runs the report and promotes recurring
    clusters without a human prompting the run.
  • Promotion goes through forge/agent tooling only; no worker/orchestrator
    code path is added (diff shows no internal/worker or internal/orchestrator
    change).
  • Re-running on unchanged evidence does not create duplicate tracking issues
    (idempotency covered by a test/fixture).
  • Only clusters meeting the recurrence rule are promoted; below-threshold
    clusters are skipped.
  • Operator approval gates the M3 implementation step; the approved M3 PR
    stops at draft / ready-for-review and merge stays manual/reviewed.
  • Redaction and byte bounds are preserved in everything the driver posts.
  • Evidence is recorded that scheduling raised the failure→durable-fix
    conversion rate (or the milestone is deferred); a runbook explains the
    schedule, evidence input, approval model, dedup, and how to disable it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:observabilityCreated by aiops-platform roadmap scriptarea:workflowCreated by aiops-platform roadmap scriptpriority:p2type:featureCreated by aiops-platform roadmap script

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions