You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement Milestone 7 of the trace-driven harness improvement loop: add a
scheduled / event-driven outer driver that runs the existing report →
proposal → follow-through chain (M1 #938 / M2 #939 / M3 #940) on a cadence without a human prompting each run, and without integrating into the
worker binary.
Goal alignment (read first): per #937's North Star the goal is durable
harness learning that makes future runs fail less — not more automation or
more reports. This milestone is a means, not the goal: it earns its place
only if it raises the rate at which confirmed failures become durable harness
fixes. It is sequenced after M5 (#952, durable evidence) and M6 (#953, the
goal-bearing milestone that closes the ratchet), and may be deferred or
dropped — a driver that only schedules more reports/issues without raising the
failure→durable-fix conversion rate is the North Star's named anti-pattern.
The driver is operator-side product tooling: it runs in the environment
that runs the aiops worker, consumes that deployment's evidence, and
proposes changes to the operator's target repo. It is not part of, and does
not act on, the aiops-platform source repo's own development harness.
Part of #937. This is the concrete realization of the "Outer-agent automation
later" step already foreseen in #931's recommended sequence (step 3). Design
source of truth: docs/design/trace-driven-harness-improvement.md.
Depends on #940 (M3) and #941 (M4) — both merged — and on M5 (#952, durable
evidence) and M6 (#953, ratchet closure); sequenced last. Follow-up to #931
/ PR #936.
Problem — operator toil, not a missing goal
M1–M4 delivered the loop's artifact-production half: given evidence, scripts/trace-harness-report.py produces grouped clusters, issue/draft-PR
proposals (schema trace-harness-report/v3), and advisory evaluator candidates.
Every connective step is manual today: capture worker stderr, run the report,
run jq, open the issue, dispatch the follow-through agent. That manual toil is
real friction, but reducing it is a convenience, not the loop's purpose — the
purpose (durable learning that closes the ratchet) is M5 (#952) + M6 (#953). The
loop-engineering sources name automation as a separate loop level, not the goal:
LangChain L3 event-driven loop: "Schedules, webhooks, cron jobs… trigger
agent runs without a human manually prompting each one."
(docs/research/2026-06-16-langchain-art-of-loop-engineering.md)
Zach Lloyd's outer loop: "A scheduled agent reviews those records… and
opens a diff." (docs/research/2026-06-16-zach-lloyd-self-improving-skills.md)
Pursue this milestone only after M5/M6 prove the ratchet turns, and only if
scheduling measurably increases the failure→durable-fix rate.
Required behavior
Add a scheduled / triggered driver whose venue is the operator's choice — a
scheduled CI workflow (the same scheduled-workflow mechanism this repo already
ships for ruleset-drift.yml / capture-unresolved-reviews.yml), an external
cron on the worker host, or a coding-agent workflow — running wherever it can
reach the target deployment's evidence. The driver:
Runs scripts/trace-harness-report.py to produce the grouped report.
Selects only actionable, recurring clusters using the design's
recurrence rule (normally ≥2 independent occurrences; a single severe
safety / data-loss finding may qualify).
Promotes each selected cluster into a tracking issue carrying the M2
proposal body, through ordinary forge / agent tooling (gh or a
coding-agent workflow), never the worker. The driver does not open a
draft PR itself: a draft PR contains a diff, which is the M3 follow-through
output produced only after operator approval (see Approval model).
Is idempotent: before creating, it searches for an existing open
tracking issue for the same cluster id (a provenance marker or a trace-harness label namespaced with the cluster id, e.g. trace-harness:runner-timeout) and updates or skips instead of duplicating.
Records each run's outcome (clusters seen / promoted / skipped-as-duplicate /
below-threshold) so the loop is auditable, and tracks how many promoted
clusters became durable fixes — the metric that justifies this milestone.
Approval model (decision to lock in this issue)
Reconcile the articles' "scheduled agent opens a diff" with the design's
"operator approves intent, not the body" (M2/M3). Recommended default:
The driver may auto-create the proposal artifact — a tracking issue
carrying the M2 proposal body — for a recurring cluster. This is transport,
not a harness change.
The M3 follow-through implementation only starts after explicit operator
approval (label flip or comment naming the cluster). The draft PR and its
diff are that approved M3 output, so a human approves intent before any
harness code is written.
Merging / shipping the harness change always stays behind normal review
(human or reviewer-agent). No unattended merge.
Alternatives, both explicit opt-in and never the default: (a) a notify-only mode
that posts the generated proposal to a single rolling issue and leaves creation
manual; (b) an advanced mode where the driver also dispatches the M3
follow-through agent to open a draft PR directly — still gated by the same
reviewed-merge rule. Default is the dedup-guarded auto-create-issue +
approve-before-implement behavior above; expose the mode as config.
All issue/PR/tracker writes happen through the coding agent's / operator's
ordinary forge tools — the same surface M3 already uses.
No unattended merge; no automatic prompt/rubric/skill/LEARNINGS.md
rewrite without a reviewed PR; no evaluator promoted to a required gate.
Redaction / byte bounds from the design and M1 are preserved end-to-end: the
driver must not widen what the report embeds and must mask clone-URL userinfo
(workflow.MaskCloneURL) and tokens in anything it posts.
Non-goals
Not a worker subcommand and not wired into cmd/worker.
Not the goal: automation/scheduling is a means; producing more reports/issues
that do not raise the failure→durable-fix rate is the anti-pattern, not a win.
No durable scheduler trace DB (rejected in the design); evidence input is
bounded and comes from M5 or ordinary log retention.
No auto-merge, no required CI/runtime gate, no worker post-turn verifier.
Is it earned yet? Confirm M5/M6 have shown the ratchet turns and that
manual promotion is the bottleneck before building any scheduler.
Evidence availability: the aiops worker runs in the operator's
environment, which a CI runner cannot see by default. Specify how
logs/artifacts reach the driver (operator-provided artifact path, a
self-hosted runner on the worker host, or the M5 manifest). A v1 may accept an
explicit input path; the robust unattended path depends on M5.
Cadence & window: pick a default cron and a bounded look-back; do not
re-scan the whole history each run.
Dedup key: define the cluster-id provenance marker (hidden body marker or
a cluster-id-namespaced label) used to find existing issues.
Noise control: recurrence threshold plus a per-run cap on how many
proposals it may open.
Acceptance criteria
A scheduled / triggered workflow runs the report and promotes recurring
clusters without a human prompting the run.
Promotion goes through forge/agent tooling only; no worker/orchestrator
code path is added (diff shows no internal/worker or internal/orchestrator
change).
Re-running on unchanged evidence does not create duplicate tracking issues
(idempotency covered by a test/fixture).
Only clusters meeting the recurrence rule are promoted; below-threshold
clusters are skipped.
Operator approval gates the M3 implementation step; the approved M3 PR
stops at draft / ready-for-review and merge stays manual/reviewed.
Redaction and byte bounds are preserved in everything the driver posts.
Evidence is recorded that scheduling raised the failure→durable-fix
conversion rate (or the milestone is deferred); a runbook explains the
schedule, evidence input, approval model, dedup, and how to disable it.
Summary
Implement Milestone 7 of the trace-driven harness improvement loop: add a
scheduled / event-driven outer driver that runs the existing report →
proposal → follow-through chain (M1 #938 / M2 #939 / M3 #940) on a cadence
without a human prompting each run, and without integrating into the
worker binary.
Goal alignment (read first): per #937's North Star the goal is durable
harness learning that makes future runs fail less — not more automation or
more reports. This milestone is a means, not the goal: it earns its place
only if it raises the rate at which confirmed failures become durable harness
fixes. It is sequenced after M5 (#952, durable evidence) and M6 (#953, the
goal-bearing milestone that closes the ratchet), and may be deferred or
dropped — a driver that only schedules more reports/issues without raising the
failure→durable-fix conversion rate is the North Star's named anti-pattern.
The driver is operator-side product tooling: it runs in the environment
that runs the aiops worker, consumes that deployment's evidence, and
proposes changes to the operator's target repo. It is not part of, and does
not act on, the aiops-platform source repo's own development harness.
Part of #937. This is the concrete realization of the "Outer-agent automation
later" step already foreseen in #931's recommended sequence (step 3). Design
source of truth:
docs/design/trace-driven-harness-improvement.md.Depends on #940 (M3) and #941 (M4) — both merged — and on M5 (#952, durable
evidence) and M6 (#953, ratchet closure); sequenced last. Follow-up to #931
/ PR #936.
Problem — operator toil, not a missing goal
M1–M4 delivered the loop's artifact-production half: given evidence,
scripts/trace-harness-report.pyproduces grouped clusters, issue/draft-PRproposals (schema
trace-harness-report/v3), and advisory evaluator candidates.Every connective step is manual today: capture worker stderr, run the report,
run
jq, open the issue, dispatch the follow-through agent. That manual toil isreal friction, but reducing it is a convenience, not the loop's purpose — the
purpose (durable learning that closes the ratchet) is M5 (#952) + M6 (#953). The
loop-engineering sources name automation as a separate loop level, not the goal:
agent runs without a human manually prompting each one."
(
docs/research/2026-06-16-langchain-art-of-loop-engineering.md)opens a diff." (
docs/research/2026-06-16-zach-lloyd-self-improving-skills.md)Pursue this milestone only after M5/M6 prove the ratchet turns, and only if
scheduling measurably increases the failure→durable-fix rate.
Required behavior
Add a scheduled / triggered driver whose venue is the operator's choice — a
scheduled CI workflow (the same scheduled-workflow mechanism this repo already
ships for
ruleset-drift.yml/capture-unresolved-reviews.yml), an externalcron on the worker host, or a coding-agent workflow — running wherever it can
reach the target deployment's evidence. The driver:
worker logs and/or CI log artifacts; the Trace L4 M5 (Trace L4 M5: durable, redacted evidence input for harness reports #952) durable
manifest when available).
scripts/trace-harness-report.pyto produce the grouped report.recurrence rule (normally ≥2 independent occurrences; a single severe
safety / data-loss finding may qualify).
proposal body, through ordinary forge / agent tooling (
ghor acoding-agent workflow), never the worker. The driver does not open a
draft PR itself: a draft PR contains a diff, which is the M3 follow-through
output produced only after operator approval (see Approval model).
tracking issue for the same cluster id (a provenance marker or a
trace-harnesslabel namespaced with the cluster id, e.g.trace-harness:runner-timeout) and updates or skips instead of duplicating.below-threshold) so the loop is auditable, and tracks how many promoted
clusters became durable fixes — the metric that justifies this milestone.
Approval model (decision to lock in this issue)
Reconcile the articles' "scheduled agent opens a diff" with the design's
"operator approves intent, not the body" (M2/M3). Recommended default:
carrying the M2 proposal body — for a recurring cluster. This is transport,
not a harness change.
approval (label flip or comment naming the cluster). The draft PR and its
diff are that approved M3 output, so a human approves intent before any
harness code is written.
(human or reviewer-agent). No unattended merge.
Alternatives, both explicit opt-in and never the default: (a) a notify-only mode
that posts the generated proposal to a single rolling issue and leaves creation
manual; (b) an advanced mode where the driver also dispatches the M3
follow-through agent to open a draft PR directly — still gated by the same
reviewed-merge rule. Default is the dedup-guarded auto-create-issue +
approve-before-implement behavior above; expose the mode as config.
Boundary requirements (SPEC alignment)
code path, phase, gate, tracker write, or PR write. SPEC §1 keeps the worker
a scheduler/runner/tracker reader; [P0][spec-alignment] Orchestrator overreach: PR creation, git push, and ticket state writes belong to the agent (SPEC §1 boundary) #76 closed orchestrator-owned PR/tracker
writes.
ordinary forge tools — the same surface M3 already uses.
LEARNINGS.mdrewrite without a reviewed PR; no evaluator promoted to a required gate.
driver must not widen what the report embeds and must mask clone-URL userinfo
(
workflow.MaskCloneURL) and tokens in anything it posts.Non-goals
cmd/worker.that do not raise the failure→durable-fix rate is the anti-pattern, not a win.
bounded and comes from M5 or ordinary log retention.
the evaluator-result feedback closure (that is Trace L4 M6, Trace L4 M6: close the ratchet with evaluator-result feedback and recurrence reopen #953).
Open considerations to resolve during design
manual promotion is the bottleneck before building any scheduler.
environment, which a CI runner cannot see by default. Specify how
logs/artifacts reach the driver (operator-provided artifact path, a
self-hosted runner on the worker host, or the M5 manifest). A v1 may accept an
explicit input path; the robust unattended path depends on M5.
re-scan the whole history each run.
a cluster-id-namespaced label) used to find existing issues.
proposals it may open.
Acceptance criteria
clusters without a human prompting the run.
code path is added (diff shows no
internal/workerorinternal/orchestratorchange).
(idempotency covered by a test/fixture).
clusters are skipped.
stops at draft / ready-for-review and merge stays manual/reviewed.
conversion rate (or the milestone is deferred); a runbook explains the
schedule, evidence input, approval model, dedup, and how to disable it.