Skip to content

[Experimental] SkillOps — skill self-improvement loop (do not merge)#266

Draft
raks097 wants to merge 10 commits into
mainfrom
feat/skill-evolution-loop
Draft

[Experimental] SkillOps — skill self-improvement loop (do not merge)#266
raks097 wants to merge 10 commits into
mainfrom
feat/skill-evolution-loop

Conversation

@raks097

@raks097 raks097 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

⚗️ Experimental branch — not for merge. feat/skill-evolution-loop is a
long-lived experimental integration branch kept separate from main. This PR
is the review/CI/greptile surface; please do not merge it into main.

What this is

SkillOps — the qvr-native substrate for an evidence-gated skill self-improvement
loop
, built so the loop is recreatable on top of qvr without bloating the binary.
The design is layered: qvr core gets the deterministic, identity-bound pieces;
the agent/LLM work lives in a skill on top.

The seam that keeps core pure: the naive approach re-runs an agent to synthesize a
trace, then grades it. qvr already has the trace from real, lock-verified usage —
so execution stays with the agent, and only grading + the gate + lineage live in the
binary. No model calls, no scheduler in qvr.

qvr core

  • Outcome signal — TOOL/SKILL spans carry qvr.outcome
    (success/failure/blocked), rolled up to a session verdict; OTLP span
    status now reflects it. Deriver v7→v8 (rederive backfills). Migration 0007.
  • Human feedbackqvr audit annotate / audit annotations record a
    reviewer's verdict in a table that survives rederive (durable human input,
    not a regenerable projection). Migration 0008.
  • Evals-as-dataevals.yaml manifest + 6 deterministic graders
    (outcome, text, tool_sequence, tool_constraint, skill_invocation,
    behavior) graded over the trace qvr already captured. internal/eval/.
  • Eval gateqvr ops eval run (exits non-zero on regression),
    qvr ops lineage, qvr ops promote, results keyed by {skill, commit}.
    Migration 0009.

On top (a shipped skill)

  • skills/evolve-skill/ — the orchestrator the agent runs over the CLI
    (observe → edit → re-run → gate → open a PR), never auto-merges or
    auto-installs
    , plus a CI recipe.

End-to-end acceptance

Two end-to-end acceptance tests drive the real CLI:

  • cmd/acceptance_loop_test.go — triage label correctness (fail→pass).
  • cmd/acceptance_behavioral_test.go — "verify edits by running tests"
    (behavioral graders, fail→pass).

Quality

make build / full go test ./... / make lint (0 issues) / gocyclo ≤15 /
modernize / vet all clean. Greptile review loop run on the branch; findings
addressed in abe5b42.

Rakshith added 10 commits June 16, 2026 23:41
…vals

Adds the qvr-core layer for evidence-gated skill evolution, keeping the
binary pure Go (no model calls, no scheduler):

- Outcome signal: TOOL/SKILL spans carry qvr.outcome (success|failure|
  blocked), rolled up to a session verdict; OTLP span status now reflects
  it instead of a hardcoded OK. Deriver v7->v8 (rederive backfills).
  Migration 0007.
- Human annotations: `qvr audit annotate` / `audit annotations` record a
  reviewer's verdict in a table that survives rederive (it is durable human
  input, not a regenerable projection). Migration 0008.
- Evals-as-data: evals.yaml manifest + deterministic graders (outcome,
  text, tool_sequence, tool_constraint, skill_invocation, behavior) graded
  over the trace qvr already captured — no model calls. internal/eval/.
- Eval store + gate: `qvr ops eval run` (exits non-zero on regression),
  `ops lineage`, `ops promote`, keyed by {skill, commit}. Migration 0009.

The seam that keeps core pure: execution stays with the agent; grading,
the gate, and lineage are the only parts in the binary, because qvr already
has the real lock-verified trace.
A qvr-installable skill that runs the outer loop over the CLI: observe
(qvr audit) -> edit (qvr edit) -> re-run -> gate (qvr ops eval run) ->
open a PR. Evidence-gated; never auto-merges or auto-installs. Ships a CI
recipe for running the loop unattended.
- ops promote: surface store errors from latestPassingEval instead of
  swallowing them — a transient DB failure no longer reads as "no passing
  eval" and silently blocks/misleads a CI gate (P1).
- ops eval: evalRunRow takes the suite as a parameter rather than closing
  over the flag global, so it is pure and testable in isolation.
- eval evidence: lastAssistantText now requires role=="assistant" before
  returning content, guarding against a future multi-message output capture.
- eval evidence: Evidence.Tools now counts the same tool-sequence the
  tool_constraint grader walks, so a `maxTools` ceiling means the same thing
  under behavior and tool_constraint graders.
- migration 0009: eval_case_results.eval_run_id is now a FK to
  eval_runs(id) ON DELETE CASCADE (FK enforcement is on), so case rows can't
  be orphaned by a future eval-run prune.
Mirror the evalRunRow refactor: promoteDecision now receives force/reason
as explicit arguments instead of reading the Cobra flag globals, so the
pure gate logic is independently testable and its sub-tests no longer
mutate package state.
Round 3 greptile findings:

- ListEvalRuns / ListAnnotations now return an empty result (not a raw
  "no such table" error) when their table is absent — the case a read-only
  open hits on a DB that predates migrations 0008/0009 (read-only opens skip
  migration apply). `qvr ops lineage` and `qvr audit annotations` therefore
  degrade gracefully on a freshly-upgraded install. New noSuchTable() guard.
- DeleteSession nulls eval_runs.session_id for the deleted session instead
  of leaving it dangling. The eval verdict is durable lineage evidence keyed
  by {skill, commit}, so it outlives the session; only the stale pointer is
  cleared.
Round 4 greptile findings:

- ops promote: when a skill has no locked commit, the refusal now points
  straight at --force-no-eval instead of suggesting `qvr ops eval run`, which
  would record under an empty commit and loop back to the same refusal.
- migration 0008: index annotations(created_at) so the --since range filter
  on `audit annotations` / `ops lineage` uses a B-tree seek, not a scan.
Round 5 greptile findings:

- text grader: the `on` field is now validated at load time (only
  final_message is accepted), so a typo like `on: message` is a manifest
  error instead of silently grading FinalMessage. evidenceField returns ""
  for any unknown field as defense-in-depth.
- migration 0008: document why annotations deliberately has NO
  session_meta FK — a cascade FK would be fired by the rederive
  INSERT-OR-REPLACE and wipe the durable verdicts the table exists to keep;
  cleanup on a real purge lives in store.DeleteSession.
Round 6 greptile findings:

- promote gate: latestPassingEval now clears only when the MOST RECENT eval
  for the locked commit passed — a newer failing run is no longer overridden
  by an older pass, so a known regression can't be promoted past. Pinned by
  TestLatestPassingEval_NewestVerdictWins.
- ListEvalRuns: per-case rows now load only when EvalRunFilter.IncludeCases
  is set, so header-only callers (lineage, the promote gate) skip the N+1.
  WHERE-building and case-loading extracted to helpers (keeps cyclo <=15).
Round 7 greptile findings (style/docs only):

- Remove the unused OutcomeUnknown constant — "unknown" is the ABSENCE of
  the attribute (""), and every comparison site already checks "", so the
  sentinel was dead and potentially misleading.
- migration 0009: document why eval_runs.session_id deliberately omits a FK
  (durable verdict outlives the session; DeleteSession NULLs the pointer),
  mirroring the rationale comment migration 0008 carries.
Round 8 greptile finding (security): the CI recipe piped
`curl -fsSL https://quiver.sh/install | sh`, so an adopter copying it
verbatim would execute unauthenticated network content in the runner with
full secrets access. Replace both jobs' install step with a pinned
`go install github.com/astra-sh/qvr@<tag>`, verified through the Go module
checksum database.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant