[Experimental] SkillOps — skill self-improvement loop (do not merge) by raks097 · Pull Request #266 · astra-sh/qvr

raks097 · 2026-06-17T03:51:16Z

⚗️ Experimental branch — not for merge. feat/skill-evolution-loop is a
long-lived experimental integration branch kept separate from main. This PR
is the review/CI/greptile surface; please do not merge it into main.

What this is

SkillOps — the qvr-native substrate for an evidence-gated skill self-improvement
loop, built so the loop is recreatable on top of qvr without bloating the binary.
The design is layered: qvr core gets the deterministic, identity-bound pieces;
the agent/LLM work lives in a skill on top.

The seam that keeps core pure: the naive approach re-runs an agent to synthesize a
trace, then grades it. qvr already has the trace from real, lock-verified usage —
so execution stays with the agent, and only grading + the gate + lineage live in the
binary. No model calls, no scheduler in qvr.

qvr core

Outcome signal — TOOL/SKILL spans carry qvr.outcome
(success/failure/blocked), rolled up to a session verdict; OTLP span
status now reflects it. Deriver v7→v8 (rederive backfills). Migration 0007.
Human feedback — qvr audit annotate / audit annotations record a
reviewer's verdict in a table that survives rederive (durable human input,
not a regenerable projection). Migration 0008.
Evals-as-data — evals.yaml manifest + 6 deterministic graders
(outcome, text, tool_sequence, tool_constraint, skill_invocation,
behavior) graded over the trace qvr already captured. internal/eval/.
Eval gate — qvr ops eval run (exits non-zero on regression),
qvr ops lineage, qvr ops promote, results keyed by {skill, commit}.
Migration 0009.

On top (a shipped skill)

skills/evolve-skill/ — the orchestrator the agent runs over the CLI
(observe → edit → re-run → gate → open a PR), never auto-merges or
auto-installs, plus a CI recipe.

End-to-end acceptance

Two end-to-end acceptance tests drive the real CLI:

cmd/acceptance_loop_test.go — triage label correctness (fail→pass).
cmd/acceptance_behavioral_test.go — "verify edits by running tests"
(behavioral graders, fail→pass).

Quality

make build / full go test ./... / make lint (0 issues) / gocyclo ≤15 /
modernize / vet all clean. Greptile review loop run on the branch; findings
addressed in abe5b42.

…vals Adds the qvr-core layer for evidence-gated skill evolution, keeping the binary pure Go (no model calls, no scheduler): - Outcome signal: TOOL/SKILL spans carry qvr.outcome (success|failure| blocked), rolled up to a session verdict; OTLP span status now reflects it instead of a hardcoded OK. Deriver v7->v8 (rederive backfills). Migration 0007. - Human annotations: `qvr audit annotate` / `audit annotations` record a reviewer's verdict in a table that survives rederive (it is durable human input, not a regenerable projection). Migration 0008. - Evals-as-data: evals.yaml manifest + deterministic graders (outcome, text, tool_sequence, tool_constraint, skill_invocation, behavior) graded over the trace qvr already captured — no model calls. internal/eval/. - Eval store + gate: `qvr ops eval run` (exits non-zero on regression), `ops lineage`, `ops promote`, keyed by {skill, commit}. Migration 0009. The seam that keeps core pure: execution stays with the agent; grading, the gate, and lineage are the only parts in the binary, because qvr already has the real lock-verified trace.

A qvr-installable skill that runs the outer loop over the CLI: observe (qvr audit) -> edit (qvr edit) -> re-run -> gate (qvr ops eval run) -> open a PR. Evidence-gated; never auto-merges or auto-installs. Ships a CI recipe for running the loop unattended.

- ops promote: surface store errors from latestPassingEval instead of swallowing them — a transient DB failure no longer reads as "no passing eval" and silently blocks/misleads a CI gate (P1). - ops eval: evalRunRow takes the suite as a parameter rather than closing over the flag global, so it is pure and testable in isolation. - eval evidence: lastAssistantText now requires role=="assistant" before returning content, guarding against a future multi-message output capture. - eval evidence: Evidence.Tools now counts the same tool-sequence the tool_constraint grader walks, so a `maxTools` ceiling means the same thing under behavior and tool_constraint graders. - migration 0009: eval_case_results.eval_run_id is now a FK to eval_runs(id) ON DELETE CASCADE (FK enforcement is on), so case rows can't be orphaned by a future eval-run prune.

Mirror the evalRunRow refactor: promoteDecision now receives force/reason as explicit arguments instead of reading the Cobra flag globals, so the pure gate logic is independently testable and its sub-tests no longer mutate package state.

Round 3 greptile findings: - ListEvalRuns / ListAnnotations now return an empty result (not a raw "no such table" error) when their table is absent — the case a read-only open hits on a DB that predates migrations 0008/0009 (read-only opens skip migration apply). `qvr ops lineage` and `qvr audit annotations` therefore degrade gracefully on a freshly-upgraded install. New noSuchTable() guard. - DeleteSession nulls eval_runs.session_id for the deleted session instead of leaving it dangling. The eval verdict is durable lineage evidence keyed by {skill, commit}, so it outlives the session; only the stale pointer is cleared.

Round 4 greptile findings: - ops promote: when a skill has no locked commit, the refusal now points straight at --force-no-eval instead of suggesting `qvr ops eval run`, which would record under an empty commit and loop back to the same refusal. - migration 0008: index annotations(created_at) so the --since range filter on `audit annotations` / `ops lineage` uses a B-tree seek, not a scan.

Round 5 greptile findings: - text grader: the `on` field is now validated at load time (only final_message is accepted), so a typo like `on: message` is a manifest error instead of silently grading FinalMessage. evidenceField returns "" for any unknown field as defense-in-depth. - migration 0008: document why annotations deliberately has NO session_meta FK — a cascade FK would be fired by the rederive INSERT-OR-REPLACE and wipe the durable verdicts the table exists to keep; cleanup on a real purge lives in store.DeleteSession.

Round 6 greptile findings: - promote gate: latestPassingEval now clears only when the MOST RECENT eval for the locked commit passed — a newer failing run is no longer overridden by an older pass, so a known regression can't be promoted past. Pinned by TestLatestPassingEval_NewestVerdictWins. - ListEvalRuns: per-case rows now load only when EvalRunFilter.IncludeCases is set, so header-only callers (lineage, the promote gate) skip the N+1. WHERE-building and case-loading extracted to helpers (keeps cyclo <=15).

Round 7 greptile findings (style/docs only): - Remove the unused OutcomeUnknown constant — "unknown" is the ABSENCE of the attribute (""), and every comparison site already checks "", so the sentinel was dead and potentially misleading. - migration 0009: document why eval_runs.session_id deliberately omits a FK (durable verdict outlives the session; DeleteSession NULLs the pointer), mirroring the rationale comment migration 0008 carries.

Round 8 greptile finding (security): the CI recipe piped `curl -fsSL https://quiver.sh/install | sh`, so an adopter copying it verbatim would execute unauthenticated network content in the runner with full secrets access. Replace both jobs' install step with a pinned `go install github.com/astra-sh/qvr@<tag>`, verified through the Go module checksum database.

Rakshith added 10 commits June 16, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] SkillOps — skill self-improvement loop (do not merge)#266

[Experimental] SkillOps — skill self-improvement loop (do not merge)#266
raks097 wants to merge 10 commits into
mainfrom
feat/skill-evolution-loop

raks097 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raks097 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

qvr core

On top (a shipped skill)

End-to-end acceptance

Quality

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raks097 commented Jun 17, 2026 •

edited

Loading