[Experimental] SkillOps — skill self-improvement loop (do not merge)#266
Draft
raks097 wants to merge 10 commits into
Draft
[Experimental] SkillOps — skill self-improvement loop (do not merge)#266raks097 wants to merge 10 commits into
raks097 wants to merge 10 commits into
Conversation
added 10 commits
June 16, 2026 23:41
…vals
Adds the qvr-core layer for evidence-gated skill evolution, keeping the
binary pure Go (no model calls, no scheduler):
- Outcome signal: TOOL/SKILL spans carry qvr.outcome (success|failure|
blocked), rolled up to a session verdict; OTLP span status now reflects
it instead of a hardcoded OK. Deriver v7->v8 (rederive backfills).
Migration 0007.
- Human annotations: `qvr audit annotate` / `audit annotations` record a
reviewer's verdict in a table that survives rederive (it is durable human
input, not a regenerable projection). Migration 0008.
- Evals-as-data: evals.yaml manifest + deterministic graders (outcome,
text, tool_sequence, tool_constraint, skill_invocation, behavior) graded
over the trace qvr already captured — no model calls. internal/eval/.
- Eval store + gate: `qvr ops eval run` (exits non-zero on regression),
`ops lineage`, `ops promote`, keyed by {skill, commit}. Migration 0009.
The seam that keeps core pure: execution stays with the agent; grading,
the gate, and lineage are the only parts in the binary, because qvr already
has the real lock-verified trace.
A qvr-installable skill that runs the outer loop over the CLI: observe (qvr audit) -> edit (qvr edit) -> re-run -> gate (qvr ops eval run) -> open a PR. Evidence-gated; never auto-merges or auto-installs. Ships a CI recipe for running the loop unattended.
- ops promote: surface store errors from latestPassingEval instead of swallowing them — a transient DB failure no longer reads as "no passing eval" and silently blocks/misleads a CI gate (P1). - ops eval: evalRunRow takes the suite as a parameter rather than closing over the flag global, so it is pure and testable in isolation. - eval evidence: lastAssistantText now requires role=="assistant" before returning content, guarding against a future multi-message output capture. - eval evidence: Evidence.Tools now counts the same tool-sequence the tool_constraint grader walks, so a `maxTools` ceiling means the same thing under behavior and tool_constraint graders. - migration 0009: eval_case_results.eval_run_id is now a FK to eval_runs(id) ON DELETE CASCADE (FK enforcement is on), so case rows can't be orphaned by a future eval-run prune.
Mirror the evalRunRow refactor: promoteDecision now receives force/reason as explicit arguments instead of reading the Cobra flag globals, so the pure gate logic is independently testable and its sub-tests no longer mutate package state.
Round 3 greptile findings:
- ListEvalRuns / ListAnnotations now return an empty result (not a raw
"no such table" error) when their table is absent — the case a read-only
open hits on a DB that predates migrations 0008/0009 (read-only opens skip
migration apply). `qvr ops lineage` and `qvr audit annotations` therefore
degrade gracefully on a freshly-upgraded install. New noSuchTable() guard.
- DeleteSession nulls eval_runs.session_id for the deleted session instead
of leaving it dangling. The eval verdict is durable lineage evidence keyed
by {skill, commit}, so it outlives the session; only the stale pointer is
cleared.
Round 4 greptile findings: - ops promote: when a skill has no locked commit, the refusal now points straight at --force-no-eval instead of suggesting `qvr ops eval run`, which would record under an empty commit and loop back to the same refusal. - migration 0008: index annotations(created_at) so the --since range filter on `audit annotations` / `ops lineage` uses a B-tree seek, not a scan.
Round 5 greptile findings: - text grader: the `on` field is now validated at load time (only final_message is accepted), so a typo like `on: message` is a manifest error instead of silently grading FinalMessage. evidenceField returns "" for any unknown field as defense-in-depth. - migration 0008: document why annotations deliberately has NO session_meta FK — a cascade FK would be fired by the rederive INSERT-OR-REPLACE and wipe the durable verdicts the table exists to keep; cleanup on a real purge lives in store.DeleteSession.
Round 6 greptile findings: - promote gate: latestPassingEval now clears only when the MOST RECENT eval for the locked commit passed — a newer failing run is no longer overridden by an older pass, so a known regression can't be promoted past. Pinned by TestLatestPassingEval_NewestVerdictWins. - ListEvalRuns: per-case rows now load only when EvalRunFilter.IncludeCases is set, so header-only callers (lineage, the promote gate) skip the N+1. WHERE-building and case-loading extracted to helpers (keeps cyclo <=15).
Round 7 greptile findings (style/docs only):
- Remove the unused OutcomeUnknown constant — "unknown" is the ABSENCE of
the attribute (""), and every comparison site already checks "", so the
sentinel was dead and potentially misleading.
- migration 0009: document why eval_runs.session_id deliberately omits a FK
(durable verdict outlives the session; DeleteSession NULLs the pointer),
mirroring the rationale comment migration 0008 carries.
Round 8 greptile finding (security): the CI recipe piped `curl -fsSL https://quiver.sh/install | sh`, so an adopter copying it verbatim would execute unauthenticated network content in the runner with full secrets access. Replace both jobs' install step with a pinned `go install github.com/astra-sh/qvr@<tag>`, verified through the Go module checksum database.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
SkillOps — the qvr-native substrate for an evidence-gated skill self-improvement
loop, built so the loop is recreatable on top of qvr without bloating the binary.
The design is layered: qvr core gets the deterministic, identity-bound pieces;
the agent/LLM work lives in a skill on top.
The seam that keeps core pure: the naive approach re-runs an agent to synthesize a
trace, then grades it. qvr already has the trace from real, lock-verified usage —
so execution stays with the agent, and only grading + the gate + lineage live in the
binary. No model calls, no scheduler in qvr.
qvr core
qvr.outcome(
success/failure/blocked), rolled up to a session verdict; OTLP spanstatus now reflects it. Deriver
v7→v8(rederive backfills). Migration0007.qvr audit annotate/audit annotationsrecord areviewer's verdict in a table that survives rederive (durable human input,
not a regenerable projection). Migration
0008.evals.yamlmanifest + 6 deterministic graders(
outcome,text,tool_sequence,tool_constraint,skill_invocation,behavior) graded over the trace qvr already captured.internal/eval/.qvr ops eval run(exits non-zero on regression),qvr ops lineage,qvr ops promote, results keyed by{skill, commit}.Migration
0009.On top (a shipped skill)
skills/evolve-skill/— the orchestrator the agent runs over the CLI(observe → edit → re-run → gate → open a PR), never auto-merges or
auto-installs, plus a CI recipe.
End-to-end acceptance
Two end-to-end acceptance tests drive the real CLI:
cmd/acceptance_loop_test.go— triage label correctness (fail→pass).cmd/acceptance_behavioral_test.go— "verify edits by running tests"(behavioral graders, fail→pass).
Quality
make build/ fullgo test ./.../make lint(0 issues) / gocyclo ≤15 /modernize / vet all clean. Greptile review loop run on the branch; findings
addressed in
abe5b42.