diff --git a/README.md b/README.md index 541e7236..9e688499 100644 --- a/README.md +++ b/README.md @@ -112,6 +112,13 @@ Validate remediation regression manifests and expected auto-fix diffs with: ruby scripts/test_remediation_fixtures.rb ``` +Regenerate the deterministic skill quality scorecard with: + +```bash +ruby scripts/generate_quality_scorecard.rb +ruby scripts/generate_quality_scorecard.rb --check +``` + CI/CD examples for GitHub Actions, GitLab CI, Azure DevOps, Jenkins, pre-commit, and local agent usage are available in [`docs/ci-cd-examples.md`](docs/ci-cd-examples.md). Validate those examples diff --git a/docs/quality-scorecard.md b/docs/quality-scorecard.md new file mode 100644 index 00000000..2b98ec4b --- /dev/null +++ b/docs/quality-scorecard.md @@ -0,0 +1,59 @@ +# Skill Quality Scorecard + +This file is generated by `ruby scripts/generate_quality_scorecard.rb`. Do not edit it by hand. + +The scorecard uses deterministic repository evidence only: `index.yaml`, `SKILL.md` frontmatter, and `tests/fixtures/*/*/manifest.yaml`. The fixture harness validates expected evidence strings, but it does not execute skills against targets, so real-world accuracy and false-positive rates are tracked as `not measured` until an execution-based benchmark exists. + +Readiness score: + +- 2 points: required frontmatter is present and matches the indexed skill id. +- 1 point: indexed frameworks match frontmatter and fixture findings include CWE or framework mappings. +- 2 points: both vulnerable and benign fixture coverage exist; 1 point for one-sided fixture coverage. + +| Skill | Fixture coverage | True-positive fixtures | False-positive fixtures | Accuracy | False-positive rate | Framework accuracy | Last audit date | Score | Status | +|---|---:|---:|---:|---|---|---|---|---:|---| +| threat-modeling | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| secure-code-review | 1 vulnerable / 1 benign | 1 expected finding(s) | 1 benign case(s) | not measured; 1 evidence string(s) validated | not measured; 1 benign fixture(s) | valid | not recorded | 5/5 | covered | +| owasp-top-10-web | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| api-security | 1 vulnerable / 1 benign | 1 expected finding(s) | 1 benign case(s) | not measured; 1 evidence string(s) validated | not measured; 1 benign fixture(s) | valid | not recorded | 5/5 | covered | +| dependency-scanning | 1 vulnerable / 1 benign | 1 expected finding(s) | 1 benign case(s) | not measured; 1 evidence string(s) validated | not measured; 1 benign fixture(s) | valid | not recorded | 5/5 | covered | +| iam-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| access-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| rbac-design | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| zero-trust-assessment | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| privileged-access | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| aws-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| azure-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| gcp-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| iac-security | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| container-security | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| cve-triage | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| patch-prioritization | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| sbom-analysis | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| scanner-tuning | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| llm-top-10 | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| agentic-top-10 | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| prompt-injection | 1 vulnerable / 1 benign | 1 expected finding(s) | 1 benign case(s) | not measured; 1 evidence string(s) validated | not measured; 1 benign fixture(s) | valid | not recorded | 5/5 | covered | +| model-supply-chain | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| ai-data-privacy | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| agent-security | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| ir-playbook | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| forensics-checklist | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| containment | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| post-incident-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| soc2-gap | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| iso27001-gap | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| pci-dss-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| hipaa-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| nist-csf-assessment | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| detection-engineering | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| siem-rules | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| alert-triage | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| log-analysis | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| firewall-review | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| segmentation | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| dns-security | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| pipeline-security | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| secrets-management | 1 vulnerable / 1 benign | 1 expected finding(s) | 1 benign case(s) | not measured; 1 evidence string(s) validated | not measured; 1 benign fixture(s) | valid | not recorded | 5/5 | covered | +| sast-config | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | +| dast-config | 0 vulnerable / 0 benign | 0 expected finding(s) | 0 benign case(s) | not measured; 0 evidence string(s) validated | not measured; 0 benign fixture(s) | valid | not recorded | 3/5 | metadata-only | diff --git a/scripts/generate_quality_scorecard.rb b/scripts/generate_quality_scorecard.rb new file mode 100644 index 00000000..a3d2a612 --- /dev/null +++ b/scripts/generate_quality_scorecard.rb @@ -0,0 +1,249 @@ +#!/usr/bin/env ruby +# frozen_string_literal: true + +require "yaml" + +ROOT = File.expand_path("..", __dir__) +INDEX_PATH = File.join(ROOT, "index.yaml") +FIXTURE_ROOT = File.join(ROOT, "tests", "fixtures") +OUTPUT_PATH = File.join(ROOT, "docs", "quality-scorecard.md") +REQUIRED_FRONTMATTER_FIELDS = %w[ + name description tags role phase frameworks difficulty time_estimate version + author license allowed-tools injection-hardened +].freeze + +def rel(path) + path.delete_prefix("#{ROOT}#{File::SEPARATOR}") +end + +def load_yaml(path) + YAML.safe_load(File.read(path), permitted_classes: [], aliases: false) || {} +end + +def parse_value(raw) + value = raw.strip + return nil if value.empty? + + if value.start_with?("[") && value.end_with?("]") + inner = value[1...-1].strip + return [] if inner.empty? + + inner.split(",").map { |item| parse_value(item) } + elsif (value.start_with?('"') && value.end_with?('"')) || + (value.start_with?("'") && value.end_with?("'")) + value[1...-1] + elsif value == "true" + true + elsif value == "false" + false + else + value + end +end + +def parse_index(path) + section = nil + current = nil + items = { "skills" => [] } + + File.readlines(path, chomp: true).each do |line| + stripped = line.strip + next if stripped.empty? || stripped.start_with?("#") + + if line =~ /^([a-z_]+):\s*$/ + section = Regexp.last_match(1) + current = nil + next + end + + next unless section == "skills" + + if line =~ /^ - ([A-Za-z0-9_-]+):\s*(.+?)\s*$/ + current = {} + items["skills"] << current + current[Regexp.last_match(1)] = parse_value(Regexp.last_match(2)) + elsif line =~ /^ ([A-Za-z0-9_-]+):\s*(.+?)\s*$/ + next unless current + + current[Regexp.last_match(1)] = parse_value(Regexp.last_match(2)) + end + end + + items +end + +def frontmatter_for(path) + text = File.read(path) + match = text.match(/\A---\s*\n(.*?)\n---\s*(?:\n|\z)/m) + raise "missing YAML frontmatter delimited by ---" unless match + + YAML.safe_load(match[1], permitted_classes: [], aliases: false) || {} +end + +def fixture_manifests + return [] unless Dir.exist?(FIXTURE_ROOT) + + Dir.glob(File.join(FIXTURE_ROOT, "*", "*", "manifest.yaml")).sort +end + +def fixture_stats_by_skill + stats = Hash.new do |hash, key| + hash[key] = { + vulnerable_cases: 0, + benign_cases: 0, + expected_findings: 0, + mapped_findings: 0 + } + end + + fixture_manifests.each do |path| + manifest = load_yaml(path) + skill = manifest["skill"] + next if skill.nil? || skill == "_example" + + findings = manifest["expected_findings"].is_a?(Array) ? manifest["expected_findings"] : [] + case manifest["kind"] + when "vulnerable" + stats[skill][:vulnerable_cases] += 1 + stats[skill][:expected_findings] += findings.size + stats[skill][:mapped_findings] += findings.count { |finding| finding.is_a?(Hash) && (finding["cwe"] || finding["framework"]) } + when "benign" + stats[skill][:benign_cases] += 1 + end + end + + stats +end + +def frontmatter_status(skill_id, file_path) + return ["missing", []] unless File.file?(file_path) + + frontmatter = frontmatter_for(file_path) + errors = [] + missing = REQUIRED_FRONTMATTER_FIELDS.reject { |field| frontmatter.key?(field) } + errors << "missing #{missing.join(', ')}" unless missing.empty? + errors << "name mismatch" if frontmatter["name"] != skill_id + errors << "no frameworks" unless frontmatter["frameworks"].is_a?(Array) && !frontmatter["frameworks"].empty? + + [errors.empty? ? "valid" : "invalid", errors] +rescue StandardError => e + ["invalid", [e.message]] +end + +def framework_status(index_entry, frontmatter_status_value, file_path, stats) + return "invalid" unless frontmatter_status_value == "valid" + + frontmatter = frontmatter_for(file_path) + indexed = index_entry["frameworks"] + declared = frontmatter["frameworks"] + return "invalid" unless indexed.is_a?(Array) && declared.is_a?(Array) && indexed == declared + return "valid" if stats[:expected_findings].zero? + + stats[:mapped_findings] == stats[:expected_findings] ? "valid" : "invalid" +rescue StandardError + "invalid" +end + +def readiness(stats, metadata_valid, framework_valid) + score = 0 + score += 2 if metadata_valid + score += 1 if framework_valid + + fixture_points = + if stats[:vulnerable_cases].positive? && stats[:benign_cases].positive? + 2 + elsif stats[:vulnerable_cases].positive? || stats[:benign_cases].positive? + 1 + else + 0 + end + score += fixture_points + + status = + if score == 5 + "covered" + elsif metadata_valid && framework_valid + "metadata-only" + else + "needs-attention" + end + + ["#{score}/5", status] +end + +def markdown_escape(value) + value.to_s.gsub("|", "\\|") +end + +def generate_markdown + index = parse_index(INDEX_PATH) + skills = index.fetch("skills", []) + fixture_stats = fixture_stats_by_skill + + lines = [] + lines << "# Skill Quality Scorecard" + lines << "" + lines << "This file is generated by `ruby scripts/generate_quality_scorecard.rb`. Do not edit it by hand." + lines << "" + lines << "The scorecard uses deterministic repository evidence only: `index.yaml`, `SKILL.md` frontmatter, and `tests/fixtures/*/*/manifest.yaml`. The fixture harness validates expected evidence strings, but it does not execute skills against targets, so real-world accuracy and false-positive rates are tracked as `not measured` until an execution-based benchmark exists." + lines << "" + lines << "Readiness score:" + lines << "" + lines << "- 2 points: required frontmatter is present and matches the indexed skill id." + lines << "- 1 point: indexed frameworks match frontmatter and fixture findings include CWE or framework mappings." + lines << "- 2 points: both vulnerable and benign fixture coverage exist; 1 point for one-sided fixture coverage." + lines << "" + lines << "| Skill | Fixture coverage | True-positive fixtures | False-positive fixtures | Accuracy | False-positive rate | Framework accuracy | Last audit date | Score | Status |" + lines << "|---|---:|---:|---:|---|---|---|---|---:|---|" + + skills.each do |entry| + skill_id = entry["id"] + file_path = File.join(ROOT, entry["file"].to_s) + stats = fixture_stats[skill_id] + frontmatter_status_value, = frontmatter_status(skill_id, file_path) + framework_status_value = framework_status(entry, frontmatter_status_value, file_path, stats) + score, status = readiness(stats, frontmatter_status_value == "valid", framework_status_value == "valid") + + coverage = "#{stats[:vulnerable_cases]} vulnerable / #{stats[:benign_cases]} benign" + tp_fixtures = "#{stats[:expected_findings]} expected finding(s)" + fp_fixtures = "#{stats[:benign_cases]} benign case(s)" + accuracy = "not measured; #{stats[:expected_findings]} evidence string(s) validated" + false_positive_rate = "not measured; #{stats[:benign_cases]} benign fixture(s)" + + lines << [ + markdown_escape(skill_id), + markdown_escape(coverage), + markdown_escape(tp_fixtures), + markdown_escape(fp_fixtures), + markdown_escape(accuracy), + markdown_escape(false_positive_rate), + markdown_escape(framework_status_value), + "not recorded", + markdown_escape(score), + markdown_escape(status) + ].join(" | ").prepend("| ").concat(" |") + end + + lines << "" + lines.join("\n") +end + +def check_output(expected) + actual = File.file?(OUTPUT_PATH) ? File.read(OUTPUT_PATH) : nil + return true if actual == expected + + warn "#{rel(OUTPUT_PATH)} is not current. Run `ruby scripts/generate_quality_scorecard.rb`." + false +end + +expected = generate_markdown + +if ARGV == ["--check"] + exit(check_output(expected) ? 0 : 1) +elsif ARGV.empty? + File.write(OUTPUT_PATH, expected) + puts "Wrote #{rel(OUTPUT_PATH)}" +else + warn "Usage: ruby scripts/generate_quality_scorecard.rb [--check]" + exit 2 +end