Skip to content

chore(licenses): add dependency license policy + CI check#724

Open
cbcoutinho wants to merge 1 commit into
masterfrom
chore/license-compliance
Open

chore(licenses): add dependency license policy + CI check#724
cbcoutinho wants to merge 1 commit into
masterfrom
chore/license-compliance

Conversation

@cbcoutinho

@cbcoutinho cbcoutinho commented Apr 26, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds an automated license compliance gate (License check workflow) that runs on every PR touching pyproject.toml / uv.lock. It fails fast on any new dep whose license is not on the allowlist or registered as an explicit per-package exception.
  • Codifies the dual-licensing posture (AGPL-3 today + commercial dual-license — including SaaS — in the future, per the CLA) in .licenses/policy.toml. Permissive licences (MIT/BSD/Apache-2.0/ISC/PSF/MPL-2.0/LGPL via dynamic linking) are auto-accepted; everything else needs a per-package exception with rationale, allowed_for, and review_required_for fields.
  • Documents the contributor workflow for adding/upgrading deps in CONTRIBUTING.md.

Initial audit (193 packages)

  • 186 allowed by the permissive allowlist.
  • 3 allowed via metadata override (wheel metadata is wrong upstream): fastembed (actually Apache-2.0), matplotlib-inline (actually BSD-3), py_rust_stemmers (actually MIT).
  • 4 flagged via per-package exception — these need follow-up before any commercial / SaaS release; tracked in separate issues:
    • pymupdf-layout — Artifex proprietary commercial-only; not actually imported, candidate for removal.
    • PyMuPDF + pymupdf4llm — AGPL-3 OR Artifex Commercial; need an Artifex agreement for non-AGPL distribution.
    • icalendar-searcher — pure AGPL-3-or-later; blocks any commercial channel (incl. proprietary SaaS).

Why SaaS matters here

AGPL §13 makes network use trigger source-disclosure obligations, so a third party offering SaaS based on this code is bound by AGPL unless they hold a separate commercial license from us. The same dependency constraints therefore apply to commercial SaaS as to embedded/proprietary distribution — both are part of the "commercial" channel. The policy file flags this distinction with allowed_for = ["agpl"] vs ["agpl", "commercial"].

Test plan

  • uv run scripts/check-licenses.py passes locally (0 failures, 4 reviewed exceptions, 3 metadata overrides).
  • Synthesised failure path verified by intentionally breaking the policy (caldav GPL/Apache OR-arm, x-wr-timezone "or later" classifier, pythonvCard4 multi-line License field) — script handled all three correctly.
  • CI License check workflow runs green on this PR.

This PR was generated with the help of AI, and reviewed by a Human

Adds an automated license compliance gate so future deps stay compatible
with both AGPL-3 distribution and the planned commercial dual-licensed
build (incl. SaaS offering, since AGPL §13 makes network use a commercial
concern).

- `.licenses/policy.toml` — allowlist of permissive licenses, per-package
  exceptions for AGPL/dual-licensed deps, and metadata overrides for
  packages whose wheel metadata mis-classifies the actual license.
- `scripts/check-licenses.py` — walks every installed package via
  `pip-licenses`, classifies against policy, exits non-zero on policy
  violations, emits a Markdown report suitable for `$GITHUB_STEP_SUMMARY`.
- `.github/workflows/license-check.yml` — runs the script on PRs that
  touch dependency manifests and uploads a JSON report artifact.
- CONTRIBUTING.md — documents the workflow contributors must follow when
  adding/upgrading a dep.
- pip-licenses added as a dev dependency.

Initial run flags four packages for follow-up (tracked separately):
  * pymupdf-layout — proprietary commercial-only, unused → remove
  * PyMuPDF / pymupdf4llm — Artifex commercial license needed for non-AGPL
  * icalendar-searcher — pure AGPL, blocks any commercial channel
@claude

claude Bot commented Apr 26, 2026

Copy link
Copy Markdown

License compliance gate — review

Overall this is a well-structured PR. The policy file is clearly documented, the SPDX composite-expression handling is non-trivial and mostly correct, and pinning all GitHub Actions by commit hash is a good security practice. A few things worth addressing before merging:


Bug: pymupdf-layout (allowed_for = []) silently passes CI

classify() treats any package in [exceptions] as status = "exception", which lands in the warning section rather than the "Failures (blocking)" section. But pymupdf-layout has allowed_for = [] and action = "remove-or-license" — it is not allowed under any channel. CI currently passes for it, which undermines the gate.

Suggested fix — add a special case in classify() after retrieving the exception entry:

if (ex := policy.exceptions.get(name.lower())) is not None:
    if not ex.get("allowed_for"):   # empty list means blocked
        return Verdict(
            name, pkg["Version"], declared,
            ex.get("license", declared), "denied",
            ex.get("rationale", "").strip().splitlines()[0],
            detail=ex,
        )
    # ... existing exception path continues here

Unnecessary permission in CI workflow

pull-requests: write is declared in license-check.yml but the workflow never posts a PR comment — it only writes to $GITHUB_STEP_SUMMARY and uploads an artifact. Remove it to follow least-privilege:

permissions:
  contents: read
  # pull-requests: write  <- unused, remove it

Script runs twice, doubling CI time

The "Check licenses" step runs the script, then "Upload license report" runs it again with || true. When the check fails, the second invocation reruns all of pip-licenses and re-evaluates every package. A --json-out PATH option that writes JSON to a file in the same pass as the markdown output would let both steps collapse into one.


Missing push: path filter

push: branches: master has no paths: filter, so it fires on every master commit regardless of whether dependency files changed. The PR trigger has a path filter; the push trigger should match for consistency (unless re-validating on every master push is intentional).


Type annotation: Verdict.detail

Per CLAUDE.md conventions, fields should use lowercase generics. detail: dict should be detail: dict[str, Any], which also requires from typing import Any.


No unit tests for the checker script

The composite-expression parsing logic in _arms and _check_license is subtle: OR-arm leniency for denied licenses, ; separator treatment, _first_line multi-line stripping. The PR tests these paths manually, but a small tests/unit/test_check_licenses.py with parametrized cases would prevent regressions. Not blocking, but worth a follow-up issue.


Summary: the core logic is sound and the policy coverage looks thorough. The main issue to fix before merging is the pymupdf-layout false-pass in CI (a package with allowed_for = [] should fail the check, not silently pass as an exception). The unused pull-requests: write permission and double-invocation are also worth fixing. Everything else is polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant