Skip to content

Phase 2: Citation dashboard endpoint#330

Merged
neuromechanist merged 2 commits into
feature/issue-321-epic-public-feedsfrom
feature/issue-323-phase2-citations-endpoint
Jun 9, 2026
Merged

Phase 2: Citation dashboard endpoint#330
neuromechanist merged 2 commits into
feature/issue-321-epic-public-feedsfrom
feature/issue-323-phase2-citations-endpoint

Conversation

@neuromechanist

Copy link
Copy Markdown
Member

Summary

Exposes canonical-paper citation tracking as a public, read-only JSON feed with a per-year and stacked-by-canonical-paper breakdown, opt-in per community.

  • Schema: new papers.cites_doi column. Added to CREATE TABLE papers for new DBs and via _migrate_db ALTER TABLE for existing ones. The idx_papers_cites_doi index is created in _migrate_db (not SCHEMA_SQL) so init_db stays safe on databases predating the column — caught by a migration test.
  • Linkage: upsert_paper records cites_doi; on conflict cites_doi = COALESCE(papers.cites_doi, excluded.cites_doi) so the first citation link wins, a later keyword sync (None) never erases it, and a re-sync backfills legacy NULL rows. sync_citing_papers threads the canonical DOI through _store_papers.
  • Aggregation: get_citation_stats(project) returns total, per_year, and by_paper{doi:{year:count}}. A 4-digit-year GLOB guard drops rows whose created_at is missing or malformed so no citation lands in a bogus year bucket.
  • Endpoint: GET /{community_id}/citations gated by public_feeds.citations; returns per_year, stacked by_paper, and canonical_dois from config. Same Cache-Control: public, max-age=3600 and 503/500 handling as the FAQ feed.

Deploy note

cites_doi is populated going forward by the scheduled citation sync. To backfill existing rows, run a full citation re-sync once (cheap) after this merges and the schema migration runs.

Limitation

A paper citing more than one canonical DOI records only the first (single column, by design). Re-sync is idempotent under COALESCE.

Test plan

  • tests/test_knowledge/test_citation_stats.py (10): stats aggregation (total/per_year/by_paper, year sorting, undated exclusion, empty DB), COALESCE link semantics (backfill, first-link-wins, keyword-sync-no-clobber), legacy-table migration.
  • tests/test_api/test_citations_feed.py (8): gate 404 (None + flag false) / 200, total+per_year, stacked by_paper, canonical_dois from config, Cache-Control, 503.
  • Real temporary SQLite databases; no business logic mocked.
  • Regression: papers_sync, search, db, community router, FAQ feed, core config (285 passed, 1 skipped).

Closes #323
Part of epic #321

Record which canonical DOI each citing paper references and expose a
per-year + stacked-by-paper citation feed, opt-in per community.

- papers.cites_doi column (CREATE TABLE + _migrate_db ALTER for existing
  DBs); index created in _migrate_db so init_db stays safe on databases
  predating the column.
- upsert_paper records cites_doi; on conflict COALESCE keeps the first
  link, so a keyword sync (None) never erases it and a re-sync backfills
  legacy NULL rows.
- sync_citing_papers threads the canonical DOI through _store_papers.
- get_citation_stats aggregates total/per_year/by_paper (4-digit-year
  GLOB guard drops undated rows).
- GET /{community_id}/citations gated by public_feeds.citations, returns
  per_year, stacked by_paper, and canonical_dois from config, with
  Cache-Control and 503/500 handling matching the FAQ feed.

Backfill on deploy: run a full citation re-sync to populate cites_doi on
existing rows.

Tests: stats aggregation, COALESCE link semantics (backfill/first-wins/
no-clobber), legacy-table migration, endpoint gate/content/cache/503.
- Narrow the _migrate_db try to the PRAGMA only so a DDL failure (locked
  DB, I/O error) on an existing papers table propagates instead of being
  swallowed at DEBUG with a misleading 'table not found' message.
- Document the single-column cross-DOI attribution limitation on
  upsert_paper.
- Cover _store_papers threading cites_doi onto each stored row.
- Cover the canonical_dois=[] branch (feed enabled, no citations config)
  and the unexpected-error 500 path; correct the test module docstring.
@neuromechanist neuromechanist merged commit 30f3e82 into feature/issue-321-epic-public-feeds Jun 9, 2026
4 checks passed
@neuromechanist neuromechanist deleted the feature/issue-323-phase2-citations-endpoint branch June 9, 2026 23:38
neuromechanist added a commit that referenced this pull request Jun 9, 2026
* Phase 1: FAQ JSON endpoint (#324)

* feat(api): public FAQ JSON feed gated by public_feeds config

Add a top-level public_feeds config block (faq/citations flags, off by
default) and a read-only GET /{community_id}/faq endpoint that serves
generated FAQ entries from the knowledge database.

- New PublicFeedsConfig model on CommunityConfig
- list_faq_entries browse helper (no FTS query required) with pagination
- Endpoint supports q/category/min_quality/limit/offset filters
- Email addresses redacted from public output (privacy mitigation)
- Returns 404 unless public_feeds.faq is enabled

Tests: list helper (ordering, filters, pagination) and endpoint
(gate, fields, redaction, filters, validation) against real SQLite data.

* fix(faq): address PR review findings

- Unify browse + search in list_faq_entries via optional query param so
  total is the real pre-LIMIT count and offset is honored in both modes
  (fixes broken pagination on the ?q= path).
- Redact emails in tags, not just question/answer.
- Guard json.loads(tags) against malformed JSON (shared _parse_faq_tags
  helper, applied to search_faq_entries too) so a corrupt row degrades to
  empty tags instead of an unlogged 500.
- Add a broad logged 500 fallback in the endpoint alongside the 503 path.
- Set Cache-Control: public, max-age=3600, matching /metrics/public.
- Include limit/offset in the list_faq_entries sqlite error log.

Tests: project-consistent fixture, faq=False gate, 503 browse+search,
redaction across question/answer/tags, list_name filter, real search
total vs page size, Cache-Control header.

* Phase 2: Citation dashboard endpoint (#330)

* feat(api): public citations dashboard with cites_doi linkage

Record which canonical DOI each citing paper references and expose a
per-year + stacked-by-paper citation feed, opt-in per community.

- papers.cites_doi column (CREATE TABLE + _migrate_db ALTER for existing
  DBs); index created in _migrate_db so init_db stays safe on databases
  predating the column.
- upsert_paper records cites_doi; on conflict COALESCE keeps the first
  link, so a keyword sync (None) never erases it and a re-sync backfills
  legacy NULL rows.
- sync_citing_papers threads the canonical DOI through _store_papers.
- get_citation_stats aggregates total/per_year/by_paper (4-digit-year
  GLOB guard drops undated rows).
- GET /{community_id}/citations gated by public_feeds.citations, returns
  per_year, stacked by_paper, and canonical_dois from config, with
  Cache-Control and 503/500 handling matching the FAQ feed.

Backfill on deploy: run a full citation re-sync to populate cites_doi on
existing rows.

Tests: stats aggregation, COALESCE link semantics (backfill/first-wins/
no-clobber), legacy-table migration, endpoint gate/content/cache/503.

* fix(citations): address PR review findings

- Narrow the _migrate_db try to the PRAGMA only so a DDL failure (locked
  DB, I/O error) on an existing papers table propagates instead of being
  swallowed at DEBUG with a misleading 'table not found' message.
- Document the single-column cross-DOI attribution limitation on
  upsert_paper.
- Cover _store_papers threading cites_doi onto each stored row.
- Cover the canonical_dois=[] branch (feed enabled, no citations config)
  and the unexpected-error 500 path; correct the test module docstring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant