Phase 2: Citation dashboard endpoint by neuromechanist · Pull Request #330 · OpenScience-Collective/osa

neuromechanist · 2026-06-09T23:27:03Z

Summary

Exposes canonical-paper citation tracking as a public, read-only JSON feed with a per-year and stacked-by-canonical-paper breakdown, opt-in per community.

Schema: new papers.cites_doi column. Added to CREATE TABLE papers for new DBs and via _migrate_db ALTER TABLE for existing ones. The idx_papers_cites_doi index is created in _migrate_db (not SCHEMA_SQL) so init_db stays safe on databases predating the column — caught by a migration test.
Linkage: upsert_paper records cites_doi; on conflict cites_doi = COALESCE(papers.cites_doi, excluded.cites_doi) so the first citation link wins, a later keyword sync (None) never erases it, and a re-sync backfills legacy NULL rows. sync_citing_papers threads the canonical DOI through _store_papers.
Aggregation: get_citation_stats(project) returns total, per_year, and by_paper{doi:{year:count}}. A 4-digit-year GLOB guard drops rows whose created_at is missing or malformed so no citation lands in a bogus year bucket.
Endpoint: GET /{community_id}/citations gated by public_feeds.citations; returns per_year, stacked by_paper, and canonical_dois from config. Same Cache-Control: public, max-age=3600 and 503/500 handling as the FAQ feed.

Deploy note

cites_doi is populated going forward by the scheduled citation sync. To backfill existing rows, run a full citation re-sync once (cheap) after this merges and the schema migration runs.

Limitation

A paper citing more than one canonical DOI records only the first (single column, by design). Re-sync is idempotent under COALESCE.

Test plan

tests/test_knowledge/test_citation_stats.py (10): stats aggregation (total/per_year/by_paper, year sorting, undated exclusion, empty DB), COALESCE link semantics (backfill, first-link-wins, keyword-sync-no-clobber), legacy-table migration.
tests/test_api/test_citations_feed.py (8): gate 404 (None + flag false) / 200, total+per_year, stacked by_paper, canonical_dois from config, Cache-Control, 503.
Real temporary SQLite databases; no business logic mocked.
Regression: papers_sync, search, db, community router, FAQ feed, core config (285 passed, 1 skipped).

Closes #323
Part of epic #321

Record which canonical DOI each citing paper references and expose a per-year + stacked-by-paper citation feed, opt-in per community. - papers.cites_doi column (CREATE TABLE + _migrate_db ALTER for existing DBs); index created in _migrate_db so init_db stays safe on databases predating the column. - upsert_paper records cites_doi; on conflict COALESCE keeps the first link, so a keyword sync (None) never erases it and a re-sync backfills legacy NULL rows. - sync_citing_papers threads the canonical DOI through _store_papers. - get_citation_stats aggregates total/per_year/by_paper (4-digit-year GLOB guard drops undated rows). - GET /{community_id}/citations gated by public_feeds.citations, returns per_year, stacked by_paper, and canonical_dois from config, with Cache-Control and 503/500 handling matching the FAQ feed. Backfill on deploy: run a full citation re-sync to populate cites_doi on existing rows. Tests: stats aggregation, COALESCE link semantics (backfill/first-wins/ no-clobber), legacy-table migration, endpoint gate/content/cache/503.

- Narrow the _migrate_db try to the PRAGMA only so a DDL failure (locked DB, I/O error) on an existing papers table propagates instead of being swallowed at DEBUG with a misleading 'table not found' message. - Document the single-column cross-DOI attribution limitation on upsert_paper. - Cover _store_papers threading cites_doi onto each stored row. - Cover the canonical_dois=[] branch (feed enabled, no citations config) and the unexpected-error 500 path; correct the test module docstring.

* Phase 1: FAQ JSON endpoint (#324) * feat(api): public FAQ JSON feed gated by public_feeds config Add a top-level public_feeds config block (faq/citations flags, off by default) and a read-only GET /{community_id}/faq endpoint that serves generated FAQ entries from the knowledge database. - New PublicFeedsConfig model on CommunityConfig - list_faq_entries browse helper (no FTS query required) with pagination - Endpoint supports q/category/min_quality/limit/offset filters - Email addresses redacted from public output (privacy mitigation) - Returns 404 unless public_feeds.faq is enabled Tests: list helper (ordering, filters, pagination) and endpoint (gate, fields, redaction, filters, validation) against real SQLite data. * fix(faq): address PR review findings - Unify browse + search in list_faq_entries via optional query param so total is the real pre-LIMIT count and offset is honored in both modes (fixes broken pagination on the ?q= path). - Redact emails in tags, not just question/answer. - Guard json.loads(tags) against malformed JSON (shared _parse_faq_tags helper, applied to search_faq_entries too) so a corrupt row degrades to empty tags instead of an unlogged 500. - Add a broad logged 500 fallback in the endpoint alongside the 503 path. - Set Cache-Control: public, max-age=3600, matching /metrics/public. - Include limit/offset in the list_faq_entries sqlite error log. Tests: project-consistent fixture, faq=False gate, 503 browse+search, redaction across question/answer/tags, list_name filter, real search total vs page size, Cache-Control header. * Phase 2: Citation dashboard endpoint (#330) * feat(api): public citations dashboard with cites_doi linkage Record which canonical DOI each citing paper references and expose a per-year + stacked-by-paper citation feed, opt-in per community. - papers.cites_doi column (CREATE TABLE + _migrate_db ALTER for existing DBs); index created in _migrate_db so init_db stays safe on databases predating the column. - upsert_paper records cites_doi; on conflict COALESCE keeps the first link, so a keyword sync (None) never erases it and a re-sync backfills legacy NULL rows. - sync_citing_papers threads the canonical DOI through _store_papers. - get_citation_stats aggregates total/per_year/by_paper (4-digit-year GLOB guard drops undated rows). - GET /{community_id}/citations gated by public_feeds.citations, returns per_year, stacked by_paper, and canonical_dois from config, with Cache-Control and 503/500 handling matching the FAQ feed. Backfill on deploy: run a full citation re-sync to populate cites_doi on existing rows. Tests: stats aggregation, COALESCE link semantics (backfill/first-wins/ no-clobber), legacy-table migration, endpoint gate/content/cache/503. * fix(citations): address PR review findings - Narrow the _migrate_db try to the PRAGMA only so a DDL failure (locked DB, I/O error) on an existing papers table propagates instead of being swallowed at DEBUG with a misleading 'table not found' message. - Document the single-column cross-DOI attribution limitation on upsert_paper. - Cover _store_papers threading cites_doi onto each stored row. - Cover the canonical_dois=[] branch (feed enabled, no citations config) and the unexpected-error 500 path; correct the test module docstring.

neuromechanist added 2 commits June 9, 2026 16:26

neuromechanist merged commit 30f3e82 into feature/issue-321-epic-public-feeds Jun 9, 2026
4 checks passed

neuromechanist deleted the feature/issue-323-phase2-citations-endpoint branch June 9, 2026 23:38

neuromechanist mentioned this pull request Jun 9, 2026

Public JSON feeds for community FAQ and citations #331

Merged

neuromechanist mentioned this pull request Jun 9, 2026

Epic: Public JSON feeds for community FAQ and citations #321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2: Citation dashboard endpoint#330

Phase 2: Citation dashboard endpoint#330
neuromechanist merged 2 commits into
feature/issue-321-epic-public-feedsfrom
feature/issue-323-phase2-citations-endpoint

neuromechanist commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neuromechanist commented Jun 9, 2026

Summary

Deploy note

Limitation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant