Skip to content

feat(citations): version groups (merge preprint + published citations)#336

Merged
neuromechanist merged 3 commits into
developfrom
feature/citation-version-groups
Jun 10, 2026
Merged

feat(citations): version groups (merge preprint + published citations)#336
neuromechanist merged 3 commits into
developfrom
feature/citation-version-groups

Conversation

@neuromechanist

Copy link
Copy Markdown
Member

Problem

OpenAlex keeps separate work records for a paper's preprint and published versions and splits citations across them. Tracking only the published DOI undercounts badly:

  • LSL: published (Imaging Neuroscience) = 61 citations, but its bioRxiv preprint holds 98.
  • BIDS Apps: the published PLoS Comp Biol record shows 0 (OpenAlex lost them); the preprint has 56.

Fix

A canonical paper can declare version aliases. The citation sync resolves every version DOI to an OpenAlex work id and queries them together with an OR-joined cites:W1|W2 filter, which OpenAlex deduplicates, attributing the merged per-year histogram to the primary DOI.

citations:
  aliases:
    "10.1162/IMAG.a.136":            # LSL published
      - "10.1101/2024.02.13.580071"  # LSL bioRxiv preprint
  • CitationConfig.aliases (primary DOI -> version DOIs), validated/normalized like dois.
  • OpenAlexCitationClient.counts_by_year / recent_citing_papers accept a work-id group; _cites_filter OR-joins with |.
  • sync_citing_papers(aliases=...) builds the group from primary + aliases; CLI and scheduler pass the community's aliases.
  • Configs: eeglab LSL preprint, bids BIDS Apps preprint.

Verified against OpenAlex: LSL combined per-year rises 61 -> 157 (2025: 75, 2024: 34, ...).

Test plan

  • Config: aliases default empty, primary+version normalization + dedup, invalid-DOI rejection.
  • OpenAlex client: _cites_filter single/multi/empty; counts_by_year sends the OR-joined filter for a group.
  • End-to-end: sync_citing_papers with aliases resolves both versions, OR-joins (cites:W1|W2), and attributes the merged count to the primary DOI (mock transport, real DB).
  • Regression: config, citations endpoint, cli sync, scheduler — all green (231 passed).

Deploy follow-up

Re-run sync papers --community {eeglab,bids} --citations so LSL/BIDS Apps pick up their preprint citations.

OpenAlex splits citations across a paper's preprint and published records, so
a canonical paper can undercount badly (LSL published=61 but its preprint
holds 98). Add a citations.aliases config map (primary DOI -> version DOIs);
the sync resolves every version to an OpenAlex work id and queries them as one
OR-joined, deduplicated cites: filter, attributing the merged per-year counts
to the primary DOI.

- CitationConfig.aliases (validated/normalized like dois).
- OpenAlexCitationClient.counts_by_year/recent_citing_papers accept a work-id
  group; _cites_filter OR-joins with '|'.
- sync_citing_papers builds the group from primary + aliases; CLI and
  scheduler pass community aliases.
- eeglab: LSL bioRxiv preprint; bids: BIDS Apps bioRxiv preprint.

LSL combined rises 61 -> 157.
- Raise on an empty alias version DOI instead of silently dropping it.
- Add a model validator: every alias primary DOI must be in dois, so a
  typo'd primary fails at config load rather than silently never merging.
- _cites_filter raises on an empty work-id list (explicit precondition).
- Tests for both new validations plus the empty-filter guard.
@neuromechanist neuromechanist merged commit 999635a into develop Jun 10, 2026
7 checks passed
@neuromechanist neuromechanist deleted the feature/citation-version-groups branch June 10, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant