feat(citations): version groups (merge preprint + published citations)#336
Merged
Conversation
OpenAlex splits citations across a paper's preprint and published records, so a canonical paper can undercount badly (LSL published=61 but its preprint holds 98). Add a citations.aliases config map (primary DOI -> version DOIs); the sync resolves every version to an OpenAlex work id and queries them as one OR-joined, deduplicated cites: filter, attributing the merged per-year counts to the primary DOI. - CitationConfig.aliases (validated/normalized like dois). - OpenAlexCitationClient.counts_by_year/recent_citing_papers accept a work-id group; _cites_filter OR-joins with '|'. - sync_citing_papers builds the group from primary + aliases; CLI and scheduler pass community aliases. - eeglab: LSL bioRxiv preprint; bids: BIDS Apps bioRxiv preprint. LSL combined rises 61 -> 157.
- Raise on an empty alias version DOI instead of silently dropping it. - Add a model validator: every alias primary DOI must be in dois, so a typo'd primary fails at config load rather than silently never merging. - _cites_filter raises on an empty work-id list (explicit precondition). - Tests for both new validations plus the empty-filter guard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
OpenAlex keeps separate work records for a paper's preprint and published versions and splits citations across them. Tracking only the published DOI undercounts badly:
Fix
A canonical paper can declare version aliases. The citation sync resolves every version DOI to an OpenAlex work id and queries them together with an OR-joined
cites:W1|W2filter, which OpenAlex deduplicates, attributing the merged per-year histogram to the primary DOI.CitationConfig.aliases(primary DOI -> version DOIs), validated/normalized likedois.OpenAlexCitationClient.counts_by_year/recent_citing_papersaccept a work-id group;_cites_filterOR-joins with|.sync_citing_papers(aliases=...)builds the group from primary + aliases; CLI and scheduler pass the community's aliases.Verified against OpenAlex: LSL combined per-year rises 61 -> 157 (2025: 75, 2024: 34, ...).
Test plan
_cites_filtersingle/multi/empty;counts_by_yearsends the OR-joined filter for a group.sync_citing_paperswith aliases resolves both versions, OR-joins (cites:W1|W2), and attributes the merged count to the primary DOI (mock transport, real DB).Deploy follow-up
Re-run
sync papers --community {eeglab,bids} --citationsso LSL/BIDS Apps pick up their preprint citations.