Port UniProt evidence parser from Java to Python/PySpark by d0choa · Pull Request #128 · opentargets/pts

d0choa · 2026-05-11T20:03:47Z

Summary

Ports the legacy Java uniprot-evidence-parser to three native PTS steps:

uniprot_evidence_parse (Polars transformer) — parses Swiss-Prot flat file → intermediate parquet with per-entry diseases and variants.
uniprot_literature (PySpark) — emits literature evidence rows from disease CC blocks.
uniprot_variants (PySpark) — emits variant evidence rows from FT VARIANT features.

The two existing evidence_postprocess_uniprot_* blocks are retargeted from input/evidence/uniprot_*.json.gz (externally provided) to the new intermediate/evidence/uniprot_*.parquet outputs (format flipped json → parquet).

Schema reconciliation vs GCS reference

Compared against gs://open-targets-pipeline-runs/ds/26.03-test5/input/evidence/uniprot_*.json.gz:

	reference rows	new rows	medium-confidence fraction
literature	7,697	9,141	ref 8.2% / new 8.4%
variants	47,867	50,370	ref 2.1% / new 2.3%

Row-count drift (+18% / +5%) is the expected delta from a newer Swiss-Prot release. Columns match exactly for literature; variants matches except variantId (chr_pos_ref_alt) which requires an rsID→genomic resolver not wired up here.

Notes for reviewers

The Java pipeline's INDEFINITE_DISEASE_NOTE_ASSOCIATIONS phrases were stale: UniProt now uses "variants" rather than "mutations" / "variations". The updated phrase list matches live data.
The /db_snp FT qualifier no longer exists in modern Swiss-Prot; rsIDs are extracted from inline dbSNP:rsNNN mentions inside /note text.
No somatic/germline split in the reference data — alleleOrigins, the somatic-census join, and the datatypeId branching were all dropped.
com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3 is wired into both pyspark step properties so ontoma.OnToma() works locally; other PTS evidence pipelines using add_efo_mapping (clingen, orphanet) likely need the same treatment for local-run support — out of scope here.

Known gap: variant→disease linkage without an explicit text marker

Our parser links each FT VARIANT to the entry's diseases by matching the disease acronym in the variant's /note (e.g. "R -> Q (in BROVCA1; ...)" → BROVCA1 → OMIM:604370). The Java pipeline used japi's structured object graph, which exposed the variant↔disease relationship even when the /note text mentioned no disease at all. Concrete example from our data: P40692 VAR_022665 has the description 'decreased but not abolished ATPase activity; dbSNP:rs28930073' and no disease tag, yet the Java reference links it to OMIM:114500 (Colorectal cancer) — one of the entry's 5 diseases. That information isn't in the flat file.

Result: we ship ~34% fewer rsID-less variant rows than the reference (8,991 vs 13,593). The gap is in less-clinically-curated variants — those that the curator didn't tag with a disease acronym in the first place — so the most precision-sensitive associations are unaffected. The acronym-tagged subset matches the reference at 99.4% key overlap (28,101 / 28,274). Closing this gap fully would require either bootstrapping from the legacy Java pipeline's output, or emitting the cross-product of variants × entry-diseases for description-less variants (high recall, low precision). Both are out of scope here.

Test plan

All tests pass: JAVA_HOME=<jdk17> uv run pytest test — 348 passing (319 baseline + 29 new).
Lint clean: uv run ruff check on the new modules.
End-to-end locally with Swiss-Prot from gs://open-targets-pipeline-runs/ds/26.03-test5/:
- uv run pts --step uniprot_evidence_parse — ~7 s
- uv run pts --step uniprot_literature — ~25 s
- uv run pts --step uniprot_variants — ~30 s
Schema and confidence-distribution diff against the GCS reference (see table above and the semantic comparison comment below).
Reviewer: run evidence_postprocess_uniprot_literature and evidence_postprocess_uniprot_variants against the new parquets and confirm downstream consumers are happy.

Out of scope (follow-ups)

variantId field — requires an rsID→genomic-coordinate resolver.
Closing the variant→disease linkage gap described above.
The need for spark-nlp in clingen / orphanet properties for local runs.

…action

…cription fallback

…ensus

…omatic

…k-nlp jar UniProt has moved from 'mutations'/'variations' to 'variants' in the six disease-note phrases that drive the medium-confidence classification; the legacy Java strings produced a near-zero match rate against the current release. Updated phrases + tests now yield medium fractions within ~10% of the GCS reference (2.3% vs 2.1% for variants, 8.4% vs 8.2% for literature). Also adds the johnsnowlabs spark-nlp jar to both pyspark steps' properties block — ontoma's OnToma() constructor requires it locally.

Originally ported from the Java repo to feed the somatic/germline split in uniprot_variants. The GCS reference revealed no somatic split exists in the current output, so the file was disconnected from the pipeline in d2160e2. Removing it now since it serves no purpose.

The brainstorming spec and implementation plan were committed early as working artifacts; they should not ship with the PR. Removing from the branch — files remain in the local working tree as untracked.

d0choa · 2026-05-11T20:14:32Z

+import polars as pl
+from loguru import logger
+from otter.config.model import Config
+from otter.storage.synchronous.handle import StorageHandle


I don't understand if relying into otter is the best way to do it.

d0choa · 2026-05-11T20:19:25Z

Semantic comparison vs GCS reference

Ran a row-level semantic diff of the generated parquets against gs://open-targets-pipeline-runs/ds/26.03-test5/input/evidence/uniprot_*.json.gz. Both pipelines fan out rows when a disease has multiple EFO matches, so I aggregated by natural key first, then compared.

Literature — key = (target, disease)

	count	%
ref unique keys	7,039
new unique keys	7,067
keys in both	6,953	98.8% of ref
ref-only (likely removed from newer Swiss-Prot)	86	1.2%
new-only (added since reference snapshot)	114	1.6%

Field agreement across the 6,953 overlapping keys:

field	agreement
`datatypeId`	100%
`targetModulation`	100%
`confidence`	99.9% (6 keys flip — 4 `medium→high`, 2 `high→medium`)
`diseaseFromSource`	98.0% (curator name updates)
`literature` PMID set	90.0% identical · 2.6% we have more · 6.9% we have fewer · 0.5% disjoint
`mappedEfoSet`	67.4% identical · 26.8% we have more · 4.5% disjoint (ontoma LUT version drift)

Variants with rsID — key = (target, disease, rsID)

	count	%
ref unique keys	28,274
new unique keys	29,200
keys in both	28,101	99.4% of ref
ref-only	173	0.6%
new-only	1,099	3.8%

Field agreement across the 28,101 overlapping keys:

field	agreement
`datatypeId`	100%
`targetModulation`	100%
`confidence`	99.9% (22 keys flip)
`diseaseFromSource`	96.6% (curator wording changes)
`literature` PMID set	96.3% identical · 1.2% we have more · 0.7% we have fewer · 1.7% disjoint
`mappedEfoSet`	66.6% identical · 28.3% we have more (ontoma version drift)

Variants without rsID

	rows	unique (target, disease) pairs
reference	13,593	2,771
new	8,991	2,912

~34% fewer rows. This is the variant→disease linkage gap discussed in the PR description: when a variant's /note has no disease acronym, our parser can't link it, while the legacy Java pipeline did via japi's structured object graph.

What's japi? UniProt's official Java client library (uk.ac.ebi.uniprot:japi, on EBI Artifactory). Instead of parsing the flat file, it queries UniProt's servers and returns a typed Java object model — UniProtEntry, DiseaseCommentStructured, VariantFeature, etc. — with cross-references already resolved. In particular, DiseaseCommentStructured.getVariants() returns the variants the curators have associated with that disease, a relationship that doesn't appear anywhere in the rendered flat-file text. Our Python port reads the flat file (the only public artefact on the EBI FTP), so it can only recover the textual breadcrumbs the curator left behind.

Headlines

Schema is exactly right: datatypeId, targetModulation match 100%; confidence matches 99.9%.
Coverage is high: 98.8% / 99.4% key overlap.
Most disagreements are upstream drift — ontoma LUT version, UniProt curator note updates, newer Swiss-Prot release content.
One real bug surfaced: spotted a "B- cell" (extra space after the hyphen) in one of our diseaseFromSource values vs "B-cell" in the reference. Suggests the FT continuation-line joiner is occasionally preserving an unwanted space across a hyphenated word. Worth a follow-up regex fix.
One documented gap confirmed: rsID-less variants are under-represented by ~34% — see the japi explanation above for the structural reason.

Swiss-Prot occasionally wraps mid-compound-word (e.g. a CC line ending '...T-cell-negative/B-' continued by 'cell-positive...'). The naive ' '.join() of continuation lines produced 'B- cell-positive', breaking substring matches and inflating diseaseFromSource disagreements against the GCS reference. The new joiner elides the space when the previous part ends with '-'. Applied to both the CC DISEASE accumulator and the FT VARIANT qualifier accumulator. Recovers ~130 diseaseFromSource matches in the row-level diff and lifts variant-key overlap from 28,101 to 28,120 against the reference snapshot.

d0choa added 24 commits May 11, 2026 19:20

docs: design for UniProt evidence parser Python/PySpark port

b7ddfff

docs: implementation plan for UniProt evidence parser port

bcf5b6e

feat(uniprot-evidence): parser scaffolding with CC DISEASE block extr…

18764b3

…action

fix(uniprot-evidence): copyright separator guard and GN accumulator

0108923

test(uniprot-evidence): multi-disease and edge-case coverage

743771f

feat(uniprot-evidence): FT VARIANT feature extraction

2a3d2ce

refactor(uniprot-evidence): promote qualifier regexes and tighten des…

aa75948

…cription fallback

feat(uniprot-evidence): link variants to diseases by acronym

27dc334

feat(uniprot-evidence): stream entries delimited by //

dfa537f

feat(uniprot-evidence): public transformer entry writing parquet

7a63730

feat(config): register uniprot_evidence_parse step

5e21ad2

chore(uniprot-evidence): port somatic dbSNP census from Java repo

38fb1ba

feat(uniprot-evidence): shared helpers for url, confidence, somatic c…

24295d1

…ensus

feat(uniprot-evidence): uniprot_literature pyspark task

49821ed

feat(config): register uniprot_literature step and retarget postprocess

f70d30a

feat(uniprot-evidence): uniprot_variants pyspark task with somatic flag

58c4df7

feat(config): register uniprot_variants step and retarget postprocess

b60be94

style(uniprot-evidence): satisfy ruff PLC1901 in description-empty tests

c785b94

fix(uniprot-evidence): correct datatypeId values for literature and s…

028e1e7

…omatic

fix(uniprot-evidence): extract dbSNP rsID from /note text

3ae4598

refactor(uniprot-evidence): align output schema with GCS reference

d2160e2

chore(uniprot-evidence): drop spec and plan docs from branch

af9d431

The brainstorming spec and implementation plan were committed early as working artifacts; they should not ship with the PR. Removing from the branch — files remain in the local working tree as untracked.

d0choa commented May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port UniProt evidence parser from Java to Python/PySpark#128

Port UniProt evidence parser from Java to Python/PySpark#128
d0choa wants to merge 25 commits into
mainfrom
feat/uniprot-evidence-parser

d0choa commented May 11, 2026 •

edited

Loading

Uh oh!

d0choa May 11, 2026

Uh oh!

d0choa commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

d0choa commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Schema reconciliation vs GCS reference

Notes for reviewers

Known gap: variant→disease linkage without an explicit text marker

Test plan

Out of scope (follow-ups)

Uh oh!

d0choa May 11, 2026

Choose a reason for hiding this comment

Uh oh!

d0choa commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semantic comparison vs GCS reference

Literature — key = (target, disease)

Variants with rsID — key = (target, disease, rsID)

Variants without rsID

Headlines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

d0choa commented May 11, 2026 •

edited

Loading

d0choa commented May 11, 2026 •

edited

Loading